Archive

Archive for the ‘Shell’ Category

Hourglass distribution

August 6, 2013 1 comment

Some days ago, I went through an article on mutation rate mentioning the “hourglass distribution”. The illustration of this distribution is pretty obvious from the plot in the article, with for some chromosome a pretty clear reduction in number of SNP around centromeres. I just wondered if we could also observe this phenomenon on bos taurus. So i just had a look on dbSNP138 data to have some clues.

First version


#Retrieve dbSNP
 wget ftp://ftp.ncbi.nih.gov/snp/organisms/cow_9913/VCF/vcf_*vcf.gz

And then just run the following R code.

#Open an output file
png("hourglass.png",1920,1080)
#split the screen in 30
par(mfrow=c(3,10))
#Create a vector of chromosomes
 chr=c(seq(1:29),"X")
 #run a loop on chromosomes
 for( i in chr){
 #Read file
 Pos=read.table(paste("vcf_chr_",i,".vcf.gz",sep=""),skip=15,fill=TRUE)
 #plot the snp density
 plot(density(Pos$V2),col="grey",main=paste("BTA : ",i,sep=""))
 polygon(density(Pos$V2),col=rgb(0,0,0.3,0.04))
 }
 dev.off()

Notes on code :

  • read.table is used directly on a gzipped file (very handy trick)
  • dbsnp file have a 15 lines long header, so I use the skip =15 option
  • I had some glitches while reading some files, (a problem with # of fields) option fill=TRUE, just fix it
  • Plot are nice….but polygon are even better, so I first plot the density and then add a polygon on it
  • rgb function is a simple way to obtain transparency, so after the three values for red, green and blue, I add the alpha setting to a value of 0.04

And after a while….you ‘ll obtain something like this

hourglass

I must say I was a bit disappointed. At least, there are no clear pattern as can be seen in human. All the bovine chromosomes are acrocentric  this may explain why generally no clear decrease in SNP density can be seen. The pattern observed on chromosome 12, 18 and 10 were even more surprising. I am wondering if there could be some sampling bias. Concerning the pattern on BTA23, the latter could be due to MHC, known to exhibit a great diversity. Density computation may also blur a bit things.

Second version

The basic work being done, we can try to investigate others hypotheses. As instance, are SNP and Indel distributed the same way along the genome ? With some slight changes, the code become :

#Open an output file
png("SNPandindelDistribution.png",1920,1080)
#split the screen in 30
par(mfrow=c(3,10))
#Create a vector of chromosomes
 chr=c(seq(1:29),"X")
 #run a loop on chromosomes
 for( i in chr){
 #Read file
 Pos=read.table(paste("vcf_chr_",i,".vcf.gz",sep=""),skip=15,fill=TRUE)
 #plot the snp density
 plot(density(Pos$V2[grep("snp",Pos$V8,perl=TRUE)]),col="grey",main=paste("BTA : ",i,sep=""))
 polygon(density(Pos$V2[grep("snp",Pos$V8,perl=TRUE)]),col=rgb(0,0,0.3,0.04))
 #Add in-del line
 dense=density(Pos$V2[grep("in-del",Pos$V8,perl=TRUE)])
 lines(dense$x,0.5*dense$y,col="red")
 }
dev.off()

Notes on code :

  • in dbsnp variant are coded either as snp or in-del, we extract line with the grep function accordingly
  • I tweaked a bit the indel line in order to avoid scale problems.

SNPandindelDistributionWe observe roughly the same pattern between snp and indel, albeit indel distribution may be smoother. I was expecting some discrepancies (relying on the article by Montgomery et al. but here again, we are only dealing with 1 base indel, which is not really representative of short indel  in general). I may try to check this results with my own results.

Serial snapshots with IGV batch

July 16, 2013 Leave a comment

IGV is a very handy tool. Nevertheless, scrolling from one position to another may be fastidious. Second, the bad aspect of this very user-friendly software, is that you can spend hours looking here and there, with at last no backtrack of your discoveries.

Thankfully, IGV can run in batch mode, allowing, for a targeted list of positions, to take screenshots and store the later in a folder. We’ll illustrate in the following with a small example :

Setting up a session

To test our script, we will first download some publicly available data.

#Download a vcf file (on beagle4 website, you may have to change file name according to last release)
wget http://faculty.washington.edu/browning/beagle/test.r1099.vcf.gz
gunzip test.r1099.vcf.gz
#Create an index
igvtools index test.r1099.vcf
#(You can alternatively prefer to use igvtools from igv GUI)
#Launch igv
igv.sh

To check if everything is right, load the vcf file test.r1099.vcf. Then move to positions chr22:20,000,000-20,100,000.

IGV

Set everything to your taste, and save the session :  File, save session, you should obtain a xml file.

Running IGV in batch mode

Now, let’s consider you are interested in some particular position. Let’s say we’ve stored several positions of interest in a csv file. Our aim is to create an IGV batch file.

Basically, we’ll have to load the session, set a directory to store the screenshots, and then move from one position to the other. A very crude version could therefore be. (I know : Why csv ? Because a lot of person still use excel 😦 )

#Create a fake positions list
cat >Liste.csv <<EOF
chr22;20070000
chr22;20081000
EOF

#Create a ss directory
mkdir IMG
#Write the header of the script
cat >Batch.igv <<EOF
new
load igv_session.xml
snapshotDirectory IMG
EOF
#Now parse the csv file
gawk -F ";" -v R=10000 '{print "goto "$1":"$2 -R"-"$2 + R"\nsnapshot Screen"NR".png"}END{print "exit"}' Liste.csv >>Batch.igv

From, igv, go to Tools, Run a batch script. Load Batch.igv, when all the process will be done, IGV will terminate and you’ll find your screenshots in IMG.

For an even more automated version you can use the script “PrepIgvBatch.sh” available in the scriptotheque

Vcf processing with awk

May 21, 2013 Leave a comment

Processing genome-wide data imply juggling between genome and chromosome scale. As instance, phasing is better done chromosome by chromosome, while annotation or recalibration may be better done at a genome wide scale. Thus splitting and merging vcf file by chromosome may be useful, I’ll report in this quick and dirty post ways to handle this problem.

Split a vcf file

The general idea when splitting vcf, is to write the content of the vcf file in a distinct file according to the chromosome (first field, $1)  they pertain to. This can conveniently be done with :

gawk '/^chr/{print >$1".tmp"}' Genome.vcf

Where for each line starting by “chr” (i.e. not the header file), line is printed in a file whose name is the concatenation of the what may be seen in the first field ($1) ant the extension “.tmp”.

The later is not perfect (we haven’t copied the header, thus this output won’t follow vcf standard). We need to add the header.

for i in *.tmp
name=`echo $i | sed 's/.tmp/.vcf/g' `
gawk '/^#/{print }' Genome.vcf >$name
cat $i >>$name
rm $i
done

Here for each “tmp” file created, we

  1. create a new vcf name (the later being created by switching tmp extension by vcf)
  2. Write is the vcf file the cheader (line starting with “#”
  3. Concatenate the tmp file to the vcf file.

As every one should prefer oneliner,  here come the oneliner version.

gawk 'BEGIN{Ch=""}{if($1~"#"){H[FNR]=$0;nh++}else{if($1!=Ch){Ch=$1;Out=$1".vcf";for(i=1;i<=nh;i++){print H[i] >Out}}; print $0 >Out}}' ../Genome.vcf

First, in the BEGIN instruction, we define Ch (a variable assigned to the current chromosome) and set it to null.  Then, when we start to read the file, if we are within the header (first field starting (^) with “#”), we store in each of the “FNR” element of an array (H) the current line ($0), and count the number of header line (nh). If we are not any longer reading a header line, then if the first field ($1) is different from the Ch variable (we’ve skipped to another chromosome), we change the output file name (Out) according to the first field($1), write in order (from 1 to nh) all the header lines stored in H,  and update the Ch variable to the value read in the first field. In any case we’ll write in the current line in the current output file.

Merge vcf files

Once you’ve split your vcf file, sooner or later, you’ll need to merge them ;-). So let’s assume, we have 31 vcf files (corresponding to chromosomes 1 to 29, + X and M), that we want to merge, the later oneliner should do the trick.

gawk '{if(FILENAME==ARGV[1]){print}else{if($1 !~ "^#"){print}}}' `for i in $(seq 1 29) X M ; do echo chr${i}.vcf ; done ` >Genome.vcf

We’ll have to read all the vcf file in order, write the whole first file, and in the later files, everything but the header. First, we generate a list of vcf file, `for i in $(seq 1 29) X M ; do echo chr${i}.vcf ; done `.

Then, if we read the first file (FILENAME==ARGV[1]), we write every line. Otherwise, if and only if, the line is not starting with “#” (i.e. is not in the header), we print the current line in the output file.

Categories: Awk, NGS, Shell, SNP Tags:

eMarch 5 : Tramp

March 29, 2013 Leave a comment

We are all living in a modern world, the cloud is everywhere and of course also in emacs. So the closing post of this serie will talk about tramp. Tramp is so rich of features that a post won’t be enough. I’ll focus on the point I’ve used so far.

Basic use

I generally use a lot of ssh connection. Tramp allows me to naviguate through the remote directory as if they were on my local machine. The basic commands C-x D and C-x f just need a special syntax to describe the path to the file directory you want to access to.

With ssh, the syntax is similar to the scp one. Hence to access the file test.txt in your home directory “~” on remote host “rem1” , just do

C-x C-f
/ssh:rem1:~/test.txt

To access the directory “/work/you/project” on remote1 “remote1.com”, ip “127.0.0.1”

C-x d
 /ssh:remote1.com:/work/you/project
 /ssh:127.0.0.1:/work/you/project

Last, while visiting a remote directory, if you start a shell buffer with “M-x shell”, the later will be started on the remote host. If a shell buffer is already open on your localhost, M-x shell will redirect you to this buffer. The trick here is to use  C-u M-x shell to start a second buffer shell.

Some other hints

Of course the neat way to access remote host is to create a ssh config file. Anyway if your login on the remote host is different from the one you generally use, or if you are using a tunnel with a different port, you ‘ll have to modify your path.

Taking the former example, assuming now that your login is “Name” you want to connect on localhost port 9022 (We are considering an access via an ssh tunnel).

C-x d
/ssh:Name@remote1.com#9022:/work/you/project

Other useful tips and alternative can be found on the emacs wiki tramp page, like :

  • setting a default tramp user with (setq tramp-default-user “ClassicalLogin”)
  • setting a default tramp protocol (setq tramp-default-method “ssh”)
Categories: emacs, Linux, Shell

Three tips making ssh connections easier

August 8, 2012 2 comments

When working with linux on a grid, we generally spend a lot of time starting ssh connections, entering password, opening again and again new terminal. In the following post we’ll see three small tips that made life much easier.

Using a ssh-key without passphrase

To avoid entering password, one can use ssh public/private keys. SSH allow to generate a pair of keys, one will remain on your computer the other (the public key) will be copied to the server where you want to login. Once the public key is added on this server, you should not be asked for password anymore. For illustration purpose, we will consider that your login is “you” and the ssh server on which you want to connect yourself is “remote.host”.

1 key generation
ssh-keygen -t rsa
  1. The normal place to store the key is ~/.ssh/ so just type enter to accept this default
  2. The trick is here, type enter to give an empty passphrase
  3. Hit enter a second time
2 Export the key on the remote server
ssh-copy-id  -i ~/.ssh/id_rsa.pub you@remote.host

You’ll be asked (normally) for the last time your password and then be disconnected

3 Test the connection

now just type :

ssh you@remote.host

And you’ll be automatically connected ! You’ll then only have to repeat the step two for each server on which you have an ssh access.

Note :

  • If you prefer to keep a passphrase (which is more secure, but less convenient) you can use ssh agent more details can be found here.
  • If you already have a passphrase (whether it’s empty or not) , you can change it with “ssh-keygen -f ~/.ssh/id_rsa.pub -p”

Create an alias for your connection

I basically have ssh account on more then 10 different servers, with for each some specificity (port number, X server availability etc.)
So to automate connection, I generally use alias. Alias is a simple command that will associate to a key word a command.

Hence, in order to connect to the server remote.host with Xforwarding enabled (-X) on port number 2222 (-p 2222) , one can type in a terminal

alias Remote="ssh -X you@remote.host -p 2222"

And then from now on, when typing Remote in the terminal you’ll be automatically connected to remote.host. The shell will execute the command associated to the key word Remote.

Alias is used interactively here, so the association will work as long as your terminal is not closed. In order to have this alias always set-up when starting a terminal,  you should add the previous alias command in either ~/.profile or ~/.bash_rc (This will depend on your configuration).

Screen

Screen is a very handy tool which allow to multiplex terminal…and much more. Suppose you are not allowed to use ssh key and must use a password every time you connect to ssh. If screen is available on the server, then you can connect only one time to it and then create several “screen” which are basically terminals.

So to connect to the ssh server and create a first screen named Term1, type:

ssh me@remote.host
screen -S Term1

You ‘ll see appear a new cleaned terminal. You can start to work and run as instance a very long process.
In order to let the process run and start a new terminal, you’ll have to detach your first terminal with “Ctrl+a d”, you ‘ll then go back to your login terminal.
From the latter, you ‘ll be able to start a second screen.

screen -S Term2

That you ‘ll can also detach and so on…

To list all the screens that have been created , from the login terminal type :

screen -ls

You ‘ll see something like :
There are screens on:
22712.Term2    (08/07/12 22:07:24)    (Detached)
22682.Term1    (08/07/12 22:06:47)    (Detached)
2 Sockets in /var/run/screen/S-you.

to reattach a screen (e.g. Term1), use the -r flag with either the pid or screen name.

screen -r Term1
screen -r 22712

This way you entered only one time your password, but you have several “terminal”. Furthermore, a great advantage of screen is that you can close your ssh connection without killing the screen, so you can let a long process run and reattach its screen terminal from time to time to control them.

Categories: Linux, Shell

Remember my name ?

July 19, 2012 Leave a comment

Some SGE options looks apparently useless, but with a second thought turn out to be very powerful. -N is one of such an option This post will try to exemplify the benefit of giving name to your job.

Usage :

To give a name to your job, you just need to add -N Name to qsub, thus to give the name Job1 to a script just type :

qsub -N Job1 Job.sh

…. and that ‘s it ! Ok let’s put it now in a working context.

Usefulness of  -N  : situation 1

Suppose, you submit every  month the same shell script, let say “GenomicEvaluation.sh”. If you submit your job with the same name name every month, with the following command


qsub -N GenomicEvaluation GenomicEvaluation.sh

Then, a simple command as :

qacct -j GenomicEvaluation | gawk '{if($1 ~ "start_time"){$1="";printf "%s , ",$0}; if($1 ~ "maxvmem"){print $2}}'  >MemConsumption.csv

Would create a file reporting the  evolution over time of the memory needed for your job !

Usefulness of  -N  : situation 2

Suppose you have 3 scripts, Job1.sh, Job2.sh and FinalJob.sh, and you don’t want FinalJob.sh to start before the end of the two first scripts. The trick is then to submit your two first with explanatory name.

qsub -N Job1 Job1.sh
qsub -N Job2 Job2.sh

And then submit FinalJob.sh but this time with the -hold_jid option followed by the two first job names

qsub -N FinalJob -hold_jid Job1,Job2   FinalJob.sh

Nice isn’t it ?

In both case, remember that JobID are given by SGE (in order of submission and based on the indication of jobseqnum in ${SGE_ROOT}/default/spool/qmaster) so they can hardly be predicted, whereas name are totally under control !

Usefulness of  -N  : situation 3

A more trivial use of -N can be to pass a variable to your script, in fact within the job environment the name you gave at submission will be assign to JOB_NAME environment variable.

Categories: Linux, SGE, Shell Tags:

Tips : Reading compressed file with fortran and named pipe

May 7, 2012 1 comment

As genomic data are accumulating at some point we have to find solutions to reduce data size as much as possible. Working with beagle (java software), made me noticing that most of “modern language” have standard library to read directly compressed file. Alas most of my tools are written in fortran and the few example I’ve seen of direct reading of compressed files used socket or bindings to C system calls….which are far from the ease of use of python or java library.

Fortunately, I discover recently the named pipe. The idea is to create a “file” that will behave like a pipe. One process will read whatever enter in the pipe and another will write in it.

The advantage here are :

  1. You don’t have to make  changes in  your fortran program
  2. You don’t need to decompress entirely a file to read it (and therefore you spare some disk space).

One example :

You have a compressed file containing genotypes. You need to read the genotype from your program, but you don’t have enough space available to decompress entirely your file. The idea is then to create a named pipe.

Named pipe creation

Named pipe can be created with the following command :

mkfifo MyNamedPipe

A simple ls -al should confirm the creation of the named pipe with the following line :

prw-rw-r– 1 francois francois     0 2012-01-07 19:53 MyNamedPipe

Redirections to named pipe

Suppose you want to compress directly a big file that you generate with any (let say fortran) program. We’ll proceed following these steps :

  1. In a first terminal, start the compression process on the pipe file (e.g. gzip MyNamedPipe)
  2. In a second terminal, write to the named pipe some data

Example with fortran

program testfifo

implicit none

integer :: i,j

open(11,file="MyNamedPipe")

do i=1,10000
 do j=i,10000
  write(11,*)i,j,j*i
 end do
end do

end program testfifo

Compilation

gfortran -o testfifo testfifo.f90

Execution

./testfifo

=> As the named pipe is empty the program is waiting for data

In another shell :

 cat MyNamedPipe | gzip >T.gz

And then voilà the decompression start and by turn fortran program continue its  normal execution.

Fortran

Things are working, but it would be nice to use mkfifo more smoothly. This can be achieve with the EXECUTE_COMMAND_LINE subroutine, within a fortran program.

Two problems should be fixed then :

  1. We should not wait the call system to finish in order to start the reading process.
  2. The read loop should know when file is totally decompressed (in fact as any other stream could be redirected to the named pipe, no end of file signal will be received by the loop).

The first point is obtained by the “&” sign to put the process in background. The second point, is obtained by echoing the “end of file” sign to the named pipe before removing it. So basically, we end-up with something like the following code.

program testfifo
implicit none
integer:: i,j,k,io,unit,iostat
character(len=30)::File,filename
character(len=600)::Instruction
write(*,*)Name of the xz file ?"
read(*,*)filename
!//Create a named pipe//
call EXECUTE_COMMAND_LINE("rm -f MyUnCompressedFile ;mkfifo MyUnCompressedFile")
!//Open a connection to it//
open(11,file="MyUnCompressedFile",iostat=iostat)
!//Write decompression instruction//
write(Instruction,'(a,a,a)')"(xz -dc  ",filename," >MyUnCompressedFile; echo \x4 >MyUnCompressedFile )"
!//Execute//
call EXECUTE_COMMAND_LINE(Instruction)
!//Execute a normal reading loop//
do
 read(11,*,iostat=io)i,j,k
 if(io/=0)exit
 write(*,*)i,j,k
end do
!//remove named pipe//
write(Instruction,'(a)')"( rm -f MyUnCompressedFile)"
CALL EXECUTE_COMMAND_LINE(Instruction)
end program testfifo

Compilation

gfortran -o testfifo testfifo.f90

Execution

echo T1.xz | ./testfifo

Note : We compile the program with gfortran. In fact, ifort doesn’t recognize (for the moment), the EXECUTE_COMMAND_LINE subroutine, which is part of the new fortran 2008 standard. In order to have the following program working, you should use “call system” instead of “call EXECUTE_COMMAND_LINE” (the first beeing an extension to fortran recognized by both ifort and gfortran).

One Last step beyond

In order to get an even smoother way to read compressed data, we can write a module with function that will work as open and read….you can find here such a module….by the way the reading part work fine, I still haven’t found a good way to open a named pipe for output compression (any hints would be very appreciated !)

Categories: Fortran, Shell