Archive

Archive for March, 2011

How many sires’ genotypes in a floppy disk ?

March 23, 2011 2 comments

I admit this question is more a funny way to compare several compression tools than a totally correct or fair comparison !

This will be the occasion to write some piece of code in Shell/AWK and R, to remember these old good floppy disk …and most importantly may start a funny discussion at coffee break !

So let’s start, the tools tested are the one easily available on any Linux Distribution:

  1. bzip2
  2. gzip
  3. xz

The idea in our script is to create a genotype file (at least to mimic it) We will progressively increase its size, and at each step compress the file to see whether the compressed file still fit in a floppy ! When compressed file will be bigger than 1.44Mb, we stop our script.

The genotype file is a flat file containing one line per marker and per sample, i.e. 54001 lines per sample. Due mostly to the length of SNP name we need around 2.5 Mb to store all the genotype of one sample.

for Zip in gzip bzip2 xz
do
echo Testing $Zip
N=0 ; T=0
while [[ T -le 1475 ]]; do
#Add 100 lines to the file
echo $N
N=$(( $N + 10000 ))
#extract the N first lines
gawk -v N=${N} '{if(NR<=N ){print $0}}' TYP.csv >out ;/usr/bin/time -o Stat -a  $Zip out
T=`du -k out.* | gawk '{print $1}' `
echo $Zip $T  $N $Zip >>Stat
rm out*
done
done

After some hours we get a  file called Stat, mixing a lot of different information, let’s analyse it with some shell script :

gawk '{if(NR%3==1){gsub(/user/," ",$1);gsub(/system/," ",$2);printf "%5.3f,%5.3f,%5.3f,",$1,$2,$1+$2};if(NR%3==0){print $1","$2/1024","$3/54001}}' Stat

And now try to draw a nice graphics from these data !

#Read the data
Stat<-read.csv("Stat.csv",header=FALSE)
#Declare all the compression tested
T <- c("gzip","bzip2","xz")
#Set a vector for the max algorithm
M <- rep(0,3) ;names(M)=T
#Compute the largest amount of sire per compression algorithm
for(Typ in 1:3){M[Typ]=max(Stat[6][Stat[4]==T[Typ]])}
#plot it
barplot(M,col="blue",main="Number of genotyped sires in a floppy disk \n depending on compression algorithm",ylab="Number of sires")
#Plot the compression time
plot(Stat[6][Stat[4]=="xz"],Stat[3][Stat[4]=="xz"],col="blue",t="l",xlab="Number of individuals in the file",ylab="Compression Time (in CPU second)",main="Evolution of compression time with number of individuals in the file")
lines(Stat[6][Stat[4]=="bzip2"],Stat[3][Stat[4]=="bzip2"],col="green",t="l")
lines(Stat[6][Stat[4]=="gzip"],Stat[3][Stat[4]=="gzip"],col="red",t="l")
legend("bottomright",c("xz","bzip","gzip"),col=c("blue","green","red"),lty=1)

Comparison of the number of sires

With regards to the number of genotyped sires’  that could fit into a floppy disk the differences between compression algorithm is just HUGE.

These differences have nevertheless a cost — compression time (which fortunately increase linearly as size of the file increase) — , while bzip2 and gzip will take respectively 0.07 s and 1 s of CPU per new sires xz will take 4 second of CPU, which can be really problematic. But we will discuss this point in another post…I have to deliver some data !

Advertisements
Categories: Awk, Linux, R, Shell, SNP

Learning by examples (3) : Creating a map

March 16, 2011 Leave a comment

Although  not important for several kind of analysis, we may need a map file to describe the simulated data.

So we need to create a map file containing as many line as marker, a meaningful name for each SNP, the number of  the chromosome and a position in cM. This can be done with this tiny code :


#Enumerate the possible chromosome
for i in `seq 1 5`
do
#for each chromosome (1 to 5) write for each SNP one line in the corresponding map file
gawk -v BTA=$i 'END{for (i=1;i<=(NF-1)/2;i++){printf "%6i M%1i%10.10i %2i %5.2f\n",i,BTA,i,BTA,(i-1)*0.05}}' typ$i >map$i
done

Explanations :

Line 2, we use a for loop with the seq command delimited by to inverted apostrophe (alt gr +7), this will basically execute seq which will write number from 1 to 5, all this output will be redirected to the variable i.

Line 5, gawk is called with the -v argument, we pass one variable within the executed awk code. While processing the data gawk will consider that variable BTA exist and have the value passed through -v , i.e. $i, the chromosome beeing processed. Don’t be  confused here with, as instance, the variable “i”, normally any variable defined within the shell is  unknown by awk. Variable used by awk (i in the for instruction as instance) have nothing in common with the one used by the shell (they are private within awk ).

We use awk in a somehow very particular way, here all the line are read (even if it’s not mentionned), but the code that will be executed appear only once we’ve been through the entire file. Here the reading step helped to know the number of fields in the file. We then divide by two (as there are always two allele per SNP) to get the number of line to be written in the map file.

We extensively use the printf for the output. The format description follow the C format description style.

We are now ready for Phasing and QTL detection !

Categories: Awk, QTL-detection, Shell, SNP

Learning by examples (2) : Data formating

March 14, 2011 Leave a comment

Next step for our simulated  data, as mentioned on the website of the QTLMAS Workshop, ““Genotypes” gives the SNP genotypes for each individual (3220 individuals = 20 sires, 200 dams and 3000 progenies). SNP are sorted by chromosomes and location on the chromosome. Two alleles are given for each SNP. This file contains 3220 lines and 19981 columns which correspond to: ID, (chromosome1, SNP1, allele1), (allele2), (chromosome1, SNP2, allele1), (allele2) … (chromosome1, SNP1998, allele1), (allele2), (chromosome2, SNP1, allele1), (allele2), … (chromosome5, SNP1998, allele1), (allele2).

First, is it true ?

Let’s check with the following oneliner :

gawk '{NCol[NF]++}END{for (i in NCol){print i" "NCol[i]}}' genotype

Explanations:

Awk will read all the lines and add one (++ operator) to the line corresponding to NF (the number of fields in the line) in the hash table NCol.
When end of the file is reached, the END instruction ask to go trough all the element of the hash table NCol and to print all the keys (all the possible line length encountered in the file) and their values (here it will be the number of time we saw a line with NF columns).

If everything is right, we obtained one line of results, ” 19981 3220″, standing for awk read 3220 lines containing 19981 fields each.

Now, that we have checked that the file is correct, we’ll manage to split it into 5 files (one for each chromosome), because we’ll then be able to run 5  job in parallel and most of our software work by chromosome.

This can be done with this piece of code :

gawk '{for (i=2;i<=NF;i++){if((i-1)%((NF-1)/5)==1){BTA=1+(i-1)/((NF-1)/5); outfile="typ"int(BTA);printf "%6i ",$1>outfile};printf "%1i ",$(i)" ">outfile; if((i-1)%((NF-1)/5)==0){print " ">outfile}}} ' genotype

Explanations:

The idea is, for each line, to go through the entire line (instruction (for (i=2;i<=NF;i++)) and change the output file every time we have to. First, we want to identify when  we must change file, as we have 5 chromosomes of the same size, we will have to make 5 chunks of (NFields-1)/5 columns (the minus 1 is to take into account the Id column).We will start a new chromosome when the modulo, obtained with operator “%”, of the actual column by the number of column per file will be equal to 1. Similarly, we will close a file when the modulo will be equal to 0.  We compute in a similar way the number of each chromosome. As we want the division to give a integer value, we use the function (int), which will keep the integer part of the result.  The name of the output file is created by a simple assignation “outfile=”typ”int(BTA).” Every time we want to print a result we use a redirection “>” in order to force awk to write in a specific file (instead of the standard output).  Last point, by default “print” add an implicit End of line, so we have to use printf (formated print) instead this way we ll be able to control the correct format of the output.

We obtain 5 files called typ1, typ2, typ3, typ4 and typ5 which will fit our need.

That All folk !

Categories: Awk, Shell, SNP

Learning by examples (1) : Data download

March 9, 2011 Leave a comment

Two (or three) things happened to enter into my TODO list (I could have decided to focus only on the most important task, but okay let’s try to manage all at once

  1. I have to set up some sample data in order to help people in working on the (hopefully) close to be declared “in production” cluster
  2. We discussed recently with colleagues about  the idea of testing the simulated data available for the coming 15th QTLMAS Workshop held in Rennes
  3. I’d like to present samples shell script code that help automated retrieval of data

So let’s start by the first step of the task data download, I just had to run this little script to obtain all the simulated data available on the site.

#Declare the page containing link to gzip files
Add=https://colloque.inra.fr/qtlmas/Home-page/News/The-simulated-data-set
#First retrieves the website page

wget -O webpage $Add
#Second find all the adress to the gziped files
for File in `gawk -F \" '/.gz"/{print $2}' webpage`
do
 wget $File
done
#gunzip the files
gunzip *.gz
#Clean the directory before leaving
rm webpage *.gz

Explanation :

First I went (manually :-s) on the web site to find the webpage where link to the data were available. The address is then declared as a variable “add”

Second, I use  wget (generally available on most of the Linux distro), the latter download the page/document indicated in the url (here the url is “Add”) , thanks to  -O I indicate the name of the file where download should be stored (here it will be webpage).

Then we know that among all the text contained in the file “webpage”, we only want to find those containing an address to a gzip file (so containing a “.gz” extension) . As we are seeking for something declared with html code, we  know that it’s highly probable that this address will be contained within the html code :

<a href="something.gz">

We therefore tell gawk that the field separator is (“) , with the help of the flag -F.
So basically gawk will first look at all the lines containing the character .gz, (this is the part /.gz/ in our instructions), then with these lines, I only print the second field.
I know we are pretty happy to have a well written web page, with one line per file to download, it could have been much more tricky, but let’s say, that simple case can also happen !

All the awk instruction is contained within two inverted bracket (the one obtained with alt+gr 7), this will just ask to execute the expression, and all the output will then be redirected to the variable File in the for loop.
Variable File will alternatively take as a value the address were the file to download are. Wget will download them.

At last, we only have to gunzip the file and leave the place as clean as possible.

Isn’t it marvelous ? Next step make some changes to the data so that they’ll fit our expected data format.

A change in my habits ?

March 8, 2011 Leave a comment


For several reasons, that could be summarized in one “too much work”, I didn’t find time to post lately, nevertheless I think I still had some little worthy programming tricks that could be published on this website , (maybe the one true problem is somehow a tendency to perfectionism).

So I’ll try to turn from my Debian habits “release when it will be ready” to a more pragmatic way of thinking “release early, release fast“. I ll try  from now on to write maybe  shorter post, just in order to keep a good writing rhythm.

I furthermore put in the manuals section of the site several works (that for some date back from the first years of my PhD)…and are somehow still not totally perfect Version < 1.0 (even though they’ve been used by quite a lot of person as far as I know !). I guess that sometime waiting too much for a release is by far much worse than publishing nothing.

Categories: Uncategorized