Home > Awk, Linux, R, Shell, SNP > How many sires’ genotypes in a floppy disk ?

How many sires’ genotypes in a floppy disk ?

I admit this question is more a funny way to compare several compression tools than a totally correct or fair comparison !

This will be the occasion to write some piece of code in Shell/AWK and R, to remember these old good floppy disk …and most importantly may start a funny discussion at coffee break !

So let’s start, the tools tested are the one easily available on any Linux Distribution:

  1. bzip2
  2. gzip
  3. xz

The idea in our script is to create a genotype file (at least to mimic it) We will progressively increase its size, and at each step compress the file to see whether the compressed file still fit in a floppy ! When compressed file will be bigger than 1.44Mb, we stop our script.

The genotype file is a flat file containing one line per marker and per sample, i.e. 54001 lines per sample. Due mostly to the length of SNP name we need around 2.5 Mb to store all the genotype of one sample.

for Zip in gzip bzip2 xz
echo Testing $Zip
N=0 ; T=0
while [[ T -le 1475 ]]; do
#Add 100 lines to the file
echo $N
N=$(( $N + 10000 ))
#extract the N first lines
gawk -v N=${N} '{if(NR<=N ){print $0}}' TYP.csv >out ;/usr/bin/time -o Stat -a  $Zip out
T=`du -k out.* | gawk '{print $1}' `
echo $Zip $T  $N $Zip >>Stat
rm out*

After some hours we get a  file called Stat, mixing a lot of different information, let’s analyse it with some shell script :

gawk '{if(NR%3==1){gsub(/user/," ",$1);gsub(/system/," ",$2);printf "%5.3f,%5.3f,%5.3f,",$1,$2,$1+$2};if(NR%3==0){print $1","$2/1024","$3/54001}}' Stat

And now try to draw a nice graphics from these data !

#Read the data
#Declare all the compression tested
T <- c("gzip","bzip2","xz")
#Set a vector for the max algorithm
M <- rep(0,3) ;names(M)=T
#Compute the largest amount of sire per compression algorithm
for(Typ in 1:3){M[Typ]=max(Stat[6][Stat[4]==T[Typ]])}
#plot it
barplot(M,col="blue",main="Number of genotyped sires in a floppy disk \n depending on compression algorithm",ylab="Number of sires")
#Plot the compression time
plot(Stat[6][Stat[4]=="xz"],Stat[3][Stat[4]=="xz"],col="blue",t="l",xlab="Number of individuals in the file",ylab="Compression Time (in CPU second)",main="Evolution of compression time with number of individuals in the file")

Comparison of the number of sires

With regards to the number of genotyped sires’  that could fit into a floppy disk the differences between compression algorithm is just HUGE.

These differences have nevertheless a cost — compression time (which fortunately increase linearly as size of the file increase) — , while bzip2 and gzip will take respectively 0.07 s and 1 s of CPU per new sires xz will take 4 second of CPU, which can be really problematic. But we will discuss this point in another post…I have to deliver some data !

Categories: Awk, Linux, R, Shell, SNP
  1. April 4, 2011 at 2:56 pm

    Is the 4 seconds of CPU time such a big problem? I mean the compression is much-much-much better with xz according to this, so if you start the process before a coffee break, it should be fine… Or not?

    • April 4, 2011 at 6:57 pm

      As long as you take coffee breaks longer than 4 seconds, you are right ! In fact for me xz compression format is really a very nice solution as far as it is integrated in a well designed script that run in background.
      Anyway, there is a way to go even faster….i am just preparing a post on this !

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: