Archive for May, 2013

Vcf processing with awk

May 21, 2013 Leave a comment

Processing genome-wide data imply juggling between genome and chromosome scale. As instance, phasing is better done chromosome by chromosome, while annotation or recalibration may be better done at a genome wide scale. Thus splitting and merging vcf file by chromosome may be useful, I’ll report in this quick and dirty post ways to handle this problem.

Split a vcf file

The general idea when splitting vcf, is to write the content of the vcf file in a distinct file according to the chromosome (first field, $1)  they pertain to. This can conveniently be done with :

gawk '/^chr/{print >$1".tmp"}' Genome.vcf

Where for each line starting by “chr” (i.e. not the header file), line is printed in a file whose name is the concatenation of the what may be seen in the first field ($1) ant the extension “.tmp”.

The later is not perfect (we haven’t copied the header, thus this output won’t follow vcf standard). We need to add the header.

for i in *.tmp
name=`echo $i | sed 's/.tmp/.vcf/g' `
gawk '/^#/{print }' Genome.vcf >$name
cat $i >>$name
rm $i

Here for each “tmp” file created, we

  1. create a new vcf name (the later being created by switching tmp extension by vcf)
  2. Write is the vcf file the cheader (line starting with “#”
  3. Concatenate the tmp file to the vcf file.

As every one should prefer oneliner,  here come the oneliner version.

gawk 'BEGIN{Ch=""}{if($1~"#"){H[FNR]=$0;nh++}else{if($1!=Ch){Ch=$1;Out=$1".vcf";for(i=1;i<=nh;i++){print H[i] >Out}}; print $0 >Out}}' ../Genome.vcf

First, in the BEGIN instruction, we define Ch (a variable assigned to the current chromosome) and set it to null.  Then, when we start to read the file, if we are within the header (first field starting (^) with “#”), we store in each of the “FNR” element of an array (H) the current line ($0), and count the number of header line (nh). If we are not any longer reading a header line, then if the first field ($1) is different from the Ch variable (we’ve skipped to another chromosome), we change the output file name (Out) according to the first field($1), write in order (from 1 to nh) all the header lines stored in H,  and update the Ch variable to the value read in the first field. In any case we’ll write in the current line in the current output file.

Merge vcf files

Once you’ve split your vcf file, sooner or later, you’ll need to merge them ;-). So let’s assume, we have 31 vcf files (corresponding to chromosomes 1 to 29, + X and M), that we want to merge, the later oneliner should do the trick.

gawk '{if(FILENAME==ARGV[1]){print}else{if($1 !~ "^#"){print}}}' `for i in $(seq 1 29) X M ; do echo chr${i}.vcf ; done ` >Genome.vcf

We’ll have to read all the vcf file in order, write the whole first file, and in the later files, everything but the header. First, we generate a list of vcf file, `for i in $(seq 1 29) X M ; do echo chr${i}.vcf ; done `.

Then, if we read the first file (FILENAME==ARGV[1]), we write every line. Otherwise, if and only if, the line is not starting with “#” (i.e. is not in the header), we print the current line in the output file.

Categories: Awk, NGS, Shell, SNP Tags:

Sources of proudness

May 18, 2013 Leave a comment

taurusastronautmodelsA2evolutionjpgApril have been the occasion of several changes and achievements on my parents’ farm.

Milking robots

First and foremost, after milking something like 1 Millions cows (~30 years,  ~50 cows, 2X a day…figures got impressive !), my parents invested in a milking robot. Launching was pretty challenging, since cows have to bear a pretty disruptive change in their life style (from a gregarious behavior to autonomy in fact). I took a whole week of “holidays” to  help for the switch between milking parlor to automate milking. Despite the 4 hours of sleep per day (we were handling cows 24/7) I really enjoyed the experience (and guess what, I’ve been pretty efficient reading articles in the night, while cows were being milked !).

A second achievement, related to the previous one, is that after one month using the robots three cows outperform our previous milking record per day (mostly the effect of being milked more often), so 3 cows reached 60Kg/milk per day, the best being able to produce 70kg/day. For those of my INRA’s colleagues lacking some references, I generally turn this into “These cows produce around 3 time their weight in a month, producing almost 2 tonnes of milk…which won’t be enough to pay the rent of my flat in Paris’ suburbs 😦 (The milk price was 299€/1000 l for 2013-1) …this is essentially why I don’t own a cow at home ! The impact of robots milking on production is pretty well known, I am just happy to witness it so quickly. Recently one of our first cow out of a “genomic sire” also reached an impressive 50kg milk per day, which is pretty nice for a first lactation…and a sign that our gEBV are not so bad (isn’t it ?)!

Production improvements

A third nice news –still regarding productivity– is that my parents ranked 35th in France in the last “Top List” (the latter rank herd based on their average productivity). The herd reach an average of 11500 kg milk (305d) per cow @ 32 Protein and 34 Fat. While we’ve been for decade within the top ranked ISU herd in France, this is always nice to see that genetic is not disconnected from phenotypes ! I am eager to see what our results will be next year, with the automated milking !

anzeige-hi-13-05-de-bossA forth (maybe a bit outdated) event was that we had our first cow reaching a total life production of 100 000 kg of milk. A funny story is that I used to tell Vincent Ducrocq (ages ago, when I was in master degree) that this cow was the living proof that we believed in the longevity index since her sire had a strong longevity index and was chosen on purpose. Years later, I also wan’t to see here a proof that longevity index are also verified on farms !

Radieuse’s great grand sons on the top gRZG

Last, the new gRZG leader in germany Boss is a great grand son of Radieuse (born and raised on our farm), and its half-brothers ranked 22nd (Platin) and 33rd (Sharp) in gRZG. While we already had several sires positively progeny tested, out of our cows, none reached a “leader” status. So I am very proud to see that this family of cow give such good results abroad.

Recent articles

May 3, 2013 1 comment

Hot out of the press:

  • A critique of Encode once again, maybe a bit harder to read than Graur’s one .
  • A very nice article on Wallace works — this remind me of an anecdotes, I heard in a CSH video session in Liege. Apparently one reviewer of Darwin’s work advised him to talk a bit more about pigeon…Wallace on his side ask Darwin to review his work….yes, knocking at the right door can make differences !
  • A nice article on short indel from the 1000 Genome project
  • One approach I never thought about but that can be very useful for sequence data analysis, source code available here
  • The funny  article of the month 😉

Software update

May 1, 2013 1 comment

April’s quick review of software’s updates :

GATK 2.5

Still some speed improvement, our bug got finally fixed ! Just wonder how broad’s staff will manage to make Haplotype Caller faster than Unified genotyper ! Wait and see.

BWA 0.7.4

After some small bugs, this version offer the same stability for BWA-MEM than for BWA-backtrack, nice to hear.

Picard-Tools 1.90

Still on the same releases rhythm.

IGV 2.3

The motif finder is a very nice new feature.


Not really an update, but ensembl 71 added sift score for cow.

Categories: Linux, NGS