Home > Awk, NGS, Shell, SNP > Vcf processing with awk

Vcf processing with awk

Processing genome-wide data imply juggling between genome and chromosome scale. As instance, phasing is better done chromosome by chromosome, while annotation or recalibration may be better done at a genome wide scale. Thus splitting and merging vcf file by chromosome may be useful, I’ll report in this quick and dirty post ways to handle this problem.

Split a vcf file

The general idea when splitting vcf, is to write the content of the vcf file in a distinct file according to the chromosome (first field, $1)  they pertain to. This can conveniently be done with :

gawk '/^chr/{print >$1".tmp"}' Genome.vcf

Where for each line starting by “chr” (i.e. not the header file), line is printed in a file whose name is the concatenation of the what may be seen in the first field ($1) ant the extension “.tmp”.

The later is not perfect (we haven’t copied the header, thus this output won’t follow vcf standard). We need to add the header.

for i in *.tmp
name=`echo $i | sed 's/.tmp/.vcf/g' `
gawk '/^#/{print }' Genome.vcf >$name
cat $i >>$name
rm $i

Here for each “tmp” file created, we

  1. create a new vcf name (the later being created by switching tmp extension by vcf)
  2. Write is the vcf file the cheader (line starting with “#”
  3. Concatenate the tmp file to the vcf file.

As every one should prefer oneliner,  here come the oneliner version.

gawk 'BEGIN{Ch=""}{if($1~"#"){H[FNR]=$0;nh++}else{if($1!=Ch){Ch=$1;Out=$1".vcf";for(i=1;i<=nh;i++){print H[i] >Out}}; print $0 >Out}}' ../Genome.vcf

First, in the BEGIN instruction, we define Ch (a variable assigned to the current chromosome) and set it to null.  Then, when we start to read the file, if we are within the header (first field starting (^) with “#”), we store in each of the “FNR” element of an array (H) the current line ($0), and count the number of header line (nh). If we are not any longer reading a header line, then if the first field ($1) is different from the Ch variable (we’ve skipped to another chromosome), we change the output file name (Out) according to the first field($1), write in order (from 1 to nh) all the header lines stored in H,  and update the Ch variable to the value read in the first field. In any case we’ll write in the current line in the current output file.

Merge vcf files

Once you’ve split your vcf file, sooner or later, you’ll need to merge them ;-). So let’s assume, we have 31 vcf files (corresponding to chromosomes 1 to 29, + X and M), that we want to merge, the later oneliner should do the trick.

gawk '{if(FILENAME==ARGV[1]){print}else{if($1 !~ "^#"){print}}}' `for i in $(seq 1 29) X M ; do echo chr${i}.vcf ; done ` >Genome.vcf

We’ll have to read all the vcf file in order, write the whole first file, and in the later files, everything but the header. First, we generate a list of vcf file, `for i in $(seq 1 29) X M ; do echo chr${i}.vcf ; done `.

Then, if we read the first file (FILENAME==ARGV[1]), we write every line. Otherwise, if and only if, the line is not starting with “#” (i.e. is not in the header), we print the current line in the output file.

Categories: Awk, NGS, Shell, SNP Tags:
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: