Home > Awk, Shell, SNP > Learning by examples (2) : Data formating

Learning by examples (2) : Data formating

Next step for our simulated  data, as mentioned on the website of the QTLMAS Workshop, ““Genotypes” gives the SNP genotypes for each individual (3220 individuals = 20 sires, 200 dams and 3000 progenies). SNP are sorted by chromosomes and location on the chromosome. Two alleles are given for each SNP. This file contains 3220 lines and 19981 columns which correspond to: ID, (chromosome1, SNP1, allele1), (allele2), (chromosome1, SNP2, allele1), (allele2) … (chromosome1, SNP1998, allele1), (allele2), (chromosome2, SNP1, allele1), (allele2), … (chromosome5, SNP1998, allele1), (allele2).

First, is it true ?

Let’s check with the following oneliner :

gawk '{NCol[NF]++}END{for (i in NCol){print i" "NCol[i]}}' genotype

Explanations:

Awk will read all the lines and add one (++ operator) to the line corresponding to NF (the number of fields in the line) in the hash table NCol.
When end of the file is reached, the END instruction ask to go trough all the element of the hash table NCol and to print all the keys (all the possible line length encountered in the file) and their values (here it will be the number of time we saw a line with NF columns).

If everything is right, we obtained one line of results, ” 19981 3220″, standing for awk read 3220 lines containing 19981 fields each.

Now, that we have checked that the file is correct, we’ll manage to split it into 5 files (one for each chromosome), because we’ll then be able to run 5  job in parallel and most of our software work by chromosome.

This can be done with this piece of code :

gawk '{for (i=2;i<=NF;i++){if((i-1)%((NF-1)/5)==1){BTA=1+(i-1)/((NF-1)/5); outfile="typ"int(BTA);printf "%6i ",$1>outfile};printf "%1i ",$(i)" ">outfile; if((i-1)%((NF-1)/5)==0){print " ">outfile}}} ' genotype

Explanations:

The idea is, for each line, to go through the entire line (instruction (for (i=2;i<=NF;i++)) and change the output file every time we have to. First, we want to identify when  we must change file, as we have 5 chromosomes of the same size, we will have to make 5 chunks of (NFields-1)/5 columns (the minus 1 is to take into account the Id column).We will start a new chromosome when the modulo, obtained with operator “%”, of the actual column by the number of column per file will be equal to 1. Similarly, we will close a file when the modulo will be equal to 0.  We compute in a similar way the number of each chromosome. As we want the division to give a integer value, we use the function (int), which will keep the integer part of the result.  The name of the output file is created by a simple assignation “outfile=”typ”int(BTA).” Every time we want to print a result we use a redirection “>” in order to force awk to write in a specific file (instead of the standard output).  Last point, by default “print” add an implicit End of line, so we have to use printf (formated print) instead this way we ll be able to control the correct format of the output.

We obtain 5 files called typ1, typ2, typ3, typ4 and typ5 which will fit our need.

That All folk !

Advertisements
Categories: Awk, Shell, SNP
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: