Home > Awk, Genomic Selection, Shell, SNP > Aw{k}esomeness (1)

Aw{k}esomeness (1)

Awesomness personifiedBeeing pretty addicted to Awk for most of the data manipulation I commonly have to make during my working days, some colleagues comes to my office asking for the same kind of problem, so I was thinking it could be good to start a series of post on these common tricks every one should know in awk.

This post will be the first of a (long ?)  series and will present the table facilities in awk.

Let’s describe some common problem you can have with genomic data.

  1. You want to know if all the line of your file have the same number of columns
  2. You want to know how many allele are observed for one particular marker

Let say, we are reading a file called MyFile, for the problem 1, we will execute this line of code

gawk '{N[NF]++}END{for (i in N){print i,N[i]}}' MyFile

Explanations :
The first part of the instruction, is looking line by line to the number of column (Value of NF), then in the hash table N in the line associated to NF, we just add 1. So basically, for each line with let’s says 3 columns, one is added to the count of line with 3 columns. The ++ notation is exactly the same +=1 or N[NF]=N[NF]+1, but you know when you can be lazy, be lazy, (furthermore if you are addicted to code golf, this helps to have good score).

The ending part (starting with the keyword END), will be executed when the file has been read.  The idea here is to look into the table N, and for all its elements (i) print the associated values. So let’s say, the files we read had lines with 3, 4, or 5 columns, the table N will have 3 elements, containing the numbers of times we encountered a line with 3, 4 or 5 columns. The print statement will print the key (i) of table N and then the value stored in N for this specific key

Note, the for …. in …. instruction, will look to all the element of the inspected table, with no specific order. If you want to order this, you’ll have to use the sort function.

For the second example let’s have more fun ! Let’s say that we are interested by the allele at a marker. We just know that this allele is located in the column 4

gawk '{N[$4]++}END{for (i in N){print i,N[i]}}' MyFile

Explanations :
The only change is that now we read the key of our table the allele read in column 4. At the end of the reading step, we will have to look to all the allele that have been stored in N and print the number of time they appear.

Too easy isn’t it ? So let’s say that we now want to have a code that can count the allele at any column (col).

gawk -v Col=4 '{N[$(Col)]++}END{for (i in N){print i,N[i]}}' MyFile

We use here the option -v, which allow us to declare a variable that will be known through out the reading process. The value will be called “Col” and will be assigned a value of 4. As we’ve done something that look “smarter”, we could be tempted to go on and test all the markers of a file with a code like this .

#Count the number of fields in MyFile
NField=`head -n1 MyFile | wc -w` #yes I know it's certainly not the best way to do it, but I like to change my habits !
for Mark in `seq 3 ${NField} `
 echo "Looking at SNP in col $Mark"
 gawk -v Col=$Mark '{N[$(Col)]++}END{for (i in N){print i,N[i]}}' MyFile

But we won’t because it’s definitely not efficient at all, suppose you have 50 000 columns, you will read 49 999 times the file for nothing !

In my opinion, the efficient way should look more like this :

gawk '{for (i=3; i<=NF;i++){N[i,$(i)]+=1}}END{for(elt in N){split(elt,T,"\034");print T[1],T[2],N[T[1],T[2]]}}' MyFile

Explanations :

So just remember that a good way to progress is to go step by step. So to obtain this one-liner, I first start  with my previous program. Then, I tackle the fact that we wanted to go through all the markers for each line. This is done by the for (i=2;i<=NF;i++), we can also access any element of a 2D table by a pair of keys. Here the keys are i (the number of the column inspected, and $(i), the value (here the allele) read in column i. Last but not the least we want to store out all the element of the table n. Here the important thing to know is that keys are merge together and stored as a 1D table, we therefore have to split the key (with the function split) to remove the ascii character “\ 034” which links both parts of the key and obtain two elements that will be stored in the table T, and then print both keys (T[1] and T[2]) and the value  assigned to them in table N.

Once you got the principle in mind,  I imagine that you’ll find a lot of possible application in your day to day work. So have fun !

Categories: Awk, Genomic Selection, Shell, SNP
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: