Home > Awk, Linux, SNP > A subtle difference

A subtle difference

My colleague came to my office for a vicious problem today. Based on the first description of the symptom, I was about to bet for the classical carriage return problem. Anyway, the true problem just came from another classical subtlety in awk syntax.

Let’s describe the problem

We have a somehow classical genotype file looking like this :

AClassicalID,A_WAY_TOO_LONG_SNP_NAME,A,C
AClassicalID,ANOTHER_TOO_LONG_SNP_NAME,A,A
AClassicalID,A_SHORTER_SNP_NAME,-,-

So 4 comma separated values : one individual ID, a SNP name, two alleles.

The aim was to print the lines where alleles were reported, or said in another way exclude line with the “-,-” alleles.

The basic awk one liner for this could be :

gawk -F"," '{if($3 != "-"){print $0}}' file

We tell gawk that fields will be separated by a comma (-F option), we test the value of the 3rd column, if the value is different from “-” then we print the line.
The above code work perfectly….but then my colleague which is very cautious just wonder if it was enough, though she try the following bit of code

gawk -F"," '{if($4 != "-"){print $0}}' file

And alas this almost identical code doesn’t seem to work. All the line of the file are printed.

Would you be smart enough to see why ?

So the problem seems to come from variable 4. In fact, you may have notice that this last field have no comma on the right. Here the last field is  not “-” but “-” plus some spaces. The condition if($4 != “-“) is thus always TRUE.

How to fix this ?

I had not a lot of time so a quick and dirty trick is just to use this code instead :

gawk -F"," '{if($4 !~ "-"){print $0}}' file

The change “!~” instead of “!=” means that now we are looking for line where the variable 4 is not matching “-“. This imply that it will exclude line where le variable 4 is equal to as instance “-“,”-   “, but also “A-“.

As we know that we should not find in the fourth column anything but “A,C,G,T or -“, the fix is acceptable.

Remarks

For other problems, this fix might not be as convenient. Sometime it could also create surprising results !

Another point to notice, in Illumina standard genotype files, genotypes are, as far as I know, in alphabetical order, so that you should never see any line with genotype “C,A ” but rather “A,C”.

Last but no the least, unknown genotype are always paired, you should not see any “A,-” or what so ever. So my colleague could have been less conscientious, nobody would notice.

Advertisements
Categories: Awk, Linux, SNP
  1. October 14, 2011 at 6:41 am

    For sure one could also use regular expressions (Did I mentioned that I used a quick and dirty trick ?)

    It could give something like :

    gawk -F”,” ‘$4 !~ /-/’ file

    Because the problem was more related to the 4th column.
    And even better :

    gawk -F”,” ‘$4 !~ /^[-][ ]+/’ file

    Because, we are looking for a 4th column starting by one “-“, followed by one or more space(s). Finally, refining the awk one-liner is always a matter of format definition the better you define it the safer.

    PS : Thank you for the comment, because, both minimalist instructions, (using implied instruction like “print”) and regexp could be nice topics for next posts !

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: