Archive

Archive for October, 2011

A subtle difference

October 4, 2011 1 comment

My colleague came to my office for a vicious problem today. Based on the first description of the symptom, I was about to bet for the classical carriage return problem. Anyway, the true problem just came from another classical subtlety in awk syntax.

Let’s describe the problem

We have a somehow classical genotype file looking like this :

AClassicalID,A_WAY_TOO_LONG_SNP_NAME,A,C
AClassicalID,ANOTHER_TOO_LONG_SNP_NAME,A,A
AClassicalID,A_SHORTER_SNP_NAME,-,-

So 4 comma separated values : one individual ID, a SNP name, two alleles.

The aim was to print the lines where alleles were reported, or said in another way exclude line with the “-,-” alleles.

The basic awk one liner for this could be :

gawk -F"," '{if($3 != "-"){print $0}}' file

We tell gawk that fields will be separated by a comma (-F option), we test the value of the 3rd column, if the value is different from “-” then we print the line.
The above code work perfectly….but then my colleague which is very cautious just wonder if it was enough, though she try the following bit of code

gawk -F"," '{if($4 != "-"){print $0}}' file

And alas this almost identical code doesn’t seem to work. All the line of the file are printed.

Would you be smart enough to see why ?

So the problem seems to come from variable 4. In fact, you may have notice that this last field have no comma on the right. Here the last field is  not “-” but “-” plus some spaces. The condition if($4 != “-“) is thus always TRUE.

How to fix this ?

I had not a lot of time so a quick and dirty trick is just to use this code instead :

gawk -F"," '{if($4 !~ "-"){print $0}}' file

The change “!~” instead of “!=” means that now we are looking for line where the variable 4 is not matching “-“. This imply that it will exclude line where le variable 4 is equal to as instance “-“,”-   “, but also “A-“.

As we know that we should not find in the fourth column anything but “A,C,G,T or -“, the fix is acceptable.

Remarks

For other problems, this fix might not be as convenient. Sometime it could also create surprising results !

Another point to notice, in Illumina standard genotype files, genotypes are, as far as I know, in alphabetical order, so that you should never see any line with genotype “C,A ” but rather “A,C”.

Last but no the least, unknown genotype are always paired, you should not see any “A,-” or what so ever. So my colleague could have been less conscientious, nobody would notice.

Categories: Awk, Linux, SNP