Home > Awk, Genomic Selection, SNP > Aw{k}esomeness (2)

Aw{k}esomeness (2)

Second (short) post of the series on awk.

One of Perl’s motto is the famous “there is more then one way to do it !” This for sure true and can be somehow convenient to write pretty fast a oneliner but as discussed here, it sometime lead you to poor performances that can be a real problem in routines work.

Let’s talk about a regular case, you have a list of individual (or Id) in a flat genotype file e.g. “Geno.csv”. Let’s says, these file have 4 columns, the first containing the Id of the  genotyped individual, the second one a marker name, and 3rd and 4th column containing allele.

We now want to create a new file, containing only the information related to individual “JB007”.

A common way to extract this lines is by using grep:

grep JB0007 Geno.csv >Out

This should be executed pretty fast with a file of reasonable size….leading you to the not so true conclusion that you reached acceptable efficiency.

First note that, if you believe that your time is sufficiently important not to be wasted by stupid mistake, you should think about what you are really asking to your computer and how could it be “misunderstood”. Here by the way, you are searching for the string “JB007” in any position of any line in the  file. If, you have an individual call “HJB0077”, the later will also be included in your file !  A good idea is therefore to use better defined string to search, with the help of regular expression and to think about ways to test that the file you obtain fit the expectation.

So if the individual JB007 has been genotyped on a bovine50 chip, we may  execute these lines to extract it from a larger file.

grep ^JB007, Geno.csv >Out
# and to check that we have 54001 lines
wc -l Out

Explanation :

We are now looking in the file “Geno.csv” for any line starting with “JB007,” .The sign”^” is a regular expression meaning the start of a line (or expression), the addition of the comma “,” is to be sure that we are working only with the first field of the csv file.

Unfortunately, most of the time, you will want to extract more than one individual, if more than one is two, then you can still use  regular expressions. as instance our last script can turn into :

 grep -E "^JB007,|JB008," Geno.csv >Out


The additional -E indicate that grep will have to search for a regular expression, and in fact the latter is composed of two regular expression divided by a “| ” statement standing for “or”.

For sure, we should still have a rather good efficiency, but if we have to extend our search to 5 or more expressions, it will  turn out to be fastidious and error prone. We must think about a better way to work. If you haven’t read last post, you can think about a for loop in shell script, plus a concatenation “>>” over the “out” file, but here again, this won’t be efficient.

Another way, is to write line by line all the expression to search in a file called MyList.txt, then the following command should make the extraction you want :

 grep -f  MyList.txt Geno.csv >Out

Explanation :
The “-f” flag ask grep to read in a file all the expression to seek in the file. But you’ll probably see that if you are looking for hundred of expressions, then the process will start to be very long (can be useful if it’s time for coffee break :-).  A Trick to be mentioned because I really found it very smart, is to see if you ‘d not better search for expression not to be included, sometimes this way to see the problem is much easier. In the case you only want to exclude several individual from your file something like this should work !

 grep -v -f  MyList.txt Geno.csv >Out

Explanation :

The flag “-v” will invert grep behavior, hence, now none of the string listed in MyList.txt should be present in the out file.

But of course there is a way to do all this with awk. As we’ve seen in last post, awk allow to use in a very convenient way hash table. Using the same file as above, the idea will be to store in a table which individual we want to keep and which one we want to exclude, and then read the whole genotype file and print desired lines . The following piece code should work :

gawk -F","  '{if(NF==1){Inc[$1]=1}else{if(Inc[$1]==1){print $0}'  MyList.txt Geno.csv >Out

Explanation :

  1. First note the order of the file to be read. We need to have the complete list of the individuals to be included before starting to read the genotype file. So MyList.txt should come first.
  2. The file to be read is a comma separated values file, so we indicate that the fields flag “-F” will be separated by the sign “,”
  3. The first instruction check that the number of fields equal 1,(in fact I assume we only have one column in MyList.txt),if so, we associate to the expression in field 1, a value of 1.
  4. Then when we are reading a file (the second one) with more than 1 field, we test if the value associated to the variable in field 1 stored in Inc table is equal to 1, and if so I print the whole line.

This code is pretty efficient if you are working with both huge list of animal to be selected and huge genotype file. It allow to control where in the file you should look for matching (with no use of regular expression). And last, with some extra instructions you will be able to add information, or change the Id (let’s say lab_Id) to another Id (Interbull_ID). In fact, you’ll be able to do the equivalent of grep and join.

Note that we could have been more strict to check if we were reading MyList.txt or Geno.csv, by using the FILENAME instruction…we maybe look at tit in deeper details in a forthcoming post.

Categories: Awk, Genomic Selection, SNP
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: