Archive

Archive for February, 2011

Aw{k}esomeness (2)

Second (short) post of the series on awk.

One of Perl’s motto is the famous “there is more then one way to do it !” This for sure true and can be somehow convenient to write pretty fast a oneliner but as discussed here, it sometime lead you to poor performances that can be a real problem in routines work.

Let’s talk about a regular case, you have a list of individual (or Id) in a flat genotype file e.g. “Geno.csv”. Let’s says, these file have 4 columns, the first containing the Id of the  genotyped individual, the second one a marker name, and 3rd and 4th column containing allele.

We now want to create a new file, containing only the information related to individual “JB007”.

A common way to extract this lines is by using grep:

grep JB0007 Geno.csv >Out


This should be executed pretty fast with a file of reasonable size….leading you to the not so true conclusion that you reached acceptable efficiency.

First note that, if you believe that your time is sufficiently important not to be wasted by stupid mistake, you should think about what you are really asking to your computer and how could it be “misunderstood”. Here by the way, you are searching for the string “JB007” in any position of any line in the  file. If, you have an individual call “HJB0077”, the later will also be included in your file !  A good idea is therefore to use better defined string to search, with the help of regular expression and to think about ways to test that the file you obtain fit the expectation.

So if the individual JB007 has been genotyped on a bovine50 chip, we may  execute these lines to extract it from a larger file.

grep ^JB007, Geno.csv >Out
# and to check that we have 54001 lines
wc -l Out


Explanation :

We are now looking in the file “Geno.csv” for any line starting with “JB007,” .The sign”^” is a regular expression meaning the start of a line (or expression), the addition of the comma “,” is to be sure that we are working only with the first field of the csv file.

Unfortunately, most of the time, you will want to extract more than one individual, if more than one is two, then you can still use  regular expressions. as instance our last script can turn into :

 grep -E "^JB007,|JB008," Geno.csv >Out


Explanation:

The additional -E indicate that grep will have to search for a regular expression, and in fact the latter is composed of two regular expression divided by a “| ” statement standing for “or”.

For sure, we should still have a rather good efficiency, but if we have to extend our search to 5 or more expressions, it will  turn out to be fastidious and error prone. We must think about a better way to work. If you haven’t read last post, you can think about a for loop in shell script, plus a concatenation “>>” over the “out” file, but here again, this won’t be efficient.

Another way, is to write line by line all the expression to search in a file called MyList.txt, then the following command should make the extraction you want :

 grep -f  MyList.txt Geno.csv >Out


Explanation :
The “-f” flag ask grep to read in a file all the expression to seek in the file. But you’ll probably see that if you are looking for hundred of expressions, then the process will start to be very long (can be useful if it’s time for coffee break :-).  A Trick to be mentioned because I really found it very smart, is to see if you ‘d not better search for expression not to be included, sometimes this way to see the problem is much easier. In the case you only want to exclude several individual from your file something like this should work !

 grep -v -f  MyList.txt Geno.csv >Out


Explanation :

The flag “-v” will invert grep behavior, hence, now none of the string listed in MyList.txt should be present in the out file.

But of course there is a way to do all this with awk. As we’ve seen in last post, awk allow to use in a very convenient way hash table. Using the same file as above, the idea will be to store in a table which individual we want to keep and which one we want to exclude, and then read the whole genotype file and print desired lines . The following piece code should work :

gawk -F","  '{if(NF==1){Inc[$1]=1}else{if(Inc[$1]==1){print $0}' MyList.txt Geno.csv >Out  Explanation : 1. First note the order of the file to be read. We need to have the complete list of the individuals to be included before starting to read the genotype file. So MyList.txt should come first. 2. The file to be read is a comma separated values file, so we indicate that the fields flag “-F” will be separated by the sign “,” 3. The first instruction check that the number of fields equal 1,(in fact I assume we only have one column in MyList.txt),if so, we associate to the expression in field 1, a value of 1. 4. Then when we are reading a file (the second one) with more than 1 field, we test if the value associated to the variable in field 1 stored in Inc table is equal to 1, and if so I print the whole line. This code is pretty efficient if you are working with both huge list of animal to be selected and huge genotype file. It allow to control where in the file you should look for matching (with no use of regular expression). And last, with some extra instructions you will be able to add information, or change the Id (let’s say lab_Id) to another Id (Interbull_ID). In fact, you’ll be able to do the equivalent of grep and join. Note that we could have been more strict to check if we were reading MyList.txt or Geno.csv, by using the FILENAME instruction…we maybe look at tit in deeper details in a forthcoming post. Categories: Awk, Genomic Selection, SNP Aw{k}esomeness (1) February 7, 2011 Leave a comment Beeing pretty addicted to Awk for most of the data manipulation I commonly have to make during my working days, some colleagues comes to my office asking for the same kind of problem, so I was thinking it could be good to start a series of post on these common tricks every one should know in awk. This post will be the first of a (long ?) series and will present the table facilities in awk. Let’s describe some common problem you can have with genomic data. 1. You want to know if all the line of your file have the same number of columns 2. You want to know how many allele are observed for one particular marker Let say, we are reading a file called MyFile, for the problem 1, we will execute this line of code gawk '{N[NF]++}END{for (i in N){print i,N[i]}}' MyFile  Explanations : The first part of the instruction, is looking line by line to the number of column (Value of NF), then in the hash table N in the line associated to NF, we just add 1. So basically, for each line with let’s says 3 columns, one is added to the count of line with 3 columns. The ++ notation is exactly the same +=1 or N[NF]=N[NF]+1, but you know when you can be lazy, be lazy, (furthermore if you are addicted to code golf, this helps to have good score). The ending part (starting with the keyword END), will be executed when the file has been read. The idea here is to look into the table N, and for all its elements (i) print the associated values. So let’s say, the files we read had lines with 3, 4, or 5 columns, the table N will have 3 elements, containing the numbers of times we encountered a line with 3, 4 or 5 columns. The print statement will print the key (i) of table N and then the value stored in N for this specific key Note, the for …. in …. instruction, will look to all the element of the inspected table, with no specific order. If you want to order this, you’ll have to use the sort function. For the second example let’s have more fun ! Let’s say that we are interested by the allele at a marker. We just know that this allele is located in the column 4 gawk '{N[$4]++}END{for (i in N){print i,N[i]}}' MyFile


Explanations :
The only change is that now we read the key of our table the allele read in column 4. At the end of the reading step, we will have to look to all the allele that have been stored in N and print the number of time they appear.

Too easy isn’t it ? So let’s say that we now want to have a code that can count the allele at any column (col).

gawk -v Col=4 '{N[$(Col)]++}END{for (i in N){print i,N[i]}}' MyFile  We use here the option -v, which allow us to declare a variable that will be known through out the reading process. The value will be called “Col” and will be assigned a value of 4. As we’ve done something that look “smarter”, we could be tempted to go on and test all the markers of a file with a code like this . #Count the number of fields in MyFile NField=head -n1 MyFile | wc -w #yes I know it's certainly not the best way to do it, but I like to change my habits ! for Mark in seq 3${NField} 
do
echo "Looking at SNP in col $Mark" gawk -v Col=$Mark '{N[$(Col)]++}END{for (i in N){print i,N[i]}}' MyFile done  But we won’t because it’s definitely not efficient at all, suppose you have 50 000 columns, you will read 49 999 times the file for nothing ! In my opinion, the efficient way should look more like this : gawk '{for (i=3; i<=NF;i++){N[i,$(i)]+=1}}END{for(elt in N){split(elt,T,"\034");print T[1],T[2],N[T[1],T[2]]}}' MyFile


Explanations :

So just remember that a good way to progress is to go step by step. So to obtain this one-liner, I first start  with my previous program. Then, I tackle the fact that we wanted to go through all the markers for each line. This is done by the for (i=2;i<=NF;i++), we can also access any element of a 2D table by a pair of keys. Here the keys are i (the number of the column inspected, and \$(i), the value (here the allele) read in column i. Last but not the least we want to store out all the element of the table n. Here the important thing to know is that keys are merge together and stored as a 1D table, we therefore have to split the key (with the function split) to remove the ascii character “\ 034” which links both parts of the key and obtain two elements that will be stored in the table T, and then print both keys (T[1] and T[2]) and the value  assigned to them in table N.

Once you got the principle in mind,  I imagine that you’ll find a lot of possible application in your day to day work. So have fun !

Categories: Awk, Genomic Selection, Shell, SNP

How many sires must one genotype ?

That is, for sure not a forgotten lyrics of “blowing in the wind”, but a pretty frequent question : How many sires should be genotyped in order to compute phases (and maybe do some imputation) ?

In fact, most of the time the question is biased because every body would like to hear “genotype one individual, you should be able to phase the whole population !”. Then most of the time people forget to think about what

1. Is there main focus (what do they want to do with their genotype)
2. Should be the reference figures to assess whether they have a good way to prove what they want to prove

If we try to view the problem in statistical perspective, SNP are bi-allelic, so for each marker an individual can be either Homozygous (Ho) or  Heterozygous (He). If the Minor Allelic Frequency (MAF) is equal to p, with $p \sim U(0,1)$

We will have the classical probability for one individual to be homozygous.

$P(\text{Ho}) = p^2 + (1-p)^2$

If we plot a simple graphic on how this probability evolve for different MAF, with this piece of code.

#Defining a sequence of MAF (considering we have rejected SNP with MAF<5%)
Maf<-seq(0.05,0.95,0.0001)
#Make a graph
plot(Maf,((1-Maf)**2)+Maf**2,t="l",col="blue",ylim=c(0,1),xlab="MAF",ylab="Probability of beeing homozygous",main="Evolution of the probability of beeing homozygous depending on MAF of the SNP")
#Checking for correctness
summary(((1-Maf)**2)+Maf**2)


So looking at the results, we would say for one marker that we have a pretty big chance of  “not failing in assigning the phase of any SNP”….this would be true, but there were no chance to fail in this task ! Furthermore, if we are trying to obtain phases, we seek for somehow a “sharper” information creating variations to be related to phenotypes…which is not really what we obtain with homozygous markers.

In fact the true challenge is to assign for the remaining 36.5 % case the right allele to its phase…which even with a random assignment, should be right with a probability of $p=\frac{1}{2}$ !  To put in a nutshell, the real baseline for any phasing algorithm is on average 81 % of correct assignment !

#Defining a sequence of MAF (considering we have rejected SNP with MAF<5%)
Maf<-seq(0.05,0.95,0.0001)
#Make a graph
plot(Maf,((1-Maf)**2)+Maf**2,t="l",col="blue",ylim=c(0,1),xlab="MAF",ylab="Probability of beeing homozygous",main="Evolution of the probability of beeing homozygous depending on MAF of the SNP")
#Adding the expectation of correct assignment obtain randomly
lines(Maf,((1-Maf)*Maf)+((1-Maf)**2)+Maf**2,t="l",col="red")
#Computing new mean
summary(((1-Maf)*Maf)+((1-Maf)**2)+Maf**2)


Now, that we have in mind the baseline figures, let’s look at the improvement that can be achieved by a good choice of animal to genotype. Among the (on average) 36.5% of individuals that won’t be homozygous, we should be able to estimate correctly the phase if we have one parents (or one offspring) that is homozygous, which should be the case in 63.5% of the time. If both parents are genotyped, the probability of not beeing able to assign correct phase because both parents are heterozygous is $p ( \text{2 Parents He})= p(\text{he})^{2}$ something around  13% of the case

For the progeny, the first good point is that we can have more than two offsprings for an individual. The probability to have one homozygous offspring among n as $p ( \text{1 Ho off among n})= p(\text{he})^{n}$ which will fall to a small probabilities very fast with n !

So now consider the probability of a sire to be correctly phase when one of its offspring have been genotyped. p  $p = (P_{\text{parent heterozygous}} . P_{\text{off homozygous}}) + (P_{\text{parent homozygous}})$

The first part  is the contribution of a “smart” sampling of genotype individuals really whereas, the important second part is  proportion of phased marker due to the existence of homozygous.

#Defining a sequence of MAF (considering we have rejected SNP with MAF<5%)
Maf<-seq(0.05,0.95,0.0001)
#Make a graph
plot(Maf,(2*Maf*(1-Maf))*(1-(2*(1-Maf)*Maf))+(Maf**2)+(1-Maf)**2,t="l",col="blue",ylim=c(0,1),xlab="MAF",ylab="",main="Probability of phasing \n if one offspring is genotyped ")
#Add the contribution of n genotyped offspring
for(i in 1:10){lines(Maf,(2*Maf*(1-Maf))*(1-(2*(1-Maf)*Maf)**i),col=i)}


From the above graphic, we see at the top a blue line showing the probability of having a good phasing if any parent have one offspring genotyped. The black line at the bottom, represent the contribution of the genoted offspring to the total probability. The series of coloured lines above this line represent the contribution of more (2 to 10) offspring genotyped. We can see :

1. The extra well phased marker due to offspring genotyped will always appears small compared to the de facto well phased homozygous marker
2. Genotyping new offspring have an interest till 4 or 5, then the extra benefits vanished

Last,  as with 50k chip you can rely on LD to some extent and assuming you got a good map. In fact information given by flanking markers can always help to estimate the most likely phase for a marker that would fall within the 36,5% of the bad situations !

If you are really interested in phasing, meaning beeing able to identify variation at the chromosome level, which exclude in fact the “perfectly assigned to a phase” homozygous marker, you should avoid genotyping individuals that are not related to another genotyped individual one generation apart.

So to try to give an answer, I would says if you can genotype a restricted number of individuals,

1. Focus on individuals with a lot of progeny (which in turn could also have a lot of progeny)
2. Keep an eye on your basic aim (if it’s for instance QTL detection, then phenotype’s quality of individual that will be genotype and could easily be phased, should be the most important thing to look at)
Categories: phasing, R, SNP