### Archive

Archive for the ‘phasing’ Category

## Drawing a dag file obtained with beagle

There are no better way to explain what a DAG is than drawing it. Unfortunately, with dense SNP chip or even sequence data , there are no way one can make manually such draw. We therefore have to  think about automated ways to represent the DAG. Here is an attempt to do so.  We basically want to represent node and edges to do so we’ll use the graphviz language to describe  our plot and the dot tool to compile the instructions.

## Toy example

Graphviz allow to easily describe relations between node within graph object, let look at a minimal complete example. In a simple text file “e.g. Ex.dot”, we describe our graph with Graphviz syntax and then compile it to get an image of our edge.

#Create description file
cat >Ex.dot <<EOF
graph {
"Node1" -- "Node2" ;
}
EOF
#Compile it
dot  Ex.dot -Tpng -o Ex.png


And you should obtain something like the following image.

Note other formats can be generated, depending on your configuration and image to display, some other format could be more appropriate.

## One step forward

Now let’s go one step forward and work with a useful example, the DAG file generated by beagle.
Retrieve a DAG file from beagle website and extract it

#Download
wget http://faculty.washington.edu/browning/beagle/beagle_example.zip
#Extract dag file from the archive (forget about the nesting within directory)
unzip -j *.zip "*.dag*"
#Gunzip the file and keep the 6 first fields
gunzip -c *.dag* | gawk '{print $1,$2,$3,$4,$5,$6} ' >beagle.dag


We want to describe edges (the one between one parent at a given marker and a child at the following marker in the dag file). Basically in the dag file, we must print the child  and parent fields (respectively fields 3 and 4)  and declare that they are linked. To avoid confusion, an indication on which marker we are referring to will be necessary. Finally, our file should start with “graph {” and finish with “}”, which lead us to this first line of command .

#Create dot file
gawk 'BEGIN{print "graph G{"}{if(NR > 1 && NF>4){Et="\""$1"."$3"\"";Et2=Et $(1)+1"."$4;print Et" --\""$1+1"."$4"\"; "Et}}END{print "}"}' beagle.dag >Ex.dot
#Compile
dot -Tpng:cairo:gd Ex.dot -o Ex.png


Note :

• We  have to protect the ” character with \, watch out code can became pretty messy !
• Note also here the use of  “-Tpng:cairo:gd”, in fact the simple “-Tpng” may not work properly when number of markers grows up.

In order to get things that convey more information, we can add some eyes candy properties to our graph. Color of a node or edge can be set with the “[color=]” instruction whereas width of an edge, will be define with “penwidth=” instruction. We’ll define functions which based on the allele value (field 5) will color the edge either in red or blue, and another which depending on the count value, will define the width of the edges. Our command thus become :

#First version
gawk -v NTot=4000 'BEGIN{print "graph G{"}{if(NR > 1 && NF>4){Et="\""$1"."$3"\"";Et2=$(1)+1"."$4;print Et" --\""Et2"\"[color=\""Col($5)",1,1 \",style=filled, penwidth="Intensity($6,NTot)"];"}}END{print "}"}function Col(a){if(a==1){return 1}else{return 0.7}}; function Intensity(n,t){return (10*n/t)}  ' beagle.dag >Ex.dot
dot -Tpng:cairo:gd Ex.dot -o Ex.png
#Second version with a node color according to the number of haplotypes associated to each hidden state
gawk -v NTot=4000 'BEGIN{print "graph G{"}{if(NR > 1 && NF>4){if($1 != mp ||$3 != sp){print Et "[color=\"0.15,"(Cpt)/NTot",1 \",style=filled];";Cpt=$6}else{Cpt=Cpt+$6};mp=$1;sp=$3;Et="\""$1"."$3"\"";Et2=$(1)+1"."$4;print Et" --\""Et2"\"[color=\""Col($5)",1,1 \",style=filled, penwidth="Intensity($6,NTot)"];"}}END{print "}"}function Col(a){if(a==1){return 1}else{return 0.7}}; function Intensity(n,t){return (10*n/t)}  ' beagle.dag >Ex.dot
dot -Tpng:cairo:gd Ex.dot -o Ex.png


And voilà ! you should get something like the following picture.

Each column represent nodes associated to each marker. Going from left to right we can observe the evolution of the DAG from the beginning of the chromosome to the 50th marker. Color of the edges indicate the allele associated to it. The width of the edges is proportional to its frequency, so we can distinguish common pathway threw node. The color intensity of the node also indicate their frequency (light yellow node being rare).

## Evolutions of LD across generation

A classical way to present LD is ploting LD structure (with e.g. Haploview) anyway, while giving us a good overview of the LD in a certain population, this does not give any hints on how this structure evolve across generations.

So I decided to write for fun a little script, looking over 50 non overlapping generations the LD plot obtained by Haploview.

The population contain 1000 individuals per generation. The marker map have 100 markers spanning 7cM.

I then split my simulated genotypes file according to generations. And for each of them generate a LD plot.

And last but not the least I merged all the plot to obtain an animated gif with ImageMagick.

And here is the result !

A very funny thing is to see how fast some SNP are fixed (I made no assumption on MAF in founders population). I tested several scenario, with varying selection pressure . The one used here is rather simple :

• Generations 1 to 5 use 100 males 500 females,
• Generations 6 to 10 use 50 males/250 females,
• Generations 11 to 15  10 males/200 females,
• Generations 16 to 50  20 males/200 females.

While no LD is supposed at generation 1, we can see pretty fast that LD is created, and once pattern are observed they remain stable for a while.

## It has to happen !

You may have goals, that you haven’t reached yet , but you keep believing that one day you’ll achieve them, and… one day it finally happen . This has been the case for two of my secret goal !

As an unconditional emacs user, I knew that soon or later I would try to write my own emacs mode. I had the occasion to achieve this objective while testing a phasing program.

When comparing phased data, you generally want to align two phases and see on which SNP do you observe discrepancies. Nevertheless with thousands SNP, this looks like searching for a needle in a hay stack.  This can be done more efficiently with colored data. The idea is to just to associate a color to a list a character.

But as I am still a bit confused with lisp syntax, I just try to find some good example on the net.

And based on this write this piece of lisp code.

;;Define some attribute for each allele
(setq myAllele
'((" [1Aa]" . font-lock-function-name-face)
(" [2Cc]" . font-lock-warning-face)
(" [Gg]" . font-lock-constant-face)
(" [Tt]" . font-lock-comment-face)
(" [0\.]" . font-lock-builtin-face)
))
(define-derived-mode SNP-mode fundamental-mode
(setq font-lock-defaults '(myAllele))
(setq mode-name "SNP mode")
(setq truncate-lines t)
)

All this code is in a file called SNP.el stored in my ~/.emacs.d/

1. The first block of code define association between allele and font-lock attribute
2. The second block define the mode and ask for line truncation

And in order to allow use this mode automatically, I add in my .emacs

load "~/.emacs.d/SNP.el")

To go one step forward, as I know that most of my genotype or phase file are named like “phase*” or “typ*”, I add also a hook i.e. these 2 lines

(setq auto-mode-alist
(cons '("typ+[1-9]" . SNP-mode) auto-mode-alist)
(cons '("typ+[1-9]" . SNP-mode) auto-mode-alist))

Now everytime I open a genotype file (stariting either by “typ” or “phase”, the SNP-mode is loaded and I thus obtain this kind of screen.

Now there no way I could miss something awesome in the genome !

For the laziest of you the SNP-mode is available here

And for those wondering what was my other achievement, you are certainly not aware of the bedrace World championship !

## How many sires must one genotype ?

That is, for sure not a forgotten lyrics of “blowing in the wind”, but a pretty frequent question : How many sires should be genotyped in order to compute phases (and maybe do some imputation) ?

In fact, most of the time the question is biased because every body would like to hear “genotype one individual, you should be able to phase the whole population !”. Then most of the time people forget to think about what

1. Is there main focus (what do they want to do with their genotype)
2. Should be the reference figures to assess whether they have a good way to prove what they want to prove

If we try to view the problem in statistical perspective, SNP are bi-allelic, so for each marker an individual can be either Homozygous (Ho) or  Heterozygous (He). If the Minor Allelic Frequency (MAF) is equal to p, with $p \sim U(0,1)$

We will have the classical probability for one individual to be homozygous.

$P(\text{Ho}) = p^2 + (1-p)^2$

If we plot a simple graphic on how this probability evolve for different MAF, with this piece of code.

#Defining a sequence of MAF (considering we have rejected SNP with MAF<5%)
Maf<-seq(0.05,0.95,0.0001)
#Make a graph
plot(Maf,((1-Maf)**2)+Maf**2,t="l",col="blue",ylim=c(0,1),xlab="MAF",ylab="Probability of beeing homozygous",main="Evolution of the probability of beeing homozygous depending on MAF of the SNP")
#Checking for correctness
summary(((1-Maf)**2)+Maf**2)


So looking at the results, we would say for one marker that we have a pretty big chance of  “not failing in assigning the phase of any SNP”….this would be true, but there were no chance to fail in this task ! Furthermore, if we are trying to obtain phases, we seek for somehow a “sharper” information creating variations to be related to phenotypes…which is not really what we obtain with homozygous markers.

In fact the true challenge is to assign for the remaining 36.5 % case the right allele to its phase…which even with a random assignment, should be right with a probability of $p=\frac{1}{2}$ !  To put in a nutshell, the real baseline for any phasing algorithm is on average 81 % of correct assignment !

#Defining a sequence of MAF (considering we have rejected SNP with MAF<5%)
Maf<-seq(0.05,0.95,0.0001)
#Make a graph
plot(Maf,((1-Maf)**2)+Maf**2,t="l",col="blue",ylim=c(0,1),xlab="MAF",ylab="Probability of beeing homozygous",main="Evolution of the probability of beeing homozygous depending on MAF of the SNP")
#Adding the expectation of correct assignment obtain randomly
lines(Maf,((1-Maf)*Maf)+((1-Maf)**2)+Maf**2,t="l",col="red")
#Computing new mean
summary(((1-Maf)*Maf)+((1-Maf)**2)+Maf**2)


Now, that we have in mind the baseline figures, let’s look at the improvement that can be achieved by a good choice of animal to genotype. Among the (on average) 36.5% of individuals that won’t be homozygous, we should be able to estimate correctly the phase if we have one parents (or one offspring) that is homozygous, which should be the case in 63.5% of the time. If both parents are genotyped, the probability of not beeing able to assign correct phase because both parents are heterozygous is $p ( \text{2 Parents He})= p(\text{he})^{2}$ something around  13% of the case

For the progeny, the first good point is that we can have more than two offsprings for an individual. The probability to have one homozygous offspring among n as $p ( \text{1 Ho off among n})= p(\text{he})^{n}$ which will fall to a small probabilities very fast with n !

So now consider the probability of a sire to be correctly phase when one of its offspring have been genotyped. p  $p = (P_{\text{parent heterozygous}} . P_{\text{off homozygous}}) + (P_{\text{parent homozygous}})$

The first part  is the contribution of a “smart” sampling of genotype individuals really whereas, the important second part is  proportion of phased marker due to the existence of homozygous.

#Defining a sequence of MAF (considering we have rejected SNP with MAF<5%)
Maf<-seq(0.05,0.95,0.0001)
#Make a graph
plot(Maf,(2*Maf*(1-Maf))*(1-(2*(1-Maf)*Maf))+(Maf**2)+(1-Maf)**2,t="l",col="blue",ylim=c(0,1),xlab="MAF",ylab="",main="Probability of phasing \n if one offspring is genotyped ")
#Add the contribution of n genotyped offspring
for(i in 1:10){lines(Maf,(2*Maf*(1-Maf))*(1-(2*(1-Maf)*Maf)**i),col=i)}


From the above graphic, we see at the top a blue line showing the probability of having a good phasing if any parent have one offspring genotyped. The black line at the bottom, represent the contribution of the genoted offspring to the total probability. The series of coloured lines above this line represent the contribution of more (2 to 10) offspring genotyped. We can see :

1. The extra well phased marker due to offspring genotyped will always appears small compared to the de facto well phased homozygous marker
2. Genotyping new offspring have an interest till 4 or 5, then the extra benefits vanished

Last,  as with 50k chip you can rely on LD to some extent and assuming you got a good map. In fact information given by flanking markers can always help to estimate the most likely phase for a marker that would fall within the 36,5% of the bad situations !

If you are really interested in phasing, meaning beeing able to identify variation at the chromosome level, which exclude in fact the “perfectly assigned to a phase” homozygous marker, you should avoid genotyping individuals that are not related to another genotyped individual one generation apart.

So to try to give an answer, I would says if you can genotype a restricted number of individuals,

1. Focus on individuals with a lot of progeny (which in turn could also have a lot of progeny)
2. Keep an eye on your basic aim (if it’s for instance QTL detection, then phenotype’s quality of individual that will be genotype and could easily be phased, should be the most important thing to look at)
Categories: phasing, R, SNP