Archive for April, 2012

Drawing a dag file obtained with beagle

April 8, 2012 Leave a comment

There are no better way to explain what a DAG is than drawing it. Unfortunately, with dense SNP chip or even sequence data , there are no way one can make manually such draw. We therefore have to  think about automated ways to represent the DAG. Here is an attempt to do so.  We basically want to represent node and edges to do so we’ll use the graphviz language to describe  our plot and the dot tool to compile the instructions.

Toy example

Graphviz allow to easily describe relations between node within graph object, let look at a minimal complete example. In a simple text file “e.g.”, we describe our graph with Graphviz syntax and then compile it to get an image of our edge.

#Create description file
cat > <<EOF
graph {
"Node1" -- "Node2" ;
#Compile it
dot -Tpng -o Ex.png

Result1 And you should obtain something like the following image.

Note other formats can be generated, depending on your configuration and image to display, some other format could be more appropriate.

One step forward

Now let’s go one step forward and work with a useful example, the DAG file generated by beagle.
Retrieve a DAG file from beagle website and extract it

#Extract dag file from the archive (forget about the nesting within directory)
unzip -j *.zip "*.dag*"
#Gunzip the file and keep the 6 first fields
gunzip -c *.dag* | gawk '{print $1,$2,$3,$4,$5,$6} ' >beagle.dag

We want to describe edges (the one between one parent at a given marker and a child at the following marker in the dag file). Basically in the dag file, we must print the child  and parent fields (respectively fields 3 and 4)  and declare that they are linked. To avoid confusion, an indication on which marker we are referring to will be necessary. Finally, our file should start with “graph {” and finish with “}”, which lead us to this first line of command .

#Create dot file
gawk 'BEGIN{print "graph G{"}{if(NR > 1 && NF>4){Et="\""$1"."$3"\"";Et2=Et $(1)+1"."$4;print Et" --\""$1+1"."$4"\"; "Et}}END{print "}"}' beagle.dag >
dot -Tpng:cairo:gd -o Ex.png

Note :

  • We  have to protect the ” character with \, watch out code can became pretty messy !
  • Note also here the use of  “-Tpng:cairo:gd”, in fact the simple “-Tpng” may not work properly when number of markers grows up.

In order to get things that convey more information, we can add some eyes candy properties to our graph. Color of a node or edge can be set with the “[color=]” instruction whereas width of an edge, will be define with “penwidth=” instruction. We’ll define functions which based on the allele value (field 5) will color the edge either in red or blue, and another which depending on the count value, will define the width of the edges. Our command thus become :

#First version
gawk -v NTot=4000 'BEGIN{print "graph G{"}{if(NR > 1 && NF>4){Et="\""$1"."$3"\"";Et2=$(1)+1"."$4;print Et" --\""Et2"\"[color=\""Col($5)",1,1 \",style=filled, penwidth="Intensity($6,NTot)"];"}}END{print "}"}function Col(a){if(a==1){return 1}else{return 0.7}}; function Intensity(n,t){return (10*n/t)}  ' beagle.dag >
dot -Tpng:cairo:gd -o Ex.png
#Second version with a node color according to the number of haplotypes associated to each hidden state
gawk -v NTot=4000 'BEGIN{print "graph G{"}{if(NR > 1 && NF>4){if($1 != mp || $3 != sp){print Et "[color=\"0.15,"(Cpt)/NTot",1 \",style=filled];";Cpt=$6}else{Cpt=Cpt+$6};mp=$1;sp=$3;Et="\""$1"."$3"\"";Et2=$(1)+1"."$4;print Et" --\""Et2"\"[color=\""Col($5)",1,1 \",style=filled, penwidth="Intensity($6,NTot)"];"}}END{print "}"}function Col(a){if(a==1){return 1}else{return 0.7}}; function Intensity(n,t){return (10*n/t)}  ' beagle.dag >
dot -Tpng:cairo:gd -o Ex.png

And voilà ! you should get something like the following picture.

Each column represent nodes associated to each marker. Going from left to right we can observe the evolution of the DAG from the beginning of the chromosome to the 50th marker. Color of the edges indicate the allele associated to it. The width of the edges is proportional to its frequency, so we can distinguish common pathway threw node. The color intensity of the node also indicate their frequency (light yellow node being rare).