### Archive

Archive for the ‘Awk’ Category

## Hourglass distribution

August 6, 2013 1 comment

Some days ago, I went through an article on mutation rate mentioning the “hourglass distribution”. The illustration of this distribution is pretty obvious from the plot in the article, with for some chromosome a pretty clear reduction in number of SNP around centromeres. I just wondered if we could also observe this phenomenon on bos taurus. So i just had a look on dbSNP138 data to have some clues.

### First version

```
#Retrieve dbSNP
wget ftp://ftp.ncbi.nih.gov/snp/organisms/cow_9913/VCF/vcf_*vcf.gz
```

And then just run the following R code.

```#Open an output file
png("hourglass.png",1920,1080)
#split the screen in 30
par(mfrow=c(3,10))
#Create a vector of chromosomes
chr=c(seq(1:29),"X")
#run a loop on chromosomes
for( i in chr){
#plot the snp density
plot(density(Pos\$V2),col="grey",main=paste("BTA : ",i,sep=""))
polygon(density(Pos\$V2),col=rgb(0,0,0.3,0.04))
}
dev.off()
```

Notes on code :

• read.table is used directly on a gzipped file (very handy trick)
• dbsnp file have a 15 lines long header, so I use the skip =15 option
• I had some glitches while reading some files, (a problem with # of fields) option fill=TRUE, just fix it
• Plot are nice….but polygon are even better, so I first plot the density and then add a polygon on it
• rgb function is a simple way to obtain transparency, so after the three values for red, green and blue, I add the alpha setting to a value of 0.04

And after a while….you ‘ll obtain something like this

I must say I was a bit disappointed. At least, there are no clear pattern as can be seen in human. All the bovine chromosomes are acrocentric  this may explain why generally no clear decrease in SNP density can be seen. The pattern observed on chromosome 12, 18 and 10 were even more surprising. I am wondering if there could be some sampling bias. Concerning the pattern on BTA23, the latter could be due to MHC, known to exhibit a great diversity. Density computation may also blur a bit things.

### Second version

The basic work being done, we can try to investigate others hypotheses. As instance, are SNP and Indel distributed the same way along the genome ? With some slight changes, the code become :

```#Open an output file
png("SNPandindelDistribution.png",1920,1080)
#split the screen in 30
par(mfrow=c(3,10))
#Create a vector of chromosomes
chr=c(seq(1:29),"X")
#run a loop on chromosomes
for( i in chr){
#plot the snp density
plot(density(Pos\$V2[grep("snp",Pos\$V8,perl=TRUE)]),col="grey",main=paste("BTA : ",i,sep=""))
polygon(density(Pos\$V2[grep("snp",Pos\$V8,perl=TRUE)]),col=rgb(0,0,0.3,0.04))
dense=density(Pos\$V2[grep("in-del",Pos\$V8,perl=TRUE)])
lines(dense\$x,0.5*dense\$y,col="red")
}
dev.off()
```

Notes on code :

• in dbsnp variant are coded either as snp or in-del, we extract line with the grep function accordingly
• I tweaked a bit the indel line in order to avoid scale problems.

We observe roughly the same pattern between snp and indel, albeit indel distribution may be smoother. I was expecting some discrepancies (relying on the article by Montgomery et al. but here again, we are only dealing with 1 base indel, which is not really representative of short indel  in general). I may try to check this results with my own results.

## Serial snapshots with IGV batch

IGV is a very handy tool. Nevertheless, scrolling from one position to another may be fastidious. Second, the bad aspect of this very user-friendly software, is that you can spend hours looking here and there, with at last no backtrack of your discoveries.

Thankfully, IGV can run in batch mode, allowing, for a targeted list of positions, to take screenshots and store the later in a folder. We’ll illustrate in the following with a small example :

### Setting up a session

To test our script, we will first download some publicly available data.

```#Download a vcf file (on beagle4 website, you may have to change file name according to last release)
wget http://faculty.washington.edu/browning/beagle/test.r1099.vcf.gz
gunzip test.r1099.vcf.gz
#Create an index
igvtools index test.r1099.vcf
#(You can alternatively prefer to use igvtools from igv GUI)
#Launch igv
igv.sh
```

To check if everything is right, load the vcf file test.r1099.vcf. Then move to positions chr22:20,000,000-20,100,000.

Set everything to your taste, and save the session :  File, save session, you should obtain a xml file.

### Running IGV in batch mode

Now, let’s consider you are interested in some particular position. Let’s say we’ve stored several positions of interest in a csv file. Our aim is to create an IGV batch file.

Basically, we’ll have to load the session, set a directory to store the screenshots, and then move from one position to the other. A very crude version could therefore be. (I know : Why csv ? Because a lot of person still use excel 😦 )

```#Create a fake positions list
cat >Liste.csv <<EOF
chr22;20070000
chr22;20081000
EOF

#Create a ss directory
mkdir IMG
#Write the header of the script
cat >Batch.igv <<EOF
new
snapshotDirectory IMG
EOF
#Now parse the csv file
gawk -F ";" -v R=10000 '{print "goto "\$1":"\$2 -R"-"\$2 + R"\nsnapshot Screen"NR".png"}END{print "exit"}' Liste.csv >>Batch.igv
```

From, igv, go to Tools, Run a batch script. Load Batch.igv, when all the process will be done, IGV will terminate and you’ll find your screenshots in IMG.

For an even more automated version you can use the script “PrepIgvBatch.sh” available in the scriptotheque

## Vcf processing with awk

Processing genome-wide data imply juggling between genome and chromosome scale. As instance, phasing is better done chromosome by chromosome, while annotation or recalibration may be better done at a genome wide scale. Thus splitting and merging vcf file by chromosome may be useful, I’ll report in this quick and dirty post ways to handle this problem.

## Split a vcf file

The general idea when splitting vcf, is to write the content of the vcf file in a distinct file according to the chromosome (first field, \$1)  they pertain to. This can conveniently be done with :

`gawk '/^chr/{print >\$1".tmp"}' Genome.vcf`

Where for each line starting by “chr” (i.e. not the header file), line is printed in a file whose name is the concatenation of the what may be seen in the first field (\$1) ant the extension “.tmp”.

The later is not perfect (we haven’t copied the header, thus this output won’t follow vcf standard). We need to add the header.

```for i in *.tmp
name=`echo \$i | sed 's/.tmp/.vcf/g' `
gawk '/^#/{print }' Genome.vcf >\$name
cat \$i >>\$name
rm \$i
done
```

Here for each “tmp” file created, we

1. create a new vcf name (the later being created by switching tmp extension by vcf)
2. Write is the vcf file the cheader (line starting with “#”
3. Concatenate the tmp file to the vcf file.

As every one should prefer oneliner,  here come the oneliner version.

```gawk 'BEGIN{Ch=""}{if(\$1~"#"){H[FNR]=\$0;nh++}else{if(\$1!=Ch){Ch=\$1;Out=\$1".vcf";for(i=1;i<=nh;i++){print H[i] >Out}}; print \$0 >Out}}' ../Genome.vcf
```

First, in the BEGIN instruction, we define Ch (a variable assigned to the current chromosome) and set it to null.  Then, when we start to read the file, if we are within the header (first field starting (^) with “#”), we store in each of the “FNR” element of an array (H) the current line (\$0), and count the number of header line (nh). If we are not any longer reading a header line, then if the first field (\$1) is different from the Ch variable (we’ve skipped to another chromosome), we change the output file name (Out) according to the first field(\$1), write in order (from 1 to nh) all the header lines stored in H,  and update the Ch variable to the value read in the first field. In any case we’ll write in the current line in the current output file.

## Merge vcf files

Once you’ve split your vcf file, sooner or later, you’ll need to merge them ;-). So let’s assume, we have 31 vcf files (corresponding to chromosomes 1 to 29, + X and M), that we want to merge, the later oneliner should do the trick.

```gawk '{if(FILENAME==ARGV[1]){print}else{if(\$1 !~ "^#"){print}}}' `for i in \$(seq 1 29) X M ; do echo chr\${i}.vcf ; done ` >Genome.vcf
```

We’ll have to read all the vcf file in order, write the whole first file, and in the later files, everything but the header. First, we generate a list of vcf file, `for i in \$(seq 1 29) X M ; do echo chr\${i}.vcf ; done `.

Then, if we read the first file (FILENAME==ARGV[1]), we write every line. Otherwise, if and only if, the line is not starting with “#” (i.e. is not in the header), we print the current line in the output file.

Categories: Awk, NGS, Shell, SNP Tags:

## Drawing a dag file obtained with beagle

There are no better way to explain what a DAG is than drawing it. Unfortunately, with dense SNP chip or even sequence data , there are no way one can make manually such draw. We therefore have to  think about automated ways to represent the DAG. Here is an attempt to do so.  We basically want to represent node and edges to do so we’ll use the graphviz language to describe  our plot and the dot tool to compile the instructions.

## Toy example

Graphviz allow to easily describe relations between node within graph object, let look at a minimal complete example. In a simple text file “e.g. Ex.dot”, we describe our graph with Graphviz syntax and then compile it to get an image of our edge.

```#Create description file
cat >Ex.dot <<EOF
graph {
"Node1" -- "Node2" ;
}
EOF
#Compile it
dot  Ex.dot -Tpng -o Ex.png
```

And you should obtain something like the following image.

Note other formats can be generated, depending on your configuration and image to display, some other format could be more appropriate.

## One step forward

Now let’s go one step forward and work with a useful example, the DAG file generated by beagle.
Retrieve a DAG file from beagle website and extract it

```#Download
wget http://faculty.washington.edu/browning/beagle/beagle_example.zip
#Extract dag file from the archive (forget about the nesting within directory)
unzip -j *.zip "*.dag*"
#Gunzip the file and keep the 6 first fields
gunzip -c *.dag* | gawk '{print \$1,\$2,\$3,\$4,\$5,\$6} ' >beagle.dag
```

We want to describe edges (the one between one parent at a given marker and a child at the following marker in the dag file). Basically in the dag file, we must print the child  and parent fields (respectively fields 3 and 4)  and declare that they are linked. To avoid confusion, an indication on which marker we are referring to will be necessary. Finally, our file should start with “graph {” and finish with “}”, which lead us to this first line of command .

```#Create dot file
gawk 'BEGIN{print "graph G{"}{if(NR > 1 && NF>4){Et="\""\$1"."\$3"\"";Et2=Et \$(1)+1"."\$4;print Et" --\""\$1+1"."\$4"\"; "Et}}END{print "}"}' beagle.dag >Ex.dot
#Compile
dot -Tpng:cairo:gd Ex.dot -o Ex.png
```

Note :

• We  have to protect the ” character with \, watch out code can became pretty messy !
• Note also here the use of  “-Tpng:cairo:gd”, in fact the simple “-Tpng” may not work properly when number of markers grows up.

In order to get things that convey more information, we can add some eyes candy properties to our graph. Color of a node or edge can be set with the “[color=]” instruction whereas width of an edge, will be define with “penwidth=” instruction. We’ll define functions which based on the allele value (field 5) will color the edge either in red or blue, and another which depending on the count value, will define the width of the edges. Our command thus become :

```#First version
gawk -v NTot=4000 'BEGIN{print "graph G{"}{if(NR > 1 && NF>4){Et="\""\$1"."\$3"\"";Et2=\$(1)+1"."\$4;print Et" --\""Et2"\"[color=\""Col(\$5)",1,1 \",style=filled, penwidth="Intensity(\$6,NTot)"];"}}END{print "}"}function Col(a){if(a==1){return 1}else{return 0.7}}; function Intensity(n,t){return (10*n/t)}  ' beagle.dag >Ex.dot
dot -Tpng:cairo:gd Ex.dot -o Ex.png
#Second version with a node color according to the number of haplotypes associated to each hidden state
gawk -v NTot=4000 'BEGIN{print "graph G{"}{if(NR > 1 && NF>4){if(\$1 != mp || \$3 != sp){print Et "[color=\"0.15,"(Cpt)/NTot",1 \",style=filled];";Cpt=\$6}else{Cpt=Cpt+\$6};mp=\$1;sp=\$3;Et="\""\$1"."\$3"\"";Et2=\$(1)+1"."\$4;print Et" --\""Et2"\"[color=\""Col(\$5)",1,1 \",style=filled, penwidth="Intensity(\$6,NTot)"];"}}END{print "}"}function Col(a){if(a==1){return 1}else{return 0.7}}; function Intensity(n,t){return (10*n/t)}  ' beagle.dag >Ex.dot
dot -Tpng:cairo:gd Ex.dot -o Ex.png
```

And voilà ! you should get something like the following picture.

Each column represent nodes associated to each marker. Going from left to right we can observe the evolution of the DAG from the beginning of the chromosome to the 50th marker. Color of the edges indicate the allele associated to it. The width of the edges is proportional to its frequency, so we can distinguish common pathway threw node. The color intensity of the node also indicate their frequency (light yellow node being rare).

## Using Fortran preprocessor (2)

Following last post on Fortran and preprocessing, we ‘ll try to see other use of preprocessing instructions.

Fortran accept several preprocessor instructions, we’ve illustrated the conditional preprocessor’s instructions, let’s look now at some case studies and how preprocessor’s instructions can help.

## Debugging

Suppose you are given a very buggy code, (I presume that you would not dare writing bugs by your own :-p) your compiler can’t help you to trace the origin of the error (which is obviously a naïve statement), so you decide to start a long and boring task consisting in adding through out the code some tags that will tell you were are you in the code…the normal way is to write several time a line like :

```write(*,*)"program reach the point A"
....
write(*,*)"program reach the point B"
```

And then you ‘ll have to find where is A or B, to identify where the “first” bugs happens.
Thanks to the “__LINE__” instruction, you’ll be able to retrieve the line of code executed. So our instructions will be something like this

```write(*,*)"program reach the line",__LINE__
```

Note, that the “_” is doubled before and after the LINE keyword.

Let’s assume your source code is split in several files, so to avoid any confusion, we’ll use another instruction

__FILE__“, which, as you’ve probably guess, contain the name of the file containing the code that is being executed.
We now have something like :

```write(*,*)"program reach the line",__LINE__," in file ",__FILE__
```

And now the “hammer”, insert this line in your code with someting like :

```for FIC in *.F90
do
mv \$FIC \${FIC}.bck
gawk '{if((NR%10)==0){print \$0 "\n write(*,*)\"program reach the line\",__LINE__,\" in file \",__FILE__"}else{print \$0}}'&nbsp;  \${FIC}.bck >\$FIC
done
```

Compile and test !

NB : The \” symbols are there because otherwise ” would be interpreted by awk as the end of a character string.

Once, you’ve corrected all the bugs, turn things back to normal with let’s say… grep :

```for FIC in *F90
do
mv \$FIC \${FIC}.dbg
grep -v&nbsp; "program reach the line"&nbsp;\${FIC}.dbg >>\${FIC}
done
```

## Avoiding debugging

Most often, when people report a bug, they never mention where does the binary they used come from, this, of course, can very often explain why the binary doesn’t make the new feature describe in your manual (at least the user read the manual !), but even with some indications on the software version, it is not always easy to know when the code has been compiled (and if you’ve eventually change your makefile as instance).

So of course, here I’ll present both __DATE__ and __TIME__ instructions, the latter giving the date of compilation and the time (in conjunction with a makefile under version control, you should be able to know everything about how your binary was compiled).

A nice way to use these instructions is by mean of command arguments, with a flag “–version” as instance.

so the example given in the command line post would then be :

```...
if(narg>0)then
!loop across options
do cptArg=1,narg
call get_command_argument(cptArg,name)
case("--version","-v")
write(*,*)"This is program TestArg : Version 0.1"
write(*,*)"Binary compiled the",__DATE__," at ",__TIME__
stop
case default
end select
end do
end if
...
```

Note for people using svn as instance, you can also add an \$Id: \$ flag in your code so that you ‘ll then know both the revision ID and the compilation date.

```...
if(narg>0)then
!loop across options
do cptArg=1,narg
call get_command_argument(cptArg,name)
case("--version","-v")
write(*,*)"This is program TestArg : Version \$Id: \$"
write(*,*)"Binary compiled the",__DATE__," at ",__TIME__
stop
case default
end select
end do
end if
...
```

So now, you’ve seen two possible uses of preprocessor’s instructions, we’ll see in the next post of the serie some other use of preprocessor.

Categories: Awk, Fortran, Linux, Shell

## A subtle difference

October 4, 2011 1 comment

My colleague came to my office for a vicious problem today. Based on the first description of the symptom, I was about to bet for the classical carriage return problem. Anyway, the true problem just came from another classical subtlety in awk syntax.

### Let’s describe the problem

We have a somehow classical genotype file looking like this :

```AClassicalID,A_WAY_TOO_LONG_SNP_NAME,A,C
AClassicalID,ANOTHER_TOO_LONG_SNP_NAME,A,A
AClassicalID,A_SHORTER_SNP_NAME,-,-```

So 4 comma separated values : one individual ID, a SNP name, two alleles.

The aim was to print the lines where alleles were reported, or said in another way exclude line with the “-,-” alleles.

The basic awk one liner for this could be :

```gawk -F"," '{if(\$3 != "-"){print \$0}}' file
```

We tell gawk that fields will be separated by a comma (-F option), we test the value of the 3rd column, if the value is different from “-” then we print the line.
The above code work perfectly….but then my colleague which is very cautious just wonder if it was enough, though she try the following bit of code

```gawk -F"," '{if(\$4 != "-"){print \$0}}' file
```

And alas this almost identical code doesn’t seem to work. All the line of the file are printed.

### Would you be smart enough to see why ?

So the problem seems to come from variable 4. In fact, you may have notice that this last field have no comma on the right. Here the last field is  not “-” but “-” plus some spaces. The condition if(\$4 != “-“) is thus always TRUE.

### How to fix this ?

I had not a lot of time so a quick and dirty trick is just to use this code instead :

```gawk -F"," '{if(\$4 !~ "-"){print \$0}}' file
```

The change “!~” instead of “!=” means that now we are looking for line where the variable 4 is not matching “-“. This imply that it will exclude line where le variable 4 is equal to as instance “-“,”-   “, but also “A-“.

As we know that we should not find in the fourth column anything but “A,C,G,T or -“, the fix is acceptable.

### Remarks

For other problems, this fix might not be as convenient. Sometime it could also create surprising results !

Another point to notice, in Illumina standard genotype files, genotypes are, as far as I know, in alphabetical order, so that you should never see any line with genotype “C,A ” but rather “A,C”.

Last but no the least, unknown genotype are always paired, you should not see any “A,-” or what so ever. So my colleague could have been less conscientious, nobody would notice.

Categories: Awk, Linux, SNP

## Two other classical problem…and their solutions

After the recent course I gave on Unix and awk basics, several of my colleagues came to my office with some strange case. Problem I already encountered a long time ago, so that I totally forgot to mention it.

First problem :  wc is not counting line properly

You open a file, read it, and observe that the number of line is let’s  say 10 but wc -l tell you that he count 9 or 11 or more lines

What’s wrong ?

There is for some reason a problem in the last line of your file. Some software doesn’t write properly the last line of a file so wc find the end of file sign in the last line containing text (normaly the end of file should be found alone on the last line of the file), this can explain why wc -l will think you got 9 lines whereas you have actually 10.

An easy way to fix is to open the problematic file with any editor  go to the last line press enter to move the cursor to the last empty line. This way your

In case wc -l tell you you have 11 or more lines, you may have additional empty lines at the end of your file, just open the problematic file and erase the extra empty lines.

Second problem :  floating points figure  are written with a comma by gawk

This bug  is in fact related to the fact that i am French (nobody’s perfect) and that on one server among all the other available at the office, in fact in this server the LANG variable is set to “fr_FR”. The only way to fix this is to change the value of LANG to either  “us_US” or better “fr_FR.UTF-8”.

This could be conveniently done by adding this line  in any script before using awk (or by adding this in your .profile or .bashrc) .

```export LANG="fr_FR.UTF-8"
```

Should we consider that all our scripts should be protected for such problem ?

Honestly, this problem are pretty rare and with well designed script on properly configured platform you shouldn’t have this problem. In fact the problem with wc generally appear after  manual editing of a file. And the LANG problem only appear if you don’t use UTF-8 encoding (which is not so usual nowadays). So just remind that this kind of problem can appear from time to time, but don’t focus too much on it.

Categories: Awk, Linux, Shell