Archive for the ‘QTL-detection’ Category

Software update

November 30, 2013 Leave a comment

Picards tools

The classical one ūüėČ

GATK 2.7-4

A new experimental feature to assess PK for reference site. Plus some bug fixing


A new version of the algorithm. Shortened arguments flags. New sides tools (data checking and graph handling).


A new version released 2.12 :  faster and with additional feature. Note a GCTA forum is now available for discussion and technical problem.


A more flexible way to define reference population and the way data are handled. Multi-threading available during sampling step.

Delly 0.12

Apart from translocation detection removal (not for long ?), this version ship a lot of awaited features like vcf output.

Varscan 2.3.6

Still some work on vcf compatibility improvement.

Categories: Linux, NGS, QTL-detection, SNP

Serial snapshots with IGV batch

July 16, 2013 Leave a comment

IGV is a very handy tool. Nevertheless, scrolling from one position to another may be fastidious. Second, the bad aspect of this very user-friendly software, is that you can spend hours looking here and there, with at last no backtrack of your discoveries.

Thankfully, IGV can run in batch mode, allowing, for a targeted list of positions, to take screenshots and store the later in a folder. We’ll illustrate in the following with a small example :

Setting up a session

To test our script, we will first download some publicly available data.

#Download a vcf file (on beagle4 website, you may have to change file name according to last release)
gunzip test.r1099.vcf.gz
#Create an index
igvtools index test.r1099.vcf
#(You can alternatively prefer to use igvtools from igv GUI)
#Launch igv

To check if everything is right, load the vcf file test.r1099.vcf. Then move to positions chr22:20,000,000-20,100,000.


Set everything to your taste, and save the session :  File, save session, you should obtain a xml file.

Running IGV in batch mode

Now, let’s consider you are interested in some particular position. Let’s say we’ve stored several positions of interest in a csv file. Our aim is to create an IGV batch file.

Basically, we’ll have to load the session, set a directory to store the screenshots, and then move from one position to the other. A very crude version could therefore be. (I know : Why csv ? Because a lot of person still use excel ūüė¶ )

#Create a fake positions list
cat >Liste.csv <<EOF

#Create a ss directory
mkdir IMG
#Write the header of the script
cat >Batch.igv <<EOF
load igv_session.xml
snapshotDirectory IMG
#Now parse the csv file
gawk -F ";" -v R=10000 '{print "goto "$1":"$2 -R"-"$2 + R"\nsnapshot Screen"NR".png"}END{print "exit"}' Liste.csv >>Batch.igv

From, igv, go to Tools, Run a batch script. Load Batch.igv, when all the process will be done, IGV will terminate and you’ll find your screenshots in IMG.

For an even more automated version you can use the script “” available in the scriptotheque


June 30, 2012 Leave a comment

After some months of work, we had our article accepted in bioinformatics and the accompanying  software GLASCOW released (available on the GIGA website).

Basically, the software perform a generalized linear mixed model analysis on (preferably) hidden states (obtained e.g. with PHASEBOOK) after a correction for population stratification. This approach has already been successfully applied in : Durkin et al, Sartelet et al, Dupuis et al.

I contribute to the software with some optimizations and code rewriting. The code is thus a mix of several coding practice and style, I must admit that to some extent this can made the code hard to follow.

Seen on this blog

Concerning programming, several details of the package are available on the blog  :

  1. The general framework of the package was set by the script
  2. I use command argument as described here 
  3. The use of Lapack is conditional, and therefore I add some preprocessor instructions in the code
  4. In order to get info on date of compilation I also add some preprocessor instructions

…coming soon on this blog ?

Others key features are :

  • A parallelized reading of the files (we rely on the use of one phase/hidden states file per chromosome, obtained with PHASEBOOK), this trick plus a neat storage of the partial relationship between individuals allow to obtain great speed improvements and lower memory consumption.
  • The computation of score on residuals allow both a general speed-up and an ability to restart analysis only from these last steps.
  • Permutations are done in parallel, nevertheless this last part was very tricky to implement, and I must says that the speed-up didn’t fulfill my expectations….I’ll try to see how to fix this.
  • We add an option for outputting results in a “.assoc” file for a better viewing experience in IGV.

So even if the code still contain some old fashion Fortran (goto etc…) the adventure was fun, inspired me some posts, give me some experience in glmm and add one article to my publications list ! Not so bad, isn’t it ?

This so small…and so HUGE

February 18, 2012 Leave a comment

I guess this is THE BIG NEWS of the week :

Nanopore announcing their new product

Nanopore DNA sequencing from Oxford Nanopore on Vimeo.

Which of course has already been commented here or here and here, with some extra informations here.

In a nutshell :

  • a small portable disposable device “adaptable” for DNA sequencing
  • a new technology that can reduce several fixed charges in the genotyping process

I am really eager to see how well it will work, to which extent the minION could be use out of the lab, but surely this product will definitely have a huge impact on our day-to-day work !

It has to happen !

November 18, 2011 Leave a comment

You may have goals, that you haven’t reached yet , but you keep believing that one day you’ll achieve them, and… one day it finally happen . This has been the case for two of my secret goal !

As an unconditional emacs user, I knew that soon or later I would try to write my own emacs mode. I had the occasion to achieve this objective while testing a phasing program.

When comparing phased data, you generally want to align two phases and see on which SNP do you observe discrepancies. Nevertheless with thousands SNP, this looks like searching for a needle in a hay stack.  This can be done more efficiently with colored data. The idea is to just to associate a color to a list a character.

But as I am still a bit confused with lisp syntax, I just try to find some good example on the net.

And based on this write this piece of lisp code.

;;Define some attribute for each allele
 (setq myAllele
 '((" [1Aa]" . font-lock-function-name-face)
 (" [2Cc]" . font-lock-warning-face)
 (" [Gg]" . font-lock-constant-face)
 (" [Tt]" . font-lock-comment-face)
 (" [0\.]" . font-lock-builtin-face)
 ;;load this new attribute
 (define-derived-mode SNP-mode fundamental-mode
 (setq font-lock-defaults '(myAllele))
 (setq mode-name "SNP mode")
 (setq truncate-lines t)

All this code is in a file called SNP.el stored in my ~/.emacs.d/

  1. The first block of code define association between allele and font-lock attribute
  2. The second block define the mode and ask for line truncation

And in order to allow use this mode automatically, I add in my .emacs

load "~/.emacs.d/SNP.el")

To go one step forward, as I know that most of my genotype or phase file are named like “phase*” or “typ*”, I add also a hook i.e. these 2 lines

(setq auto-mode-alist
 (cons '("typ+[1-9]" . SNP-mode) auto-mode-alist)
 (cons '("typ+[1-9]" . SNP-mode) auto-mode-alist))

Now everytime I open a genotype file (stariting either by “typ” or “phase”, the SNP-mode is loaded and I thus obtain this kind of screen.

Now there no way I could miss something awesome in the genome !

For the laziest of you the SNP-mode is available here

And for those wondering what was my other achievement, you are certainly not aware of the bedrace World championship !

Hash function race !

June 30, 2011 Leave a comment

I spent lately a lot of time optimizing my evaluation software. By doing so, I note that with increasing amount of data, time spent in hashing is becoming very long.

Thus I make a turn around internet to see whether or not I could optimize my hash function, and finally end up with several good resources among which this website :

My problem was that the hash function proposed were in c (and in a marvelous world, I ‘d prefer a Fortran version….that I didn’t find anywhere on the web….so I decided to write it myself.)

I took a list of 235930 pairs of integer (some that are hashed in real evaluations). For each pair, I compute a hash value both with a prime number hash function and with Jenkins’ lookup3.c function.

Technical details :

The prime number hashing function taken as a reference was the one provided by I. Misztal (et al.) in its blupf90 package. Basically, each coordinates to be hashed is multiplied by a specific prime number. The function keep the modulo of the figure obtained by the size of the storage array, where hash key are  stored. If for a specific hash key, the cell is already used, then we look for the  cell that is 17 cells forward.

In Jenkins’ function, we make several bit operations on the coordinates that we want¬† to hash. In case of collision, we hash again our coordinates until we found an available cell. One important thing is that the storage array size should be a power of two.

Let’s look at the results :

Following are the total number of collisions and the maximum number of collisions for one pair of integer.

Basically Jenkins hash reduces by a factor 4 the total number of collisions. Furthermore, for a matrix that should be filled at 90% Jenkins’ function won’t have more than 413 successive collisions for one single pair of coordinates, whereas this value could be as high as 8213 with the prime function.

If we observe the evolution of this figure with an increasing proportion of filled cell in our vectorWe clearly see that until 40% of the cell are still available, prime hash function tend to generate a lot of collisions, while Jenkins hash remain pretty efficient up to 70% of filled cells.

Due to the lower level of collisions, Jenkins Hash should be more efficient than Prime function. To test this hypothesis, I then compare the cpu time needed for each algorithm, with or without -03 optimization options with both ifort and gfortran.  The test has been repeated 100 times.

Apart when compiled with gfortran and no options,¬† Jenkins’ hash was significantly faster than the prime function. The difference doesn’t sound huge¬† (about¬† 0.04 CPU s), but should be worthwhile with bigger problem. We can furthermore assume that in real use, we should also get some gains while “looking up”¬† data from the hash matrix.

For  interested people, you can find here a module which can replace the hashvr function in Misztal program.

Beware, you should also :

  1. add¬† “use hash” in all the subroutine who need hashvr (These subroutine are¬† in sparse.f90)
  2. Modify the increment size for the hash matrix “hash_incr”(it should be 2)
  3. Change the default hash size, it should be a power of two.
  4. Comment the old hashvr function in sparse2.f (to avoid name conflict)
  5. Update your makefile !

Last hint, if you have a rough idea of the hash matrix size you need, let’s say “H”,¬† you can compute the closest power of two with this code


You could thus avoid the cost of successive copies of your hash matrix.

Learning by examples (3) : Creating a map

March 16, 2011 Leave a comment

Although  not important for several kind of analysis, we may need a map file to describe the simulated data.

So we need to create a map file containing as many line as marker, a meaningful name for each SNP, the number of  the chromosome and a position in cM. This can be done with this tiny code :

#Enumerate the possible chromosome
for i in `seq 1 5`
#for each chromosome (1 to 5) write for each SNP one line in the corresponding map file
gawk -v BTA=$i 'END{for (i=1;i<=(NF-1)/2;i++){printf "%6i M%1i%10.10i %2i %5.2f\n",i,BTA,i,BTA,(i-1)*0.05}}' typ$i >map$i

Explanations :

Line 2, we use a for loop with the seq command delimited by to inverted apostrophe (alt gr +7), this will basically execute seq which will write number from 1 to 5, all this output will be redirected to the variable i.

Line 5, gawk is called with the -v argument, we pass one variable within the executed awk code. While processing the data gawk will consider that variable BTA exist and have the value passed through -v , i.e. $i, the chromosome beeing processed. Don’t be¬† confused here with, as instance, the variable “i”, normally any variable defined within the shell is¬† unknown by awk. Variable used by awk (i in the for instruction as instance) have nothing in common with the one used by the shell (they are private within awk ).

We use awk in a somehow very particular way, here all the line are read (even if it’s not mentionned), but the code that will be executed appear only once we’ve been through the entire file. Here the reading step helped to know the number of fields in the file. We then divide by two (as there are always two allele per SNP) to get the number of line to be written in the map file.

We extensively use the printf for the output. The format description follow the C format description style.

We are now ready for Phasing and QTL detection !

Categories: Awk, QTL-detection, Shell, SNP