Recent articles
Just a small review of the articles I went through recently :
- A surprising results on sex bias in Holstein. Huge impact on the industry could be thought of, if the figure are confirmed and impact better estimated (I mean, conventional semen give 50% of females. So the worst case scenario against which +450 kg more milk at the end of the second lactation could be produced…is not the common situation).
- Analysis of selective sweep in cattle with sequence data.
- Some hints on way to correct for population stratification
- The funny article of the month, a compression exercise with all the results included (Side effect of the OpenData policy ? …lots of information you’ll never really use ?)
Recent articles
Just a small review of the articles I went through recently :
- Deserving the “catchy title award” another discussion on p-value.
- You missed articles on Encode Controversy ?
- Seriously cool : a new natural selection in modern human browser.
- New approaches to studies population structure here and there
- Xmas is not so far, a useful article if you are a great UNO player
Software update
The classical one 😉
A new experimental feature to assess PK for reference site. Plus some bug fixing
A new version of the algorithm. Shortened arguments flags. New sides tools (data checking and graph handling).
A new version released 2.12 : faster and with additional feature. Note a GCTA forum is now available for discussion and technical problem.
A more flexible way to define reference population and the way data are handled. Multi-threading available during sampling step.
Apart from translocation detection removal (not for long ?), this version ship a lot of awaited features like vcf output.
Still some work on vcf compatibility improvement.
Recent articles
Just a small review of the articles I went through recently :
- A large study on population structure of domesticated cattle
- Several Large scale project papers (post by nextgenseek ) I would add the TCGA article
- Methylation as an assessment of someone’s age. Note, in the same area a dedicated issue on methylation by nature insight
- A nice article on taboo around genetics . As long as these debate are avoided, we won’t be able to evolve in our relation to our genes.
- A nice piece of history around Bayes’s Essay True Title figuring out how science was spread by always strike me out. I loved this article as much as the excellent “Fisher, Neyman, and the creation of classical statistics”
Recent article
- A very nice initiative for newbies in bioinformatics, the latter coming with a set of important rules in NGS analysis and a reflexion on bioinformatics training.
- An interesting article on ameiotic evolution
- A very detailed article on INLA applied to animal model.
- Why partially identifiable covariance matrix should be consider with utmost care. I wondered if when dealing with Genomic relationship based on haplotypes, the mentioned phenomenon doesn’t apply.
- Hints on the results of my survey (soon on the blog)
Software update
The classical one 😉 Note that some changes were done for java 7 compatibility. So, after GATK, turning to java 7 as default may be on its way.
Another update (r1128), not documented so far.
The new version now use all variants in the reference panel (snp, indels, SV)
The openmp specification 4.0 are out ! Now support Fortran2003 and prepare the support for accelerator. Note the next intel compiler version already support a large number of the new specifications.
This library now in its version 1.4 . Support for new GPUs were added as well as additional subroutines. I wish more Fortran interface were added…maybe next time !
As previously mentioned, the last version of cuda is now available as rpm/deb package (allowing a much easier install).
More on variant distribution with dbsnp and Vep
Just a follow-up post, there are so many questions one can wonder about genome, that I thought it would be nice to elaborate a bit on the dbSNP data. So to move forward, we’ll see how to obtain sift score for the dbnsp ressources with Vep.
Install Vep
#Download VEP wget www.ensembl.org/info/docs/variation/vep/vep_script.html -O Out && wget `gawk '/variant_effect_predictor.tar.gz/ && /latest/{split($0,T,"\"");print T[2]}' Out ` -O vep.tar.gz #Extract it gunzip vep.tar.gz ; tar -xvf vep.tar #Go into the directory, create a cache folder cd variant_effect_predictor ; mkdir .vep #Run install, answers should be yes to ask for the use of cache file, and the number <=> bos Taurus perl INSTALL.pl -c .vep
Note :
- We first download Vep page
- Then the html code is parsed, and the link to the latest Vep verions is extracted and donwloaded
- During install, you’ll have to indicate a local vep directory (here .vep)
- Prefer a local cache file
Running Vep on vcf
#Download dbSNP vcf wget ftp://ftp.ncbi.nih.gov/snp/organisms/cow_9913/VCF/vcf_*vcf.gz for BTA in `seq 1 29 ` X MT do #Decompress zcat vcf_chr_${BTA}.vcf.gz >$BTA.vcf #Run Vep perl variant_effect_predictor.pl --offline --species bos_taurus -i ${BTA}.vcf --vcf --html --sift b --dir .vep --output_file Vep${BTA} done
Note :
- We use the Vep in local mode, so you’ll have to declare the .vep directory explicitly “–dir .vep”
- Output will be in vcf format (to avoid handling too many different file format) “–vcf”
- sift score are available for cow since ensembl 71 nevertheless you must ask for them in Vep “–sift b”
Location of “deleterious” variant
The vcf now have some annotations appended. We just go back to last post’s R code, but wonder this time where the variations supposed to be deleterious are ?
#Open an output file png("DeleteriousDistribution.png",1920,1080) #split the screen in 30 par(mfrow=c(3,10)) #Create a vector of chromosomes chr=c(seq(1:29),"X") #run a loop on chromosomes for( i in chr){ #Read file Pos=read.table(paste("Vep",i,sep=""),skip=15,fill=TRUE) #plot the snp density plot(density(Pos$V2[grep("snp",Pos$V8,perl=TRUE)]),col="grey",main=paste("BTA : ",i,sep="")) polygon(density(Pos$V2[grep("snp",Pos$V8,perl=TRUE)]),col=rgb(0,0,0.3,0.04)) #Add in-del line dense=density(Pos$V2[grep("deleterious",Pos$V8,perl=TRUE)]) lines(dense$x,0.5*dense$y,col="red") } dev.off()
Note: A shortcut would be to download the (not so up-to-date) vcf available at ensembl ftp site. The code is essentially the same as the one used in the previous post.
And you should obtain something like the following plot.
Once again this is still a quick and very dirty result. I wonder if there are any good story in these graphs (I mean one story that would not instantly vanished due to assembly problem or obvious bias !).
Hourglass distribution
Some days ago, I went through an article on mutation rate mentioning the “hourglass distribution”. The illustration of this distribution is pretty obvious from the plot in the article, with for some chromosome a pretty clear reduction in number of SNP around centromeres. I just wondered if we could also observe this phenomenon on bos taurus. So i just had a look on dbSNP138 data to have some clues.
First version
#Retrieve dbSNP wget ftp://ftp.ncbi.nih.gov/snp/organisms/cow_9913/VCF/vcf_*vcf.gz
And then just run the following R code.
#Open an output file png("hourglass.png",1920,1080) #split the screen in 30 par(mfrow=c(3,10)) #Create a vector of chromosomes chr=c(seq(1:29),"X") #run a loop on chromosomes for( i in chr){ #Read file Pos=read.table(paste("vcf_chr_",i,".vcf.gz",sep=""),skip=15,fill=TRUE) #plot the snp density plot(density(Pos$V2),col="grey",main=paste("BTA : ",i,sep="")) polygon(density(Pos$V2),col=rgb(0,0,0.3,0.04)) } dev.off()
Notes on code :
- read.table is used directly on a gzipped file (very handy trick)
- dbsnp file have a 15 lines long header, so I use the skip =15 option
- I had some glitches while reading some files, (a problem with # of fields) option fill=TRUE, just fix it
- Plot are nice….but polygon are even better, so I first plot the density and then add a polygon on it
- rgb function is a simple way to obtain transparency, so after the three values for red, green and blue, I add the alpha setting to a value of 0.04
And after a while….you ‘ll obtain something like this
I must say I was a bit disappointed. At least, there are no clear pattern as can be seen in human. All the bovine chromosomes are acrocentric this may explain why generally no clear decrease in SNP density can be seen. The pattern observed on chromosome 12, 18 and 10 were even more surprising. I am wondering if there could be some sampling bias. Concerning the pattern on BTA23, the latter could be due to MHC, known to exhibit a great diversity. Density computation may also blur a bit things.
Second version
The basic work being done, we can try to investigate others hypotheses. As instance, are SNP and Indel distributed the same way along the genome ? With some slight changes, the code become :
#Open an output file png("SNPandindelDistribution.png",1920,1080) #split the screen in 30 par(mfrow=c(3,10)) #Create a vector of chromosomes chr=c(seq(1:29),"X") #run a loop on chromosomes for( i in chr){ #Read file Pos=read.table(paste("vcf_chr_",i,".vcf.gz",sep=""),skip=15,fill=TRUE) #plot the snp density plot(density(Pos$V2[grep("snp",Pos$V8,perl=TRUE)]),col="grey",main=paste("BTA : ",i,sep="")) polygon(density(Pos$V2[grep("snp",Pos$V8,perl=TRUE)]),col=rgb(0,0,0.3,0.04)) #Add in-del line dense=density(Pos$V2[grep("in-del",Pos$V8,perl=TRUE)]) lines(dense$x,0.5*dense$y,col="red") } dev.off()
Notes on code :
- in dbsnp variant are coded either as snp or in-del, we extract line with the grep function accordingly
- I tweaked a bit the indel line in order to avoid scale problems.
We observe roughly the same pattern between snp and indel, albeit indel distribution may be smoother. I was expecting some discrepancies (relying on the article by Montgomery et al. but here again, we are only dealing with 1 base indel, which is not really representative of short indel in general). I may try to check this results with my own results.
Recent articles
Just a small review of the articles I went through recently :
- Why proper definition of any statistic matter. An article on the numerous Fst definitions, and their respective behavior with varying assumptions : MAF spectrum, samples size etc. The article is very comprehensive and give a nice review of main concept around Fst.
- Still around population genetics an article of my PhD supervisor on selective sweeps identification. The supporting information are also pretty worthy.
- A nice review on measure of dependence it’s a real relief to see alternative to the correlation coefficient do exist ! I must admit that pointless statements like “X and Y are correlated at 5%” just kill me.
- “Encode returns” a reply to Graur’s article by Mattick and Dinger.
- While at first glance its title may sound rather tautological, this article on long intergenic non coding RNA nicely shows that similar behavior (Ribosome occupancy) does not imply similar results (Protein coding). I wondered if the absence of stop codon in lincRNA (and thus the absence of ribosome release) impact the cell, but apparently lincRNA are not numerous enough to have any substantial effect.
Software update
So what are the last noticeable update of this latest months ?
At last HaplotypeCaller is now a real alternative for Big project ! <grumpy> The bad thing <\grumpy> is that now GATK need java7, which is still not so common nowadays and (thanks to oracle) no that easy to install !
I am actually running it on a batch of 100 bulls @12&+X(so far so good, I will complete a whole genome variant calling in 2 weeks).
Beagle 4
Now allowing imputation ! And this just make everything so simple !
This tool has been released in its 5.0 version.
This is the yearly release. Not a lot of new things but it’s always worthy to upgrade.