Home > Agriculture, Awk, Funny science, Misc, NGS, R, Shell, SNP > Hourglass distribution

## Hourglass distribution

Some days ago, I went through an article on mutation rate mentioning the “hourglass distribution”. The illustration of this distribution is pretty obvious from the plot in the article, with for some chromosome a pretty clear reduction in number of SNP around centromeres. I just wondered if we could also observe this phenomenon on bos taurus. So i just had a look on dbSNP138 data to have some clues.

### First version

```
#Retrieve dbSNP
wget ftp://ftp.ncbi.nih.gov/snp/organisms/cow_9913/VCF/vcf_*vcf.gz
```

And then just run the following R code.

```#Open an output file
png("hourglass.png",1920,1080)
#split the screen in 30
par(mfrow=c(3,10))
#Create a vector of chromosomes
chr=c(seq(1:29),"X")
#run a loop on chromosomes
for( i in chr){
#plot the snp density
plot(density(Pos\$V2),col="grey",main=paste("BTA : ",i,sep=""))
polygon(density(Pos\$V2),col=rgb(0,0,0.3,0.04))
}
dev.off()
```

Notes on code :

• read.table is used directly on a gzipped file (very handy trick)
• dbsnp file have a 15 lines long header, so I use the skip =15 option
• I had some glitches while reading some files, (a problem with # of fields) option fill=TRUE, just fix it
• Plot are nice….but polygon are even better, so I first plot the density and then add a polygon on it
• rgb function is a simple way to obtain transparency, so after the three values for red, green and blue, I add the alpha setting to a value of 0.04

And after a while….you ‘ll obtain something like this

I must say I was a bit disappointed. At least, there are no clear pattern as can be seen in human. All the bovine chromosomes are acrocentric  this may explain why generally no clear decrease in SNP density can be seen. The pattern observed on chromosome 12, 18 and 10 were even more surprising. I am wondering if there could be some sampling bias. Concerning the pattern on BTA23, the latter could be due to MHC, known to exhibit a great diversity. Density computation may also blur a bit things.

### Second version

The basic work being done, we can try to investigate others hypotheses. As instance, are SNP and Indel distributed the same way along the genome ? With some slight changes, the code become :

```#Open an output file
png("SNPandindelDistribution.png",1920,1080)
#split the screen in 30
par(mfrow=c(3,10))
#Create a vector of chromosomes
chr=c(seq(1:29),"X")
#run a loop on chromosomes
for( i in chr){
#plot the snp density
plot(density(Pos\$V2[grep("snp",Pos\$V8,perl=TRUE)]),col="grey",main=paste("BTA : ",i,sep=""))
polygon(density(Pos\$V2[grep("snp",Pos\$V8,perl=TRUE)]),col=rgb(0,0,0.3,0.04))
dense=density(Pos\$V2[grep("in-del",Pos\$V8,perl=TRUE)])
lines(dense\$x,0.5*dense\$y,col="red")
}
dev.off()
```

Notes on code :

• in dbsnp variant are coded either as snp or in-del, we extract line with the grep function accordingly
• I tweaked a bit the indel line in order to avoid scale problems.

We observe roughly the same pattern between snp and indel, albeit indel distribution may be smoother. I was expecting some discrepancies (relying on the article by Montgomery et al. but here again, we are only dealing with 1 base indel, which is not really representative of short indel  in general). I may try to check this results with my own results.