### Archive

Archive for the ‘R’ Category

## Hourglass distribution

August 6, 2013 1 comment

Some days ago, I went through an article on mutation rate mentioning the “hourglass distribution”. The illustration of this distribution is pretty obvious from the plot in the article, with for some chromosome a pretty clear reduction in number of SNP around centromeres. I just wondered if we could also observe this phenomenon on bos taurus. So i just had a look on dbSNP138 data to have some clues.

### First version


#Retrieve dbSNP
wget ftp://ftp.ncbi.nih.gov/snp/organisms/cow_9913/VCF/vcf_*vcf.gz


And then just run the following R code.

#Open an output file
png("hourglass.png",1920,1080)
#split the screen in 30
par(mfrow=c(3,10))
#Create a vector of chromosomes
chr=c(seq(1:29),"X")
#run a loop on chromosomes
for( i in chr){
#plot the snp density
plot(density(Pos$V2),col="grey",main=paste("BTA : ",i,sep="")) polygon(density(Pos$V2),col=rgb(0,0,0.3,0.04))
}
dev.off()


Notes on code :

• read.table is used directly on a gzipped file (very handy trick)
• dbsnp file have a 15 lines long header, so I use the skip =15 option
• I had some glitches while reading some files, (a problem with # of fields) option fill=TRUE, just fix it
• Plot are nice….but polygon are even better, so I first plot the density and then add a polygon on it
• rgb function is a simple way to obtain transparency, so after the three values for red, green and blue, I add the alpha setting to a value of 0.04

And after a while….you ‘ll obtain something like this

I must say I was a bit disappointed. At least, there are no clear pattern as can be seen in human. All the bovine chromosomes are acrocentric  this may explain why generally no clear decrease in SNP density can be seen. The pattern observed on chromosome 12, 18 and 10 were even more surprising. I am wondering if there could be some sampling bias. Concerning the pattern on BTA23, the latter could be due to MHC, known to exhibit a great diversity. Density computation may also blur a bit things.

### Second version

The basic work being done, we can try to investigate others hypotheses. As instance, are SNP and Indel distributed the same way along the genome ? With some slight changes, the code become :

#Open an output file
png("SNPandindelDistribution.png",1920,1080)
#split the screen in 30
par(mfrow=c(3,10))
#Create a vector of chromosomes
chr=c(seq(1:29),"X")
#run a loop on chromosomes
for( i in chr){
#plot the snp density
plot(density(Pos$V2[grep("snp",Pos$V8,perl=TRUE)]),col="grey",main=paste("BTA : ",i,sep=""))
polygon(density(Pos$V2[grep("snp",Pos$V8,perl=TRUE)]),col=rgb(0,0,0.3,0.04))
dense=density(Pos$V2[grep("in-del",Pos$V8,perl=TRUE)])
lines(dense$x,0.5*dense$y,col="red")
}
dev.off()


Notes on code :

• in dbsnp variant are coded either as snp or in-del, we extract line with the grep function accordingly
• I tweaked a bit the indel line in order to avoid scale problems.

We observe roughly the same pattern between snp and indel, albeit indel distribution may be smoother. I was expecting some discrepancies (relying on the article by Montgomery et al. but here again, we are only dealing with 1 base indel, which is not really representative of short indel  in general). I may try to check this results with my own results.

## More on polled

I’ve collected so far 23 answers to my little survey (by the way, a great “thank you” to all of you who answered). Just to remind to those of you who haven’t bet yet, why being polled and red carrier could be of interest for a breeder, here comes some R code. First you ‘ll have to download French’s holstein  gEBV here. Then start R in a terminal.

#Find file name (I suppose the file is unique and in the directory)
File=dir(pattern="export_genomiques_")

#Polled genotype===================================
#create a vector of status "pp=>Normal,Pp=>heterozygous polled,PP=>Homozygous polled"
PO=rep("pp",dim(gEBV)[1])
#Identify heterozygous polled with grep applied to the name of the sire
PO[gEBV$NOTAUR[grep(" P | P$",gEBV$NOTAUR,perl=TRUE)]]="Pp" #Identify homozygous polled PO[gEBV$NOTAUR[grep(" PP | PP$",gEBV$NOTAUR,perl=TRUE)]]="PP"
#Draw a boxplot
boxplot(gEBV$X0 ~ PO ,col="blue") #Add the Number of observation for(i in 1:3){text(i,0.95*max(gEBV$X0),paste("N= ",table(PO)[i]))}

#=Now the Red factor================================
RF=rep("rr",dim(gEBV)[1])
RF[gEBV$NOTAUR[grep(" RF | RF$",gEBV$NOTAUR,perl=TRUE)]]="Rr" RF[gEBV$NOTAUR[grep(" RED | RED$",gEBV$NOTAUR,perl=TRUE)]]="RR"
#Draw a boxplot
boxplot(gEBV$X0 ~ RF ,col="red") #Add the Number of observation for(i in 1:dim(table(RF))){text(i,0.95*max(gEBV$X0),paste("N= ",table(RF)[i]))}


So for a given genotype, chances to be in the “Elite” will vary a lot. Iris and her brother have pedigree (gEBV) of 155. Their parents reliability is .7.

mu=155 ; VarG=25 ; Rel=0.25*(0.7+0.7)
sigma=sqrt(1-Rel)*VarG
#Boxplot of gEBV depending on genotypes (in the actual sire population)
boxplot(gEBV$X0 ~ RF + PO ,col="grey",main="gEBV according to genotype") #Add E(EBV) and trace a 90IC area abline(h=155,col="green") polygon(x=c(0,0,10,10),y=c(155-(1.96*sigma),155+(1.96*sigma),155+(1.96*sigma),155-(1.96*sigma)),col=rgb(0,0.8,0.1,0.1))  So, we can clearly see that for Iris and her brother their career will mostly depend on the fact that they are red and polled carriers ! Categories: Agriculture, Funny science, R ## Why R (and not SAS) again. February 19, 2013 Leave a comment Ten years ago, when I first came to INRA, I learned SAS. This was THE statistical analysis language. Unfortunately, SAS was not really affordable (this is still true), so back to school, if I wanted to run some analysis, I had to take appointments with my professor who was the only person with SAS on its computer. Not a very convenient solution to move forward. Thus, I tried to figure out if I had an alternative to SAS…that’s the way I discovered R ! I’ve already acknowledged Phillipe Besse (who by the time offer both SAS and R scripts for data mining on his website), after a friend Ph’D defense, but I ‘d like to repeat my self thanks to him and the R community, I really learned a very valuable skill for my career. Once fully convinced that R was a good solution (and SAS might be a problem), I tried to convince my colleague that moving from SAS to R could be worthwhile on the long term. I unfortunately received the same old song : SAS was very popular and all kind of not so bullet proofed arguments. With time, using R became more and more obvious, so that I fortunately had less and less often to argue about why R and not SAS. In fact : 1. Most students have now basics skills in R (and less often in SAS) 2. R implement most of the newly published method 3. R is pretty versatile (though not perfect in every area) The excellent freakonometrics just published a presentation , presenting R and adding a small review of R popularity. The licence fees are pretty self explanatory, and I hope that referring to them will give me more credits, next time I’ll try to convince people that the cost/benefits ratio of SAS, is questionable. Categories: R ## How many sires’ genotypes in a floppy disk ? March 23, 2011 2 comments I admit this question is more a funny way to compare several compression tools than a totally correct or fair comparison ! This will be the occasion to write some piece of code in Shell/AWK and R, to remember these old good floppy disk …and most importantly may start a funny discussion at coffee break ! So let’s start, the tools tested are the one easily available on any Linux Distribution: 1. bzip2 2. gzip 3. xz The idea in our script is to create a genotype file (at least to mimic it) We will progressively increase its size, and at each step compress the file to see whether the compressed file still fit in a floppy ! When compressed file will be bigger than 1.44Mb, we stop our script. The genotype file is a flat file containing one line per marker and per sample, i.e. 54001 lines per sample. Due mostly to the length of SNP name we need around 2.5 Mb to store all the genotype of one sample. for Zip in gzip bzip2 xz do echo Testing$Zip
N=0 ; T=0
while [[ T -le 1475 ]]; do
#Add 100 lines to the file
echo $N N=$(( $N + 10000 )) #extract the N first lines gawk -v N=${N} '{if(NR<=N ){print $0}}' TYP.csv >out ;/usr/bin/time -o Stat -a$Zip out
T=du -k out.* | gawk '{print $1}'  echo$Zip $T$N $Zip >>Stat rm out* done done  After some hours we get a file called Stat, mixing a lot of different information, let’s analyse it with some shell script : gawk '{if(NR%3==1){gsub(/user/," ",$1);gsub(/system/," ",$2);printf "%5.3f,%5.3f,%5.3f,",$1,$2,$1+$2};if(NR%3==0){print$1","$2/1024","$3/54001}}' Stat


And now try to draw a nice graphics from these data !

#Read the data
#Declare all the compression tested
T <- c("gzip","bzip2","xz")
#Set a vector for the max algorithm
M <- rep(0,3) ;names(M)=T
#Compute the largest amount of sire per compression algorithm
for(Typ in 1:3){M[Typ]=max(Stat[6][Stat[4]==T[Typ]])}
#plot it
barplot(M,col="blue",main="Number of genotyped sires in a floppy disk \n depending on compression algorithm",ylab="Number of sires")
#Plot the compression time
plot(Stat[6][Stat[4]=="xz"],Stat[3][Stat[4]=="xz"],col="blue",t="l",xlab="Number of individuals in the file",ylab="Compression Time (in CPU second)",main="Evolution of compression time with number of individuals in the file")
lines(Stat[6][Stat[4]=="bzip2"],Stat[3][Stat[4]=="bzip2"],col="green",t="l")
lines(Stat[6][Stat[4]=="gzip"],Stat[3][Stat[4]=="gzip"],col="red",t="l")
legend("bottomright",c("xz","bzip","gzip"),col=c("blue","green","red"),lty=1)


With regards to the number of genotyped sires’  that could fit into a floppy disk the differences between compression algorithm is just HUGE.

These differences have nevertheless a cost — compression time (which fortunately increase linearly as size of the file increase) — , while bzip2 and gzip will take respectively 0.07 s and 1 s of CPU per new sires xz will take 4 second of CPU, which can be really problematic. But we will discuss this point in another post…I have to deliver some data !

Categories: Awk, Linux, R, Shell, SNP

## How many sires must one genotype ?

That is, for sure not a forgotten lyrics of “blowing in the wind”, but a pretty frequent question : How many sires should be genotyped in order to compute phases (and maybe do some imputation) ?

In fact, most of the time the question is biased because every body would like to hear “genotype one individual, you should be able to phase the whole population !”. Then most of the time people forget to think about what

1. Is there main focus (what do they want to do with their genotype)
2. Should be the reference figures to assess whether they have a good way to prove what they want to prove

If we try to view the problem in statistical perspective, SNP are bi-allelic, so for each marker an individual can be either Homozygous (Ho) or  Heterozygous (He). If the Minor Allelic Frequency (MAF) is equal to p, with $p \sim U(0,1)$

We will have the classical probability for one individual to be homozygous.

$P(\text{Ho}) = p^2 + (1-p)^2$

If we plot a simple graphic on how this probability evolve for different MAF, with this piece of code.

#Defining a sequence of MAF (considering we have rejected SNP with MAF<5%)
Maf<-seq(0.05,0.95,0.0001)
#Make a graph
plot(Maf,((1-Maf)**2)+Maf**2,t="l",col="blue",ylim=c(0,1),xlab="MAF",ylab="Probability of beeing homozygous",main="Evolution of the probability of beeing homozygous depending on MAF of the SNP")
#Checking for correctness
summary(((1-Maf)**2)+Maf**2)


So looking at the results, we would say for one marker that we have a pretty big chance of  “not failing in assigning the phase of any SNP”….this would be true, but there were no chance to fail in this task ! Furthermore, if we are trying to obtain phases, we seek for somehow a “sharper” information creating variations to be related to phenotypes…which is not really what we obtain with homozygous markers.

In fact the true challenge is to assign for the remaining 36.5 % case the right allele to its phase…which even with a random assignment, should be right with a probability of $p=\frac{1}{2}$ !  To put in a nutshell, the real baseline for any phasing algorithm is on average 81 % of correct assignment !

#Defining a sequence of MAF (considering we have rejected SNP with MAF<5%)
Maf<-seq(0.05,0.95,0.0001)
#Make a graph
plot(Maf,((1-Maf)**2)+Maf**2,t="l",col="blue",ylim=c(0,1),xlab="MAF",ylab="Probability of beeing homozygous",main="Evolution of the probability of beeing homozygous depending on MAF of the SNP")
#Adding the expectation of correct assignment obtain randomly
lines(Maf,((1-Maf)*Maf)+((1-Maf)**2)+Maf**2,t="l",col="red")
#Computing new mean
summary(((1-Maf)*Maf)+((1-Maf)**2)+Maf**2)


Now, that we have in mind the baseline figures, let’s look at the improvement that can be achieved by a good choice of animal to genotype. Among the (on average) 36.5% of individuals that won’t be homozygous, we should be able to estimate correctly the phase if we have one parents (or one offspring) that is homozygous, which should be the case in 63.5% of the time. If both parents are genotyped, the probability of not beeing able to assign correct phase because both parents are heterozygous is $p ( \text{2 Parents He})= p(\text{he})^{2}$ something around  13% of the case

For the progeny, the first good point is that we can have more than two offsprings for an individual. The probability to have one homozygous offspring among n as $p ( \text{1 Ho off among n})= p(\text{he})^{n}$ which will fall to a small probabilities very fast with n !

So now consider the probability of a sire to be correctly phase when one of its offspring have been genotyped. p  $p = (P_{\text{parent heterozygous}} . P_{\text{off homozygous}}) + (P_{\text{parent homozygous}})$

The first part  is the contribution of a “smart” sampling of genotype individuals really whereas, the important second part is  proportion of phased marker due to the existence of homozygous.

#Defining a sequence of MAF (considering we have rejected SNP with MAF<5%)
Maf<-seq(0.05,0.95,0.0001)
#Make a graph
plot(Maf,(2*Maf*(1-Maf))*(1-(2*(1-Maf)*Maf))+(Maf**2)+(1-Maf)**2,t="l",col="blue",ylim=c(0,1),xlab="MAF",ylab="",main="Probability of phasing \n if one offspring is genotyped ")
#Add the contribution of n genotyped offspring
for(i in 1:10){lines(Maf,(2*Maf*(1-Maf))*(1-(2*(1-Maf)*Maf)**i),col=i)}


From the above graphic, we see at the top a blue line showing the probability of having a good phasing if any parent have one offspring genotyped. The black line at the bottom, represent the contribution of the genoted offspring to the total probability. The series of coloured lines above this line represent the contribution of more (2 to 10) offspring genotyped. We can see :

1. The extra well phased marker due to offspring genotyped will always appears small compared to the de facto well phased homozygous marker
2. Genotyping new offspring have an interest till 4 or 5, then the extra benefits vanished

Last,  as with 50k chip you can rely on LD to some extent and assuming you got a good map. In fact information given by flanking markers can always help to estimate the most likely phase for a marker that would fall within the 36,5% of the bad situations !

If you are really interested in phasing, meaning beeing able to identify variation at the chromosome level, which exclude in fact the “perfectly assigned to a phase” homozygous marker, you should avoid genotyping individuals that are not related to another genotyped individual one generation apart.

So to try to give an answer, I would says if you can genotype a restricted number of individuals,

1. Focus on individuals with a lot of progeny (which in turn could also have a lot of progeny)
2. Keep an eye on your basic aim (if it’s for instance QTL detection, then phenotype’s quality of individual that will be genotype and could easily be phased, should be the most important thing to look at)
Categories: phasing, R, SNP