Home > phasing, R, SNP > How many sires must one genotype ?

## How many sires must one genotype ?

That is, for sure not a forgotten lyrics of “blowing in the wind”, but a pretty frequent question : How many sires should be genotyped in order to compute phases (and maybe do some imputation) ?

In fact, most of the time the question is biased because every body would like to hear “genotype one individual, you should be able to phase the whole population !”. Then most of the time people forget to think about what

1. Is there main focus (what do they want to do with their genotype)
2. Should be the reference figures to assess whether they have a good way to prove what they want to prove

If we try to view the problem in statistical perspective, SNP are bi-allelic, so for each marker an individual can be either Homozygous (Ho) or  Heterozygous (He). If the Minor Allelic Frequency (MAF) is equal to p, with $p \sim U(0,1)$

We will have the classical probability for one individual to be homozygous.

$P(\text{Ho}) = p^2 + (1-p)^2$

If we plot a simple graphic on how this probability evolve for different MAF, with this piece of code.

#Defining a sequence of MAF (considering we have rejected SNP with MAF<5%)
Maf<-seq(0.05,0.95,0.0001)
#Make a graph
plot(Maf,((1-Maf)**2)+Maf**2,t="l",col="blue",ylim=c(0,1),xlab="MAF",ylab="Probability of beeing homozygous",main="Evolution of the probability of beeing homozygous depending on MAF of the SNP")
#Checking for correctness
summary(((1-Maf)**2)+Maf**2)


So looking at the results, we would say for one marker that we have a pretty big chance of  “not failing in assigning the phase of any SNP”….this would be true, but there were no chance to fail in this task ! Furthermore, if we are trying to obtain phases, we seek for somehow a “sharper” information creating variations to be related to phenotypes…which is not really what we obtain with homozygous markers.

In fact the true challenge is to assign for the remaining 36.5 % case the right allele to its phase…which even with a random assignment, should be right with a probability of $p=\frac{1}{2}$ !  To put in a nutshell, the real baseline for any phasing algorithm is on average 81 % of correct assignment !

#Defining a sequence of MAF (considering we have rejected SNP with MAF<5%)
Maf<-seq(0.05,0.95,0.0001)
#Make a graph
plot(Maf,((1-Maf)**2)+Maf**2,t="l",col="blue",ylim=c(0,1),xlab="MAF",ylab="Probability of beeing homozygous",main="Evolution of the probability of beeing homozygous depending on MAF of the SNP")
#Adding the expectation of correct assignment obtain randomly
lines(Maf,((1-Maf)*Maf)+((1-Maf)**2)+Maf**2,t="l",col="red")
#Computing new mean
summary(((1-Maf)*Maf)+((1-Maf)**2)+Maf**2)


Now, that we have in mind the baseline figures, let’s look at the improvement that can be achieved by a good choice of animal to genotype. Among the (on average) 36.5% of individuals that won’t be homozygous, we should be able to estimate correctly the phase if we have one parents (or one offspring) that is homozygous, which should be the case in 63.5% of the time. If both parents are genotyped, the probability of not beeing able to assign correct phase because both parents are heterozygous is $p ( \text{2 Parents He})= p(\text{he})^{2}$ something around  13% of the case

For the progeny, the first good point is that we can have more than two offsprings for an individual. The probability to have one homozygous offspring among n as $p ( \text{1 Ho off among n})= p(\text{he})^{n}$ which will fall to a small probabilities very fast with n !

So now consider the probability of a sire to be correctly phase when one of its offspring have been genotyped. p  $p = (P_{\text{parent heterozygous}} . P_{\text{off homozygous}}) + (P_{\text{parent homozygous}})$

The first part  is the contribution of a “smart” sampling of genotype individuals really whereas, the important second part is  proportion of phased marker due to the existence of homozygous.

#Defining a sequence of MAF (considering we have rejected SNP with MAF<5%)
Maf<-seq(0.05,0.95,0.0001)
#Make a graph
plot(Maf,(2*Maf*(1-Maf))*(1-(2*(1-Maf)*Maf))+(Maf**2)+(1-Maf)**2,t="l",col="blue",ylim=c(0,1),xlab="MAF",ylab="",main="Probability of phasing \n if one offspring is genotyped ")
#Add the contribution of n genotyped offspring
for(i in 1:10){lines(Maf,(2*Maf*(1-Maf))*(1-(2*(1-Maf)*Maf)**i),col=i)}


From the above graphic, we see at the top a blue line showing the probability of having a good phasing if any parent have one offspring genotyped. The black line at the bottom, represent the contribution of the genoted offspring to the total probability. The series of coloured lines above this line represent the contribution of more (2 to 10) offspring genotyped. We can see :

1. The extra well phased marker due to offspring genotyped will always appears small compared to the de facto well phased homozygous marker
2. Genotyping new offspring have an interest till 4 or 5, then the extra benefits vanished

Last,  as with 50k chip you can rely on LD to some extent and assuming you got a good map. In fact information given by flanking markers can always help to estimate the most likely phase for a marker that would fall within the 36,5% of the bad situations !

If you are really interested in phasing, meaning beeing able to identify variation at the chromosome level, which exclude in fact the “perfectly assigned to a phase” homozygous marker, you should avoid genotyping individuals that are not related to another genotyped individual one generation apart.

So to try to give an answer, I would says if you can genotype a restricted number of individuals,

1. Focus on individuals with a lot of progeny (which in turn could also have a lot of progeny)
2. Keep an eye on your basic aim (if it’s for instance QTL detection, then phenotype’s quality of individual that will be genotype and could easily be phased, should be the most important thing to look at)