Demography and the age of rare variants

Iain Mathieson, Gil McVean

(Submitted on 16 Jan 2014)

Recently, large whole-genome sequencing projects have provided access to much of the rare variation in human populations. This variation is highly informative about population structure and recent demography. In this paper, we show how the age of rare variants can be estimated from patterns of haplotype sharing and how this information can detect and quantify historical relationships between populations. We investigate the distribution of the age of f2 variants in a worldwide sample sequenced by the 1,000 Genomes Project, revealing enormous variation across populations. The median age of f2 variants shared within continents is 50 to 160 generations for Europe and Asia, and 170 to 320 generations for Africa. Variants shared between continents are much older with median ages ranging from 320 to 670 generations between Europe and Asia, and 1,000 to 2,400 generations between African and Non-African populations. The distribution of the ages of variants shared across populations is informative about their demography, revealing recent bottlenecks, ancient splits, and more modern connections between populations. We see the signature of selection in the observation that functional variants are significantly younger than nonfunctional variants of the same frequency. This approach is relatively insensitive to mutation rate and complements other nonparametric methods for demographic inference.

### Like this:

Like Loading...

*Related*

Pingback: Most viewed on Haldane’s Sieve: January 2014 | Haldane's Sieve

I finally took a look at this paper and I think it’s quite interesting. However, I am somewhat puzzled by the result that, if haplotypes could be ascertained exactly, then the MLE of the age does not depend on N. The derivation seems believable to me but I can’t wrap my head around how a coalescent-based derivation (which scales everything by N) can result in an estimate of a quantity unscaled by N. For example, if I were to simulate data in ms with theta = 1, 5, and 10, where I’m mentally thinking “in all these cases the mutation rate is the same, really the difference is that the population size is bigger”, would I really get the same number for the mean estimate of the age of the f2 haplotypes, regardless of how I set theta?

My suspicion is that I am missing a crucial step in the thought process but I’d like to see if the authors or anyone else can help me out.

Hi Josh, and thanks for the comment.

Yes, sorry, this could probably be a bit clearer in the methods. It’s the time in generations which doesn’t depend on N. The time in coalescent units scales linearly with N, and then that factor drops out when you convert back to real time. So if you simulated with different thetas, you’d certainly rescale the tree and therefore the estimates (in coalescent time), but when you converted back to generations you’d get the same number. I guess another way to think about it is that the rate at which mutations (or recombinations) accumulate on a particular lineage in real time doesn’t depend on N.

– Iain

Iain,

Thanks! I think that makes sense… conditional on the tree all that matters is the real time, so it ends up canceling out, I guess.