Razib Khan One-stop-shopping for all of my content

March 25, 2020

The Human Genome Diversity Project at high-coverage!

Filed under: Human Evolution,Population genomics — Razib Khan @ 1:19 am

After a few years of presentations and preprints, the new high-quality whole-genome analysis of the HGDP dataset is finally published in Science, Insights into human genetic variation and population history from 929 diverse genomes. The HGDP dates back 30 years, so this is the culmination of a long line of research. The authors in this paper looked at nearly 1,000 HGDP individuals at high coverage sequencing, meaning that they had extremely good confidence in their calls of the state of a base across all 3 billion pairs.

This is in contrast to the ~600,000 markers in the original HGDP analyses from the 2000s, which came from results of a “SNP-array.” A SNP-array of this form focuses on the variation by looking at polymorphic sites (sites which vary in the population). How did they originally determine what was polymorphic? Unfortunately, they had to rely on European populations, so the original analyses were using a quite skewed measuring stick. Whole-genome analyses bypass these problems because you get the totality of sequence information, and, the high-coverage means you can confidently call very rare variations in some of these individuals (they’re not false positives).

The HGDP was assembled by L. L. Cavalli-Sforza and curated from ethnographically interesting populations. Therefore, it is useful to compare it to the 1000 Genomes, which tends to focus on more conventional populations. The 1000 Genomes has 2,500 individuals, sequenced at somewhat lower coverage on average. While this project yielded 70 million polymorphisms, the 1000 Genomes Project had 85 million. Most of these are rare. The power to detect rare polymorphisms is useful in elucidating population structure because rare polymorphisms tend to be evolutionary new, and so reflect more recent differentiation.

For example, they compared Yoruba, Mbuti, and non-Africans. Looking at common polymorphisms the Yoruba are closer to non-Africans while looking at rare ones they are closer to Mbuti. Why? The rarer polymorphisms reflect recent differentiation, and there has been recent gene flow between Mbuti and Yoruba.

On the whole, they recapitulated earlier findings but using more sophisticated methods that leveraged their whole-genome data they added some wrinkles. For example, some populations diverged in a very sharp and distinct fashion, such as Han and Yakuts, or Druze and Sardinians. But for the populations that diverged between 150,000 and 50,000 years ago, mostly within Africa, the separations were more gradual and probably characterized by repeated gene flow between the descendent groups (e.g., Non-Africans, Yoruba, Mbuti, San, etc.).

This reiterates that there isn’t a one-size-fits-all narrative we can use to talk about the emergence of modern populations and the way those populations are patterned. There are debates about whether we are a “clinal” species or not. I don’t think that’s a good question, because as implied in this paper a great deal of the past diversity has been collapsed through recent admixture events. The authors also detect deep and complex structure and differentiation. They’re clearly just scratching the surface.

Finally, there is more reiteration of the nature of Neanderthal and Denisovan admixture. The Neanderthals who mixed into early humans were quite homogeneous, or, there were not many of them. The haplotypes are not too numerous, and, they don’t exhibit the patterns you’d expect from different admixtures and source populations. The diversity is too great to be a single individual, but it could have been a small number. The main caution I would suggest here is that Neanderthals seem to often be quite homogeneous on the local scale.

The Denisovans are a different story. They detect the difference between Oceanian and non-Oceanian Denisovan ancestry (the Oceanian source Denisovans were quite distinct from the Altai Denisovans). But they also detect a different Denisovan contribution to the genomes of the Cambodians. The indigenous people of the Phillippines also harbor different Denisovan ancestry (not in this paper). The “Denisovans” seem to have been a cluster of different lineages that persisted in parallel for a long time.

Where is there to go next with the HGDP. At some point, better technologies will allow for a more thorough exploration of structural variation. I’ve emphasized this is an analysis of the sequence because that’s what it is. There is more information in non-sequence variation that they’ll get to one day (there was some structural analysis in this paper, but I believe that we are currently technology limited).

July 17, 2019

The genomics of the Viking Age

Filed under: Population genomics,Vikings — Razib Khan @ 3:11 pm

A huge new preprint on Vikings (as well as the Bronze Age, Iron Age, and comparisons to moderns), Population genomics of the Viking world:

…we sequenced the genomes of 442 ancient humans from across Europe and Greenland ranging from the Bronze Age (c. 2400 BC) to the early modern period (c. 1600 CE), with particular emphasis on the Viking Age. We find that the period preceding the Viking Age was accompanied by foreign gene flow into Scandinavia from the south and east: spreading from Denmark and eastern Sweden to the rest of Scandinavia. Despite the close linguistic similarities of modern Scandinavian languages, we observe genetic structure within Scandinavia, suggesting that regional population differences were already present 1,000 years ago. We find evidence for a majority of Danish Viking presence in England, Swedish Viking presence in the Baltic, and Norwegian Viking presence in Ireland, Iceland, and Greenland. Additionally, we see substantial foreign European ancestry entering Scandinavia during the Viking Age. We also find that several of the members of the only archaeologically well-attested Viking expedition were close family members. By comparing Viking Scandinavian genomes with present-day Scandinavian genomes, we find that pigmentation-associated loci have undergone strong population differentiation during the last millennia. Finally, we are able to trace the allele frequency dynamics of positively selected loci with unprecedented detail, including the lactase persistence allele and various alleles associated with the immune response. We conclude that the Viking diaspora was characterized by substantial foreign engagement: distinct Viking populations influenced the genomic makeup of different regions of Europe, while Scandinavia also experienced increased contact with the rest of the continent.

A few notes:

– Though the broad patterns seem to have been established with the expansion between 3,000 and 2,500 BC from the Yamnaya steppe (at least in Northern Europe), some subtle details in genome-wide ancestry shifted in subsequent periods. This data set seems to show a decline in “Neolithic Farmer” and increase in hunter-gatherer and steppe ancestry after the Bronze Age, with some increase in the former by the Viking Age. This suggests that there is some sort of skew in sampling which misses populations enriched for hunter-gatherer ancestry (I suspect these groups live in the most marginal land and are the most mobile).

– There is structure by the Viking Age, which is not surprising. But the authors also report a few regions of southern Sweden where samples are enriched for Neolithic farmer ancestry down to the Viking age, suggesting that even ancient structure wasn’t well mixed (yet).

– Most of the selection for the phenotype which characterizes modern-day Northern European populations seem to have completed over the 2,000 years between the Bronze Age and the Viking Age.

July 1, 2019

The vines around the tree trunks

Filed under: Human Evolution,Population genomics — Razib Khan @ 10:22 pm


A lot of the understanding of scientific theories and models in the public domain is communicated by evocative metaphors and turns of phrase. For example, Charles Darwin famously wrote:

It is interesting to contemplate a tangled bank, clothed with many plants of many kinds, with birds singing on the bushes, with various insects flitting about, and with worms crawling through the damp earth, and to reflect that these elaborately constructed forms, so different from each other, and dependent upon each other in so complex a manner, have all been produced by laws acting around us….

When it comes to understanding the origin of our own species and the broader human lineage over the past two million years, I’ve started to come to a mental model of a weighted-graph with edges. Some of the edges traverse time and have strong weights. These are analogous to the normal phylogenetic tree model, representing phyletic gradualism and anagenesis along each branch before some bifurcation event. But, some of the edges move horizontally between others. These represent migration and/or gene flow between the primary lineages.

I’m not sure though that a graph theory derived mental model helps many people, so I’ll use another one: imagine large trunks defining the primary lineages, and vines tying them together representing gene flow events. The above figure is from a new preprint, Mapping gene flow between ancient hominins through demography-aware inference of the ancestral recombination graph. This is a methods-heavy preprint. It utilizes an “ancestral recombination graph” (so a model of the genealogy of genes in the genome) and MCMC generate Bayesian probabilities of particular events (e.g., introgression of a lineage that diverged x years ago at fraction y).

The abstract presents some specific findings:

…While much attention has been paid to the relatively recent gene flow from Neanderthals and Denisovans into modern humans, other instances of introgression leave more subtle genomic evidence and have received less attention. Here, we present an extended version of the ARGweaver algorithm, ARGweaver-D, which can infer local genetic relationships under a user-defined demographic model that includes population splits and migration events. This Bayesian algorithm probabilistically samples ancestral recombination graphs (ARGs) that specify not only tree topology and branch lengths along the genome, but also indicate migrant lineages…We show that this method is well powered to detect the archaic migration into modern humans, even with only a few samples…We apply it to human, Neanderthal, and Denisovan genomes, looking for signatures of older proposed migration events, including ancient humans into Neanderthal, and unknown archaic hominins into Denisovans. We identify 3% of the Neanderthal genome that is putatively introgressed from ancient humans, and estimate that the gene flow occurred between 200-300kya. We find no convincing evidence that negative selection acted against these regions. We also identify 1% of the Denisovan genome which was likely introgressed from an unsequenced hominin ancestor, and note that 15% of these regions have been passed on to modern humans through subsequent gene flow.

ARGweaver-D is gnarly. Not in a bad way. But you should never really trust computational wizard of this sort unless you’ve taken it for a test drive, or it’s been around decades and people have validated it. A “play with the parameters” phase is necessary for these packages to become more than magic.

That being said, for about half a decade people have been detecting evidence of a “super-archaic” lineage within Denisovans. This is just another confirmation with another method. The super-archaic hypothesis seems plausible as an explanation of the patterns in the data (there may be other explanations). Second, there’s a lot of circumstantial evidence for gene flow into Neanderthals from moderns. E.g., mtDNA replacement in Neanderthals. Though not in the abstract, the preprint mentions the likelihood of “super-archaic” introgression into Neanderthals as well. From a recent ancient DNA paper on Nuclear DNA from two early Neandertals reveals 80,000 years of genetic continuity in Europe:

We find that population split times between HST and other Neandertals of less than 150 ka ago make the occurrence of a mitochondrial time to the most recent common ancestor (TMRCA) of 270 ka ago unlikely (1.2% of all simulated loci have such a deep TMRCA; note S11). We note that this result is robust to uncertainties in the estimates of the Neandertal population size and of the mitochondrial TMRCA (note S11). The presence of this deeply divergent mtDNA in HST thus suggests a more complex scenario in which HST carries some ancestry from a genetically distant population.

It seems entirely likely that we’re going to see “shadows of forgotten ancestors” in our genomes. But wait, there’s more!

…ARGweaver-D only detected a small amount of Sup→Afr introgression, which was somewhat lower than our estimated false positive rate. One aspect to note here is that the power to identify introgression from an unsequenced population is highly dependent on the population size of the recipient population. The larger the population, the deeper the coalescences are within that population, making it more difficult to discern which long branches might be explained by super-archaic introgression…If we had used a smaller population size, ARGweaver-D would have produced more Sup→Afr predictions, but most of these would be false positives unless that smaller population size is closer to the truth. Overall, we caution that the problem of detecting super-archaic introgression into a large and structured population such as Africas is very difficult and that claims of such introgression need to be robust to the demographic model used in analysis. It may not be possible to address the question of ancient introgression into Africans without directly sequencing fossils from the introgressing population.

In northern Eurasia, in particular, one might imagine a scenario with large fluctuations in population size, and patchy landscapes. This would reduce gene flow between populations, and also foster drift to produce distinct lineages. Simple stylized models of gene flow at particular times across disparate lineages makes a great deal of sense in this context. But if Africa had larger populations of humans, with more interconnected networks with continuous, if variable, levels of gene flow then the stylized models will mislead in important features.

This preprint is likely reporting some true robust results that will hold up. But I think the bigger picture is that it will lead us toward moving beyond the extremely simple models in vogue a generation ago, to a more subtle understanding of complex emergence and collapse of human population structure over the last two million years.

November 3, 2018

It’s raining selective sweeps

Filed under: Population genetics,Population genomics,Selection — Razib Khan @ 11:44 pm

A week ago a very cool new preprint came out, Identifying loci under positive selection in complex population histories. It’s something that you can’t even imagine just ten years ago. The authors basically figure out ways to identify deviations of markers from expected allele frequency given a null neutral evolutionary model. The method is put first, which I really like, before getting to results or discussion. Additionally, they did a lot of simulation ahead of time. The sort of simulation that is really not possible before the sort of computational resources we have now.

Here’s the abstract:

Detailed modeling of a species’ history is of prime importance for understanding how natural selection operates over time. Most methods designed to detect positive selection along sequenced genomes, however, use simplified representations of past histories as null models of genetic drift. Here, we present the first method that can detect signatures of strong local adaptation across the genome using arbitrarily complex admixture graphs, which are typically used to describe the history of past divergence and admixture events among any number of populations. The method – called Graph-aware Retrieval of Selective Sweeps (GRoSS) – has good power to detect loci in the genome with strong evidence for past selective sweeps and can also identify which branch of the graph was most affected by the sweep. As evidence of its utility, we apply the method to bovine, codfish and human population genomic data containing multiple population panels related in complex ways. We find new candidate genes for important adaptive functions, including immunity and metabolism in under-studied human populations, as well as muscle mass, milk production and tameness in particular bovine breeds. We are also able to pinpoint the emergence of large regions of differentiation due to inversions in the history of Atlantic codfish.

On a related note in regards to selection, On the well-founded enthusiasm for soft sweeps in humans: a reply to Harris, Sackman, and Jensen. The authors are responding to a recent preprint criticizing their earlier work. The reason that it’s fascinating to me is that these sorts of arguments today are really concrete and not so theoretical. There’s a lot of data for analytic techinques to chew through, and computation has really transformed the possibilities.

A generation ago these sorts of debates would be a sequence of “you’re wrong!” vs. “no, you’re wrong!” Today the disputes involve a lot of data, and so have a reasonable chance of resolution.

The first preprint identifies the usual candidates in humans that you normally see, and expected targets in cattle and cod. Sure, that will given biologists more interested in mechanisms and pathways things to chew upon, but imagine once researchers have large numbers of genomes for thousands and thousands of species. Then they’ll be testing deviations from neutral allele frequencies across many trees, and getting a more general and abstract sense of the parameter that selection explores, conditional on particularities o evolutionary history.

This is why I’m excited about plans to sequence lots and lots of species.

October 6, 2018

The population genetic structure of China (through noninvasive prenatal testing)


This week a big whole genome analysis of China was published in Cell, Genomic Analyses from Non-invasive Prenatal Testing Reveal Genetic Associations, Patterns of Viral Infections, and Chinese Population History. The abstract:

We analyze whole-genome sequencing data from 141,431 Chinese women generated for non-invasive prenatal testing (NIPT). We use these data to characterize the population genetic structure and to investigate genetic associations with maternal and infectious traits. We show that the present day distribution of alleles is a function of both ancient migration and very recent population movements. We reveal novel phenotype-genotype associations, including several replicated associations with height and BMI, an association between maternal age and EMB, and between twin pregnancy and NRG1. Finally, we identify a unique pattern of circulating viral DNA in plasma with high prevalence of hepatitis B and other clinically relevant maternal infections. A GWAS for viral infections identifies an exceptionally strong association between integrated herpesvirus 6 and MOV10L1, which affects piwi-interacting RNA (piRNA) processing and PIWI protein function. These findings demonstrate the great value and potential of accumulating NIPT data for worldwide medical and genetic analyses.

In The New York Times write-up there is an interesting detail, “This study served as proof-of-concept, he added. His team is moving forward on evaluating prenatal testing data from more than 3.5 million Chinese people.” So what he’s saying is that this study with >100,000 individuals is a “pilot study.” Let that sink in.

The PCA at the top of the post is a bit busy, so I want to highlight the salient aspect. These results confirm that 5-10% of the ancestry of the Hui, Chinese speaking Muslims, is West Eurasian. The Uygur and Kazakh are about ~40% on the left of the plot. The authors note that the Manchus overlapped almost perfectly with individuals sampled from Northern China. This is expected because by the end of the Ching dynasty most of the Manchus had been fully Sinicized, and in the 20th century fully assimilated. Recently due to an emphasis on “national minorities” and some privileges granted therein many people have identified as Manchu due to some ancestry who in all other ways simply northern Han (the Manchu language is moribund).

The sections on particular adaptations which vary by region are not surprising. In books like The Retreat of Elephants the slow, gradual, and inexorable expansion of the Chinese beyond the Yangzi basin is described in a way that makes it clear that southern diseases and climate were a major impediment. But through a process of acclimation, assimilation of local peoples, and adaptation, by 1000 AD the center of demographic gravity had shifted to the south.

There is a section of the text which I think will be falsified though:

After removing participants with 49bp read length and with sequencing error rate >0.00325, a principal component analysis of 45,387 self-reported Han Chinese from the 31 administrative divisions showed that the greatest differentiation of Han Chinese is along a latitudinal gradient (Figures S3E and S3F), consistent with previous studies (Chen et al., 2009, Xu et al., 2009). In contrast, there is, perhaps surprisingly, very little differentiation from East to West. This observation may be explained by the fact that a large proportion of the western Han populations in China are recent immigrants organized by the central government starting from 1949 when the People’s Republic of China was founded (Liang and White, 1996).

I don’t think there’s any need to make recourse to migration from 1949 and after. The argument in Guns, Germs, and Steel suffices: it’s just easier to move across latitudes than longitudes. The people of the north eat noodles made from wheat, and the people of the south eat rice. This is a big cultural transition for peasants to make, and so it didn’t happen as often as moving to the coast, or inland. We have documented instances of mass migrations from adjacent provinces due to famine and political instability. In the 17th century conflicts resulted in the depopulation of Sichuan and the arrival of large numbers of people from Hunan and Hubei to the east.

The plot below is one of the more interesting ones from the paper. From left to right, private alleles found in the HapMap Utah whites also found in all individuals in a given province, and then just Han, and then private alleles to ethnic Telugu Indians (from South India) found in all individuals in a given province, and then just Han.

Click to enlarge

The first thing to notice is that there is a correlation between the Han and non-Han. This shouldn’t be surprising. Plenty of ethnic groups have become Han through acculturation and become demographically absorbed. This is probably truer in parts of the south than in the north, but southern Chinese ethnic minorities are genetically and culturally much more like the Han in the first place.

Private alleles shared with Northern Europeans (CEU) almost certainly has to do with the interaction sphere of the steppe pastoralists, which extends from the Carpathians to Mongolia. The relatively high frequency of R1a, and to a lesser extent R1b, among many Turkic/Central Asian peoples is a pretty good sign of where this West Eurasian ancestry comes from.

The Indian affinity is perhaps more interesting. To be honest I was surprised at the high affinity in Yunnan and Hainan. Tibet has strong cultural connections to India through its form of Buddhism. But its interesting that Qinghai, where many Tibetans also live, does not have the affinity with India. What’s going on in the other provinces? I suspect that the aboriginal peoples assimilated by the Han and other groups in this region probably had some distant connections to the non-West Eurasian ancestry in South Asia.

Older Posts »

Powered by WordPress

Do NOT follow this link or you will be banned from the site!