# Razib KhanOne-stop-shopping for all of my content

## September 18, 2017

### Population structure in Neanderthals leads to genetic homogeneity

Filed under: Human Genetics,Paleoanthropology — Razib Khan @ 11:26 pm

The above tweet is in response to a article which reports on the finding past month in PNAS, Early history of Neanderthals and Denisovans. It’s open access, you should read it. I don’t think I’ve reviewed it because I haven’t dug through the supplements. To be frank this is a paper where you pretty much have to read the supplements because they’re introducing a somewhat different model here than is the norm.

I talked to Alan Rogers at SMBE about this paper. Broadly, I think there might be something to it, and it’s because of what David says above. It is simply hard to imagine that Neanderthals could be extremely successful with such low genetic diversity as we see, and spread so thin. Now, the Quanta Magazine tries to emphasize that the effective population is not the true census population, but I wish it would have explained it more clearly. Basically, the size that is relevant for breeding is obviously not going to the same as a head count. And, because effective populations are highly sensitive to bottlenecks you can get really small numbers even when the extant population at any given time may be large.

The PNAS paper makes some novel inferences, and I’ll set that to the side until I read the supplements. But I don’t think it’s crazy that population structure within Neanderthals could be leading to lower total genetic diversity.

### Release the UK Biobank! (the prediction of height edition)

Filed under: Genomic prediction,Human Genetics,Human Genomics,UK Biobank — Razib Khan @ 9:25 pm

There’s so much science coming out of the UK Biobank it’s not even funny. It’s like getting the palantír or something.

Anyway, a preprint, submitted for your approval. A vision of things to come? Accurate Genomic Prediction Of Human Height:

We construct genomic predictors for heritable and extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). Replication tests show that these predictors capture, respectively, ~40, 20, and 9 percent of total variance for the three traits. For example, predicted heights correlate ~0.65 with actual height; actual heights of most individuals in validation samples are within a few cm of the prediction. The variance captured for height is comparable to the estimated SNP heritability from GCTA (GREML) analysis, and seems to be close to its asymptotic value (i.e., as sample size goes to infinity), suggesting that we have captured most of the heritability for the SNPs used. Thus, our results resolve the common SNP portion of the “missing heritability” problem – i.e., the gap between prediction R-squared and SNP heritability. The ~20k activated SNPs in our height predictor reveal the genetic architecture of human height, at least for common SNPs. Our primary dataset is the UK Biobank cohort, comprised of almost 500k individual genotypes with multiple phenotypes. We also use other datasets and SNPs found in earlier GWAS for out-of-sample validation of our results.

A scatter-plot is worth a thousand derivations.

You know what better than 500,000 samples? One billion samples! A nerd can dream….

## September 17, 2017

### Massive genomic sample sizes = detecting evolution in real time

Filed under: Human Genetics — Razib Khan @ 7:22 pm

The recent PLOS BIOLOGY paper, Identifying genetic variants that affect viability in large cohorts, seems to have triggered a feeding frenzy in the media. For example, Big Think has put up Researchers Find Evidence That Human Evolution Is Still Actively Happening.

I wasn’t paying close attention because of course human evolution is still happening actively. From a genetic perspective, evolution is just change in allele frequencies. Populations aren’t infinite, so even if there wasn’t any selection stochastic forces would shift allele frequencies. But of course selection is probably happening. For adaptation by natural selection to occur you need heritable variation on a trait where there are fitness differences as a function of variation within the population. It seems implausible that these conditions don’t still apply. There’s plenty of fitness variation in the population, and it’s unlikely to be random as a function of heritable variation.

But the devil is in the details. And last year Field et al. used the modern genomic tools available to detect selection occurring over the past 2,000 years. It is not credible that it would have magically stopped a few centuries ago.

So why is this new paper such a big deal? (note that it’s in PLOS BIOLOGY, not PLOS GENETICS) Because the method they use is ingenious and simple. Basically, they’re looking at changes in allele frequencies as a function of age in huge populations. It’s a little more complicated than that, they used a logistic regression to control for some of the other variables. But they found some biologically plausible hits with their data set of 50,000-150,000. And, they replicated their hits from a European sample to a non-European one.

This does bring me back to a discussion I observed a while back. An evolutionary geneticist who works with Drosophila mentioned offhand that in his field there really wasn’t that much of a need for more data. They could spend all their time to doing analysis. A prominent human geneticist whose work focused on biomedicine piped up that that wasn’t true at all for their field. There are some differences in the scientific questions, but there are also differences in terms of what you can do with humans as a model organism.

In the paper they look forward to the day of increasing sample sizes an order of magnitude beyond where it is now. At some point in the near future, large fractions of entire nations will be sequenced at medical grade level (30x coverage).

Anyway, you should read Identifying genetic variants that affect viability in large cohorts. It’s pretty straightforward.

## September 14, 2017

### After agriculture, before bronze

The above plot shows genetic distance/variation between highland and lowland populations in Papa New Guinea (PNG). It is from a paper in Science that I have been anticipating for a few months (I talked to the first author at SMBE), A Neolithic expansion, but strong genetic structure, in the independent history of New Guinea.

What does “strong genetic structure” mean? Basically Fst is showing the proportion of genetic variation which is partitioned between groups. Intuitively it is easy to understand, in that if ~1% of the genetic variation is partitioned between groups in one case, and ~10% in another, then it is reasonable to suppose that the genetic distance between groups in the second case is larger than in the first case. On a continental scale Fst between populations is often on the order of ~0.10. That is the value for example when you pool the variation amongst Northern Europeans and Chinese, and assess how much of it can be apportioned in a manner which differentiates populations (so it’s about ~10% of the variation).

This is why ancient DNA results which reported that Mesolithic hunter-gatherers and Neolithic farmers in Central Europe who coexisted in rough proximity for thousands of years exhibited differences on the order of ~0.10 elicited surprise. These are values we are now expecting from continental-scale comparisons. Perhaps an appropriate analogy might be the coexistence of Pygmy groups and Bantu agriculturalists? Though there is some gene flow, the two populations exist in symbiosis and exhibit local ecological segregation.

In PNG continental scale Fst values are also seen among indigenous people. The differences between the peoples who live in the highlands and lowlands of PNG are equivalent to those between huge regions of Eurasia. This is not entirely surprising because there has been non-trivial gene flow into lowland populations from Austronesian groups, such as the Lapita culture. Many lowland groups even speak Austronesian languages today.

Using standard ADMIXTURE analysis the paper shows that many lowland groups have significant East Asian ancestry (red), while none of the highland groups do (some individuals with East Asian admixture seem to be due to very recent gene flow). But even within the highlands the genetic differences are striking. The  Fst values between Finns and Southern European groups such as Spaniards are very high in a European context (due to Finnish Siberian ancestry as well as drift through a bottleneck), but most comparisons within the highland groups in PNG still exceeds this.

The paper also argues that genetic differences between Papuans and the natives of Australia pre-date the rising sea levels at the beginning of the Holocene, when Sahul divided between its various constituents. This is not entirely surprising considering that the ecology of the highlands during the Pleistocene would have been considerably different from Australia to the south, resulting in sharp differences in the hunter-gatherer lifestyles. Additionally, there does not seem to have been a genetic cline. Papuans are symmetrically related to all Australian groups they had samples from.

Using coalescence-based genomic methods they inferred that separation between highlands and some lowland groups occurred ~10-20,000 years ago. That is, after the Last Glacial Maximum. For the highlands, the differences seem to date to within the last 10,000 years. The Holocene. Additionally, they see population increases in the highlands, correlating with the shift to agriculture (cultivation of taro).

None of the above is entirely surprising, though I would take the date inferences with a grain of salt. The key is to observe that large genetic differences, as well as cultural differences, accrued in the highlands of PNG during the Holocene. In the paper they have a social and cultural explanation for what’s going on:

Fst values in PNG fall between those of hunter-gatherers and present-day populations of west Eurasia, suggesting that a transition to cultivation alone does not necessarily lead to genetic homogenization.

A key difference might be that PNG had no Bronze Age, which in west Eurasia was driven by an expansion of herders and led to massive population replacement, admixture, and cultural and linguistic change (7, 8), or Iron Age such as that linked to the expansion of Bantu-speaking
farmers in Africa (24). Such cultural events have resulted in rapid Y-chromosome lineage expansions due to increased male reproductive variance (25), but we consistently find no evidence for this in PNG (fig. S13). Thus, in PNG, wemay be seeing the genetic, linguistic, and cultural diversity that sedentary human societies can achieve in the absence of massive technology-driven expansions.

Peter Turchin in books like Ultrasociety has aruged that one of the theses in Steven Pinker’s The Better Angels of Our Nature is incorrect: that violence has not decreased monotonically, but peaked in less complex agricultural societies. PNG is clearly a case of this, as endemic warfare was a feature of highland societies when they encountered Europeans. Lawrence Keeley’s War Before Civilization: The Myth of the Peaceful Savage gives so much attention to highland PNG because it is a contemporary illustration of a Neolithic society which until recently had not developed state-level institutions.

What papers like these are showing is that cultural and anthropological dynamics strongly shape the nature of genetic variation among humans. Simple models which assume as a null hypothesis that gene flow occurs through diffusion processes across a landscape where only geographic obstacles are relevant simply do not capture enough of the dynamic. Human cultures strongly shape the nature of interactions, and therefore the genetic variation we see around us.

## September 11, 2017

### Inbreeding causing issues in Osama bin Laden’s family

Filed under: Human Genetics,Osama bin Laden — Razib Khan @ 5:07 pm

I didn’t figure I would have to say much about 9/11 really that others could not say (aside from perhaps you should read Marc Sageman’s Understanding Terror Networks if you want an ethnography of the Salafi jihadist movement which lead to al-Qaeda). But The Daily Best has a profile of one of Osama bin Laden’s sons:

Moreover, by this time, bin Laden already had two wives. But Najwa, the first of them, encouraged him to pursue Khairia, believing that having someone with her training permanently on hand would help her son Saad and his brothers and sisters, some of whom also suffered from developmental disorders.

Osama bin Laden had two dozen some children (approximately). But it was strange to me to see mention of several children with developmental disorders. Inbreeding is a major burden for Arab Muslim societies. And sure enough, Osama bin Laden’s first wife was his first cousin. She gave birth to around 10 children. Her father was Osama bin Laden’s mother’s brother. With the possibility of several generations of cousin marriage their relatedness may have been closer than normal half-siblings.

Note: Osama bin Laden’s father was from Yemen and his mother from Syria. So he was most certainly not inbred.

## August 29, 2017

### Why do percentage estimates of “ancestry” vary so much?

Filed under: Genetics,Human Genetics — Razib Khan @ 10:36 pm

When looking at the results in Ancestry DNA, 23andMe, and Family Tree DNA my “East Asian” percentage is:

– 19%
– 13%
– 6%

What’s going on here? In science we often make a distinction between precision and accuracy. Precision is how much your results vary when you re-run an experiment or measurement. Basically, can you reproduce your result? Accuracy refers to how close your measurement is to the true value. A measurement can be quite precise, but consistently off. Similarly, a measurement may be imprecise, but it bounces around the true value…so it is reasonably accurate if you get enough measurements just cancel out the errors (which are random).

The values above are precise. That is, if you got re-tested on a different chip, the results aren’t going to be much different. The tests are using as input variation on 100,000 to 1 million markers, so a small proportion will give different calls than in the earlier test. But that’s not going to change the end result in most instances, even though these methods often have a stochastic element.

But what about accuracy? I am not sure that old chestnuts about accuracy apply in this case, because the percentages that these services provide are summaries and distillations of the underlying variation. The model of precision and accuracy that I learned would be more applicable to the DNA SNP array which returns calls on the variants; that is, how close are the calls of the variant to the true value (last I checked these are arrays are around 99.5% accurate in terms of matching the true state).

What you see when these services pop out a percentage for a given ancestry is the outcome of a series of conscious choices that designers of these tests made keeping in mind what they wanted to get out of these tests. At a high level here’s what’s going on:

1. You have a model of human population history and dynamics with various parameters
2. You have data that that varies that you put into that model
3. You have results which come back with values which are the best fit of that data to the model you specificed

Basically you are asking the computational framework a question, and it is returning its best answer to the question posed. To ask whether the answer is accurate or not is almost not even wrong. The frameworks vary because they are constructed by humans with difference preferences and goals.

Almost, but not totally wrong. You can for example simulate populations whose histories you know, and then test the models on the data you generated. Since you already know the “truth” about the simulated data’s population structure and history, you can see how well your framework can infer what you already know from the patterns of variation in the generated data.

Going back to my results, why do my East Asian percentages vary so much? The short answer is that one of the major variables in the model alluded to above is the nature of the reference population set and the labels you give them.

Looking at Bengalis, the ethnic group I’m from, it is clear that in comparison to other South Asian populations they are East Asian shifted. That is, it seems clear I do have some East Asian ancestry. But how much?

The “simple” answer is to model my ancestry is a mix of two populations, an Indian one and an East Asian one, and then see what the values are for my ancestry across the two components. But here is where semantics becomes important: what is Indian and East Asian? Remember, these are just labels we give to groups of people who share genetic affinities. The labels aren’t “real”, the reality is in the raw read of the sequence. But humans are not capable of really getting anything from millions of raw SNPs assigned to individuals. We have to summarize and re-digest the data.

The simplest explanation for what’s going on here is that the different companies have different populations put into the boxes which are “Indian/South Asian” and “East Asian.” If you are using fundamentally different measuring sticks, then there are going to be problems with doing apples to apples comparisons.

My personal experience is that 23andMe tends to give very high percentages of South Asian ancestry for all South Asians. Because “South Asian” is a very diverse category when tests come back that someone is 95-99% South Asian…it’s not really telling you much. In contrast, some of the other services may be using a small subset of South Asians, who they define as “more typical”, and so giving lower percentages to people from Pakistan and Bengal, who have admixture from neighboring regions to the west and east respectively.*

Something similar can occur with East Asian ancestry. If the “donor” ancestral groups are South Asian and East Asian for me, then the proportions of each is going to vary by how close the donor groups selected by the company is to the true ancestral group. If, for example, Family Tree DNA chose a more Northeastern Asian population than Ancestry DNA, then my East Asian population would vary between the two services because I know my East Asian ancestry is more Southeast Asian.

The moral of the story is that the values you obtain are conditional on the choices you make, and those choices emerge from the process of reducing and distilling the raw genetic variation into a manner which is human interpretable. If the companies decided to use the same model, the would come out with the same results.

* I helped develop an earlier version of MyOrigins, and so can attest to this firsthand.

## July 25, 2017

### Ancient Europeans: isolated, always on the edge of extinction

Filed under: Europe,Human Genetics,Scandinavia — Razib Khan @ 12:19 am

A few years ago I suggested to the paleoanthropologist Chris Stringer that the first modern humans who arrived in Europe did not contribute appreciable ancestry to modern populations in the continent (appreciable as in 1% or more of the genome).* It seems I may have been right according to results from a 2016 paper, The genetic history of Ice Age Europe. The very oldest European ancient genome samples “failed to contribute appreciably to the current European gene pool.”

Why did I make this claim? Two reasons:

1) 40,000 years is a long time, and there was already substantial evidence of major population turnovers across northern Eurasia by this point. You go far enough into the future and it’s not likely that a local population leaves any descendants. So just work that logic backward.

2) There was already evidence of low population sizes and high isolation levels between groups in Pleistocene and Mesolithic/Neolithic Europe. This would again argue in favor of a high likelihood of local extinctions give enough time.

This does not only apply to just modern humans, descendants of southern, likely African, populations. Neanderthals themselves show evidence of high homogeneity, and expansions through bottlenecks over the ~600,000 years of their flourishing.

The reason that these dynamics characterized modern humans and earlier hominins in northern Eurasia is what ecologists would term an abiotic factor: the Ice Age. Obviously humans could make a go of it on the margins of the tundra (the Neanderthals seem less adept at penetrating the very coldest of terrain in comparison to their modern human successors; they likely frequented the wooded fringes, see The Humans Who Went Extinct). We have the evidence of several million years of continuous habitation by our lineage. But many of the ancient genomes from these areas, whether they be Denisovan, Neanderthal, or Mesolithic European hunter-gatherer, show indications of being characterized by very low effective population sizes. Things only change with the arrival of farming and agro-pastoralism.

For two obvious reasons we happen to have many ancient European genomes. First, many of the researchers are located in Europe, and the continent has a well developed archaeological profession which can provide well preserved samples with provenance and dates. And second, Europe is cool enough that degradation rates are going to be lower than if the climate was warmer. But if Europe, as part of northern Eurasia, is subject to peculiar exceptional demographic dynamics we need to be cautious about generalizing in terms of the inferences we make about human population genetic history. Remember that ancient Middle Eastern farmers already show evidence of having notably larger effective population sizes than European hunter-gatherers.

Two new preprints confirm the long term population dynamics typical of European hunter-gatherers, Assessing the relationship of ancient and modern populations and Genomics of Mesolithic Scandinavia reveal colonization routes and high-latitude adaptation. The first preprint is rather methods heavy, and seems more of a pathfinder toward new ways to extract more analytic juice from ancient DNA results. Those who have worked with population genomic data are probably not surprised at the emphasis on collecting numbers of individuals as opposed to single genome quality. That is, for the questions population geneticists are interested in “two samples sequenced to 0.5x coverage provide better resolution than a single sample sequenced to 2x coverage.”

I encourage readers (and “peer reviewers”) to dig into the appendix of Assessing the relationship of ancient and modern populations. I won’t pretend I have (yet). Rather, I want to highlight an interesting empirical finding when the method was applied to extant ancient genomic samples: “we found that no ancient samples represent direct ancestors of modern Europeans.”

This is not surprising. The ‘hunter-gatherer’ resurgence of the Middle Neolithic notwithstanding, Northern Europe was subject to two major population replacements, while Southern Europe was subject to one, but of a substantial nature. Recall that the Bell Beaker paper found that “spread of the Beaker Complex to Britain was mediated by migration from the continent that replaced >90% of Britain’s Neolithic gene pool within a few hundred years.” This means that less than 10% of modern Britons’ ancestry are a combination of hunter-gatherers and Neolithic farmers.

And yet if you look at various forms of model-based admixture analyses it seems as if modern Europeans have substantial dollops of hunter-gatherer ancestry (and hunter-gatherer U5 mtDNA and Y chromosomal lineage I1 and I2, associated with Pleistocene Europeans, is found at ~10% frequency in modern Europe in the aggregate; though I suspect this is a floor). What gives? Let’s look at the second preprint, which is more focused on new empirical results from ancient Scandinavian genomes, Genomics of Mesolithic Scandinavia reveal colonization routes and high-latitude adaptation. From early on in the preprint:

Based on SF12’s high-coverage and high-quality genome, we estimate the number of single nucleotide polymorphisms (SNPs) hitherto unknown (that are not recorded in dbSNP (v142)) to be c. 10,600. This is almost twice the number of unique variants (c. 6,000) per Finnish individual (Supplementary Information 3) and close to the median per European individual in the 1000 Genomes Project (23) (c. 11,400, Supplementary Information 3). At least 17% of these SNPs that are not found in modern-day individuals, were in fact common among the Mesolithic Scandinavians (seen in the low coverage data conditional on the observation in SF12), suggesting that a substantial fraction of human variation has been lost in the past 9,000 years (Supplementary Information 3). In other words, the SHGs (as well as WHGs and EHGs) have no direct descendants, or a population that show direct continuity with the Mesolithic populations (Supplementary Information 6) (13–17). Thus, many genetic variants found in Mesolithic individuals have not been carried over to modern-day groups.

The gist of the paper in terms of archaeology and demographic history is that Scandinavian hunter-gatherers were a compound population. One component of their ancestry is what we term “Western hunter-gatherers” (WHG), who descended from the late  Pleistocene Villabruna cluster (see paper mentioned earlier). Samples from Belgium, Switzerland, and Spain all belong to this cluster. The second element are “Eastern hunter-gatherers” (EHG). These samples derive from the Karelia region, to the east of modern Finland, bound by the White Sea to the north. EHG populations exhibit affinities to both WHG as well as Siberian populations who contributed ancestry to Amerindians, the “Ancestral North Eurasians” (ANE). There is a question at this point whether EHG are the product of a pulse admixture between an ANE and WHG population, or whether there was a long existent ANE-WHG east-west cline which the EHG were situated upon. That is neither here nor there (the Tartu group has a paper addressing this leaning toward isolation-by-distance from what I recall).

Explicitly testing models to the genetic data the authors conclude that there was a migration of EHG populations with a specific archaeological culture around the north fringe of Scandinavia, down the Norwegian coast. Conversely, a WHG population presumably migrated up from the south and somewhat to the east (from the Norwegian perspective).

And yet the distinctiveness of the very high quality genome as inferred from unique SNPs they have suggests to them that very little of the ancestry of modern Scandinavians (and Finns to be sure) derives from these ancient populations. Very little does not mean all. There is a lot of functional analysis in the paper and supplements which I will not discuss in this post, and one aspect is that it seems some adaptive alleles for high latitudes might persist down to the present in Nordic populations as a gift from these ancient forebears. This is no surprise, not all regions of the genome are created equal (a more extreme case is the Denisovan derived high altitude adaptation haplotype in modern Tibetans).

Nevertheless, there was a great disruption. First, the arrival of farmers whose ultimate origins were Anatolia ~6,000 years ago to the southern third of Scandinavia introduced a new element which came in force (agriculture spread over the south in a few centuries). A bit over a thousand years later the Corded Ware people, who were likely Indo-European speakers, arrived. These Indo-European speakers brought with them a substantial proportion of ancestry related to the hunter-gatherers because they descended in major fraction from the EHG (and later accrued more European hunter-gatherer ancestry from both the early farmers and likely some residual hunter-gatherer populations who switched to agro-pastoralism**).

For several years I’ve had discussions with researchers whose daily bread & butter are the ancient DNA data sets of Europe. I’ve gotten some impressions implicitly, and also from things they’ve said directly. It strikes me that the Bantu expansion may not be a bad analogy in regards to the expansion of farming in Europe (and later agro-pastoralism). Though the expanding farmers initial mixed with hunter-gatherers on the frontier, once they got a head of steam they likely replaced small hunter-gatherer groups in totality, except in areas like Scandinavia and along the maritime fringe where ecological conditions were such hunter-gatherers were at advantage (War Before Civilization seems to describe a massive farmer vs. coastal forager war on the North Sea).

But this is not the end of the story for Norden. At SMBE I saw some ancient genome analysis from Finland on a poster. Combined with ancient genomic analysis from the Baltic, along with deeper analysis of modern Finnish mtDNA, it seems likely that the expansion of Finno-Samic languages occurred on the order of ~2,000 years ago. After the initial expansion of Corded Ware agro-pastoralists.

The Sami in particular seem to have followed the same path along the northern fringe of Scandinavia that the EHG blazed. Though they herd reindeer, they were also Europe’s last indigenous hunter-gatherers. Genetically they exhibit the same minority eastern affinities in their ancestry that the Finns do, though to a greater extent. But their mtDNA harbors some distinctive lineages, which might be evidence of absorption of ancient Scandinavian substate.

I’ll leave it to someone else to explain how and why the Finns and Sami came to occupy the areas where they currently dominate (note that historically Sami were present much further south in Norway and Sweden than they are today). But note that in Latvia and Lithuania the N1c Y chromosomal lineage is very common, despite no language shift, indicating that there was a great deal of reciprocal mixing on the Baltic.

Overall the story is of both population and cultural turnover. This should not surprise when one considers that northern Eurasia is on the frontier of the human range. And perhaps it should temper the inferences we make about other areas of the world.

* You may notice that this threshold is lower than the Neanderthal admixture proportions in the non-African genome. Why is this old admixture still detectable while modern human lineages go extinct? Because it seems to have occurred with non-African humans had a very small effective population, and was mixed thoroughly. Because of the even genomic distribution this ancestry has not been lost in any of the daughter populations.

** Haplogroup I1, which descends from European late Pleistocene populations, exhibits a star phylogeny of similar time depth as R1b and R1a.

## July 17, 2017

### Castes are not just of mind

Filed under: Caste,Human Genetics,India — Razib Khan @ 8:31 pm

Before Nicholas Dirks was a controversial chancellor of UC Berkeley, he was a well regarded historian of South Asia. He wrote Castes of Mind: Colonialism and the Making of Modern India. I read it, along with other books on the topic in the middle 2000s.

Here is Amazon summary from Library Journal:

Is India’s caste system the remnant of ancient India’s social practices or the result of the historical relationship between India and British colonial rule? Dirks (history and anthropology, Columbia Univ.) elects to support the latter view. Adhering to the school of Orientalist thought promulgated by Edward Said and Bernard Cohn, Dirks argues that British colonial control of India for 200 years pivoted on its manipulation of the caste system. He hypothesizes that caste was used to organize India’s diverse social groups for the benefit of British control. His thesis embraces substantial and powerfully argued evidence. It suffers, however, from its restricted focus to mainly southern India and its near polemic and obsessive assertions. Authors with differing views on India’s ethnology suffer near-peremptory dismissal. Nevertheless, this groundbreaking work of interpretation demands a careful scholarly reading and response.

The condensation is too reductive. Dirks does not assert that caste structures (and jati) date to the British period, but the thrust of the book clearly leaves the impression that this particular identity’s formative shape on the modern landscape derives from the colonial experience. The British did not invent caste, but the modern relevance seems to date to the British period.

This is in keeping with a mode of thought flourishing today under the rubric of postcolonialism, with roots back to Edward Said’s Orientalism. As a scholar of literature Said’s historical analysis suffered from the lack of deep knowledge. A cursory reading of Orientalism picks up all sorts of errors of fact. But compared to his heirs Said was actually a paragon of analytical rigor. I say this after reading some contemporary postcolonial works, and going back and re-reading Orientalism.

To not put too fine a point on it postcolonialism is more about a rhetorical posture which aims to destroy what it perceives as Western hegemonic culture. In the process it transforms the modern West into the causal root of almost all social and cultural phenomenon, especially those that are not egalitarian. Anyone with a casual grasp of world history can see this, which basically means very few can, since so few actually care about details of fact.

Castes of Mind is an interesting book, and a denser piece of scholarship than Orientalism. Its perspective is clear, and though it is not without qualification, many people read it to mean that caste was socially constructed by the British.

This seems false. It has become quite evident that even the classical varna categories seem to correlate with genome-wide patterns of relatedness. And the Indian jatis have been endogamous for on the order of two thousand. From The New York Times, In South Asian Social Castes, a Living Lab for Genetic Disease:

The Vysya may have other medical predispositions that have yet to be characterized — as may hundreds of other subpopulations across South Asia, according to a study published in Nature Genetics on Monday. The researchers suspect that many such medical conditions are related to how these groups have stayed genetically separate while living side by side for thousands of years.

This is not really a new finding. It was clear in 2009’s Reconstructing Indian Population History. It’s more clear now in The promise of disease gene discovery in South Asia.

Unfortunately though science is not well known in any depth in the general public. The ascendency of social constructionism is such that a garbled and debased view that “caste was invented by the British” will continue to be the “smart” and fashionable view among many elites.

## July 10, 2017

### The great Bantu expansion was massive

Filed under: History,Human Genetics,Punt — Razib Khan @ 12:01 am

Lots of stuff at SMBE of interest to me. I went to the Evolution meeting last year, and it was a little thin on genetics for me. And I go to ASHG pretty much every year, but there’s a lot of medical stuff that is not to my taste. SMBE was really pretty much my style.

In any case one of the more interesting talks was given by Pontus Skoglund (soon of the Crick Institute). He had several novel African genomes to talk about, in particular from Malawi hunter-gatherers (I believe dated to 3,000 years before the present), and one from a pre-Bantu pastoralist.

At one point Skoglund presented a plot showing what looked like an isolation by distance dynamic between the ancient Ethiopian Mota genome and a modern day Khoisan sample, with the Malawi population about $\frac{2}{3}$ of the way toward the Khoisan from the Ethiopian sample. Some of my friends from a non-human genetics background were at the talk and were getting quite excited at this point, because there is a general feeling that the Reich lab emphasizes the stylized pulse admixture model a bit too much. Rather than expansion of proto-Ethiopian-like populations and proto-Khoisan-like populations they interpreted this as evidence of a continuum or cline across East Africa. I’m not sure if this is the right interpretation of the plot presented, but it’s a reasonable one.

Malawi is considerably to the north of modern Khoisan populations. This is not surprising. From what I have read Khoisan archaeological remains seem to be found as far north as Zimbabwe, while others have long suggested a presence as far afield as Kenya. Perhaps more curiously: the Malawi hunter-gatherers exhibit not evidence of having contributed genes to modern Bantu residents of Malawi.

Surprising, but not really. If you look at a PCA plot of Bantu genetic variation it really starts showing evidence of local substrate (Khoisan) in South Africa. From Cameroon to Mozambique it looks like the Bantu simply overwhelmed local populations, they are clustered so tight. Though it is true that African populations harbor a lot of diversity, that diversity is not necessarily partition between the populations. The Bantu expansion is why.

Of more interest from the perspective of non-African history is the Tanzanian pastoralist. This individual is about 38% West Eurasian, and that ancestry has the strongest affinities with Levantine Neolithic farmers. Specifically, the PPN, which dates to between 8500-5500 BCE. More precisely, this individual was exclusively “western farmer” in the Lazaridis et al. formulation. Additionally, Skoglund also told me that the Cushitic (and presumably Semitic) peoples to the north and east had some “eastern farmer.” I immediately thought back to Hogdson et al. Early Back-to-Africa Migration into the Horn of Africa, which suggested multiple layers. Finally, 2012 Pagani et al. suggested that admixture in the Ethiopian plateau occurred on the order of ~3,000 years ago.

Bringing all of this together it suggests to me two things

1. The migration back from Eurasia occurred multiple times, with an early wave arriving well before the Copper/Bronze Age east-west and west-east gene flow in the Near East (also, there was backflow to West Africa, but that’s a different post….).
2. The migration was patchy; the Mota sample dates to 4,500 years ago, and lacks any Eurasian ancestry, despite the likelihood that the first Eurasian backflow was already occurring.

Skoglund will soon have the preprint out.

## July 9, 2017

### SLC24A5 is very important, but we don’t know why

The golden of pigmentation genetics started in 2005 with SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. Prior to that pigmentation genetics was really to a great extent coat color genetics, done in mice and other organisms which have a lot of pelage variation.

Of course there was work on humans, mostly related to melanocortin 1. But more interesting were classical pedigree studies which indicated that the number of loci controlling variation in pigmentation was not that high. This, it was a mildly polygenic trait insofar as some large effect quantitative trait loci could be discerned in the inheritance patterns.

From The Genetics of Human Populations, written in the 1960s, but still useful today because of its comprehensive survey of the classical period:

Depending on what study samples you use variance on a locus of SLC24A5 explains less than 10% or more than 30% of the total variance. But it is probably the biggest effect locus on the whole in human populations when you pool them altogether (obviously it explains little variance in Africans or eastern non-Africans since it is homozygous ancestral by and large in both groups).

One aspect of the derived SNP in this locus is that it seems to be under strong selection. In a European 1000 Genomes sample there are 1003 SNPs of the derived variant, and 3 of the ancestral. Curiously this allele was absent in Western European Mesolithic European hunter-gatherers, though it was present in hunter-gatherers on the northern and eastern fringes of the continent. It was also present in Caucasian hunter-gatherers and farmers from the Middle East who migrated to Europe. It seems very likely that these sorts of high frequencies are due to selection in Europe.

The variant is also present in appreciably frequencies in many South Asian populations, and there seems to have been in situ selection there too, as well as the Near East. In Ethiopia it also seems to be under selection.

It could be something due to radiation…but the Near East and South Asia are quite high intensity in that regard. As are the highlands of Ethiopia. About seven years ago I suggested that rather that UV radiation as such the depigmentation that has occurred across the Holocene might be due to agriculture and changes in diet.

But a new result from southern Africa presented at the SMBE meeting this year suggests that this can not be a comprehensive answer. Meng Lin in Brenna Henn’s lab uses a broad panel of KhoeSan populations to find that the derived allele on SLC24A5 reaches ~40% frequency. Probably a high fraction of West Eurasian admixture in these groups is around ~10% being generous. Where did this allele come from? The results from Joe Pickrell a few years back are sufficient to explain: there was a movement of pastoralists with distant West Eurasian ancestry who brought cattle to southern Africa, and so resulted in the ethnogenesis of groups such as the Nama people (there is also Y chromosomal work by Henn on this).

Lin reports that the haplotype around SLC24A5 is the same one as in Western Eurasia. Iain Mathieson (who is now at Penn if anyone is looking for something to do in grad school or a post-doc) has told me that the haplotype in the Motala Mesolithic hunter-gatherers and in the hunter-gatherers from the Caucasus are the same. It seems that this haplotype was widespread early in the Holocene. Curiously, the Motala hunter-gatherers also carry the East Asian haplotype around their derived EDAR variant.

I don’t know what to make of this. My intuition is that if a haplotype like this is so widespread nearly ~10,000 years ago recombination would have broken it apart into smaller pieces so that haplotype structure would be easier to discern. As it is that doesn’t seem to be the case.

And we also don’t know what’s going on withSLC24A5. Obviously it impacts skin color. It has been shown to do so in admixed populations. But it is hard to believe that that is the sole target of natural selection here.

## June 14, 2017

Filed under: Diet,FADS,Genetics,Human Genetics — Razib Khan @ 7:21 pm

Food is a big deal for humans. Without it we die. Unlike some animals (here’s looking at you pandas) we’re omnivorous. We eat fruit, nuts, greens, meat, fish, and even fungus. Some of us even eat things which give off signals of being dangerous or unpalatable, whether it be hot sauce or lutefisk.

This ability to eat a wide variety of items is a human talent. Those who have put their cats on vegetarian diets know this. After a million or so years of being hunters and gatherers with a presumably varied diet for thousands and thousands of years most humans at any given time ate some form of grain based gruel. Though I am sympathetic to the argument that in terms of quality of life this was a detriment to median human well being, agriculture allowed our species to extract orders of magnitude more calories from a unit of land, though there were exceptions, such as in marine environments (more on this later).

Ergo, some scholars, most prominently Peter Bellwood, have argued that farming did not spread through cultural diffusion. Rather, farmers simply reproduced at much higher rates because of the efficiency of their lifestyle in comparison to that of hunter-gatherers. The latest research, using ancient DNA, broadly confirms this hypothesis. More precisely, it seems that cultural revolutions in the Holocene have shaped most of the genetic variation we see around us.

But genetic variation is not just a matter of genealogy. That is, the pattern of relationships, ancestor to descendent, and the extent of admixtures across lineages. Selection is also another parameter in evolutionary genetics. This can even have genome-wide impacts. It seems quite possible that current levels of Neanderthal ancestry are lower than might otherwise have been the case due to selection against functional variants derived from Neanderthals, which are less fitness against a modern human genetic background.

The importance of selection has long been known and explored. Sickle-cell anemia only exists because of balancing selection. Ancient DNA has revealed that many of the salient traits we associate with a given population, e.g., lactose tolerance or blue eyes, have undergone massive changes in population wide frequency over the last 10,000 years. Some of this is due to population replacement or admixture. But some of it is due to selection after the demographic events. To give a concrete example, the frequency of variants associated with blue eyes in modern Europeans dropped rapidly with the expansion of farmers from the Near East ~10,000 years ago, but has gradually increased over time until it is the modal allele in much of Northern Europe. Lactase persistence in contrast is not an ancient characteristic which has had its ups and downs, but something new that evolved due to the cultural shock of the adoption of dairy consumption by humans as adults. The region around lactase is one of the strongest signals of natural selection in the European genome, and ancient DNA confirms that the ubiquity of the lactase persistent allele is a very recent phenomenon.

But obviously lactase is not going to be the only target of selection in the human genome. Not only can humans eat many different things, but we change our portfolio of proportions rather quickly. In a Farewell to Alms the economic historian Gregory Clark observed that English peasants ate very differently before and after the Black Death. As any ecologist knows populations are resource constrained when they are near the carrying capacity, and England during the High Medieval period there was massive population growth due to gains in productivity (e.g., the moldboard plough) as well as intensification of farming and utilization of all the marginal land.

After the Black Death (which came in waves repeatedly) there was a massive population decline across much of Europe. Because institutions and practices were optimized toward maintaining a much higher population, European peasants lived a much better lifestyle after the population crash because the pie was being cut into far fewer pieces. In other words, centuries of life on the margins just scraping by did not mean that English peasants couldn’t live large when the times allowed for it. We were somewhat pre-adapted.

Our ability to eat a variety of items, and the constant varying of the proportions and kind of elements which go into our diet, mean that sciences like nutrition are very difficult. And, it also means that attempts to construct simple stories of adaptation and functional patterns from regions of the genome implicated in diet often fail. But with better analytic technologies (whole genome sequencing, large sample sizes) and some elbow grease some scientists are starting to get a better understanding.

A group of researchers at Cornell has been taking a closer look at the FADS genes over the past few years (as well as others at CTEG). These are three nearby genes, FADS1FADS2, and FADS3 (they probably underwent duplication). These genes are involved in the metabolization of fatty acids, and dietary regime turns out to have a major impact on variation around these loci.

The most recent paper out of the Cornell group, Dietary adaptation of FADS genes in Europe varied across time and geography:

Fatty acid desaturase (FADS) genes encode rate-limiting enzymes for the biosynthesis of omega-6 and omega-3 long-chain polyunsaturated fatty acids (LCPUFAs). This biosynthesis is essential for individuals subsisting on LCPUFA-poor diets (for example, plant-based). Positive selection on FADS genes has been reported in multiple populations, but its cause and pattern in Europeans remain unknown. Here we demonstrate, using ancient and modern DNA, that positive selection acted on the same FADS variants both before and after the advent of farming in Europe, but on opposite (that is, alternative) alleles. Recent selection in farmers also varied geographically, with the strongest signal in southern Europe. These varying selection patterns concur with anthropological evidence of varying diets, and with the association of farming-adaptive alleles with higher FADS1 expression and thus enhanced LCPUFA biosynthesis. Genome-wide association studies reveal that farming-adaptive alleles not only increase LCPUFAs, but also affect other lipid levels and protect against several inflammatory diseases.

The paper itself can be difficult to follow because they’re juggling many things in the air. First, they’re not just looking at variants (e.g., SNPs, indels, etc.), but also the haplotypes that the variants are embedded in. That is, the sequence of markers which define an association of variants which indicate descent from common genealogical ancestors. Because recombination can break apart associations one has to engage with care in historical reconstruction of the arc of selection due to a causal variant embedded in different haplotypes.

But the great thing about this paper is that in the case of Europe they can access ancient DNA. So they perform inferences utilizing whole genomes from many extant human populations, but also inspect change in allele frequency trajectories over time because of the density of the temporal transect. The figure to the left shows variants in both an empirical and modeling framework, and how they change in frequency over time.

In short, variants associated with higher LCPUFA synthesis actually decreased over time in Pleistocene Europe. This is similar to the dynamic you see in the Greenland Inuit. With the arrival of farmers the dynamic changes. Some of this is due to admixture/replacement, but some of it can not be accounted for admixture and replacement. In other words, there was selection for the variants which synthesize more LCPUFA.

This is not just limited to Europe. The authors refer to other publications which show that the frequency of alleles associated with LCPUFA production are high in places like South Asia, notable for a culture of preference for plant-based diets, as well as enforced by the reality that animal protein was in very short supply. In Europe they can look at ancient DNA because we have it, but the lesson here is probably general: alternative allelic variants are being whipsawed in frequency by protean shifts in human cultural modes of production.

In War Before Civilization Lawrence Keeley observed that after the arrival of agriculture in Northern Europe in a broad zone to the northwest of the continent, facing the Atlantic and North Sea, farming halted rather abruptly for centuries. Keeley then recounts evidence of organized conflict in between two populations across a “no man’s land.”

But why didn’t the farmers just roll over the old populations as they had elsewhere? Probably because they couldn’t. It is well known that marine regions can often support very high densities of humans engaged in a gathering lifestyle. Though not farmers, these peoples are often also not nomadic, and occupy areas as high density. The tribes of the Pacific Northwest, dependent upon salmon fisheries, are classic examples. Even today much of the Northern European maritime fringe relies on the sea. High density means they had enough numbers to resist the human wave of advance of farmers. At least for a time.

Just as cultural forms wane and wax, so do some of the underlying genetic variants. If you dig into the guts of this paper you see much of the variation dates to the out of Africa period. There were no great sweeps which expunged all variation (at least in general). Rather, just as our omnivorous tastes are protean and changeable, so the genetic variation changes over time and space in a difficult to reduce manner. The flux of lifestyle change is probably usually faster than biological evolution can respond, so variation reducing optimization can never complete its work.

The modern age of the study of natural selection in the human genome began around when A Map of Recent Positive Selection In the Human Genome was published. And it continues with methods like SDS, which indicate that selection operates to this day. Not a great surprise, but solidifying our intuitions. In the supplements to the above paper the authors indicate that the focal alleles that they are interrogating exhibit coefficients of selection around ~0.5% or so. This is rather appreciable. The fact that fixation has not occurred indicates in part that selection has reversed or halted, as they noted. But another aspect is that there are correlated responses; the FADS genes are implicated in many things, as the authors note in relation to inflammatory diseases. But I’m not sure that the selection effects of these are really large in any case. I bet there are more important things going on that we haven’t discovered or understood.

Obviously genome-wide analyses are going to continue for the foreseeable future. Ten years ago my late friend Mike McKweon predicted that at some point genomics was going to have be complemented by detailed follow up through bench-work. I’m not sure if we’re there yet, but there are only so many populations you can sequence, and only to a particular coverage to obtain any more information. Some selection sweeps will be simple stories with simple insights. But I suspect many more like FADS will be more complex, with the threads of the broader explanatory tapestry assembled publications by publication over time.

Citation: Ye, K., Gao, F., Wang, D., Bar-Yosef, O. & Keinan, A. Dietary adaptation of FADS genes in Europe varied across time and geography. Nat. Ecol. Evol. 1, 0167 (2017).

## May 10, 2017

### The Bronze age demographic transformation of Britiain

Filed under: Bell Beaker,Britain,Evolution,History,Human Genetics,Human Genomics — Razib Khan @ 8:52 am

In Norman Davies’ the excellent The Isles: A History, he mentions offhand that unlike the Irish the British to a great extent have forgotten their own mythology. This is one reason that J. R. R. Tolkien created Middle Earth, they gave the Anglo-Saxons the same sort of mythos that the Irish and Norse had.

But to some extent I think we can update our assessments. Science is bringing myth to life. The legendary “Bell Beaker paper” is now available in preprint form, The Beaker Phenomenon And The Genomic Transformation Of Northwest Europe. The methods are not too abstruse if you have read earlier works on this vein (i.e., no Nick Patterson authored methodological supplement that I saw). And the results are straightforward.

And what are those results?

First, the Bell Beaker phenomenon was both cultural and demographic. Cultural in that it began in the Iberian peninsula, and was transmitted to Central Europe, without much gene flow from what they can see. Demographic in that its push west into what is today the Low Countries and France and the British Isles was accompanied by massive gene flow.

In their British samples they conclude that 90% of the ancestry of early Bronze Age populations derive from migrants from Central Europe with some steppe-like ancestry. In over words, in a few hundred years there was a 90% turnover of ancestry. The preponderance of the male European R1b lineage also dates to this period. It went from ~0% to ~75-90% in Britain over a few hundred years.

If most of the genetic-demographic character of modern Britain was established during the Bronze Age*, then there has been significant selection since the Bronze Age. The figure to the left shows ancient (Neolithic/Bronze age) frequencies of selected SNPs, with modern frequencies in the British in dashed read. The top-left SNP is for HERC2-OCA2, the region related to brown vs. blue eye color, and also associated with some more general depigmentation. The top-right SNP is in SLC45A2, the second largest effect skin color locus in Europeans. The bottom SNP is for a mutation on LCT, which allows for the digestion of milk sugar as adults.

The vast majority of the allele frequency change in Britons for digestion of milk sugar post-dates the demographic turnover. In other words, the modern allele frequency is a function of post-Bronze Age selection. This is not surprising, as it supports the result in Eight thousand years of natural selection.

At least as interesting are the pigmentation loci. The fact that the derived frequency in HERC2-OCA2 is lower in both British and Central European Beaker people samples indicates that the lower proportion is not an artifact of sampling. Britons have gotten more blue-eyed over the last 4,000 years. Second, SLC45A2 is at shocking low proportions for modern European populations.

In the 1000 Genomes the 4% ancestral allele frequency is almost certainly a function of the Siberian (non-European) ancestry. In modern Iberians the ancestral frequency is 18% (and it is even higher in Sardinians last I checked), but in Tuscans it is ~2%. Though not diagnostic of Europeans in the way the derived SNP at SLC24A2 is, SLC452 derived variants are much more constrained to Europe. Individuals who are homozygote ancestral for SNPs atSLC45A2 rare in modern Northern Europeans (pretty much nonexistent actually). But even as late as the Bronze Age they would have been present at low but appreciable frequencies.

This particular result convinces me that the method in Field et al. which detected lots of recent (last 2,000 years) selection on pigmentation in British populations is not just a statistical artifact. Though these papers are solving much of European prehistory, they are also going to be essential windows into the trajectory of natural selection in human populations over the last 5,000 years.

* In the context of this paper the Anglo-Saxon migrations tackled by the PoBI paper are minor affairs because the two populations were already genetically rather close. Additionally, the PoBI paper found that the German migrations were significant demographic events, but most of the ancestry across Britain does date to the previous period.

## May 1, 2017

### So what’s point of demographic models which leave you scratching your head

Filed under: Genomics,History,Human Genetics,Selection,Tibetans — Razib Khan @ 10:45 pm

There’s a new paper on Tibetan adaptation to high altitudes, Evolutionary history of Tibetans inferred from whole-genome sequencing. The focus of the paper is on the fact that more genes than have previously been analyzed seem to be the targets of natural selection. And I buy most of their analyses (not sure about the estimate of Denisovan ancestry being 0.4%…these sorts of things can be tricky).

But they fancy it up with a ∂a∂ model of population history, as well as using MSMC to account for gene flow. I don’t understand why they didn’t use something simpler like TreeMix, which can also handle more complex models. I guess because they wanted to focus on only a few populations?

Years ago I asked the developer of MSMC, Stephan Schiffels, if assuming an admixed population is not admixed might cause weird inferences. Why yes, it would. For example, admixed populations might show higher effective population since they’re pooling the histories of two separate populations. As for ∂a∂, the model above leaves me literally scratching my head.

…predicted that the initial divergence between Han and Tibetan was much earlier, at 54kya (bootstrap 95% C.I 44 kya to 58 kya). However, for the first 45ky, the two populations maintained substantial gene flow (6.8×10-4 and 9.0×10-4 per generation per chromosome). After 9.4 kya (bootstrap 95% C.I 8.6 kya to 11.2 kya), the gene flow rate dramatically dropped (1.3×10-11 and 4×10-7 per generation per chromosome), which is consistent with the estimate from MSMC.

Mystifying. The separation between Chinese and Tibetans is pretty much immediately after modern humans arrive in East Asia. Then there’s a lot of reciprocal gene flow…which ends during the Holocene.

We’re being told here that there are two populations which persisted in some form for ~45,000 years. Is this believable? That these two populations maintained some sort of continuity, and, remained in close proximity to engage in gene flow. And then ~10,000 years ago the ancestors of the Tibetans separated from the ancestors of the modern Han Chinese.

The latter scenario I can imagine. It’s this ~45,000 year dance I’m confused by. If there is substantial gene flow between the two groups why did they keep enough distinctive drift to be separate populations?

With what we know about ancient DNA from Europe if we posited such a model for that continent we’d be way off. There’s been too many population turnovers. Is East Asia different? I’m moderately skeptical of that. I think perhaps researchers should be very aware of the limitations of ∂a∂ when it comes to fine-grained population genomic analyses.

Note: This is a cool paper, and this small section is not entirely relevant. Which is why I’m confused about it since it seems the weakest part of the analysis in terms of originality, and the least believable.

## April 28, 2017

### Beyond “Out of Africa” and multiregionalism: a new synthesis?

Filed under: Africa,Evolution,Genetics,Genomics,Human Evolution,Human Genetics — Razib Khan @ 4:14 pm

For several decades before the present era there have been debates between proponents of the recent African origin of modern humans, and the multiregionalist model. Though molecular methods in a genetic framework have come of the fore of late these were originally paleontological theories, with Chris Stringer and Milford Wolpoff being the two most prominent public exponents of the respective paradigms.

Oftentimes the debate got quite heated. If you read books from the 1990s, when multiregionalism in particular was on the defensive, there were arguments that the recent out of Africa model was more inspirational in regards to our common humanity. As a riposte the multiregionalists asserted that those suggesting recent African origins with total replacement was saying that our species came into being through genocide.

Though some had long warned against this, the dominant perception outside of population genetics was that results such the “mitochondrial Eve” had given strong support to the recent African origin of modern humans, to the exclusion of other ancestry. 2002’s Dawn of Human Culture took it for granted that the recent African origin of modern humans to the total exclusion of other hominin lineages was established fact.

In 2008 I went to a talk where Svante Paabo presented some recent Neanderthal ancient mtDNA work. It was rather ho-hum, as Paabo showed that the Neanderthal lineages were highly diverged from modern ones, and did not leave any descendants. Though of course most modern human lineages did not leave any descendants from that period, Paabo took this evidence supporting the proposition that Neanderthals did not contribute to the modern human gene pool.

When his lab reported autosomal Neanderthal admixture in 2010, it was after initial skepticism and shock internally. I know Milford Wolpoff felt vindicated, while Chris Stringer began to emphasize that the recent African origin of modern humanity also was defined by regional assimilation of other lineages. The data have ultimately converged to a position somewhere between the extreme models of total replacement or balanced and symmetrical gene flow.

This is not surprising. Extreme positions are often rhetorically useful and popular when there’s no data. But reality does not usually conform to our prejudices, so ultimately one has to come down at some point.

The data for non-Africans is rather unequivocal. The vast majority of (>90%) of the ancestry of non-Africans seems to go back to a small number of common ancestors ~60,000 years ago. Perhaps in the range of ~1,000 individuals. These individuals seem to be a node within a phylogenetic tree where all the other branches are occupied by African populations. Between this period and ~15,000 years ago these non-Africans underwent a massive range expansion, until modern humans were present on all continents except Antarctica. Additionally, after the Holocene some of these non-African groups also experienced huge population growth due to intensive agricultural practice.

To give a sense of what I’m getting at, the bottleneck and common ancestry of non-Africans goes back ~60,000 years, but the shared ancestry of Khoisan peoples and non-Khoisan peoples goes back ~150,000-200,000 years. A major lacunae of the current discussion is that often the dynamics which characterize non-Africans are assumed to be applicable to Africans. But they are not.

A 2014 paper illustrates one major difference by inferring effective population from whole genomes: African populations have not gone through the major bottleneck which is imprinted on the genomes of all non-African populations. The Khoisan peoples, the most famous of which are the Bushmen of the Kalahari, have the largest long term effective populations of any human group. The Yoruba people of Nigeria have a history where they were subject to some population decline, but not to the same extent as non-Africans.

What do we take away from this?

One thing is that we have to consider that the assimilationist model which seems to be necessary for non-Africans, also applies to Africans. For years some geneticists have been arguing that some proportion of African ancestry as well is derived from lineages outside of the main line leading up to anatomically modern humans. Without the smoking gun of ancient genomes this will probably remain a speculative hypothesis. I hope that Lee Berger’s recent assertion that they’ve now dated Homo naledi to ~250,000 years before the present may offer up the possibility that ancient DNA will help resolver the question of African archaic admixture (i.e., if naledi is related to the “ghost population”?).

The second dynamic is that the bottleneck-then-range-expansion which is so important in defining the recent prehistory of non-Africans is not as relevant to Africans during the Pleistocene. The very deep split dates being inferred from whole genome analysis of African populations makes me wonder if multiregional evolution is actually much more important within Africa in the development of modern humans in the last few hundred thousand years. Basically, the deep split dates may highlight that there was recurrent gene flow over hundreds of thousands of years between different closely related hominin populations in Africa.

Ultimately, it doesn’t seem entirely surprising that the “Out of Africa” model does not quite apply within Africa.

Addendum: Over the past ~5,000 years we have seen the massive expansion of agricultural populations within the continent. The “deep structure” therefore may have been erased to a great extent, with Pygmies, Khoisan, and Hadza, being the tip of the iceberg in terms of the genetic variation which had characterized the Africa during the Pleistocene.

## April 23, 2017

### The logic of human destiny was inevitable 1 million years ago

Filed under: Evolution,Genetics,Genomics,Human Evolution,Human Genetics — Razib Khan @ 1:11 pm

Robert Wright’s best book, Nonzero: The Logic of Human Destiny, was published near 20 years ago. At the time I was moderately skeptical of his thesis. It was too teleological for my tastes. And, it does pander to a bias in human psychology whereby we look to find meaning in the universe.

But this is 2017, and I have somewhat different views.

In the year 2000 I broadly accepted the thesis outlined a few years later in The Dawn of Human Culture. That our species, our humanity, evolved and emerged in rapid sequence, likely due to biological changes of a radical kind, ~50,000 years ago. This is the thesis of the “great leap forward” of behavioral modernity.

Today I have come closer to models proposed by Michael Tomasello in The Cultural Origins of Human Cognition and Terrence Deacon in The Symbolic Species: The Co-evolution of Language and the Brain. Rather than a punctuated event, an instance in geological time, humanity as we understand it was a gradual process, driven by general dynamics and evolutionary feedback loops.

The conceit at the heart of Robert J. Sawyer’s often overly preachy Neanderthal Parallax series, that if our own lineage went extinct but theirs did not they would have created a technological civilization, is I think in the main correct. It may not be entirely coincidental that the hyper-drive cultural flexibility of African modern humans evolved in African modern humans first. There may have been sufficient biological differences to enable this to be likely. But I believe that if African modern humans were removed from the picture Neanderthals would have “caught up” and been positioned to begin the trajectory we find ourselves in during the current Holocene inter-glacial.

The data indicate that all human lineages were subject to increased encephalization. That process trailed off ~200,000 years ago, but it illustrates the general evolutionary pressures, ratchets, or evolutionary “logic”, that applied to all of them. Overall there were some general trends in the hominin lineage that began to characterized us about a million years ago. We pushed into new territory. Our rate of cultural change seems to gradually increased across our whole range.

One of the major holy grails I see now and then in human evolutionary genetics is to find “the gene that made us human.” The scramble is definitely on now that more and more whole genome sequences from ancient hominins are coming online. But I don’t think there will be such gene ever found. There isn’t “a gene,” but a broad set of genes which were gradually selected upon in the process of making us human.

In the lingo, it wasn’t just a hard sweep from a de novo mutation. It was as much, or even more, soft sweeps from standing variation.

## April 19, 2017

### Mouse fidelity comes down to the genes

Filed under: Genetics,Genomics,Human Genetics — Razib Khan @ 10:02 pm

While birds tend to be at least nominally monogamous, this is not the case with mammals. This strikes some people as strange because humans seem to be monogamous, at least socially, and often we take ourselves to be typically mammalian. But of course we’re not. Like many primates we’re visual creatures, rather than relying in smell and hearing. Obviously we’re also bipedal, which is not typical for mammals. And, our sociality scales up to massive agglomerations of individuals.

How monogamous we are is up for debate. Desmond Morris, who is well known to many from his roles in television documentaries, has been a major promoter of the idea that humans are monogamous, with a focus on pair-bonds. In contrast, other researchers have highlighted our polygamous tendencies. In The Mating Mind Geoffrey Miller argues for polygamy, and suggests that pair-bonds in a pre-modern environment were often temporary, rather than lifetime (Miller is now writing a book on polyamory).

The fact that in many societies high status males seem to engage in polygamy, despite monogamy being more common, is one phenomenon which confounds attempts to quickly generalize about the disposition of our species. What is preferred may not always be what is practiced, and the external social adherence to norms may be quite violated in private.

Adducing behavior is simpler in many other organisms, because their range of behavior is more delimited. When it comes to studying mating patterns in mammals voles have long been of interest as a model. There are vole species which are monogamous, and others which are not. Comparing the diverged lineages could presumably give insight as to the evolutionary genetic pathways relevant to the differences.

But North American deer mice, Peromyscus, may turn to be an even better bet: there are two lineages which exhibit different mating patterns which are phylogenetically close enough to the point where they can interbreed. That is crucial, because it allows one to generate crosses and see how the characteristics distribute themselves across subsequent generations. Basically, it allows for genetic analysis.

And that’s what a new paper in Nature does, The genetic basis of parental care evolution in monogamous mice. In figure 3 you can see the distribution of behaviors in parental generations, F1 hybrids, and the F2, which is a cross of F1 individuals. The widespread distribution of F2 individuals is likely indicative of a polygenic architecture of the traits. Additionally, they found that some traits are correlated with each other in the F2 generation (probably due to pleiotropy, the same gene having multiple effects), while others were independent.

With the F2 generation they ran a genetic analysis which looked for associations between traits and regions of the genome. They found 12 quantitative trait loci (QTLs), basically zones of the genome associated with variation on one or more of the six traits. From this analysis they immediately realized there was sexual dimorphism in terms of the genetic architecture; the same locus might have a different effect in the opposite sex. This is evolutionarily interesting.

Because the QTLs are rather large in terms of physical genomic units the authors looked to see which were plausible candidates in terms of function. One of their hits was vasopressin, which should be familiar to many from vole work, as well as some human studies. Though the QTL work as well as their pup-switching experiment (which I did not describe) is persuasive, the fact that a gene you’d expect shows up as a candidate really makes it an open and shut case.

The extent of the variation explained by any given QTL seems modest. In the extended figures you can see it’s mostly in the 1 to 5 percent range. In Carl Zimmer’s excellent write up he ends:

But Dr. Bendesky cautioned that the vasopressin gene would probably turn out to be just one of many that influence oldfield mice. Though it is strongly linked to parental behavior, the vasopressin gene accounts for 6.7 percent of the variation in nest building among males, and only 2.9 percent among females.

The genetic landscape of human parenting will turn out to be even more rugged, Dr. Bendesky predicted.

“You cannot do a 23andMe test and find out if your partner is going to be a good father,” he said.

Sort of. The genetic architecture above is polygenic…but not incredibly diffuse. The proportion of variation explained by the largest effect allele is more than for height, and far more than for education. If human research follows up on this, I wouldn’t be surprised if you could develop a polygenic risk score.

But I don’t have a good intuition on how much variation in humans there really is for these sorts of traits that are heritable. I assume some. But I don’t know how much. And how much of the variance in behavior might be explained by human QTLs? Humans don’t lick or build nests, or retrieve pups. Also, as one knows from Genetics and Analysis of Quantitative Traits sexually dimorphic traits take a long time to evolve. These are two deer mice species. Within humans there may not have been enough time for this sort of heritable complexity of behavior to evolve.

There are a lot of philosophical issues here about translating to a human context.

Nevertheless, this research shows that ingenious animal models can powerfully elucidate the biological basis of behavior.

Citation: The genetic basis of parental care evolution in monogamous mice. Nature (2017) doi:10.1038/nature22074

## April 18, 2017

### Women hate going to India

Filed under: Anthroplogy,Genetics,Human Genetics,India,Parsi — Razib Khan @ 9:11 pm

For some reason women do not seem to migrate much into South Asia. In the late 2000s I, along with others, noticed a strange discrepancy in the Y and mtDNA lineages which trace one’s direct male and female lines: in South Asia the male lineages were likely to cluster with populations to the north an west, while the females lines did not. South Asia’s females lines in fact had a closer relationship to the mtDNA lineages of Southeast and East Asia, albeit distantly.

One solution which presented itself was to contend there was no paradox at all. That the Y chromosomal lineages found in South Asia were basal to those to the west and north. In particular, there were some papers suggesting that perhaps R1a1a originated in South Asia at the end of the last Pleistocene. Whole genome sequencing of Y chromosomes does not bear this out though. R1a1a went through rapid expansion recently, and ancient DNA has found it in Russia first. But in 2009 David Reich came out with Reconstructing Indian population history, which offered up somewhat of a possible solution.

What Reich and his coworkers found that South Asia seems to be characterized by the mixture of two very different types of populations. One set, ANI (Ancestral North Indian), are basically another western or northwestern Eurasian group. ASI (Ancestral South Indian), are indigenous, and exhibit distant affinities to the Andaman Islanders. The India-specific mtDNA then were from ASI, while the Y chromosomes with affinities to people to the north and west were from ANI. In other words, the ANI mixture into South Asia was probably through a mass migration of males.

But it’s not just Y and mtDNA in this case only. A minority of South Asians speak Austro-Asiatic languages. The most interesting of these populations are the Munda, who tend to occupy uplands in east-central India. Older books on India history often suggest that the Munda are the earliest aboriginals of the subcontinent, but that has to confront the fact that most Austro-Asiatic language are spoken in Southeast Asia. There was no true consensus where they were present first.

Genetics seems to have solved this question. The evidence is building up that Austro-Asiatic languages arrived with rice farmers from Southeast Asia. Though most of the ancestry of the Munda is of ANI-ASI mix, a small fraction is clearly East Asian. And interestingly, though they carry no East Asian mtDNA, they do carry East Asian Y. Again, gene flow mediated by males.

The same is true of India’s Bene Israel Jewish community.

A new preprint on biorxiv confirms that the Parsis are another instance of the same dynamic: The genetic legacy of Zoroastrianism in Iran and India: Insights into population structure, gene flow and selection:

Zoroastrianism is one of the oldest extant religions in the world, originating in Persia (present-day Iran) during the second millennium BCE. Historical records indicate that migrants from Persia brought Zoroastrianism to India, but there is debate over the timing of these migrations. Here we present novel genome-wide autosomal, Y-chromosome and mitochondrial data from Iranian and Indian Zoroastrians and neighbouring modern-day Indian and Iranian populations to conduct the first genome-wide genetic analysis in these groups. Using powerful haplotype-based techniques, we show that Zoroastrians in Iran and India show increased genetic homogeneity relative to other sampled groups in their respective countries, consistent with their current practices of endogamy. Despite this, we show that Indian Zoroastrians (Parsis) intermixed with local groups sometime after their arrival in India, dating this mixture to 690-1390 CE and providing strong evidence that the migrating group was largely comprised of Zoroastrian males. By exploiting the rich information in DNA from ancient human remains, we also highlight admixture in the ancestors of Iranian Zoroastrians dated to 570 BCE-746 CE, older than admixture seen in any other sampled Iranian group, consistent with a long-standing isolation of Zoroastrians from outside groups. Finally, we report genomic regions showing signatures of positive selection in present-day Zoroastrians that might correlate to the prevalence of particular diseases amongst these communities.

The paper uses lots of fancy ChromoPainter methodologies which look at the distributions of haplotypes across populations. But some of the primary results are obvious using much simpler methods.

1) About 2/3 of the ancestry of Indian Parsis derives from an Iranian population
2) About 1/3 of the ancestry of Indian Parsis derives from an Indian popuation
3) Almost all the Y chromosomes of Indian Parsis can be accounted for by Iranian ancestry
4) Almost all the mtDNA haplogroups of Indian Parsis can be accounted for by Indian ancestry
5) Iranian Zoroastrians are mostly endogamous
6) Genetic isolation has resulted in drift and selection on Zoroastrians

The fact that the ancestry proportion is clearly more than 50% Iranian for Parsis indicates that there was more than one generation of males who migrated. They did not contribute mtDNA, but they did contribute genome-wide to Iranian ancestry. There are wide intervals on the dating of this admixture event, but they are consonant oral history that was later written down by the Parsis.

So there you have it. Another example of a population formed from admixture because women hate going to India.

Citation: The genetic legacy of Zoroastrianism in Iran and India: Insights into population structure, gene flow and selection.
Saioa Lopez, Mark G Thomas, Lucy van Dorp, Naser Ansari-Pour, Sarah Stewart, Abigail L Jones, Erik Jelinek, Lounes Chikhi, Tudor Parfitt, Neil Bradman, Michael E Weale, Garrett Hellenthal
bioRxiv 128272; doi: https://doi.org/10.1101/128272

## April 15, 2017

### Genetic variation in human populations and individuals

Filed under: Genetics,Genomics,Human Genetics,Polymorphisms,SNPs — Razib Khan @ 9:25 pm

I’m old enough to remember when we didn’t have a good sense of how many genes humans had. I vaguely recall numbers around 100,000 at first, which in hindsight seems rather like a round and large number. A guess. Then it went to 40,000 in the early 2000s and then further until it converged to some number just below 20,000.

But perhaps more fascinating is that we have a much better catalog of the variation across the whole human genome now. Often friends ask me questions of the form: “so DTC genomic company X has about 800,000 SNPs, is that enough to do much?” To answer such a question you need some basic numbers in your head, as well as what you want to “do.”

First, the human genome has about 3 billion base pairs (3 Gb). That’s a lot. But most of the genome famously doesn’t code for proteins. The exome, the proportion of the genome where bases directly translate into a protein accounts for 1% of the whole genome. That’s 30 million bases (30 Mb). But this small region of the genome is very important, as the vast majority of major disease mutations are found in the exome.

When it comes to a standard 800K SNP chip, which samples 800,000 positions across the 3 Gb genome, it is likely that the designers enriched the marker set for functional positions relevant to diseases. Not all marker positions are created equal. Though even outside of those functional positions there are often nearby SNPs that can “tag” them, so you can infer one from the state of the other.

But are 800,000 positions enough to make good ancestry inference? (to give one example) Yes. 800,000 is actually a substantial proportion of the polymorphism in any given genome. There have been some papers which improved on the numbers in 2015’s A global reference for human genetic variation, but it’s still a good comprehensive review to get an order-of-magnitude sense. The table below gives you a sense of individual variation:

Median autosomal variant sites per genome

When it comes to single nucleotide polymorphisms (SNPs), what SNP chips are getting at, an 800K array should get a substantial proportion of your genome-wide variation. More than enough for ancestry inference or forensics. The singleton column shows mutations specific to the individual.  When focusing on new mutations specific to an individual that might cause disease, singleton large deletions and nonsynonymous SNPs is really where I’d look.

But what about whole populations? The plot to the left shows the count of variants as a function of alternative allele frequency. When we say “SNP”, you really mean variants which exhibit polymorphism at a particular cut-off frequency for the minor allele (often 1%). It is clear that as the minor allele frequency increases in relation to the human reference genome the number of variants decreases.

From the paper:

The majority of variants in the data set are rare: ~64 million autosomal variants have a frequency <0.5%, ~12 million have a frequency between 0.5% and 5%, and only ~8 million have a frequency >5% (Extended Data Fig. 3a). Nevertheless, the majority of variants observed in a single genome are common: just 40,000 to 200,000 of the variants in a typical genome (1–4%) have a frequency <0.5% (Fig. 1c and Extended Data Fig. 3b). As such, we estimate that improved rare variant discovery by deep sequencing our entire sample would at least double the total number of variants in our sample but increase the number of variants in a typical genome by only ~20,000 to 60,000.

An 800K SNP chip will be biased toward the 8 million or so variants with a frequency of 5%. This number gives you a sense of the limited scope of variation in the human genome. 0.27% of the genome captures a lot of the polymorphism.

Citation: 1000 Genomes Project Consortium. “A global reference for human genetic variation.” Nature 526.7571 (2015): 68-74.

## April 8, 2017

### Why only one migrant per generation keeps divergence at bay

The best thing about population genetics is that because it’s a way of thinking and modeling the world it can be quite versatile. If Thinking Like An Economist is a way to analyze the world rationally, thinking like a population geneticist allows you to have the big picture on the past, present, and future, of life.

I have some personal knowledge of this as a transformative experience. My own background was in biochemistry before I became interested in population genetics as an outgrowth of my lifelong fascination with evolutionary biology. It’s not exactly useless knowing all the steps of the Krebs cycle, but it lacks in generality. In his autobiography I recall Isaac Asimov stating that one of the main benefits of his background as a biochemist was that he could rattle off the names on medicine bottles with fluency. Unless you are an active researcher in biochemistry your specialized research is quite abstruse. Population genetics tends to be more applicable to general phenomena.

In a post below I made a comment about how one migrant per generation or so is sufficient to prevent divergence between two populations. This is an old heuristic which goes back to Sewall Wright, and is encapsulated in the formalism to the left. Basically the divergence, as measured by Fst, is proportional to the inverse of 4 time the proportion of migrants times the total population + 1. The mN is equivalent to the number of migrants per generation (proportion times the total population). As the mN become very large, the Fst converges to zero.

The intuition is pretty simple. Image you have two populations which separate at a specific time. For example, sea level rise, so now you have a mainland and island population. Since before sea level rise the two populations were one random mating population their initial allele frequencies are the same at t = 0. But once they are separated random drift should begin to subject them to divergence, so that more and more of their genes exhibit differences in allele frequencies (ergo, Fst, the between population proportion of genetic variation, increases from 0).

Now add to this the parameter of migration. Why is one migrant per generation sufficient to keep divergence low? The two extreme scenarios are like so:

1. Large populations change allele frequency very slowly due to drift, so only a small proportion of migration is needed to prevent them from diverging
2. Small populations change allele frequency very fast due to drift, so a larger proportion of migration is needed to prevent them from drifting

Within a large population one migrant is a small proportion, but drift is occurring very slowly. Within a small population drift is occurring fast, but one migrant is a relatively large proportion of a small population.

Obviously this is a stylized fact with many details which need elaborating. Some conservation geneticists believe that the focus on one migrant is wrongheaded, and the number should be set closer to 10 migrants.

But it still gets at a major intuition: gene flow is extremely powerful and effective at reducing differences between groups. This is why most geneticists are skeptical of sympatric speciation. Though the focus above is on drift, the same intuition applies to selective divergence. Gene flow between populations work at cross-purposes with selection which drives two groups toward different equilibrium frequencies.

This is why it was surprising when results showed that Mesolithic hunter-gatherers and farmers in Europe were extremely genetically distinct in close proximity for on the order of 1,000 years. That being said, strong genetic differentiation persists between Pygmy peoples and their agriculturalist neighbors, despite a long history of living nearby each other (Pygmies do not have their own indigenous languages, but speak the tongue of their farmer neighbors). In the context of animals physical separation is often necessary for divergence, but for humans cultural differences can enforce surprisingly strong taboos. Culture is as strong a phenomenon as mountains or rivers….

## April 2, 2017

### The future shall, and should, be sequenced

Filed under: Genomics,GWAS,Human Genetics — Razib Khan @ 10:32 pm

Last fall I talked about a preprint, Human demographic history impacts genetic risk prediction across diverse populations. It’s now published in AJHG, with the same informative title, Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations. Even though talked about this before, I thought it would be useful to highlight again.

To recap, GWAS is a pretty big deal, but only in the last 15 years or so. With genome-wide data researchers began to explore associations between diseases and population genetic variation. In some cases they discovered strong associations between characteristics and genetic variants, but in many casese it turned out that though a trait is highly heritable (e.g., schizophrenia) the causal variants are either not common or do not explain much of the variation in the poplation (or both).

But as the second decade of GWAS proceeds the sample sizes are getting larger, and researchers are moving from SNP-chips, with their various biases, to high quality whole-genome sequences. One of the major sorts of low hanging fruit in the minds of many people are rare variants. Basically SNP-chips are geared toward finding common variations within large populations, since they have a finite number of markers they are going to interrogate. Sequencing though is a comprehensive catalog of the genome in a relative sense. If you have high coverage (so you sample the site many times) you can easily discover rare mutations within an individual genome that makes them distinctive from almost the rest of the human race (these may be de novo mutations, or, they could be mutations private to their extended pedigree).

But context matters. Martin et al. find that confirmed GWAS hits in Europeans tend to exhibit decreased portability as a function of genetic distance. This isn’t entirely surprising, especially if rarer variants are part of the explanation. Rare variants usually emerged later in history, after the differentiation between geographic races.

A solution would be to have a diverse panel of populations in your studies. For many reasons this was not to be. Northwest Europeans are enormously enriched in current data sets. Martin et al. observe that recent this has diminished somewhat, from 95% European to less than 80%. But they observe that this is mostly due to the inclusion of “Asian” samples, as opposed to African and Native Americans, who remain as undererpresented as they did several years ago.

The African and Native American samples present somewhat different problems. The Native American groups are quite drifted due to bottlenecks. Likely they have their own variants due to the combined affects of mutation and selection through 15 to 20,000 years of isolation from other human populations. In contrast, the African groups have lots of diversity with a high time depth due to their ancestral histories, which are less subject to bottleneck effects. The prediction ability into Africans of current GWAS looks to be rather pathetic. This is reasonable because their diversity is poorly captured in Eurocentric study designs, and, they are more genetically diverged from Europeans than Asians are.

Ultimatley I think, and hope, this portability question will be of short term utility. As sequencing gets cheap, and studies become more numerous, we’ll fill in the gaps of understudied populations. Finally, ethics is above my paygrade, but I do hope those who demand a strenuous bar on consent keep in mind that that will result in slower growth of these study populations. Academics want to do a good job, but they also want to stay on the good side of IRB.

Citation: Martin, Alicia R., et al. “Human demographic history impacts genetic risk prediction across diverse populations.” bioRxiv (2016): 070797.

Older Posts »