Live not by visualization alone

Live not by visualization alone

pc1
Synthetic map

In the age of 500,000 SNP studies of genetic variation across dozens of populations obviously we’re a bit beyond lists of ABO blood frequencies. There’s no real way that a conventional human is going to be able to discern patterns of correlated allele frequency variations which point to between population genetic differences on this scale of marker density. So you rely on techniques which extract the general patterns out of the data, and present them to you in a human-comprehensible format. But, there’s an unfortunate tendency for humans to imbue the products of technique with a particular authority which they always should not have.

ResearchBlogging.orgThe History and Geography of Human Genes is arguably the most important historical genetics work of the past generation. It has surely influenced many within the field of genetics, and because of its voluminous elegant visual displays of genetic data it is also a primary source for those outside of genetics to make sense of phylogenetic relations between human populations. And yet one aspect of this great work which never caught on was the utilization of “synthetic maps” to visualize components of genetic variation between populations. This may have been fortuitous, a few years ago a paper was published, Interpreting principal components analyses of spatial population genetic variation, which suggested that the gradients you see on the map above may be artifacts:

Nearly 30 years ago, Cavalli-Sforza et al. pioneered the use of principal component analysis (PCA) in population genetics and used PCA to produce maps summarizing human genetic variation across continental regions. They interpreted gradient and wave patterns in these maps as signatures of specific migration events. These interpretations have been controversial, but influential, and the use of PCA has become widespread in analysis of population genetics data. However, the behavior of PCA for genetic data showing continuous spatial variation, such as might exist within human continental groups, has been less well characterized. Here, we find that gradients and waves observed in Cavalli-Sforza et al.’s maps resemble sinusoidal mathematical artifacts that arise generally when PCA is applied to spatial data, implying that the patterns do not necessarily reflect specific migration events. Our findings aid interpretation of PCA results and suggest how PCA can help correct for continuous population structure in association studies.

A paper earlier this year took the earlier work further and used a series of simulations to show how the nature of the gradients varied. In light of recent preoccupations the results are of interest. Principal Component Analysis under Population Genetic Models of Range Expansion and Admixture:

In a series of highly influential publications, Cavalli-Sforza and colleagues used principal component (PC) analysis to produce maps depicting how human genetic diversity varies across geographic space. Within Europe, the first axis of variation (PC1) was interpreted as evidence for the demic diffusion model of agriculture, in which farmers expanded from the Near East ∼10,000 years ago and replaced the resident hunter-gatherer populations with little or no interbreeding. These interpretations of the PC maps have been recently questioned as the original results can be reproduced under models of spatially covarying allele frequencies without any expansion. Here, we study PC maps for data simulated under models of range expansion and admixture. Our simulations include a spatially realistic model of Neolithic farmer expansion and assume various levels of interbreeding between farmer and resident hunter-gatherer populations. An important result is that under a broad range of conditions, the gradients in PC1 maps are oriented along a direction perpendicular to the axis of the expansion, rather than along the same axis as the expansion. We propose that this surprising pattern is an outcome of the “allele surfing” phenomenon, which creates sectors of high allele-frequency differentiation that align perpendicular to the direction of the expansion.

The first figure shows the general framework with which they performed the simulations:

pcab1

You have a lattice which consists of demes, population units, all across Europe. They modulated parameters such as population growth (r), carrying capacity (C), and migration (m). Additionally, they had various scenarios of expansion from the southwest or southeast, as well as two expansions one after another to mimic the re-population of Europe after the Ice Age by Paleolithic groups, and their later replacement by Neolithic groups. They modulated admixture and introgression of genes from the Paleolithic group to the Neolithics so that you had the full range where the final European were mostly Neolithic or mostly Paleolithic.

Below are some of the figures which show the results:

allesurAs you can see the strange thing is that in some models the synthetic map gradient is rotated 90 degrees from the axis of demographic expansion! In this telling the famous synthetic map showing Neolithic expansion might be showing expansion from Iberia. Perhaps a radiation from a post-Ice Age southern refuge?

One explanation might be “allele surfing” on the demographic “wave of advance.” Basically as a population expands very rapidly stochastic forces such as random genetic drift and bottlenecks could produce diversification along the edge of the population wave front. The reason for this is that these rapidly expanding populations explode out of serial bottlenecks and demographic expansions, which will produce genetic distinctiveness among the many differentiated demes bubbling along the edge of expansion. Alleles which may have been at low frequency in the ancestral population can “fix” in descendant populations on the edge of the demographic wave of advance. This is the explanation, more or less, that one group gave last year for the very high frequencies of R1b1b2 in Western Europe. With this, they overturned the classic assumption that R1b1b2 was a Paleolithic marker, and suggested it was a Neolithic one.

Here’s their conclusion from the paper:

A previous study showed that the original patterns observed in PCA might not reflect any expansion events (Novembre and Stephens 2008). Here, we find that under very general conditions, the pattern of molecular diversity produced by an expansion may be different than what was expected in the literature. In particular, we find conditions where an expansion of Neolithic farmers from the southeast produces a greatest axis of differentiation running from the southwest to the northeast. This surprising result is seemingly due to allele surfing leading to sectors that create differentiation perpendicular to the expansion axis. Although a lot of our results can be explained by the surfing phenomenon, some interesting questions remain open. For example, the phase transition observed for relatively small admixture rates between Paleolithic resident and Neolithic migrant populations occurs at a value that is dependent on our simulation settings, and further investigations would be needed to better characterize this critical value as a function of all the model parameters. Another unsolved question is to know why the patterns generally observed in PC2 maps for our simulation settings sometimes arise in PC1 maps instead. These unexplained examples remind us that PCA is summarizing patterns of variation in the sample due to multiple factors (ancestral expansions and admixture, ongoing limited migration, habitat boundary effects, and the spatial distribution of samples). In complex models such as our expansion models with admixture in Europe, it may be difficult to tease apart what processes give rise to any particular PCA pattern. Our study emphasizes that PC (and AM) should be viewed as tools for exploring the data but that the reverse process of interpreting PC and AM maps in terms of past routes of migration remains a complicated exercise. Additional analyses—with more explicit demographic models—are more than ever essential to discriminate between multiple explanations available for the patterns observed in PC and AM maps. We speculate that methods exploiting the signature of alleles that have undergone surfing may be a powerful approach to study range expansions.

What’s the big picture here? In the textbook Human Evolutionary Genetics it is asserted that synthetic maps never became very popular compared to PCA itself. I think this is correct. But, the original synthetic maps have become prominent for many outside of genetics. They figure in Peter Bellwood’s First Farmers, and are taken as a given by many pre-historians, such as Colin Renfrew. And yet a reliance on these sorts of tools must not be blind to the reality that the more layers of abstraction you put between your perception and comprehension of concrete reality, the more likely you are to be led astray by quirks and biases of method.

In this case I do think first-order intuition would tell us that synthetic maps which display PCs would be showing gradients as a function of demographic pulses. And yet the intuition may not be right, and with the overturning of old orthodoxies in the past generation of inferences from the variation patterns in modern populations, we should be very cautious.

Citation: Olivier François, Mathias Currat, Nicolas Ray, Eunjung Han, Laurent Excoffier, & John Novembre (2010). Principal Component Analysis under Population Genetic
Models of Range Expansion and Admixture Mol Biol Evol

Live not by visualization alone

Razib Khan