Razib Khan One-stop-shopping for all of my content

September 16, 2017

Carving nature at its joints more realistically

Filed under: Admixture,construct,phylogenetics,Population genetics,Structure — Razib Khan @ 10:23 pm

If you are working on phylogenetic questions on a coarse evolutionary scale (that is, “macroevolutionary,” though I know some evolutionary geneticists will shoot me the evil eye for using that word) generating a tree of relationships is quite informative and relatively straightforward, since it has a comprehensible mapping onto to what really occurred in nature. When your samples are different enough that the biological species concept works well and gene flow doesn’t occur between node, then a tree is a tree (one reason Y and mtDNA results are so easy to communicate to the general public in personal genomics).

Everything becomes more problematic when you are working on a finer phylogenetic scale (or in taxa where inter-species gene flow is common, as is often the case with plants). And I’m using problematic here in the way that denotes a genuine substantive analytic issue, as opposed to connoting something that one has moral or ethical objections to.

It is intuitively clear that there is often genetic population structure within species, but how to summarize and represent that variant is not a straightforward task.

In 2000 the paper Inference of Population Structure Using Multilocus Genotype Data in Genetics introduced the sort of model-based clustering most famously implemented with Structure. The paper illustrates limitations with the neighbor-joining tree methods which were in vogue at the time, and contrasts them with a method which defines a finite set of populations and assigns proportions of each putative group to various individuals.

The model-based methods were implemented in numerous packages over the 2000s, and today they’re pretty standard parts of the phylogenetic and population genetic toolkits. The reason for their popularity is obvious: they are quite often clear and unambiguous in their results. This may be one reason that they emerged to complement more visualization methods like PCA and MDS with fewer a priori assumptions.

But of course, crisp clarity is not always reality. Sometimes nature is fuzzy and messy. The model-based methods take inputs and will produce crisp results, even if those results are not biologically realistic. They can’t be utilized in a robotic manner without attention to the assumptions and limitations (see A tutorial on how (not) to over-interpret STRUCTURE/ADMIXTURE bar plots).

This is why it is exciting to see a new preprint which addresses many of these issues, Inferring Continuous and Discrete Population Genetic Structure Across Space*:

A classic problem in population genetics is the characterization of discrete population structure in the presence of continuous patterns of genetic differentiation. Especially when sampling is discontinuous, the use of clustering or assignment methods may incorrectly ascribe differentiation due to continuous processes (e.g., geographic isolation by distance) to discrete processes, such as geographic, ecological, or reproductive barriers between populations. This reflects a shortcoming of current methods for inferring and visualizing population structure when applied to genetic data deriving from geographically distributed populations. Here, we present a statistical framework for the simultaneous inference of continuous and discrete patterns of population structure….

The whole preprint should be read for anyone interested in phylogenomic inference, as there is extensive discussion and attention to many problems and missteps that occur when researchers attempt to analyze variation and relationships across a species’ range. Basically, the sort of thing that might be mentioned in peer review feedback, but isn’t likely to be included in any final write-ups.

As noted in the abstract the major issue being addressed here is the problem that many clustering methods do not include within their model the reality that genetic variation within a species may be present due to continuous gene flow defined by isolation by distance dynamics. This goes back to the old “clines vs. clusters” debates. Many of the model-based methods assume pulse admixtures between population clusters which are random mating. This is not a terrible assumption when you consider perhaps what occurred in the New World when Europeans came in contact with the native populations and introduced Africans. But it is not so realistic when it comes to the North European plain, which seems to have become genetically differentiated only within the last ~5,000 years, and likely seen extensive gene flow.

The figure below shows the results from the conStruct method (left), and the more traditional fastStructure (right):

There are limitations to the spatial model they use (e.g., ring species), but that’s true of any model. The key is that it’s a good first step to account for continuous gene flow, and not shoehorning all variation into pulse admixtures.

Though in beta, the R package is already available on github (easy enough to download and install). I’ll probably have more comment when I test drive it myself….

* I am friendly with the authors of this paper, so I am also aware of their long-held concerns about the limitations and/or abuses of some phylogenetic methods. These concerns are broadly shared within the field.

September 14, 2017

After agriculture, before bronze

 

The above plot shows genetic distance/variation between highland and lowland populations in Papa New Guinea (PNG). It is from a paper in Science that I have been anticipating for a few months (I talked to the first author at SMBE), A Neolithic expansion, but strong genetic structure, in the independent history of New Guinea.

What does “strong genetic structure” mean? Basically Fst is showing the proportion of genetic variation which is partitioned between groups. Intuitively it is easy to understand, in that if ~1% of the genetic variation is partitioned between groups in one case, and ~10% in another, then it is reasonable to suppose that the genetic distance between groups in the second case is larger than in the first case. On a continental scale Fst between populations is often on the order of ~0.10. That is the value for example when you pool the variation amongst Northern Europeans and Chinese, and assess how much of it can be apportioned in a manner which differentiates populations (so it’s about ~10% of the variation).

This is why ancient DNA results which reported that Mesolithic hunter-gatherers and Neolithic farmers in Central Europe who coexisted in rough proximity for thousands of years exhibited differences on the order of ~0.10 elicited surprise. These are values we are now expecting from continental-scale comparisons. Perhaps an appropriate analogy might be the coexistence of Pygmy groups and Bantu agriculturalists? Though there is some gene flow, the two populations exist in symbiosis and exhibit local ecological segregation.

In PNG continental scale Fst values are also seen among indigenous people. The differences between the peoples who live in the highlands and lowlands of PNG are equivalent to those between huge regions of Eurasia. This is not entirely surprising because there has been non-trivial gene flow into lowland populations from Austronesian groups, such as the Lapita culture. Many lowland groups even speak Austronesian languages today.

Using standard ADMIXTURE analysis the paper shows that many lowland groups have significant East Asian ancestry (red), while none of the highland groups do (some individuals with East Asian admixture seem to be due to very recent gene flow). But even within the highlands the genetic differences are striking. The  Fst values between Finns and Southern European groups such as Spaniards are very high in a European context (due to Finnish Siberian ancestry as well as drift through a bottleneck), but most comparisons within the highland groups in PNG still exceeds this.

The paper also argues that genetic differences between Papuans and the natives of Australia pre-date the rising sea levels at the beginning of the Holocene, when Sahul divided between its various constituents. This is not entirely surprising considering that the ecology of the highlands during the Pleistocene would have been considerably different from Australia to the south, resulting in sharp differences in the hunter-gatherer lifestyles. Additionally, there does not seem to have been a genetic cline. Papuans are symmetrically related to all Australian groups they had samples from.

Using coalescence-based genomic methods they inferred that separation between highlands and some lowland groups occurred ~10-20,000 years ago. That is, after the Last Glacial Maximum. For the highlands, the differences seem to date to within the last 10,000 years. The Holocene. Additionally, they see population increases in the highlands, correlating with the shift to agriculture (cultivation of taro).

None of the above is entirely surprising, though I would take the date inferences with a grain of salt. The key is to observe that large genetic differences, as well as cultural differences, accrued in the highlands of PNG during the Holocene. In the paper they have a social and cultural explanation for what’s going on:

  Fst values in PNG fall between those of hunter-gatherers and present-day populations of west Eurasia, suggesting that a transition to cultivation alone does not necessarily lead to genetic homogenization.

A key difference might be that PNG had no Bronze Age, which in west Eurasia was driven by an expansion of herders and led to massive population replacement, admixture, and cultural and linguistic change (7, 8), or Iron Age such as that linked to the expansion of Bantu-speaking
farmers in Africa (24). Such cultural events have resulted in rapid Y-chromosome lineage expansions due to increased male reproductive variance (25), but we consistently find no evidence for this in PNG (fig. S13). Thus, in PNG, wemay be seeing the genetic, linguistic, and cultural diversity that sedentary human societies can achieve in the absence of massive technology-driven expansions.

Peter Turchin in books like Ultrasociety has aruged that one of the theses in Steven Pinker’s The Better Angels of Our Nature is incorrect: that violence has not decreased monotonically, but peaked in less complex agricultural societies. PNG is clearly a case of this, as endemic warfare was a feature of highland societies when they encountered Europeans. Lawrence Keeley’s War Before Civilization: The Myth of the Peaceful Savage gives so much attention to highland PNG because it is a contemporary illustration of a Neolithic society which until recently had not developed state-level institutions.

What papers like these are showing is that cultural and anthropological dynamics strongly shape the nature of genetic variation among humans. Simple models which assume as a null hypothesis that gene flow occurs through diffusion processes across a landscape where only geographic obstacles are relevant simply do not capture enough of the dynamic. Human cultures strongly shape the nature of interactions, and therefore the genetic variation we see around us.

September 10, 2017

Quantitative genomics, adaptation, and cognitive phenotypes

The human brain utilizes about ~20% of the calories you take in per day. It’s a large and metabolically expensive organ. Because of this fact there are lots of evolutionary models which focus on the brain. In Catching Fire: How Cooking Made Us Human Richard Wrangham suggests that our need for calories to feed our brain is one reason we started to use fire to pre-digest our food. In The Mating Mind Geoffrey Miller seems to suggest that all the things our big complex brain does allows for a signaling of mutational load. And in Grooming, Gossip, and the Evolution of Language Robin Dunbar suggests that it’s social complexity which is driving our encephalization.

These are all theories. Interesting hypotheses and models. But how do we test them? A new preprint on bioRxiv is useful because it shows how cutting-edge methods from evolutionary genomics can be used to explore questions relating to cognitive neuroscience and pyschopathology, Polygenic selection underlies evolution of human brain structure and behavioral traits:

…Leveraging publicly available data of unprecedented sample size, we studied twenty-five traits (i.e., ten neuropsychiatric disorders, three personality traits, total intracranial volume, seven subcortical brain structure volume traits, and four complex traits without neuropsychiatric associations) for evidence of several different signatures of selection over a range of evolutionary time scales. Consistent with the largely polygenic architecture of neuropsychiatric traits, we found no enrichment of trait-associated single-nucleotide polymorphisms (SNPs) in regions of the genome that underwent classical selective sweeps (i.e., events which would have driven selected alleles to near fixation). However, we discovered that SNPs associated with some, but not all, behaviors and brain structure volumes are enriched in genomic regions under selection since divergence from Neanderthals ~600,000 years ago, and show further evidence for signatures of ancient and recent polygenic adaptation. Individual subcortical brain structure volumes demonstrate genome-wide evidence in support of a mosaic theory of brain evolution while total intracranial volume and height appear to share evolutionary constraints consistent with concerted evolution…our results suggest that alleles associated with neuropsychiatric, behavioral, and brain volume phenotypes have experienced both ancient and recent polygenic adaptation in human evolution, acting through neurodevelopmental and immune-mediated pathways.

The preprint takes a kitchen-sink approach, throwing a lot of methods of selection at the phenotype of interest. Also, there is always the issue of cryptical population structure generating false positive associations, but they try to address it in the preprint. I am somewhat confused by this passage though:

Paleobiological evidence indicates that the size of the human skull has expanded massively over the last 200,000 years, likely mirroring increases in brain size.

From what I know human cranial sizes leveled off in growth ~200,000 years ago, peaked ~30,000 years ago, and have declined ever since then. That being said, they find signatures of selection around genes associated with ‘intracranial volume.’

There are loads of results using different methods in the paper, but I was curious note that schizophrenia had hits for ancient and recent adaptation. A friend who is a psychologist pointed out to me that when you look within families “unaffected” siblings of schizophrenics often exhibit deviation from the norm in various ways too; so even if they are not impacted by the disease, they are somewhere along a spectrum of ‘wild type’ to schizophrenic. In any case in this paper they found recent selection for alleles ‘protective’ of schizophrenia.

There are lots of theories one could spin out of that singular result. But I’ll just leave you with the fact that when you have a quantitative trait with lots of heritable variation it seems unlikely it’s been subject to a long period of unidirecitional selection. Various forms of balancing selection seem to be at work here, and we’re only in the early stages of understanding what’s going on. Genuine comprehension will require:

– attention to population genetic theory
– large genomic data sets from a wide array of populations
– novel methods developed by population genomicists
– and funcitonal insights which neuroscientists can bring to the table

June 27, 2017

Why you should learn some population genetics

Filed under: Population genetics — Razib Khan @ 10:03 pm

From reader surveys I know a substantial portion of the people who will see this post are financially well off (of those who aren’t, a large number are students). Therefore, you can invest in some books.

Often people ask me questions related to population genetics in the comments (sometimes I get emails). That is all well and good. But it is always better to be able to fish than have to ask for fish. Additionally, learning some population and quantitative genetics allows you to develop some tacit schemas through which you can process information coming at you, and through with you can develop some general intuition.

If you have a modest level of mathematical fluency and and the disposable income, here are three indispensable books which are like the keys to the kingdom:

* Elements of Evolutionary Genetics
* Principles of Population Genetics
* Introduction to Quantitative Genetics.

If you don’t have the cash to spare, there are online notes which are pretty good:

* Graham Coop’s Population Genetics notes
* Joe Felsenstein’s Theoretical Evolutionary Genetics

There are others online resources, but they are not as comprehensive. John Gillespie’s Population Genetics: A Concise Guide is good as very gentle introductions go, but if you are going to spend money, I think just plumping down for a more comprehensive textbook (which will have more genomics in it) is better over the long run.

The goal of getting these books isn’t to make you a population geneticist, but, if you are interested in evolutionary questions it gives you a powerful toolkit. Really nothing in evolutionary process makes sense except in the light of population genetics.

April 25, 2017

Dost thou know the equilibrium at panmixia?

Filed under: Genetics,Population genetics — Razib Khan @ 3:58 pm

If you read a blog about Biblical criticism from a Christian perspective it would probably be best if you were familiar with the Bible. You don’t have to have read much scholarly commentary, rather, just the New Testament. Barring that, at least the synoptic gospels!

At this point, with over 400 individuals responding to the reader survey, it is strange to consider that more people believe they have a handle on what Fst is than the Hardy-Weinberg Equilibrium. First, Fst is a more subtle concept than people often think it is. And second, because the HWE is so easy, important, and foundational to population genetics. I mean p^2 + 2pq^2 + q^2 = 1. Could it be simpler???

So a quick ask. If you are one of the people who doesn’t understand HWE or why it is important, please get yourself a copy of John Gillespie’s Population Genetics: A Concise Guide. I understand that not everyone has the time, interest, or money for Principles of Population Genetics, or any of the more “hardcore” texts. But Population Genetics: A Concise Guide will surely suffice to follow anything on this blog.

Or, barring that, please review the online resources which you have available. Two examples:

Graham Coop’s Notes on Population Genetics or Joe Felsenstein’s unpublished textbook Theoretical Evolutionary Genetics.

April 23, 2017

Why the rate of evolution may only depend on mutation

Filed under: Evolutionary Genetics,Genetics,Population genetics — Razib Khan @ 10:07 pm

Sometimes people think evolution is about dinosaurs.

It is true that natural history plays an important role in inspiring and directing our understanding of evolutionary process. Charles Darwin was a natural historian, and evolutionary biologists often have strong affinities with the natural world and its history. Though many people exhibit a fascination with the flora and fauna around us during childhood, often the greatest biologists retain this wonderment well into adulthood (if you read W. D. Hamilton’s collections of papers, Narrow Roads of Gene Land, which have autobiographical sketches, this is very evidently true of him).

But another aspect of evolutionary biology, which began in the early 20th century, is the emergence of formal mathematical systems of analysis. So you have fields such as phylogenetics, which have gone from intuitive and aesthetic trees of life, to inferences made using the most new-fangled Bayesian techniques. And, as told in The Origins of Theoretical Population Genetics, in the 1920s and 1930s a few mathematically oriented biologists constructed much of the formal scaffold upon which the Neo-Darwinian Synthesis was constructed.

The product of evolution

At the highest level of analysis evolutionary process can be described beautifully. Evolution is beautiful, in that its end product generates the diversity of life around us. But a formal mathematical framework is often needed to clearly and precisely model evolution, and so allow us to make predictions. R. A. Fisher’s aim when he wrote The Genetical Theory Natural Selection was to create for evolutionary biology something equivalent to the laws of thermodynamics. I don’t really think he succeeded in that, though there are plenty of debates around something like Fisher’s fundamental theorem of natural selection.

But the revolution of thought that Fisher, Sewall Wright, and J. B. S. Haldane unleashed has had real yields. As geneticists they helped us reconceptualize evolutionary process as more than simply heritable morphological change, but an analysis of the units of heritability themselves, genetic variation. That is, evolution can be imagined as the study of the forces which shape changes in allele frequencies over time. This reduces a big domain down to a much simpler one.

Genetic variation is concrete currency with which one can track evolutionary process. Initially this was done via inferred correlations between marker traits and particular genes in breeding experiments. Ergo, the origins of the “the fly room”.

But with the discovery of DNA as the physical substrate of genetic inheritance in the 1950s the scene was set for the revolution in molecular biology, which also touched evolutionary studies with the explosion of more powerful assays. Lewontin & Hubby’s 1966 paper triggered a order of magnitude increase in our understanding of molecular evolution through both theory and results.

The theoretical side occurred in the form of the development of the neutral theory of molecular evolution, which also gave birth to the nearly neutral theory. Both of these theories hold that most of the variation with and between species on polymorphisms are due to random processes. In particular, genetic drift. As a null hypothesis neutrality was very dominant for the past generation, though in recent years some researchers are suggesting that selection has been undervalued as a parameter for various reasons.

Setting the live scientific debate, which continue to this day, one of the predictions of neutral theory is that the rate of evolution will depend only on the rate of mutation. More precisely, the rate of substitution of new mutations (where the allele goes from a single copy to fixation of ~100%) is proportional to the rate of mutation of new alleles. Population size doesn’t matter.

The algebra behind this is straightforward.

First, remember that the frequency of the a new mutation within a population is \frac{1}{2N}, where N is the population size (the 2 is because we’re assuming diploid organisms with two gene copies). This is also the probability of fixation of a new mutation in a neutral scenario; it’s probability is just proportional to its initial frequency (it’s a random walk process between 0 and 1.0 proportions). The rate of mutations is defined by \mu, the number of expected mutations at a given site per generation (this is a pretty small value, for humans it’s on the order of 10^{-8}). Again, there are 2N individuals, so you have 2N\mu to count the number of new mutations.

The probability of fixation of a new mutations multiplied by the number of new mutations is:

    \[ \( \frac{1}{2N} \) \times 2N\mu = \mu \]

So there you have it. The rate of fixation of these new mutations is just a function of the rate of mutation.

Simple formalisms like this have a lot more gnarly math that extend them and from which they derive. But they’re often pretty useful to gain a general intuition of evolutionary processes. If you are genuinely curious, I would recommend Elements of Evolutionary Genetics. It’s not quite a core dump, but it is a way you can borrow the brains of two of the best evolutionary geneticists of their generation.

Also, you will be able to answer the questions on my survey better the next time!

April 14, 2017

Why overdominance probably isn’t responsible for much polymorphism

Filed under: Genetics,Population genetics — Razib Khan @ 10:54 pm

Hybrid vigor is a concept that many people have heard of, because it is very useful in agricultural genetics, and makes some intuitive sense. Unfortunately it often gets deployed in a variety of contexts, and its applicability is often overestimated. For example, many people seem to think (from personal communication) that it may somehow be responsible for the genetic variation around us.

This is just not so. As you may know each human carries tens of millions of genetic variants within their genome. Populations have various levels of polymorphism at particular positions in the genome. How’d they get there? In the early days of population genetics there were two broad schools, the “balance” and “classical.” The former made the case for the importance of balancing selection in maintaining variation. The latter suggested that the variation we see around us is simply a transient between fixation of a favored mutation from a low a frequency or extinction of a disfavored variant (perhaps environmental conditions changed and a high frequency variant is now disfavored). Arguably the rise of neutral theory and empirical results from molecular evolution supported the classical model more than the balance framework (at least this was Richard Lewontin’s argument, and I follow his logic here).

But even in relation to alleles which are maintained at polymorphism through balancing selection, overdominance isn’t going to be the major player.

Sickle cell disease is a classic consequence of overdominance; the heterozygote is more fit than the wild type or the recessive disease which is caused by homozygotes of the mutation. Obviously polymorphism is maintained despite the decreased fitness of the mutant homozygote because the heterozygote is so much more fit than the wild type. The final proportion of the alleles segregating in the population will be conditional on the fitness drag of the homozygote in the mutant type, because as per HWE it will be present in the population ~q2.

The problem is that this is clearly not going to scale across loci. That is, even if the fitness drag is more minimal than is the case with the sickle cell locus, one can imagine a cummulative situation. The segregation load is just going to be too high. Overdominance is probably a transient strategy which fades away as populations evolve more efficient ways to adapt that doesn’t have such a fitness load.

So how does balancing selection still lead to variation without heteroygote advantage? W. D. Hamilton argued that much of it was due to negative frequency dependent selection. Co-evolution with pathogens is the best case of this. As strategies get common pathogens adapt, so rare strategies encoded by rare alleles gain in fitness. As these alleles increase in frequency their fitness decreases due to pathogen resistance. Their frequency declines, and eventually the pathogens lose the ability to resist it, and its frequency increases again.

April 8, 2017

Why only one migrant per generation keeps divergence at bay

The best thing about population genetics is that because it’s a way of thinking and modeling the world it can be quite versatile. If Thinking Like An Economist is a way to analyze the world rationally, thinking like a population geneticist allows you to have the big picture on the past, present, and future, of life.

I have some personal knowledge of this as a transformative experience. My own background was in biochemistry before I became interested in population genetics as an outgrowth of my lifelong fascination with evolutionary biology. It’s not exactly useless knowing all the steps of the Krebs cycle, but it lacks in generality. In his autobiography I recall Isaac Asimov stating that one of the main benefits of his background as a biochemist was that he could rattle off the names on medicine bottles with fluency. Unless you are an active researcher in biochemistry your specialized research is quite abstruse. Population genetics tends to be more applicable to general phenomena.

In a post below I made a comment about how one migrant per generation or so is sufficient to prevent divergence between two populations. This is an old heuristic which goes back to Sewall Wright, and is encapsulated in the formalism to the left. Basically the divergence, as measured by Fst, is proportional to the inverse of 4 time the proportion of migrants times the total population + 1. The mN is equivalent to the number of migrants per generation (proportion times the total population). As the mN become very large, the Fst converges to zero.

The intuition is pretty simple. Image you have two populations which separate at a specific time. For example, sea level rise, so now you have a mainland and island population. Since before sea level rise the two populations were one random mating population their initial allele frequencies are the same at t = 0. But once they are separated random drift should begin to subject them to divergence, so that more and more of their genes exhibit differences in allele frequencies (ergo, Fst, the between population proportion of genetic variation, increases from 0).

Now add to this the parameter of migration. Why is one migrant per generation sufficient to keep divergence low? The two extreme scenarios are like so:

  1. Large populations change allele frequency very slowly due to drift, so only a small proportion of migration is needed to prevent them from diverging
  2. Small populations change allele frequency very fast due to drift, so a larger proportion of migration is needed to prevent them from drifting

Within a large population one migrant is a small proportion, but drift is occurring very slowly. Within a small population drift is occurring fast, but one migrant is a relatively large proportion of a small population.

Obviously this is a stylized fact with many details which need elaborating. Some conservation geneticists believe that the focus on one migrant is wrongheaded, and the number should be set closer to 10 migrants.

But it still gets at a major intuition: gene flow is extremely powerful and effective at reducing differences between groups. This is why most geneticists are skeptical of sympatric speciation. Though the focus above is on drift, the same intuition applies to selective divergence. Gene flow between populations work at cross-purposes with selection which drives two groups toward different equilibrium frequencies.

This is why it was surprising when results showed that Mesolithic hunter-gatherers and farmers in Europe were extremely genetically distinct in close proximity for on the order of 1,000 years. That being said, strong genetic differentiation persists between Pygmy peoples and their agriculturalist neighbors, despite a long history of living nearby each other (Pygmies do not have their own indigenous languages, but speak the tongue of their farmer neighbors). In the context of animals physical separation is often necessary for divergence, but for humans cultural differences can enforce surprisingly strong taboos. Culture is as strong a phenomenon as mountains or rivers….

December 18, 2012

Unveiling the genealogical lattice

To understand nature in all its complexity we have to cut down the riotous variety down to size. For ease of comprehension we formalize with math, verbalize with analogies, and visualize with representations. These approximations of reality are not reality, but when we look through the glass darkly they give us filaments of essential insight. Dalton’s model of the atom is false in important details (e.g., fundamental particles turn out to be divisible into quarks), but it still has conceptual utility.

Likewise, the phylogenetic trees popularized by L. L. Cavalli-Sforza in The History and Geography of Human Genes are still useful in understanding the shape of the human demographic past. But it seems that the bifurcating model of the tree must now be strongly tinted by the shades of reticulation. In a stylized sense inter-specific phylogenies, which assume the approximate truth of the biological species concept (i.e., little gene flow across lineages), mislead us when we think of the phylogeny of species on the microevolutionary scale of population genetics. On an intra-specific scale gene flow is not just a nuisance parameter in the model, it is an essential phenomenon which must be accommodated into the framework.


This is on my mind because of the emergence of packages such as TreeMix and AdmixTools. Using software such as these on the numerous public data sets allows one to perceive the reality of admixture, and overlay lateral gene flow upon the tree as a natural expectation. But perhaps a deeper result is the character of the tree itself is torn asunder. The figure above is from a new paper, Efficient moment-based inference of admixture parameters and sources of gene flow, which debuts MixMapper. The authors bring a lot of mathematical heft to their exposition, and I can’t say I follow all of it (though some of the details are very similar to Pickrell et al.’s). But in short it seems that in comparison to TreeMix MixMapper allows for more powerful inference of a narrower set of populations, selected for exploring very specific questions. In contrast, TreeMix explores the whole landscape with minimal supervision. Having used the latter I can testify that that is true.

The big result from MixMapper is that it extends the result of Patterson et al., and confirms that modern Europeans seem to be an admixture between a “north Eurasian” population, and a vague “west Eurasian” population. Importantly, they find evidence of admixture in Sardinians, which implies that Patterson et al.’s original were not sensitive to admixture in putative reference populations (note that Patterson is a coauthor on this paper as well). The rub, as noted in the paper, is that it is difficult to estimate admixture when you don’t have “pure” ancestral reference populations. And yet here the takeaway for me is that we may need to rethink our whole conception of pure ancestral populations, and imagine a human phylogenetic tree as a series of lattices in eternal flux, with admixed nodes periodically expanding so as to generate the artifice of a diversifying tree. The closer we look, the more likely that it seems that most of the populations which have undergone demographic expansion in the past 10,000 years are also the products of admixture. Any story of the past 10,000 years, and likely the past 100,000 years, must give space at the center of the narrative arc lateral gene flow across populations.

Cite: arXiv:1212.2555 [q-bio.PE]

December 10, 2012

Is Daniel MacArthur ‘desi’?

My initial inclination in this post was to discuss a recent ordering snafu which resulted in many of my friends being quite peeved at 23andMe. But browsing through their new ‘ancestry composition’ feature I thought I had to discuss it first, because of some nerd-level intrigue. Though I agree with many of Dienekes concerns about this new feature, I have to admit that at least this method doesn’t give out positively misleading results. For example, I had complained earlier that ‘ancestry painting’ gave literally crazy results when they weren’t trivial. It said I was ~60 percent European, which makes some coherent sense in their non-optimal reference population set, but then stated that my daughter was >90 percent European. Since 23andMe did confirm she was 50% identical by descent with me these results didn’t make sense; some readers suggested that there was a strong bias in their algorithms to assign ambiguous genomic segments to ‘European’ heritage (this was a problem for East Africans too).

Here’s my daughter’s new chromosome painting:

One aspect of 23andMe’s new ancestry composition feature is that it is very Eurocentric. But, most of the customers are white, and presumably the reference populations they used (which are from customers) are also white. Though there are plenty of public domain non-white data sets they could have used, I assume they’d prefer to eat their own data dog-food in this case. But that’s really a minor gripe in the grand scheme of things. This is a huge upgrade from what came before. Now, it’s not telling me, as a South Asian, very much. But, it’s not telling me ludicrous things anymore either!

But in regards to omission I am curious to know why this new feature rates my family as only ~3% East Asian, when other analyses put us in the 10-15% range. The problem with very high values is that South Asians often have some residual ‘eastern’ signal, which I suspect is not real admixture, but is an artifact. Nevertheless, northeast Indians, including Bengalis, often have genuine East Asia admixture. On PCA plots my family is shifted considerably toward East Asians. The signal they are picking up probably isn’t noise. Almost every apportionment of East Asian ancestry I’ve seen for my family yields a greater value for my mother, and that holds here. It’s just that the values are implausibly low.

In any case, that’s not the strangest thing I saw. I was clicking around people who I had “shared” genomes with, and I stumbled upon this:

As you can guess from the screenshot this is Daniel MacArthur’s profile. And according to this ~25% of chromosome 10 is South Asian! On first blush this seemed totally nonsensical to me, so I clicked around other profiles of people of similar Northern European background…and I didn’t see anything equivalent.

What to do? It’s going to take more evidence than this to shake my prior assumptions, so I downloaded Dr. MacArthur’s genotype. Then I merged it with three HapMap populations, the Utah whites (CEU), the Gujaratis (GIH), and the Chinese from Denver (CHD). The last was basically a control. I pulled out chromosome 10. I also added Dan’s wife Ilana to the data set, since I believe she got typed with the same Illumina chip, and is of similar ethnic background (i.e., very white). It is important to note that only 28,000 SNPs remained in the data set. But usually 10,000 is more than sufficient on SNP data for model-based clustering with inter-continental scale variation.

I did two things:

1) I ran ADMIXTURE at K = 3, unsupervised

2) I ran an MDS, which visualized the genetic variation in multiple dimensions

Before I go on, I will state what I found: these methods supported the inference from 23andMe, on chromosome 10 Dr. MacArthur seems to have an affinity with South Asians (i.e., this is his ‘curry chromosome’). Here are the average (median) values in tabular format, with MacArthur and his wife presented for comparison.

ADMIXTURE results for chromosome 10
K 1 K 2 K 3
CEU 0.04 0.02 0.93
GIH 0.87 0.05 0.08
CHD 0.01 0.97 0.01
Daniel MacArthur 0.29 0.07 0.64
Ilana Fisher 0.01 0.06 0.94

You probably want a distribution. Out of the non-founder CEU sample none went above 20% South Asian. Though it did surprise me that a few were that high, making it more plausible to me that MacArthur’s results on chromosome 10 were a fluke:

And here’s the MDS with the two largest dimensions:

Again, it’s evident that this chromosome 10 is shifted toward South Asians. If I had more time right now what I’d do is probably get that specific chromosomal segment, phase it, and then compare it to various South Asian populations. But I don’t have time now, so I went and checked out the results from the Interpretome. I cranked up the settings to reduce the noise, and so that it would only spit out the most robust and significant results. As you can see, again chromosome 10 comes up as the one which isn’t quite like the others.

Is there is a plausible explanation for this? Perhaps Dr. MacArthur can call up a helpful relative? From what  recall his parents are immigrants from the United Kingdom, and it isn’t unheard of that white Britons do have South Asian ancestry which dates back to the 19th century. Though to be totally honest I’m rather agnostic about all this right now. This genotype has been “out” for years now, so how is it that no one has noticed this peculiarity??? Perhaps the issue is that everyone was looking at the genome wide average, and it just doesn’t rise to the level of notice? What I really want to do is look at the distribution of all chromosomes and see how Daniel MacArthur’s chromosome 10 then stacks up. It might be a random act of nature yet.

Also, I guess I should add that at ~1.5% South Asian that would be consistent with one of MacArthur’s great-great-great-great grandparents being Indian. Assuming 25 year generation times that puts them in the mid-19th century. Of course, at such a low proportion the variance is going to be high, so it is quite possible that you need to push the real date of admixture one generation back, or one generation forward.

December 1, 2012

Northern Europeans and Native Americans are not more closely related than previously thought

A new press release is circulating on the paper which I blogged a few months ago, Ancient Admixture in Human History. Unlike the paper, the title of the press release is misleading, and unfortunately I notice that people are circulating it, and probably misunderstanding what is going on. Here’s the title and first paragraph:

Native Americans and Northern Europeans More Closely Related Than Previously Thought

Released: 11/30/2012 2:00 PM EST
Source: Genetics Society of America

Newswise — BETHESDA, MD – November 30, 2012 — Using genetic analyses, scientists have discovered that Northern European populations—including British, Scandinavians, French, and some Eastern Europeans—descend from a mixture of two very different ancestral populations, and one of these populations is related to Native Americans. This discovery helps fill gaps in scientific understanding of both Native American and Northern European ancestry, while providing an explanation for some genetic similarities among what would otherwise seem to be very divergent groups. This research was published in the November 2012 issue of the Genetics Society of America’s journal GENETICS

 

The reality is ta Native Americans and Northern Europeans are not more “closely related” genetically than they were before this paper. There has been no great change to standard genetic distance measures or phylogeographic understanding of human genetic variation. A measure of relatedness is to a great extent a summary of historical and genealogical processes, and as such it collapses a great deal of disparate elements together into one description. What the paper in Genetics outlined was the excavation of specific historically contingent processes which result in the summaries of relatedness which we are presented with, whether they be principal component analysis, Fst, or model-based clustering.

What I’m getting at can be easily illustrated by a concrete example. To the left is a 23andMe chromosome 1 “ancestry painting” of two individuals. On the left is me, and the right is a friend. The orange represents “Asian ancestry,” and the blue represents “European” ancestry. We are both ~50% of both ancestral components. This is a correct summary of our ancestry, as far as it goes. But you need some more information. My friend has a Chinese father and a European mother. In contrast, I am South Asian, and the end product of an ancient admixture event. You can’t tell that from a simple recitation of ancestral quanta. But it is clear when you look at the distribution of ancestry on the chromosomes. My components have been mixed and matched by recombination, because there have been many generations between the original admixture and myself. In contrast, my friend has not had any recombination events between his ancestral components, because he is the first generation of that combination.

So what the paper publicized in the press release does is present methods to reconstruct exactly how patterns of relatedness came to be, rather than reiterating well understood patterns of relatedness. With the rise of whole-genome sequencing and more powerful computational resources to reconstruct genealogies we’ll be seeing much more of this to come in the future, so it is important that people are not misled as to the details of the implications.

Northern Europeans and Native Americans are not more closely related than previously thought

A new press release is circulating on the paper which I blogged a few months ago, Ancient Admixture in Human History. Unlike the paper, the title of the press release is misleading, and unfortunately I notice that people are circulating it, and probably misunderstanding what is going on. Here’s the title and first paragraph:

Native Americans and Northern Europeans More Closely Related Than Previously Thought

Released: 11/30/2012 2:00 PM EST
Source: Genetics Society of America

Newswise — BETHESDA, MD – November 30, 2012 — Using genetic analyses, scientists have discovered that Northern European populations—including British, Scandinavians, French, and some Eastern Europeans—descend from a mixture of two very different ancestral populations, and one of these populations is related to Native Americans. This discovery helps fill gaps in scientific understanding of both Native American and Northern European ancestry, while providing an explanation for some genetic similarities among what would otherwise seem to be very divergent groups. This research was published in the November 2012 issue of the Genetics Society of America’s journal GENETICS

 

The reality is ta Native Americans and Northern Europeans are not more “closely related” genetically than they were before this paper. There has been no great change to standard genetic distance measures or phylogeographic understanding of human genetic variation. A measure of relatedness is to a great extent a summary of historical and genealogical processes, and as such it collapses a great deal of disparate elements together into one description. What the paper in Genetics outlined was the excavation of specific historically contingent processes which result in the summaries of relatedness which we are presented with, whether they be principal component analysis, Fst, or model-based clustering.

What I’m getting at can be easily illustrated by a concrete example. To the left is a 23andMe chromosome 1 “ancestry painting” of two individuals. On the left is me, and the right is a friend. The orange represents “Asian ancestry,” and the blue represents “European” ancestry. We are both ~50% of both ancestral components. This is a correct summary of our ancestry, as far as it goes. But you need some more information. My friend has a Chinese father and a European mother. In contrast, I am South Asian, and the end product of an ancient admixture event. You can’t tell that from a simple recitation of ancestral quanta. But it is clear when you look at the distribution of ancestry on the chromosomes. My components have been mixed and matched by recombination, because there have been many generations between the original admixture and myself. In contrast, my friend has not had any recombination events between his ancestral components, because he is the first generation of that combination.

So what the paper publicized in the press release does is present methods to reconstruct exactly how patterns of relatedness came to be, rather than reiterating well understood patterns of relatedness. With the rise of whole-genome sequencing and more powerful computational resources to reconstruct genealogies we’ll be seeing much more of this to come in the future, so it is important that people are not misled as to the details of the implications.

October 10, 2012

A plea for population genetics

Filed under: Population genetics — Razib Khan @ 9:31 pm

The title here is somewhat misleading. This is not just a plea for population genetics, but for quantitative genetics as well. Genetics is a big field. But today it is defined by and large by DNA, the concrete entity in which the abstraction of the gene is embedded. Look at the header of this website, or the background to my Twitter account. Mind you, I’m pathetically informed about molecular genetics, and don’t have a strong interest in the topic! I did consider using the H.W.E. or the breeder’s equation for the header, but in the end I judged it too abstruse and unfamiliar to most readers. DNA dominates when it comes to the modern mental conception of genetics, and we have to live with it to some extent.
But there is also great value in the genetics which has intellectual roots in the pre-DNA Mendelians and biometricians. This genetics exhibits a symbiotic, but not necessary, association with genetics as a branch of biophysics. Yet I come here not to insult or impugn my friends who toil in the trenches of the molecular wars. Rather, I simply want ...

September 27, 2012

Paleopopulation Genetics

Filed under: Genetics,Population genetics — Razib Khan @ 9:57 pm

It seems a new field is being born! Jeff Wall & Monty Slatkin have a pretty thorough review out, Paleopopulation Genetics:

Paleopopulation genetics is a new field that focuses on the population genetics of extinct groups and ancestral populations (i.e., populations ancestral to extant groups). With recent advances in DNA sequencing technologies, we now have unprecedented ability to directly assay genetic variation from fossils. This allows us to address issues, such as past population structure, changes in population size, and evolutionary relationships between taxa, at a much greater resolution than can traditional population genetics studies. In this review, we discuss recent developments in this emerging field as well as prospects for the future.

Nothing very new for close readers of this weblog, but the references are useful for later mining.

August 28, 2012

Evolutionary & population genetics preprints – Haldane’s Sieve

OK, perhaps I can help with that. Dr. Coop speaks of the collaboration between himself & Dr. Joseph Pickrell, Haldane’s Sieve, which I added to my RSS days ago (and you can see me pushing it to my Pinboard). From the “About”:

As described above, most posts to Haldane’s Sieve will be basic descriptions of relevant preprints, with little to no commentary. All posts will have comment sections where discussion of the papers will be welcome. A second type of post will be detailed comments on a preprint of particular interest to a contributor. These posts could take the style of a journal review, or may simply be some brief comments. We hope they will provide useful feedback to the authors of the preprint. Finally, there will be posts by authors of preprints in which they describe their work and place it in broader context.

We ask the commenters to remember that by submitting articles to preprint servers the authors (often biologists) are taking a somewhat unusual step. Therefore, comments should be phrased in a constructive manner to aid the authors.

It might be helpful if other evolution/genetics bloggers ...

July 22, 2012

What is inbreeding?

Filed under: Inbreeding,Population genetics — Razib Khan @ 9:23 pm

I’ve put up a bunch of posts relating to inbreeding recently (1, 2, 3, 4). But I haven’t really defined it. First, let’s stipulate what inbreeding is not: it is not the same as incest. Acts of incest can include individuals who have no blood relationship to each other (e.g., Hamlet). Additionally, there are instances of inbreeding which are not necessarily incestuous. If a population is highly inbred, then individuals who are not relations by social custom may still be so genetically similar to a point where the pairing can not credibly be stated as an outcross. But still, what do I mean? To refresh myself I re-read the section on inbreeding in Hartl & Clark. And I think that helped clarify one implicit assumption which I have which may not be clear to everyone, and I’ll get to that.

In any case, first, what’s the deal with inbreeding? The short answer is that inbreeding is a measure of the probability of identity by descent of two alleles at a given locus in a given individual. This concise definition itself is the problem. These are all abstract concepts, close to being ...

July 1, 2012

On theoretical evolutionary genetics

Filed under: Population genetics,theoretical evolutionary genetics — Razib Khan @ 10:13 pm

Joe Felsenstein in the comments:

The books you have listed are good ones, by fine people. But may I immodestly suggest a book of mine? If you want to work your way through the theory of theoretical population genetics, I have set of notes for my Genome 562 course, a textbook. It is a freely-downloadable PDF (start with my website by clicking on my name in this comment). It’s not for everyone but I think those interested in knowing how the theory actually works in more detail will benefit from it. As it’s free, I have no monetary interest in calling your attention to it, just pure ego. (And if you want a one-locus population genetics simulation program, try PopG at my lab’s website too — Google “Felsenstein PopG”).

Many of the books I recommended below are rather expensive. Theoretical Evolutionary Genetics (PDF) is not. Unfortunately much of the discourse of contemporary science is beyond the financial means of much of the world’s population, whether it be in university press textbooks or gated journals. So I’m quite happy in putting up a link to this text-in-progress.

Learning population and evolutionary genetics

Filed under: Population genetics — Razib Khan @ 11:44 am

A reader emailed me to ask what I thought would be a good way to better understand some of the more technical posts I put up.

First, two course notes which I’ve found useful as personal references:

- EEB 5348 — Population Genetics

- Evolutionary Quantitative Genetics, Uppsala University (if you are ambitious, bookmark this too)

Some people might argue that John Gillespie’s Population Genetics: A Concise Guide (Kindle edition) is a touch too abstruse and cryptic for the introductory reader. It’s short, and the mathematics isn’t challenging, but because of its concision the author can sometimes unleash upon your nearly cryptic formalism, perhaps defeating the purpose of a soft introduction in the first place. To get the most out of this book you probably ironically have to have a more thorough textbook on hand to clear up those particular points which you find confusing. But to get the general logic of population genetics and establish familiarity this seems to be the right entry point (assuming you’re not to terrified by algebra).

Of course most readers of this weblog are focused ...

March 7, 2012

Where the wild clines aren’t

Filed under: Anthroplogy,Human Genetics,Human Genomics,Population genetics,race — Razib Khan @ 7:38 pm

In the recent ‘do human races’ exist controversy Nick Matzke’s post Continuous geographic structure is real, “discrete races” aren’t has become something of a touchstone (perhaps a post like Cosma Shalizi’s on I.Q. and heritability).* In the post Matzke emphasized the idea of clines, roughly a continuous gradient of genetic change over space. Fair enough. But in the map above I traced two linear transects. I would suggest that anyone who has a general understanding of the demographics of South-Central Eurasia would immediately anticipate that these transects would reveal a relatively sharp break in allele frequencies. True, there are intermediate populations between the two end points, in Nepal, and on the fringes of India’s northeastern states. But clearly about halfway through the southwest-northeast transect you’ll see a rapid shift in allele frequencies. The blue transect is different, insofar as the change occurs very near its eastern pole. In Bengal, 85% of the length of the transect from its western terminus, the populations will still be far closer genetically to those on the western pole than those just to the east!

 

I thought of this when I saw that Zack had posted a Tibetan data set from Qinghai. As the crow flies Qinghai is closer to the plains of North India than peninsular South India, but Zack found Tibetans from this region to be only ~1 percent South Asian. That’s likely to be close to noise. I assume this does not surprise anyone. Despite the fact that North India is very populous in relation to Tibet, it turns out that geographical barriers are very strong in discouraging gene flow (note that Tibet and North India are actually culturally related; Tibetan Buddhism has its origins in the Tantric Buddhism of Bengal). This is one of my major “beefs” with the idea that “race does not exist” because of clines. I think this is a robust point when it comes to there being no Middle Eastern race vs. Scandinavian race. The clines are real and gradual between these two population sets. But I do think there has been strong differentiation between populations from the antipodes of Eurasia. I suspect that the emergence of more flexible lifestyles (e.g., oasis agriculture, horse nomadism) has in fact resulted in far greater connections between the isolated zones of Western and Eastern Eurasia over the past 10,000 years than before. In fact, one can conceptualize it as a two fold process. On the one hand you had very powerful expansions from small initial founder groups across macro-regions such as Western Eurasia and the Far East. This resulted in a decrease of genetic difference within these zones through the power of homogenization, though increased Fst in the few zones of direct contact across the zones. But, the “empty zones” of Central Eurasia may also have filled up with”proto-”Silk Road” centers over the past ~10,000 years, resulting in more frequent long term connections between the macro-regions than had heretofore been possible.

* I guess I should divulge that I have socialized with Nick Matzke an that we share common friends.

February 12, 2012

The social and biological construction of race

Filed under: Anthroplogy,Hispanics,Latinos,Population genetics,race — Razib Khan @ 2:45 pm

Many of our categories are human constructions which map upon patterns in nature which we perceive rather darkly. The joints about which nature turns are as they are, our own names and representations are a different thing altogether. This does not mean that our categories have no utility, but we should be careful of confusing empirical distributions, our own models of those distributions, and reality as it is stripped of human interpretative artifice.

I have argued extensively on this weblog that:

1) Generating a phylogeny of human populations and individuals within those populations is trivial. You don’t need many markers, depending on the grain of your phylogeny (e.g., to differentiate West Africans vs. Northern Europeans you actually can use one marker!).

2) These phylogenies reflect evolutionary history, and the trait differences are not just superficial (i.e., “skin deep”).

The former proposition I believe is well established. A group such as “black American” has a clear distribution of ancestries in a population genetic sense. The latter proposition is more controversial and subject to contention. My own assumption is that we will know the truth of the matter within the generation.


A black American

But that is the biological construction of race. Subject to fudge and fuzziness, but mapping upon a genuine reality. What about the social construction? Due to its flexibility this is a much more difficult issue to characterize in a succinct manner. Consider the cultural conditionals which render G. K. Butterfield “black” and Luis Guzman “Hispanic.” Both individuals are products of an admixture between people of mixed African and European ancestry (and likely some Amerindian in Guzman’s case). It turns out that the genes have segregated out such that Butterfield reflects more his European ancestry in traits. Guzman’s phenotype is more mixed. The perception of these two individuals is weighted by two different strains in modern American racial ideology. First, that of hypodescent, where one drop of black blood means that an individual is black, without equivocation. Halle Berry appealed to this framework to argue why her daughter, who is less than 1/4 African in ancestry (Berry’s African American father almost certainly had some European ancestry) was black. No matter that hypodescent’s origins were to buttress white racial supremacy and purity. Today black Americans espouse for purposes of community solidarity (the black American community as we know it is a partly a product of hypodescent which forced mixed-race blacks into the African American community).


Not a black American

The second issue, which has crystallized in our time, but has roots back decades, is the peculiar position of “Hispanics/Latinos” in the American racial system. As A. D. Powell has observed Hispanics seem to be able to evade the one drop rule, unless their African features are extremely dominant (e.g., pre-skin whitening Sammy Sosa). I’ve looked at the genotypes of enough Latin Americans to assume that some level of African ancestry (e.g., ~5%) is present in the vast majority of those who are not the children of recent European immigrants or from indigenous communities. For example, Mexico’s large slave population seems to have been totally absorbed, to the point where their past existence has been nearly forgotten. Mexicans of mestizo or white identity routinely have African ancestry, they just don’t know it, nor is it part of their racial identity. And it isn’t just Latinos. People of Middle Eastern ancestry, in particular Arabs, often have some African ancestry. But they are not classified as black (unlike Hispanics/Latinos they don’t have their own ethnic category, but are put into the “white” box, irrespective of their race, from Afro-Arab to Syrian).

This broader coexistence of frameworks persists on the implicit level. We don’t usually explicitly flesh out these details. Rather, we take these social constructions as givens. The major problem is when the problems and artificialities of these social constructions begin to bleed over into attempts to understand patterns of biological variation. Because of America’s fixation on the black-white dichotomy rooted in skin color people routinely offer up the fact that the human phylogeny is not well correlated with pigmentation as a refutation of the concept of race. What biology is doing is refuting a peculiar social construction of race. It is not negating the reality of human population substructure. Sociology and culture anthropology are empires of imagination to a much greater extent than human biology.

I’m thinking of this because with the birth of my daughter I confronted the bleeding over of the social into the biological. For medical purposes her race had to be assessed. One side of her ancestry was not problematic; white European. But I had to argue for why her other half should not be listed as “Asian.” For sociological purposes I have no great issue with the term Asian American which is inclusive of South and East Asians (I am not denying that this a recent political identity, I am saying that I do not personally find it objectionable and routinely enter my race as “Asian American” into public forms). But for biological purposes this is an incoherent and misleading classification. I know when my sister was born my parents put her race as “Asian,” which even at the time I felt was totally without purpose as far as biological taxonomy went. At the end of it all my daughter had “South Asian” entered in by hand. Better that her information be discarded than aggregated into a data set in a misleading fashion.

Obviously disentangling the social and biological is not necessarily impossible. Rather, it takes a little care and explicitness, as it is so easy to move between the two domains so easily as to elide their differences. And to some extent they do inform each other. Personal genomics is adding a new twist, but the general problem is as old as human systematics. The only cure is care.

Image credit: Wikipedia

Older Posts »

Powered by WordPress