Razib Khan One-stop-shopping for all of my content

April 30, 2018

Is American genetic diversity enough?

Filed under: Historical Population Genetics,Human Genetic Variation — Razib Khan @ 8:51 pm

In the nearly 20 years since the draft of the human genome was complete,* we’ve moved on to bigger and better things. In particular, researchers are looking to diversify their panels of human genetic diversity, because of differences between groups matter. You can’t just substitute them for each other genetically.

There have been efforts to diversify the population panels recently, but that prompts the question whether American population coverage is sufficient. My first thought is that the genetic diversity in the USA is probably getting us 90% of the way there. Consider Spencer’s comment about Queens, it’s the most ethnically diverse large conurbation in the country.

There are some gaps though. In Who We Are David Reich points out the distinctiveness of Indian population genetics. The subcontinent has lots of large census populations which have drifted upward deleterious alleles due to long-term endogamy. And, many of these populations don’t have a strong representation in the Diaspora.

In contrast, much of the rest of the world is panmictic enough that an American panel can pick up most of the variation. American Chinese are skewed toward Guandong and Fujian, but a substantial number of people from other parts of China have arrived in the last generation. Regional structure is not so strong that you’ll miss out on too much, aside from very rare variants which are more extended pedigree scale rather than population scale.

There are small populations such as Hadza, Khoikhoi, and Pygmies in Africa which are probably going to be missed by American population panels, but the total census size of these groups is pretty low (for comparison, there are 1 million Pulayar Dalits in the state of Kerala alone). Much of the rest of Africa is West African variation well represented in African Americans, and Bantu and Nilotic variation probably captured my immigrant communities.

I’d propose supplementing American genetic diversity with sampling Cape Coloureds in South Africa.

* No discussions about how the genome isn’t totally complete. I know that.

March 29, 2018

Ancestry does not always match up with appearance

Filed under: Human Genetic Variation,megan bowen — Razib Khan @ 9:04 pm

A few years ago I watched a bunch of Megan Bowen’s YouTubes about living in Korea as an expat. In one episode she had explained that the reason she had a black American accent (she’s from Georgia I think) is that she is a black American. Just a very light-skinned one.

In other videos, you can see that her skin is a little darker without typical Korean makeup, though she is still very light-skinned. And her natural hair is quite curly. But it would not be implausible to assume that she is one of the 10% or so of African Americans who are more than 50% white.

I didn’t think much about this until today. As part of my job, I watch ancestry-related YouTube videos to get a sense of how people interpret their results, and Megan Bowen showed up!

So I watched her video. There are some photos of her parents, and both look darker in complexion and more typically African American in their appearance. She also admitted that she was so light at birth that her father took a paternity test, and she was his.

The results for her ancestry came back…and she’s 65% Sub-Saharan African! This is curious because arguably Megan Bowen looks more “white” than the actress Megalyn Echikunwoke, who is 50% European (American) and 50% Nigerian (or half-Shona half-English Thandie Newton, the list could go on).

We have the genome-wide data. Megan is 65% Sub-Saharan African. And ~32% European.

Ultimately this is a pretty clear issue of the fact that only a subset of genes are responsible for the features which we deem ancestrally informative in a naive manner. Skin color, hair form, and facial features.

To the right is a plot from a paper which looked for variants affecting skin color in a Cape Vedre sample. They used ~900,000 SNPs to assess ancestry, so you know that that’s right. They also used a melanin index generated with a spectrophotometer. You see that 44% of the variation in skin color can be predicted by ancestry in this admixed population.

There’s a clear correlation between ancestry and complexion, but because the number of loci affecting the variation of complexion in humans is relatively small for a polygenic trait, the relationship can get decoupled rather easily (a few large effect genetic loci explain a lot of the rest of the variation).

If you looked at pigmentation loci in Megan Bowen and did local ancestry analysis, you’d see a strong enrichment for European segments. Far greater than the genome-wide 32%. It happens. It’s probability, not magic.

January 29, 2011

“Asian” in all the right places

mtDNA haplogroup G1a2

The pith: In this post I examine the most recent results from 23andMe for my family in the context of familial and regional (Bengal) history. I also use these results to offer up a framework for the ethnognesis of the eastern Bengali people within the last 1,000 years, and their relationship to other South Asian and Southeast Asian populations.

Since I received my 23andMe results last May I’ve been blogging about it a fair amount. In a recent post I inferred that perhaps I had a recent ancestor who was an ethnic Burman or some related group. My reasoning was that this explained a pattern of elevated matches on chromosomal segments with populations from southwest China in the HGDP data set. But now we have more than my genome to go on. This week I got the first V3 chip results from a sibling. And finally, yesterday the results from my parents came in. One thing that I immediately found interesting was my father’s mtDNA haplogroup assignment, G1a2. This came from his maternal grandmother, and as you can see it has a distribution which ...

March 28, 2010

More on recombination & natural selection

A follow up to the post below, see John Hawks, Selection’s genome-wide effect on population differentiation and p-ter’s Natural selection and recombination. As I said, it’s a dense paper, and I didn’t touch on many issues.

March 26, 2010

Natural selection & recombination in the human genome

Filed under: Genetics,Genomics,Human Genetic Variation,Population genetics — Razib Khan @ 8:28 am

If you are like me, and if you are reading this weblog there is a significant probability you are like me, you read L. L. Cavalli-Sforza’s History and Geography of Human Genes in the 1990s, and in the early aughts Spencer Wells’ A Journey of Man. Science has come very far in the last in the last 10-15 years, even Cavalli-Sforza’s magnum opus pales in comparison to the literal tsunami of data and analysis which the “post-genomic era” has ushered in. Instead of a gene here and there, or even the mtDNA and Y chromosome, researchers are now looking at hundreds of thousands of genetic variants, SNPs, across genomes. We’re rapidly approaching the era of whole genome sequencing, even if we’re not quite there yet.

But what’s the purpose of advances in technique and computation? Though the long-term project is to understand human variation and genetic function so as to have biomedical utility, in the short-term there is an enormous wealth of more abstract population genetic insight which can be extracted. Because of the biomedical focus of contemporary genomics we take a somewhat anthropocentric view, which is fine by me as I am an unregenerate speciest. The fish, fowl and crawling things of the earth can come later. And in any case, the beauty of the human focus of modern evolutionary genomics is that there are whole disciplines such as paleoanthropology which can serve as partners in interdisciplinary projects.

Humans are like any other organism, buffeted by conventional evolutionary genetic dynamics, drift, migration, natural selection, as well as processes which are more biophysically rooted such as recombination and mutation. Each of these processes leave their tell-tale marks on the genome. Mutation replenishes variation which drift and selection often eliminate, the former by chance and the latter in the form of negative selection. Migration serves to homogenize across populations through gene flow, while diversifying within populations by introducing novel variants. Finally, recombination breaks up linear associations of genetic variants along a DNA sequence, and has been used to explain sex.

In regards to H. sapiens it seems that our recent evolutionary history is dominated by a few big events. Within the last 100,000 years we underwent an extremely rapid population expansion from a small founding group within Africa, and radiated adaptively across all continents except for Antarctica. We are a then a relatively genetically homogeneous population, with much of the extant variation remaining within Africa, and the non-African groups getting progressively less diverse with distance from that continent. Basically a model whereby our species spread across the world via serial founder events. This simple model suffices in the broad sketch, but there is much more to the story. Over the past few years the older idea that current continental populations are the descendants of the first settlers, that is, the first modern humans who displaced the archaic populations which preceded them, seems unlikely to be to totally correct in all cases. It is likely wrong in Europe and to some extent India, no trivial exceptions.

There is much which can be said about details of demographic history in regards to the possibility of mass migrations, but today I want to focus on another dynamic: the effect of natural selection on the human genome. There are some researchers who are very skeptical of the efficacy of selection in shaping the patterns of variation we see, constraining it to a few loci such as that which confers lactase persistence or resistance to malaria. Others feel that selection’s power in shaping the genome is far more pervasive. Finally, there is a middle path, which emphasizes a diverse and complex portfolio.

A new paper explores the extent and nature of selection in human genomes through combining a rather old population genetic statistic with new expanded data sets and powerful statistical techniques, Human Population Differentiation Is Strongly Correlated with Local Recombination Rate:

Allele frequency differences across populations can provide valuable information both for studying population structure and for identifying loci that have been targets of natural selection. Here, we examine the relationship between recombination rate and population differentiation in humans by analyzing two uniformly-ascertained, whole-genome data sets. We find that population differentiation as assessed by inter-continental FSTshows negative correlation with recombination rate, with FST reduced by 10% in the tenth of the genome with the highest recombination rate compared with the tenth of the genome with the lowest recombination rate (P≪10−12). This pattern cannot be explained by the mutagenic properties of recombination and instead must reflect the impact of selection in the last 100,000 years since human continental populations split. The correlation between recombination rate andFST has a qualitatively different relationship for FST between African and non-African populations and for FST between European and East Asian populations, suggesting varying levels or types of selection in different epochs of human history.

You know of FST, even if you don’t know what FST is. You have heard that 15% of the variation in human genes is between races, and 85% within races. That 15% is an FST of 0.15. In other words,FST is a population genetic statistic which partitions the variance in genes between and within populations. If you have two populations and both have allele frequencies of 0.50 for two alleles, A & B, at one locus, then the FST would naturally be 0 as there is no between population difference, you can swap individuals from either group interchangeably for purposes of comparison. In contrast if the frequencies were disjoint so that all individuals in one population were of allele A and all individuals in the other of B, then naturally the FST would be 1, as all the variance is between populations, and all the information you need is found within population substructure. A perfect opportunity for profiling!

But this varies by gene and genomic region. As you know from above most variance is within races, not between them. But for the gene SLC24A5 almost all the variance is between Europeans and Africans, not within them. Similarly, all the variance on this gene is between Europeans and East Asians. Finally, there is no variance between East Asians and Africans on this gene. Why? Because it looks like that this gene has recently increased in frequency in West Eurasia, to the point where a new mutation has replaced the ancestral variant, which is common in Africa and East Eurasia. Additionally, it is notable that this genetic variant seems to account for 30-40% of the skin tone difference between Africans and Europeans. The point is that total genome variation is not always a good indicator of the evolutionary history of a specific gene. This is a problem especially in the case of those which we might find of interest.

In the paper above they find that areas of high recombination are negatively correlated with FST on a global scale. In other words, genomic regions which recombine more often across DNA strands and so shuffle genetic variation about and break apart linear associations show lower FST values. Reduced between population variation.

It’s rather clear in their first figure. Before we jump to that, let me note that they’re using the Perlegen data set, which has two dozen African Americans, Chinese and Europeans, respectively, and 1 million SNPs. Each panel has the FST values on the Y axis, and the recombination rate on the X axis.


It’s pretty clear what’s going on just through inspection. There’s an average decrease of 4% in FST for every 1 cM increase in recombination rate. The correlation estimates for FST and median recombination rates are for each panel:

A −0.962 (P = 8.9×10−6)
B −0.815 (P = 0.0041)
C −0.931 (P = 0.0001)
D −0.361 (P = 0.306)

The correlation estimates tell you something you can see visually: there’s a big difference in the relationships contingent upon which populations you’re using to calculate FST. In particular, a lot of the linear relationship between FST and recombination rate is actually due to the African vs. non-African difference. This is not a total surprise, Africans have a lot of genetic variation. In terms of genes one can think of non-Africans as simply a branch of Africans in many ways. Additionally, there’s been suggestive data for a decade now that when Africans left Africa they were subject to new selection pressures which seem common to Eurasians as a whole (though to be fair these signatures of selective sweeps in Eurasia may simply be false positives generated by population bottlenecks and the like).

Yes, I will work back to selection from genetic variation; the two are related, though the relation can be subtle. So why the negative correlation between FST and recombination? Consider an SNP, a single DNA base pair, which is subject to positive selection. It can increase rapidly in frequency so that it goes from ~0 to ~1 in proportion with the population. Fair enough, but SNPs do not exist in an abstract universe, they’re physically embedded in DNA, and so are flanked by many other bases. If an SNP is subject to powerful directional selection which drives its frequency upward, then adjacent bases also “hitchhike” along in frequency. In other words, powerful directional selection can reorder the variation of whole genomic regions, depending on how powerful the selection is. It can sweep away the noisy scatter of variants introduced over many generations by mutation and replace them with a long sequence of alleles from an ancestral genome which harbored the selected variant. Over time mutation can mask the homogenization as it replenishes variation, but there is also another dynamic which blocks the long march to genetic uniformity: recombination. Recombination can tear apart blocks of alleles as they sweep up in frequency, and the more recombination, the greater the counterforce to the homogenizing power of selection on the local genome as the block is chopped up evermore.

Now consider the nature of selection in different populations. Let us stipulate that the light skin of East Asians and Europeans are adaptations; we know that they are conferred by different SNPs. In other words, selection operated on different genetic variants to produce the adaptation (though by and large across the same set of genes). Therefore, FST on pigmentation genes is relatively high because of between population difference, and these genes tend to be surrounded by regions of homogenization as they seem to have swept up to high frequency rapidly and dragged along many nearby alleles. Assuming this effect of natural selection equal recombination would naturally tend to work against increased FST by reducing the number of linked alleles being dragged along by breaking apart the genomic blocks along their transient up.

A similar effect occurs in the case of background selection, which is operative around deleterious mutations. When there are alleles which are negatively selected their neighbors are also effected; consider it a sort of population genetic “property value.” Because negative selection tends to reduce effective population size, on a large geographical scale it can result in increased FST (consider the genetic uniqueness of isolated populations which have gone through bottlenecks). Again, recombination can blunt the impact of a deleterious allele on its neighbors. The authors do note that the particular dynamics of positive and background selection differ, the latter is a gentler affair by far, a repetitive tap as opposed to a sledgehammer, but the genomic resolution of their analysis is such that they lack power to explore these differences.

I’ll let the authors describe the peculiarities of the European-Chinese panel:

…The weaker correlation for the FST between European and Chinese populations is driven by a dip in differentiation at very low recombination rate loci…which is not at all what is seen in the comparison of African and non-African populations…This curve shows a qualitatively non-monotonic pattern, which motivated us to perform a quadratic regression fitted within the bootstrapping framework. The regression is concave and includes very significant linear (P = 3.0×10−4) as well as quadratic (P = 1.8×10−5) terms. Conversely, quadratic regression gives a non-significant quadratic term for FST between African Americans and each of the other two populations and if anything is slightly convex. As expected, for single SNP analysis (without binning by recombination rate), linear regression is very significant for FST between African Americans and either non-African population (P≪10−12). For FST between Chinese and Europeans, however, linear regression is not significant (P = 0.81), while a quadratic regression is very significant (P≪10−12)….

Non-monotic is just a way to say that the trend reverses direction. In other words, the linear model isn’t too good a fit on what’s going on with the variation between Europeans and Chinese, and how it relates to recombination rates. They offer two speculative possibilities for the “inverted U-shaped” nature of the relationship between FST and recombination in Europeans and Chinese. First, the smaller effective population sizes of non-African groups results in greater efficacy of background selection. As random genetic drift tends to increase the frequency of deleterious alleles, powerful negative selection is given opportunity work against that region of the genome. This results in more background selection as adjacent genomic regions are impacted. Because of differing population sizes the balance between positive and background selection is different for Africans and non-Africans. A second hypothesis is that gene flow between the two Eurasian groups allowed for selective sweeps to move from one group to the other. In other words, between population variance can be reduced if a favored alleles spreads across all populations from one original group (lactase persistence in much of northern Eurasia may be a case of just this).

Intriguingly they found the linear relationship between FST and recombination to be stronger in genomic regions which are coding (i.e., they have genes which code for proteins), and in particular in the X chromosome. The second aligns with other recent work which indicates that the X chromosome may be subject to stronger selective pressures than the rest of the genome because of its peculiar expression pattern (males have only one copy of the gene, and females express only one copy per cell due to X chromosome inactivation).

Finally, they replicated their results using other data sets. I’ll just show the figure from HapMap3:

I obviously rotated so you could see the labels at higher resolution. Here are the populations:

WAF (“West African”) is a combined sample of YRI (Yoruba in Ibadan, Nigeria) and LWK (Luhya in Webuye, Kenya)

EAS (“East Asia”) is a combined sample of CHB (Han Chinese in Beijing, China), CHD (Chinese in Metropolitan Denver, CO, USA), and JPT (Japanese in Tokyo, Japan)

EUR (“Europe”) is a combined sample of CEU (ancestry from Northern and Western Europe) and TSI (Toscani in Italia)

GIH is a sample of Gujarati Indians in Houston, TX, USA

MKK is a sample of Maasai in Kinyawa, Kenya; and CHI (Chinese) is a combined sample of CHB and CHD.

Their analysis of these findings are cautious:

A striking result is that the relationship between FST and recombination rate is stronger for FST between pairs of closely-related populations, whether within or outside Africa: FST between a West African sample and Maasai (of mixed West African and East African ancestry…decreases by an average of 6% for every 1 cM/Mb…FST between Italians and individuals of North-Western European ancestry decreases by 10% for every cM/Mb …and FST between Japanese and individuals of Chinese ancestry decreases by 4%...In view of the large effective population size in recent human history since each of these pairs of populations have split, these observations support the possibility that the different patterns observed between different pairs of populations are due to natural selection operating more efficiently in the context of larger population sizes. We observed a weak convex relationship with recombination rate for FSTbetween closely-related populations in a quadratic regression analysis …which is intriguingly opposite to what was observed between Europeans and Asians…On the other hand, these observations do not seem to support the possibility that the different patterns are due to selective sweeps being shared to different extent across different pairs of populations since the level of gene flow between HapMap 3 closely-related populations likely have had been higher than that between continents. These results, while interesting, should be viewed with caution due to the confounder of ascertainment bias. It will be possible to test these observations further by analyzing data from the 1000 Genomes Project, where whole-genome sequencing will generate data that is largely free of ascertainment bias for many of the HapMap 3 populations as well as additional populations

The reason that selection would be more powerful at large population sizes is that the noise of random genetic drift is less likely to interfere with its deterministic process. Additionally, one presumes there would be more extant genetic variation in large populations than small ones. But though these results are interesting, they don’t seem to put too much stock in them.

Let me finish with the author’s conclusion:

In conclusion, we have shown that genome-wide human population differentiation in allele frequencies is significantly correlated with recombination rate on a megabase scale, demonstrating that natural selection has had a profound effect on allele frequency distributions averaged over the last hundred thousand years. While these results likely reflect the effects of hitchhiking and background selection, disentangling the strengths of these two forces will require extending the analyses presented in this paper. One important direction is to use genetic maps that have fine spatial resolution, which may shed light on the detailed distribution of selective coefficients that have shaped allele frequency differentiation. A second direction in which these results can be extended is to compare more populations of continentally diverse ancestry. This should facilitate an exploration of the relationship between recombination rate and population differentiation during different epochs of human evolution, and should allow a better understanding of how demographic history has shaped the impact of natural selection on patterns of human genetic variation.

Note: I left a lot out in this treatment. It’s Open Access so you can read the whole thing!

Citation: Keinan A, Reich D, 2010 Human Population Differentiation Is Strongly Correlated with Local Recombination Rate. PLoS Genet 6(3): e1000886. doi:10.1371/journal.pgen.1000886

Powered by WordPress