Razib Khan One-stop-shopping for all of my content

March 23, 2017

Your ancestry inference is precise and accurate(ish)

Filed under: 23andMe,Ancestry,Culture,Family Tree DNA,Genetics,Genomics — Razib Khan @ 6:29 am

For about three years I consulted for Family Tree DNA. It was a great experience, and I met a lot of cool people through that connection. But perhaps the most interesting aspect was the fact that I can understand the various pressures that direct-to-consumer genomics firms face from the demand side. The science is one thing, but when you are working on a consumer facing product, other variables come into play which are you not cognizant of when you are thinking of it from a point of pure analysis. I’m pretty sure that my insights working with Family Tree DNA can generalize to the other firms as well (23andMe, Ancestry, and Genographic*).

The science behind the ancestry inference elements of the product on offer is not particularly controversial or complex, but the customer aspect of how these results are received can become an intractable nightmare. The basic theory was outlined in the year 2000 in Pritchard et al.’s Inference of Population Structure Using Multilocus Genotype Data. You have lots of data thanks to better genomic technology (e.g., 300,000 SNPs). You have computers to analyze that data. And, you have scientific models of population history and dynamics which you can test that data against. The shape of the data will determine the parameters of the model, and it this those parameters that yield “your ancestry.”

In broad sketches the results make sense for most people. It’s in the finer details that the confusions emerge. To the left you see my son’s 23andMe ancestry deconvolution. The color coding is such you can tell that his maternal and paternal chromosomes have very different ancestry profiles (mostly Northern European and South Asian, respectively).

But his “Northern European” chromosomes also are more richly colored, with alternative segments denoting ancestry from different parts of Northern Europe. So in terms of proportions I am told my son is about 15 percent French and German, and 10 percent Scandinavian and 10 percent British and Irish. This is reasonable. On the other side he’s nearly 50 percent “broadly South Asian.” The balance is accounted for by my East Asian ancestry, which is correct, as my South Asian ethnicity is from Bengal, where there is a fair amount of East Asian ancestry (my family’s origin is on the eastern edge of Bengal itself).

And it is here that the non-scientific concerns of consumer genomics comes into focus. The genetic differences and distance between various South Asian groups are far higher than those between various Northern European groups. Depending on the statistic measure you use intra-South Asian variation is about one order of magnitude greater than intra-Northern European differences. This is due to geographic partitioning, the caste system, and differential admixture in South Asians between extreme diverged ancestral elements (about half of South Asian ancestry is very similar to Europeans and Middle Easterners, and half of it is extremely different, so how far you are from the 50 percent mark determines a lot).

Broadly South Asian

In Northern Europe there is very little genetic variation from the British Isles all the way the Baltic. The reason for this is historical: massive population turnover in the region 4,500 years ago means that much of the genetic divergence between the groups dates to the Bronze Age. It is this the genetic divergence, the variation, that is the raw material for the inferences and proportions you see in ancestry calculators. There’s just not that much raw material for Northern Europeans.

Broadly South Asian

Remember, the methods require lots of variation in the data as a raw input. You’re making the inference machine work real hard to produce a reasonable robust result if you don’t have that much variation. In contrast to the situation with Northern Europeans, with South Asians the companies are leaving raw material on the table, and just combining diverse groups together.

What’s going on here? As you might have guessed this is an economically motivated decision. Most South Asians know their general heritage due to caste and regional origins (though many Bengalis exhibit some lacunae about their East Asian ancestry). In contrast, many Americans of Northern European ancestry with an interest in genealogy are extremely curious about explicit proportional breakdowns between Northern European nationalities. The direct-to-consumer genomic firms attempt to cater to this demand as best as they can.

As I have stated many times, racial background is to various extents both biological and social. When it comes to the difference between Lithuanians and Nigerians the biological differences due to evolutionary history are straightforward, and clear and distinct. You can generate a phylogenetic history and perform a functional analysis of the differences. Additionally, you also have to note that the social differences exist, but are not straightforward. Like Lithuanians Nigerians of Igbo background are generally Roman Catholic, while most other Nigerians are not. The linguistic differences between Nigerian languages are great enough that it is defensible to suggest that Hausa speakers of Afro-Asiatic dialects are closer to Lithuanians in their phylogenetic history than to the dialects of the Yoruba.

A Lithuanian American

Contrast this to the situation where you differentiate Lithuanians from French. To any European the differences here are incredibly huge. The history of France, what was Roman Gaul, goes back 2,000 years. After the collapse of the West Roman Empire by any measure the people who became French were at the center of European history. In contrast, Lithuanians were a marginal tribe, who did not enter Christian civilization until the late 14th century. In social-cultural terms, due to history, the differences between French and Lithuanians are extremely salient to people of French and Lithuanian ancestry. But genetically the differences are modest at best.

If a direct-to-consumer genetic testing company tells you that you are 90 percent Northern European and 10 percent West African, that is a robust result that has a clear historical genetic interpretation. The two element’s of one’s ancestry have been relatively distinct for on the order of 100,000 years, with the Northern European element really just a proxy for non-Africans (though it is easy to drill-down within Eurasia). In contrast, notice how 23andMe, with some of the best scientists in the business, tells people they are “French-German,” and not French or German. What the hell is a “French-German”? Someone from Alsace-Lorraine? A German descendent of Huguenots? Obviously not.

“French-German” is a cluster almost certainly because there are no clear and distinct genetic differences between French and Germans. Yes, there is a continuum of allele frequencies between these two groups, but having looked at a fair number of people of French and German background in Family Tree DNA’s database I can tell you that France and Germany have a lot of local structure even among people of indigenous ancestry. Germans from the Rhineland are quite often genetically closer to French from Normandy than they are to Germans from eastern Saxony. Some of this is due to gene flow between neighboring regions, but some of this is due to cultural fluidity as to who exactly is German. It is clear that some Germans from the eastern regions are Germanized Slavs. Some Germans from the north exhibit strong affinities to Scandinavians, while Germans from Bavaria and Austria are classically Central European (whatever that means). The average German is distinct from the average French person, but the genetic clustering of the two groups is not clear and distinct.

Remember earlier I explained that the science is predicated on aligning data and models. The cultural model of Northern Europeans is conditioned on diversity and difference which has been very salient for the past few thousand years since the rise and fall of Rome. But the evolutionary genetic history is one where there are far fewer differences. The data do not fit a model that makes much sense to the average consumer (e.g., “you descend from a mix of Bronze Age migrants from the west-central steppe of Eurasia and Mesolithic indigenous hunter-gatherers and Neolithic farmers”). What makes sense to the average American consumer are histories of nationalities, so direct-to-consumer genetic companies try to satisfy this need. Because the needs of the consumer and their cultural expectations are poorly served by the data (genetic variation) and models of population history, you have a lot of awkward kludges and strange results.

Imagine, for example, you want to estimate how “German” someone is.  What do you use for your reference population of Germans?  Looking at the data there are clearly three major clusters within Germany when you weight the numbers appropriate, with affinities to the northern French, Slavs, and Scandinavians, and various proportions in between. Your selection of your sample is going to mean that some Germans are going to be more Germans than other Germans. If you select an eastern German sample then western Germans whose ancestors have been speaking a Germanic language far longer than eastern Germans are going to come out as less German. Or, you could just pick all of these disparate groups…in which case, lots of Northern Europeans become “German.”

Consumers want genetic tests to reflect strong cultural memories which were forged in the fires of rapidly protean and distinction-making process of cultural evolution. But biological and cultural evolution exhibit different modes (the latter generates huge between group differences) and tempos (those differences emerge fast). The ancestry results many people get are the outcomes of compromises to thread the needle and square the circle.

All the above is half the story. Next I’ll explain why “deep history” has to be massaged to make recent history informative and comprehensible….

* Also, I have a little historical perspective because of my friendship with the person who arguably created this sector, Spencer Wells.

November 28, 2016

Genomics Is Not Magic, There Is No Magic

Filed under: Genetics,science — Razib Khan @ 3:26 am
MIT Technology Review has an article up, Do Your Family Members Have a Right to Your Genetic Code?, which is now part of the genomics-human-interest-piece genre you see regularly. Here you have the exemplar of this sort of narrative: what do you do when one twin gets a test and the other does not, and...

November 26, 2016

Afro-Asiatic and Eurasian Backflow

Filed under: Africa,Genetics,science — Razib Khan @ 11:55 pm
If you follow Y genealogy you know that the distribution of R1ba2 exhibits a peculiar pattern. R1b is the most common haplgroup in Western Eurasia, and shares a deep common ancestry with R1a. It seems to have risen to high frequencies in Europe only during the Bronze Age, though has been found in earlier periods....

The Species Barriers Between Neanderthals and Anatomically Modern Humans

Filed under: Genetics,Neanderthals,science — Razib Khan @ 11:55 pm
A new paper in The American Journal of Humans Genetics, The Divergence of Neandertal and Modern Human Y Chromosomes, reports on possible reasons why we don't see Y chromosomes in modern humans from this archaic lineage, despite exhibiting detectable levels of autosomal admixture. As you might recall the clear lack of deep branching Y and...

The species barriers between Neanderthals and anatomically modern humans

Filed under: Genetics,Neanderthals,science — Razib Khan @ 3:30 pm

gr2A new paper in The American Journal of Humans Genetics, The Divergence of Neandertal and Modern Human Y Chromosomes, reports on possible reasons why we don’t see Y chromosomes in modern humans from this archaic lineage, despite exhibiting detectable levels of autosomal admixture. As you might recall the clear lack of deep branching Y and mtDNA lineages was long one of the major genetic rationales for why gene flow between Neanderthals and modern humans was presumably not very significant. This, despite suggestive evidence from morphological analysis as well as inferences from autosomal data. The problem is that it is harder to do the sort of clean phylogenetic reconstruction via a coalescent model utilizing autosomal data (which recombines, as opposed to the Y and mtDNA, which do not for the regions of interest), so ancient genome sequences were really what was needed to convince most people with these sorts of markers.

This makes us ask: why are Neanderthal Y and mtDNA lineages not found in modern humans which exhibit indications of gene flow from other hominin lineages? After all, the lack of these really led many people off on the wrong track for years. I recall in 2008 going to a talk by Svante Paabo who reported that the Neanderthal mtDNA he had sequenced was definitely very different from anything in the current databases for our species, which confirmed his assumption that there was no admixture into modern populations (Paabo changed his tune very soon after due to the whole genome sequencing obviously). One simple explanation is that because effective population sizes of Y and mtDNA are smaller than autosomal regions of the genome they’ll be more strongly subject to drift, and exhibit higher extinction rates. In other words, it wouldn’t be that surprising of all Neanderthal Y and mtDNA went extinct after admixture because they were a small minority, and most lineages went extinct in any case. Researchers who work in non-human phylogeography who relied on mtDNA in particular can tell of many stories of being led astray by looking at one informative locus.

But chance may not be what is at work here. Buried in the discussion of the paper:

…polypeptides from several Y-chromosome genes act as male-specific minor histocompatibility (H-Y) antigens that can elicit a maternal immune response during gestation. Such effects could be important drivers of secondary recurrent miscarriages30 and might play a role in the fraternal birth order effect of male sexual orientation.31 Interestingly, all three genes with potentially functional missense differences between the Neandertal and modern humans sequences are H-Y genes, including KDM5D, the first H-Y gene characterized…It is tempting to speculate that some of these mutations might have led to genetic incompatibilities between modern humans and Neandertals and to the consequent loss of Neandertal Y chromosomes in modern human populations. Indeed, reduced fertility or viability of hybrid offspring with Neandertal Y chromosomes is fully consistent with Haldane’s rule, which states that “when in the [first generation] offspring of two different animal races one sex is absent, rare, or sterile, that sex is the [heterogametic] sex.”

The origin of species is obviously one of the founding questions which arose with the emergence of evolutionary biology. Haldane’s rule dates to the 1920s. In mammals the heterogametic sex are males, so these the hybrids which will be selected against (or, they may be sterile). There’s been a lot of research of late on why Neanderthals went extinct, and whether there were speciation barriers in keeping with the biological species concept between our two lineages. This result suggests that there is going to be interesting stuffed coming out of the population genomics of ancient hominins in the near future….

December 4, 2013

How genetics are rewriting the history of the Caribbean

Filed under: Christopher Columbus,Genetics,Native Americans,science — Razib Khan @ 2:47 am
Over the past century history has been approached from many different angles, despite the stereotype of scholars haunting dusty archives. Adventurers once called antiquarians became archaeologists, and inspired the fictional Indiana Jones. Today it is the turn of the geneticists to put their stamp upon history. By tracing patterns of variation they gain insights as to the [...]

November 8, 2013

Selection happens; but where, when, and why?

Filed under: Anthroplogy,Genetics,Genomics,Pigmentation — Razib Khan @ 1:49 am
Distribution of SLC452 variation at SNP rs1426654. Credit, HGDP Browser

Distribution of SLC452 variation at SNP rs1426654. Credit, HGDP Browser

Nina Davuluri, Miss America 2014, Credit: Andy Jones

Nina Davuluri, Miss America 2014, Credit: Andy Jones

One of the secondary issues which cropped up with Nina Davuluri winning Miss America is that it seems implausible that someone with her complexion would be able to win any Indian beauty contest. A quick skim of Google images “Miss India” will make clear the reality that I’m alluding to. The Indian beauty ideal, especially for females, is skewed to the lighter end of the complexion distribution of native South Asians. Nina Davuluri herself is not particularly dark skinned if you compared her to the average South Asian; in fact she is likely at the median. But it would be surprising to see a woman who looks like her held up as conventionally beautiful in the mainstream Indian media. When I’ve pointed this peculiar aspect out to Indians* some of them of will submit that there are dark skinned female celebrities, but when I look up the actresses in question they are invariably not very dark skinned, though perhaps by comparison to what is the norm in that industry they may be. But whatever the cultural reality is, the fraught relationship of color variation to aesthetic variation prompts us to ask, why are South Asians so diverse in their complexions in the first place? A new paper in PLoS Genetics, The Light Skin Allele of SLC24A5 in South Asians and Europeans Shares Identity by Descent, explores this genetic question in depth.

Much of the low hanging fruit in this area was picked years ago. A few large effect genetic variants which are known to be polymorphic across many populations in Western Eurasia segregate within South Asian populations. What this means in plainer language is that a few genes which cause major changes in phenotype are floating around in alternative flavors even within families among people of Indian subcontinental origin. Ergo, you can see huge differences between full siblings in complexion (African Americans, as an admixed population, are analogous). While loss of pigmentation in eastern and western Eurasia seems to be a case of convergent evolution (different mutations in overlapping sets of genes), the H. sapiens sapiens ancestral condition of darker skin is well conserved from Melanesia to Africa.


So what’s the angle on this paper you may ask? Two things. The first is that it has excellent coverage of South Asian populations. This matters because to understand variation in complexion you should probably look at populations which vary a great deal. Much of the previous work has focused on populations at the extremes of the human distribution, Africans and Europeans. There are obvious limitations using this approach. If you are looking at variant traits, then focusing on populations where the full range of variation is expressed can be useful. Second, this paper digs deeply into the subtle evolutionary and phylogenomic questions which are posed by the diversification of human pigmentation. It is often said that race is often skin deep, as if to dismiss the importance of human biological variation. But skin is a rather big deal. It’s our biggest organ, and the pigmentation loci do seem to be rather peculiar.

You probably know that on the order of ~20% of genetic variation is partitioned between continent populations (races). But this is not the case at all genes. And pigmentation ones tend to be particular notable exceptions to the rule. In late 2005 a paper was published which arguably ushered in the era of modern pigmentation genomics, SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. The authors found that one nonsynonomous mutation was responsible for on the order of 25 to 33% of the variation in skin color difference between Africans and Europeans. And, the allele frequency was nearly disjoint across the two populations, and between Europeans and East Asians. When comparing Europeans to Africans and East Asians almost all the variation was partitioned across the populations, with very little within them. The derived SNP, which differs from the ancestral state, is found at ~100% frequency in Europeans, and ~0% in Africans and East Asians. It is often stated (you can Google it!) that this variant is the second most ancestrally informative allele in the human genome in relation to Europeans vs. Africans.

SLC24A5 was just the beginning. SLC45A2, TYR, OCA2, and KITLG are just some of the numerous alphabet soup of loci which has come to be understood to affect normal human variation in pigmentation. Despite the relatively large roll call of pigmentation genes one can safely say that between any two reasonably distinct geographic populations ~90 percent of the between population variation in the trait is going to be due to ~10 genes. Often there is a power law distribution as well. The first few genes of large effect are over 50% of the variance, while subsequent loci are progressively less important.

So how does this work to push the overall results forward?

– With their population coverage the authors confirm that SLC24A5 seems to be polymorphic in all Indo-European and Dravidian speaking populations in the subcontinent. The frequency of the derived variant ranges from ~90% in the Northwest, and ~80% in Brahmin populations all over the subcontinent, to ~10-20% in some tribal groups.

– Though there is a north-south gradient, it is modest, with a correlation of ~0.25. There is a much stronger correlation with longtitude, but I’m rather sure that this is an artifact of their low sampling of Indo-European populations in the eastern Gangetic plain. As hinted in the piece the correlation with longitude has to do with the fact that Tibetan and Burman populations in these fringe regions tend to lack the West Eurasian allele.

– Using haplotype based tests of natural selection the authors infer that the frequency of this allele has been driven up positively in north, but not south, India. It could be that the authors lack power to detect selection in the south because of lower frequency of the derived allele. And, I did wonder if selection in the north was simply an echo of what occurred in West Eurasia. But if you look at the frequency of the A allele in the north most of the populations seem to have a higher frequency of the derived variant than they do of inferred “Ancestral North Indian”.

What’s perhaps more interesting is the bigger picture of human evolutionary dynamics and phylogenetics that these results illuminate. Resequencing the region around SLC24A5 these researchers confirmed it does look like the derived variant is identical by descent in all populations across Western Eurasia and into South Asia. What this means is that this mutation arose in someone at some point around the Last Glacial Maximum, after West Eurasians separated from East Eurasians. The authors gives some numbers using some standard phylogenetic techniques, but admit that it is ancient DNA that will give true clarity on the deeper questions. When I see something written like that my hunch, and hope, is that more papers are coming soon.

When I first read The Light Skin Allele of SLC24A5 in South Asians and Europeans Shares Identity by Descent, I thought that it was essential to read Ancient DNA Links Native Americans With Europe and Efficient moment-based inference of admixture parameters and sources of gene flow. The reason goes back to the plot which I generated at the top of this post: notice that Native Americans do not carry the West Eurasian variant of SLC24A5. What the find of the ~24,000 Siberian boy, and his ancient DNA, suggest is that there was a population with affinities closer to West Eurasians than East Eurasians that contributed to the ancestry of Native Americans. The lack of the European variant of SLC24A5 in Native Americans suggests to me that the sweep had not begun, or, that the European variant was disfavored. What the other paper reports is that on the order of 20-40% of the ancestry of Europeans may be derived from an ancient North Eurasian population, unrelated to West Eurasians (or at least not closely related). It is likely that this population has something to do with the Siberian boy. Since Europeans are fixed for the derived variant of SLC24A5, that implies to me that sweep must have occurred after 24,000 years ago.

journal.pgen.1003912.g002At this point I have to admit that I believe need to be careful calling this a “European variant.” Just because it is nearly fixed in Europe, does not imply that the variant arose in Europe. If you look at the frequency of the derived variant you see it is rather high in the northern Middle East. Looking at some of the populations in the Middle Eastern panel the ancestral variant might be all explained by admixture in historical time from Africa. If the sweep began during the last Ice Age, then most of Europe would have been uninhabited. The modern distribution is informative, but it surely does not tell the whole story.

Where we are is that SLC24A5 , and pigmentation as a whole, is coming to be genomically characterized fully. We don’t know the whole story of why light skin was selected so strongly. And we don’t quite know where the selection began, and when it began. But through gradually filling in pieces of the puzzle we may come to grips with this adaptively significant trait in the nearly future.

Citation: Basu Mallick C, Iliescu FM, Möls M, Hill S, Tamang R, et al. (2013) The Light Skin Allele of SLC24A5 in South Asians and Europeans Shares Identity by Descent. PLoS Genet 9(11): e1003912. doi:10.1371/journal.pgen.1003912

* From my personal experience American born Indians often do not share the same prejudices and biases, partly because subtle shades of brown which are relevant in the Indian context seem ludicrous in the United States.

The post Selection happens; but where, when, and why? appeared first on Gene Expression.

Selection happens; but where, when, and why?

Filed under: Anthroplogy,Genetics,Genomics,Pigmentation — Razib Khan @ 1:49 am
Distribution of SLC452 variation at SNP rs1426654. Credit, HGDP Browser

Distribution of SLC452 variation at SNP rs1426654. Credit, HGDP Browser

Nina Davuluri, Miss America 2014, Credit: Andy Jones

Nina Davuluri, Miss America 2014, Credit: Andy Jones

One of the secondary issues which cropped up with Nina Davuluri winning Miss America is that it seems implausible that someone with her complexion would be able to win any Indian beauty contest. A quick skim of Google images “Miss India” will make clear the reality that I’m alluding to. The Indian beauty ideal, especially for females, is skewed to the lighter end of the complexion distribution of native South Asians. Nina Davuluri herself is not particularly dark skinned if you compared her to the average South Asian; in fact she is likely at the median. But it would be surprising to see a woman who looks like her held up as conventionally beautiful in the mainstream Indian media. When I’ve pointed this peculiar aspect out to Indians* some of them of will submit that there are dark skinned female celebrities, but when I look up the actresses in question they are invariably not very dark skinned, though perhaps by comparison to what is the norm in that industry they may be. But whatever the cultural reality is, the fraught relationship of color variation to aesthetic variation prompts us to ask, why are South Asians so diverse in their complexions in the first place? A new paper in PLoS Genetics, The Light Skin Allele of SLC24A5 in South Asians and Europeans Shares Identity by Descent, explores this genetic question in depth.

Much of the low hanging fruit in this area was picked years ago. A few large effect genetic variants which are known to be polymorphic across many populations in Western Eurasia segregate within South Asian populations. What this means in plainer language is that a few genes which cause major changes in phenotype are floating around in alternative flavors even within families among people of Indian subcontinental origin. Ergo, you can see huge differences between full siblings in complexion (African Americans, as an admixed population, are analogous). While loss of pigmentation in eastern and western Eurasia seems to be a case of convergent evolution (different mutations in overlapping sets of genes), the H. sapiens sapiens ancestral condition of darker skin is well conserved from Melanesia to Africa.


So what’s the angle on this paper you may ask? Two things. The first is that it has excellent coverage of South Asian populations. This matters because to understand variation in complexion you should probably look at populations which vary a great deal. Much of the previous work has focused on populations at the extremes of the human distribution, Africans and Europeans. There are obvious limitations using this approach. If you are looking at variant traits, then focusing on populations where the full range of variation is expressed can be useful. Second, this paper digs deeply into the subtle evolutionary and phylogenomic questions which are posed by the diversification of human pigmentation. It is often said that race is often skin deep, as if to dismiss the importance of human biological variation. But skin is a rather big deal. It’s our biggest organ, and the pigmentation loci do seem to be rather peculiar.

You probably know that on the order of ~20% of genetic variation is partitioned between continent populations (races). But this is not the case at all genes. And pigmentation ones tend to be particular notable exceptions to the rule. In late 2005 a paper was published which arguably ushered in the era of modern pigmentation genomics, SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. The authors found that one nonsynonomous mutation was responsible for on the order of 25 to 33% of the variation in skin color difference between Africans and Europeans. And, the allele frequency was nearly disjoint across the two populations, and between Europeans and East Asians. When comparing Europeans to Africans and East Asians almost all the variation was partitioned across the populations, with very little within them. The derived SNP, which differs from the ancestral state, is found at ~100% frequency in Europeans, and ~0% in Africans and East Asians. It is often stated (you can Google it!) that this variant is the second most ancestrally informative allele in the human genome in relation to Europeans vs. Africans.

SLC24A5 was just the beginning. SLC45A2, TYR, OCA2, and KITLG are just some of the numerous alphabet soup of loci which has come to be understood to affect normal human variation in pigmentation. Despite the relatively large roll call of pigmentation genes one can safely say that between any two reasonably distinct geographic populations ~90 percent of the between population variation in the trait is going to be due to ~10 genes. Often there is a power law distribution as well. The first few genes of large effect are over 50% of the variance, while subsequent loci are progressively less important.

So how does this work to push the overall results forward?

- With their population coverage the authors confirm that SLC24A5 seems to be polymorphic in all Indo-European and Dravidian speaking populations in the subcontinent. The frequency of the derived variant ranges from ~90% in the Northwest, and ~80% in Brahmin populations all over the subcontinent, to ~10-20% in some tribal groups.

- Though there is a north-south gradient, it is modest, with a correlation of ~0.25. There is a much stronger correlation with longtitude, but I’m rather sure that this is an artifact of their low sampling of Indo-European populations in the eastern Gangetic plain. As hinted in the piece the correlation with longitude has to do with the fact that Tibetan and Burman populations in these fringe regions tend to lack the West Eurasian allele.

- Using haplotype based tests of natural selection the authors infer that the frequency of this allele has been driven up positively in north, but not south, India. It could be that the authors lack power to detect selection in the south because of lower frequency of the derived allele. And, I did wonder if selection in the north was simply an echo of what occurred in West Eurasia. But if you look at the frequency of the A allele in the north most of the populations seem to have a higher frequency of the derived variant than they do of inferred “Ancestral North Indian”.

What’s perhaps more interesting is the bigger picture of human evolutionary dynamics and phylogenetics that these results illuminate. Resequencing the region around SLC24A5 these researchers confirmed it does look like the derived variant is identical by descent in all populations across Western Eurasia and into South Asia. What this means is that this mutation arose in someone at some point around the Last Glacial Maximum, after West Eurasians separated from East Eurasians. The authors gives some numbers using some standard phylogenetic techniques, but admit that it is ancient DNA that will give true clarity on the deeper questions. When I see something written like that my hunch, and hope, is that more papers are coming soon.

When I first read The Light Skin Allele of SLC24A5 in South Asians and Europeans Shares Identity by Descent, I thought that it was essential to read Ancient DNA Links Native Americans With Europe and Efficient moment-based inference of admixture parameters and sources of gene flow. The reason goes back to the plot which I generated at the top of this post: notice that Native Americans do not carry the West Eurasian variant of SLC24A5. What the find of the ~24,000 Siberian boy, and his ancient DNA, suggest is that there was a population with affinities closer to West Eurasians than East Eurasians that contributed to the ancestry of Native Americans. The lack of the European variant of SLC24A5 in Native Americans suggests to me that the sweep had not begun, or, that the European variant was disfavored. What the other paper reports is that on the order of 20-40% of the ancestry of Europeans may be derived from an ancient North Eurasian population, unrelated to West Eurasians (or at least not closely related). It is likely that this population has something to do with the Siberian boy. Since Europeans are fixed for the derived variant of SLC24A5, that implies to me that sweep must have occurred after 24,000 years ago.

journal.pgen.1003912.g002At this point I have to admit that I believe need to be careful calling this a “European variant.” Just because it is nearly fixed in Europe, does not imply that the variant arose in Europe. If you look at the frequency of the derived variant you see it is rather high in the northern Middle East. Looking at some of the populations in the Middle Eastern panel the ancestral variant might be all explained by admixture in historical time from Africa. If the sweep began during the last Ice Age, then most of Europe would have been uninhabited. The modern distribution is informative, but it surely does not tell the whole story.

Where we are is that SLC24A5 , and pigmentation as a whole, is coming to be genomically characterized fully. We don’t know the whole story of why light skin was selected so strongly. And we don’t quite know where the selection began, and when it began. But through gradually filling in pieces of the puzzle we may come to grips with this adaptively significant trait in the nearly future.

Citation: Basu Mallick C, Iliescu FM, Möls M, Hill S, Tamang R, et al. (2013) The Light Skin Allele of SLC24A5 in South Asians and Europeans Shares Identity by Descent. PLoS Genet 9(11): e1003912. doi:10.1371/journal.pgen.1003912

* From my personal experience American born Indians often do not share the same prejudices and biases, partly because subtle shades of brown which are relevant in the Indian context seem ludicrous in the United States.

The post Selection happens; but where, when, and why? appeared first on Gene Expression.

August 9, 2013

Indo-Aryan Assimilation Theory (AAT)

Filed under: Genetics — Razib Khan @ 2:33 am
See my post and Zack Ajmal’s. Also, in case I didn’t make it clear enough in the post: I think that post-Indo-Aryan incursions were probably non-trivial. I now lean toward the proposition that the Jat populations of northwest India and … Continue reading

January 5, 2013

Why the future won’t be genetically homogeneous

While reading The Founders of Evolutionary Genetics I encountered a chapter where the late James F. Crow admitted that he had a new insight every time he reread R. A. Fisher’s The Genetical Theory of Natural Selection. This prompted me to put down The Founders of Evolutionary Genetics after finishing Crow’s chapter and pick up my copy of The Genetical Theory of Natural Selection. I’ve read it before, but this is as good a time as any to give it another crack.

Almost immediately Fisher aims at one of the major conundrums of 19th century theory of Darwinian evolution: how was variation maintained? The logic and conclusions strike you like a hammer. Charles Darwin and most of his contemporaries held to a blending model of inheritance, where offspring reflect a synthesis of their parental values. As it happens this aligns well with human intuition. Across their traits offspring are a synthesis of their parents. But blending presents a major problem for Darwin’s theory of adaptation via natural selection, because it erodes the variation which is the raw material upon which selection must act. It is a famously peculiar fact that the abstraction of the gene was formulated over 50 years before the concrete physical embodiment of the gene, DNA, was ascertained with any confidence. In the first chapter of The Genetical Theory R. A. Fisher suggests that the logical reality of persistent copious heritable variation all around us should have forced scholars to the inference that inheritance proceeded via particulate and discrete means, as these processes do not diminish variation indefinitely in the manner which is entailed by blending.

More formally the genetic variance decreases by a factor of 1/2 every generation in a blending model. This is easy enough to understand. But I wanted to illustrate it myself, so I slapped together a short simulation script. The specifications are as follows:

1) Fixed population size, in this case 100 individuals

2) 100 generations

3) All individuals have 2 offspring, and mating is random (no consideration of sex)

4) The offspring trait value is the mid-parent value of the parents, though I also including a “noise” parameter in some of the runs, so that the outcome is deviated somewhat in a random fashion from expected parental values

In terms of the data structure the ultimate outcome is a 100 ✕ 100 matrix, with rows corresponding to generations, and each cell an individual in that generation. The values in each cell span the range from 0 to 1. In the first generation I imagine the combining of two populations with totally different phenotypic values; 50 individuals coded 1 and 50 individuals coded 0. If a 1 and 1 mate, the produce only 1′s. Likewise with 0′s. On the other hand a 0 and a 1 produce a 0.5. And so forth. The mating is random in each generation.

The figure to the left illustrates the decay in the variance of the trait value over generation time in different models. The red line is the idealized decay: 1/2 decrease in variance per generation. The blue line is one simulation. It roughly follows the decay pattern, though it is deviated somewhat because it seems that there was some assortative mating randomly (presumably if I used many more individuals it would converge upon the analytic curve). Finally you see one line which follows the trajectory of a simulation with noise. Though this population follows the theoretical decay more closely initially, it converges upon a different equilibrium value, one where some variance remains. That’s because the noise parameter continues to inject this every generation. The relevant point is that most of the variation disappears < 5 generations, and it is basically gone by the 10th generation. To maintain variation in a blending inheritance model requires a great deal of mutation, the extent of which is just not plausible.

To get a different sense of what occurred in these two particular simulations, here are heat maps. The interval 0 and  1 now have shading in each sell. I am displaying only 50 generations here. The top panel is one without noise, while the bottom panel has the noise parameter.

The contrast with a Mendelian model is striking. Imagine that 0 and 1 are now coded by two homozygote genotypes, with heterozygotes exhibiting a value of 0.5. If all the variation is controlled by the genotypes, then you have three genotypes, and three trait values. If I change the scenario above to a Mendelian one than variance will initially decrease, but the equilibrium will be maintained at a much higher level, as 50% of the population will be heterozygotes (0.5), and 50% homozygotes of each variety (0 and 1). With the persistence of heritable variation natural selection can operate to change the allele frequencies over time without the worry that the trait values within a breeding population will converge upon each other too rapidly. This is true even in cases of polygenic traits. Height and I.Q. remain variant, because they are fundamentally heritable through discrete and digital processes.

All this is of course why the “blond gene” won’t disappear, redheads won’t go extinct, nor will humans converge upon a uniform olive shade in a panmictic future. A child is a genetic cross between parents, but only between 50% of each parent’s genetic makeup. And that is one reason they are not simply an “averaging” of parental trait values.

January 4, 2013

Mitochondrial Eve: a de facto deception?

The above image, and the one to the left, are screenshots from my father’s 23andMe profile. Interestingly, his mtDNA haplogroup is not particularly common among ethnic Bengalis, who are more than ~80% on a branch of M. This reality is clear in the map above which illustrates the Central Asian distribution my father’s mtDNA lineage. In contrast, his whole genome is predominantly South Asianform, as is evident in the estimate that 23andMe provided via their ancestry composition feature, which utilizes the broader genome. The key takeaway here is that the mtDNA is informative, but it should not be considered to be representative, or anything like the last word on one’s ancestry in this day and age.


As a matter of historical record mtDNA looms large in human population genetics and phylogeography for understandable reasons. Mitchondria produce more genetic material than is found in the nucleus, and so were the lowest hanging fruit in the pre-PCR era. Additionally, because mtDNA lineages do not recombine they are well suited to a coalescent framework, where an idealized inverted treelike phylogeny converges upon a common ancestor. Finally, mtDNA was presumed to be neutral, so reflective of demographic events unperturbed by adaptation, and characterized by a high mutation rate, yielding a great amount of variation with which to differentiate the branches of the human family tree.

Many of these assumptions are are now disputable. But that’s not the point of this post. In the age of dense 1 million marker SNP-chips why are we still focusing on the history of one particular genetic region? In a word: myth. Eve, the primal woman. The “mother of us all,” who even makes cameos in science fiction finales!

In 1987 a paper was published which found that Africans harbored the greatest proportion of mtDNA variation among human populations. Additionally, these lineages coalesced back to a common ancestor on the order of 150,000 years ago. Since mtDNA is present in humans, there was a human alive 150,000 years ago who carried this ancestral lineage, from which all modern lineages derive. Mitochondrial DNA is passed from mothers to their offspring, so this individual must have been a woman. In the press she was labeled Eve, for obvious reasons. The scientific publicity resulted in a rather strange popular reaction, culminating in a Newsweek cover where Adam and Eve are depicted as naked extras from Eddie Murphy’s Coming to America film.

The problem is that people routinely believe that mtDNA Eve was the only ancestress of all modern humans from the period in which she lived. Why they believe this is common sense, and requires no great consideration. The reality is that the story being told by science is the story of mtDNA, with inferences about the populations which serve as hosts for mtDNA being incidental. These inferences need to be made cautiously and with care. It is basic logic that a phylogeny will coalesce back to a common ancestor at some point. Genetic lineages over time go extinct, and so most mtDNA lineages from the time of Eve went extinct. There were many woman who were alive during the same time as Eve, who contributed at least as much, perhaps more, to the genetic character of modern humans today. All we can say definitively is that their mtDNA lineage is no longer present. As mtDNA is passed from mother to daughter (males obviously have mtDNA, but we are dead ends, and pass it to no one), all one needs for a woman’s mtDNA lineage to go extinct is for her to have only sons. Though she leaves no imprint on the mtDNA phylogeny, obviously her sons may contribute genes to future generations.

Prior to ancient DNA and the proliferation of dense SNP data sets scholars were a bit too ambitious about what they believed they could infer from mtDNA and Y lineages (e.g., The Real Eve: Modern Man’s Journey Out of Africa). We are in a different time now, inferences made about the past rest on more than one leg. But the legend of Eve of the mtDNA persists, not because of its compelling scientific nature, but because this is a case where science piggy-backs upon prior conceptual furniture. This yields storytelling power, but a story which is based on a thin basis of fact becomes just another tall tale.

All this is on my mind because one of the scientists involved with Britain’s DNA, Jim Wilson, has penned a response to Vincent Plagnol’s Exaggerations and errors in the promotion of genetic ancestry testing (see here for more on this controversy). Overall I don’t find Wilson’s rebuttal too persuasive. It is well written, but it has the air of sophistry and lawyerly precision. I have appreciated Wilson’s science before, so I am not casting aspersions at his professional competence. Rather, some of the more enthusiastic and uninformed spokespersons for his firm have placed him in a delicate and indefensible situation, and he is gamely attempting to salvage the best of a bad hand. Importantly, he does not reassure me in the least that his firm did not use Britain’s atrocious libel laws as a threat to mute forceful criticism of their business model on scientific grounds. A more general issue here is that Wilson is in a situation where he must not damage the prospects of his firm, all the while maintaining his integrity as a scientist. From what I have seen once science becomes a business one must abandon the pretense of being a scientist first and foremost, no matter how profitable that aura of objectivity may be. The nature of marketing is such that the necessary caution and qualification essential for science becomes a major liability in the processing of communicating. It’s about selling, not convincing.

Going back to Eve, Wilson marshals a very strange argument:

“The claim that Adam and Eve really existed, as you suggest, refers to the most recent common ancestors of the mtDNA and non-recombining part of the Y chromosome. I don’t agree that there is nothing special about these individuals: there must have been a reason why mitochondrial Eve was on the front cover of Time magazine in the late 80s!….

A minor quibble, but I suspect he means the Newsweek cover. More seriously, this line of argumentation is bizarre on scientific grounds. Rather, it is a tack which is more rational when aiming toward a general audience which might purchase a kit which they believe might tell them of their relationship to “Eve.”

In the wake of the discussion at Genomes Unzipped I participated in further exchanges with Graham Coop and Aylwyn Scally on Twitter, and decided to spend 20 minutes this afternoon asking people what they thought about mitochondrial Eve. By “people,” I mean individuals who are pursuing graduate educations in fields such as genetics and forensics. My cursory “field research” left me very alarmed. Naturally these were individuals who did not make elementary mistakes in regards to the concept, but there was great confusion. I can only wonder what’s going through the minds of the public.

Analogies, allusions, and equivalences are useful when they leverage categories and concepts which we are solidly rooted in, and transpose them upon a foreign cognitive landscape. By pointing to similarities of structure and relation one can understand more fully the novel ground which one is exploring. Saying that the president of India is analogous to the queen of England is an informative analogy. These are both positions where the individual is a largely ceremonial head of state. In contrast, the president of the United States and the queen of England are very different figures, because the American executive is not ceremonial at all. This is not a useful analogy, even though superficially it sees no lexical shift.

Who was Eve? A plain reading is that she is the ancestor of all humans, and more importantly, the singular ancestress of all humans back to the dawn of time. This is a concept which the public grasps intuitively. Who is mtDNA Eve? A woman who flourished 150,000 years ago, who happened to carry the mtDNA lineage which would drift to fixation in the ancestors of modern humans. I think this is a very different thing indeed. For purposes of poetry and marketing the utilization of the name Eve is justifiable. But on scientific grounds all it does is confuse, obfuscate, and mislead.

The fiasco that Vincent Plagnol stumbled upon is just a symptom of a broader problem. Scientists need to engage in massive conceptual clean up, as catchy phrases such as “mitochondrial Eve” and “Y Adam” permeated the culture over the past generation, and mislead many sincere and engaged seekers of truth. This is of the essence because personal genomics, and the scientific understanding of genealogy, are now moving out of the ghetto of hobbyists, enthusiasts, and researchers. Though I doubt this industry will be massive, it will be ubiquitous, and a seamless part of our information portfolio. If people still have ideas like mitochondrial Eve in their head it is likely to cloud their perception of the utility of the tools at hand, and their broader significance.

December 21, 2012

The causes of evolutionary genetics

A few days ago I was browsing Haldane’s Sieve,when I stumbled upon an amusing discussion which arose on it’s “About” page. This “inside baseball” banter got me to thinking about my own intellectual evolution. Over the past few years I’ve been delving more deeply into phylogenetics and phylogeography, enabled by the rise of genomics, the proliferation of ‘big data,’ and accessible software packages. This entailed an opportunity cost. I did not spend much time focusing so much on classical population and evolutionary genetic questions. Strewn about my room are various textbooks and monographs I’ve collected over the years, and which have fed my intellectual growth. But I must admit that it is a rare day now that I browse Hartl and Clark or The Genetical Theory of Natural Selection without specific aim or mercenary intent.

R. A. Fisher

Like a river inexorably coursing over a floodplain, with the turning of the new year it is now time to take a great bend, and double-back to my roots, such as they are. This is one reason that I am now reading The Founders of Evolutionary Genetics. Fisher, Wright, and Haldane, are like old friends, faded, but not forgotten, while Muller was always but a passing acquaintance. But ideas 100 years old still have power to drive us to explore deep questions which remain unresolved, but where new methods and techniques may shed greater light. A study of the past does not allow us to make wise choices which can determine the future with any certitude, but it may at least increase the luminosity of the tools which we have iluminate the depths of the darkness. The shape of nature may become just a bit less opaque through our various endeavors.

Figure from “Directional Positive Selection on an Allele of Arbitrary Dominance”, Teshima KM, Przeworski M

So what of this sieve of Haldane? As noted at  Haldane’s Sieve the concept is simple. Imagine two mutations, one which expresses a trait in a recessive fashion, and another in a dominant one. The sieve operates by favoring the emergence out of the low frequency zone where stochastic forces predominate of dominantly expressing variants (i.e., even if an allele confers a large fitness benefit, at low frequencies the power of random chance may still imply that it is highly likely to go extinct). An example of this would be lactase persistence, which in the modal  Eurasian variant seems to exhibit dominance. The converse case, where beneficial mutations are recessive in expression suffer from a structural problem where their benefit is more theoretical than realized.

The mathematics of this is exceedingly simple, a consequence of the Hardy-Weinberg dynamics of diploid random mating organisms. Let’s use the gene which is implicated in variation in lactase persistence as an example, LCT. Consider two alleles, LP and LNP, where the former confers persistence (one can digest lactose sugar as an adult), and the latter manifests the conventional mammalian ‘wild type’ (the production of lactase ceases as one leaves the life stage when nursing is feasible). LP is clearly the novel mutant. In a small population it is not unimaginable that by random chance the frequency of LP rises to ~10%. What now? At HWE you have:

p2 + 2pq + q2 = 1, where q = LP allele. At ~10% the numbers substituted would be:

(0.90)2 + 2(0.90)(0.10) + (0.10)2

This is where dominance or recessive expression is highly relevant. The reality is that LP is a dominant trait. So in this population the frequency of LP as a trait would be:

(0.10)2 + 2(0.90)(0.10) = 19%

Now imagine a model where LP is favored, but it expresses in a recessive fashion. Then the frequency of the trait would equal q2, the homozygote LP-allele proportion. That is, 1%. Though population genetics is often constructed on an algebraic foundation, the results lend themselves to intuition. A structural parameter endogenous to the genetic system, dominant or recessive expression, can have longstanding consequences in terms of the likely trajectory of the alleles. Selection only “sees” the trait, so a recessive trait with sterling qualities may as well be a trait with no qualities. In contrast, a dominantly expressed allele can cut like a scythe through a population, because every copy “counts.”

In preparation for this post I revisited the selection on Haldane’s Sieve in the encyclopediac Elements of Evolutionary Genetics. The authors note that this phenomenon, though of vintage character as these things can be reckoned is a field as young as evolutionary genetics, is still a live one. The dominance of favored mutations in wild populations, or the recessive character of deleterious ones in laboratory stock, may reflect the different regimes which these two genes pools are subject to. The nature of things is such that is easier to generate recessive mutations than dominant ones (i.e., loss is easier than gain), so the preponderance of dominant variants in wild stocks subject to positive selective pressure lends credence to the idea that evolutionary rather than development forces and constraints shape the genetic character of many species.

And yet things are not quite so tidy. Haldane’s Sieve, and the framework of dominant versus recessive alleles, operates differently in the area of sex chromosomes. In many lineages there is a ‘heterogametic sex’ which carries only one functional chromosome for most of the genome. In mammals this is the male (XY), while in birds this is the female (ZW). As males have only one functional copy of most genes on the sex chromosome, the masking effect of recessive expression does not apply to them in mammals. This may imply that because of the exposure of many deleterious recessive variants to natural selection within the heterogametic sex one would see different allelic distributions and genetic landscapes on these chromosomes (e.g., more rapid adaptation because of the exposure of nominally recessive alleles in the heterogametic sex, as well as more purifying selection on deleterious variants). But the reality is more complex, and the literature in this area is somewhat muddled. More precisely, it seems phylogenetically sensitive. Validation of the theory in mammals founders once one moves to Drosphila.

And that is why research in evolutionary genetics continues. The theory stimulates empirical exploration, and is tested against it. Much of the formal theory of classical evolutionary genetics, which crystallized in the years before World War II, is now gaining renewed relevance because of empirical testability in the era of big data and big computation. This is an domain where the past is not simply of interest to historians. Scientists themselves, chasing the next grant, and producing the expected stream of publications, may benefit from a little historical perspective by standing upon the shoulders of giants.

December 13, 2012

We are Nature

Filed under: Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 8:03 am

There’s an interesting piece in Slate, The Great Schism in the Environmental Movement, which seems to be a distillation of trends which have been bubbling within the modern environmentalist movement for a generation now (I’ve read earlier manifestos in a similar vein). I can’t assess the magnitude of the shift, but here’s the top-line:

But that is a false construct that scientists and scholars have been demolishing the past few decades. Besides, there’s a growing scientific consensus that the contemporary human footprint—our cities, suburban sprawl, dams, agriculture, greenhouse gases, etc.—has so massively transformed the planet as to usher in a new geological epoch. It’s called the Anthropocene.

Modernist greens don’t dispute the ecological tumult associated with the Anthropocene. But this is the world as it is, they say, so we might as well reconcile the needs of people with the needs of nature. To this end, Kareiva advises conservationists to craft “a new vision of a planet in which nature—forests, wetlands, diverse species, and other ancient ecosystems—exists amid a wide variety of modern, human landscapes.”


Let’s take this debate as a given. It is fundamentally normative. That is, it is about values. We we need to tread carefully before projecting values across disputants. Far too often in this domain people seem to presume normative alignments, and therefore confuse ideological disagreement for rejection of factual truths. But, one thing to consider is that it is probable that human beings have already radically reshaped the ecological character of the world over the past 100,000 years. The implicit model that many older environmental activists seem to present is a framework pitting man & the machine vs. nature (the Shire vs. Mordor). But it is just not a useful dichotomy for many.

It is possible that there was, and is, no “pristine” nature. These disparate perspectives come to the fore in particular in post-colonial landscapes settled by Europeans. There is a long tradition in these areas of transforming ‘natives’ into ‘Noble Savages,’ who have attained some idealized harmony with Nature. The reality is that it is not harmony that was attained, but equilibrium. The arrival of anatomically modern humans to Australia and the New World resulted in a ‘shock’ to the ecological system, as megafauna went extinct due to the new variable of human predation. Even if H. sapiens were not the sufficient condition for these extinctions (populations naturally go through cycles), it is likely they were necessary (i.e., humans might extirpate species during times of low census size). But it is not just the initial impact in terms of species turnover. Australian and Amerindian populations seem to have reshaped the long term character of the landscape through fireCharles C. Mann argues in 1491 that  the vast forests which colonial and early American settlers cleared were in fact second growth, which emerged in the wake of massive die-offs of indigenous peoples due to Old World disease.

All of this is fundamentally complicated. Instead of a decision tree with two options, ‘Civilization’ vs. ‘Nature,’ there is actually a space populated with a multitude of positions. As someone touched by a moderate amount of biophilia my vision for the future is one of arcology based urbanism, massively scaled up algaculture, and megafaunal rewilding through genetic engineering and ancient DNA. Rather than idealize a mythic past we should endeavor to forge a new future. So it was, and so shall it ever be.

December 12, 2012

A lighter shade of brown: Dan MacArthur, look east or south!

Filed under: Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 2:58 pm

South Indian Udupi cuisine

In the post below I offered up my supposition that Dan MacArthur’s ancestry is unlikely to be Northwest Indian, which precludes a Romani origin for his South Asian ancestry. Indeed this is almost certainly so, Dienekes Pontikos followed up my crude analyses with IBD-sharing calculations (IBD = ‘identity by descent,’ which is basically what you would think it is). The South Asian population which MacArthur has the closest affinity to is from Karnataka, which is one of the Dravidian speaking states of the South. This does not necessarily refute my earlier contention, as aside from Brahmins most Bengalis seem to have broad South Indian affinities, except for the fact that they often have more East Asian ancestry.


Now, I may seem a touch obsessive on this issue at this point. There are several things motivating me. First, this was laying around in plain sight, but we missed it for years! Second, I’ve known Dan for a while, so this is very amusing on a personal level. Third, Dienekes’ has been pushing me to continue my exploration in a friendly competition. None of this is very difficult, and I’ve been going at it in the early hours of the day before work, or right before I go to sleep. In short, I’m doing this in part to show that you don’t need to just talk genomics, you too can do genomics. Ironically the age of “Big Data” is also the age of distributed data.

Don’t wait to have children!

Filed under: Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 7:34 am

The New Republic has a piece up, How Older Parenthood Will Upend American Society, which won’t have surprising data for readers of this weblog. But it’s nice to see this sort of thing go “mainstream.” My daughter was born when her parents were in their mid-30s, so I know all the statistics. They aren’t good bed-time reading (she’s healthy and robust so far!). If I had to do it over again I definitely wouldn’t have waited this long. After becoming a father it brought home to me that waiting was one of the worst decisions of my life. Why postpone something this incredible for the more far more prosaic pleasures of an extended adolescence? Granted, I’m not sure that I would have been the best father at 25, but I don’t think there’s much I can say in reply to the argument that I should have become a father by 30.

More concretely, we would have had sperm and egg “banked” if we had been smart delaying parenthood. The article notes that storage of sperm costs $850 up front, and $300 to $500 per year after that, and that many balk at the cost. And how much do you spend on your cell phone every year? The issue here seems to be time preference.


Of course one of the aspects of the piece which is implicit is its class-orientation. It finishes up:

It won’t be easy to make the world more baby-friendly, but if we were to try, we’d have to restructure the professions so that the most intensely competitive stage of a career doesn’t occur right at the moment when couples should be lavishing attention on infants….

Though the fertility transition has occurred across all groups, it has been most striking among the professional classes. But the the professional classes are a minority of the population. The reality is that we’d expect fertility to rebound over time as professional class people, whether for genetic or cultural (likely both) reasons, fail to reproduce, and are replaced by those who produce large families early, and continue to do so in a heritable fashion.

 
The plot to the left illustrates the dynamic which I’m alluding to. If at t = 0 you have two populations where one outnumbers the second by a factor of 100, if the population which is at ~1% proportion has a doubling-time of 30 years vs. replacement for the majority, then it takes ~200 years for the minority to outnumber the majority. Of course, we need to be careful about projections. Something I’ve criticized more gullible or historically naive readers about. Mormon fundamentalists or Old Order Amish have very high fertility rates in the context of being embedded in advanced societies which have very low fertility (at least in relation). But context-free models are useless. For example, some hyper-fertile groups, such as Mormon fundamentalists and Hasidic Jews, have been known to exploit the welfare state to subsidize their natalism. Such a subsidy lasts only so long as the ratio between producers and consumers favors the former. If a trend is unsustainable, it won’t be sustained. In the short term the “limits to growth” are probably not capped by resources, but human capital, which allows us to have increasing marginal returns. Systems do return to equilibria eventually.

As I don’t see a shift back to younger families among the professional classes, what’s the solution? Massive sperm and egg banking of 20 year old individuals seems like an economically feasible and effective strategy. I wouldn’t be surprised if the math works out that this reduces our long term health care costs, by cutting down on children born with congential defects and decreasing the lifetime morbidity rate of the population as a whole. Want to bend the cost curve? This might be a way! Combined with pre-implantation genetic diagnosis the ‘solution’ then is probably more technology, not less.

The long term fix of course is to reform the professional system and the culture of careerism so that what is rewarded is genuine productivity, rather than signalling. By this, I mean that I suspect many people who work “80 hours a week” aren’t all that productive beyond about 50, but they do signal to their superiors that they are “committed.” And of course now the whole peer group has to shift toward this irrational norm. Better workplace monitoring and productivity assessments might be a way to address this issue, so that superiors can actually evaluate gains to hours of putative labor.

Finally, the General Social Survey asks people about the ideal number of children they’d like to have, versus the actually number of children they do have (CHLDIDEL and CHILDS). To remove demographic confounds I limited the sample below to non-Hispanic white women age 45 and up between the year 2006 and 2010, and compared across educational attainment.

December 11, 2012

We don’t know why Ethiopians breathe easy

Filed under: Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 9:40 pm

Most people are aware that altitude imposes constraints on individual performance and function. Much of this is flexible; athletes who train at high altitudes may gain a performance edge. But over the long term there are costs, just as there are with computers which are ‘overclocked.’ This is the point where you make the transition from physiology to evolution. Residence at high altitude entails strong selective pressures on populations. Over the past few years there has been a great deal of exploration of the genetics of long resident high altitude groups, the Tibetans, Peruvians, and Ethiopians.

Why is high altitude adaptation of interest? First, it’s clear and distinct. Coding the two phenotypes is not that difficult. Second, the adaptive value is clear. Though people can survive at high altitudes, most do not flourish. And third, you have several distinct populations which are phylogenetically diverse. In other words, you can test evolution across multiple lineages, and see if the phenotypic and genetic features resemble each other, or differ. To some extent we know the answer: though there are overlaps, altitude adaptations differ both in their physiological presentation, and in their genetic architecture. Because it is such a powerful force, the shape of adaptation to high altitudes gives us a better sense of the arc of evolution across diverse populations and times.

With all that under our belt, I have to admit that a new paper in PLoS Genetics, The Genetic Architecture of Adaptations to High Altitude in Ethiopia, is quite peculiar in my opinion. To not put too fine a point on it, it seems hurried and confused. Granted, from what I can gather the authors confirmed the likelihood that the genetic architecture of high altitude adaptations in Ethiopian populations is such that variance exhibits a more polygenic character, resulting in weaker ability to detect genes of large effect. But if you read the paper you notice that they go off on a particular tangent on the adaptation to diseases on immune loci. Much of the work over the past 10 years on recent evolution in humans has focused on immune response, so I don’t see why that was necessary for the broader paper.

Additionally, there’s one element which I think needs to be explored more: the phylogenetic history of the Ethiopian people. The paper makes the blithe assertion that Ethiopians have been resident in the highlands of that nation for tens of thousands of years, but the latest research is that that just isn’t so. In short, it may be that Ethiopians as we understand them are recent compounds of a long resident Sub-Saharan African population and West Eurasian migrants. I’m not sure that this necessarily impacts the high altitude adaptations, as the admixed populations would presumably inherit the alleles from the local residents if they did not exterminate them. But, it does seem to complicate the story that you’d want to tell.

Overall, I think this sort of paper, which follows on the heels of other work, should have been presented in a more explicitly evolutionary context. If these adaptations are old, then there may be many variants in the highlands of Ethiopia. In contrast, the Peruvian adaptations are presumably new, and from what I have been told they are more like improvisations on the fly. In a way the spectrum of subtly of adaptation, with Ethiopians being the most nuanced, and Peruvians the least (which Tibetans in the middle), gives us a sense of the different power of evolution on the scale of 50,000 years vs. 10,000 years. Additionally, the region around Ethiopia has long been inhabited by hominins. No prior population was resident on the Altiplano before the arrival of the first Amerindians ~10,000 years before the present.

I’m hoping that the Nielsen Group comes out with a better paper. I know they’re going to revisit the topic of altitude adaptation (so I was told at ASHG 2012).

Citation: Alkorta-Aranburu G, Beall CM, Witonsky DB, Gebremedhin A, Pritchard JK, et al. (2012) The Genetic Architecture of Adaptations to High Altitude in Ethiopia. PLoS Genet 8(12): e1003110. doi:10.1371/journal.pgen.1003110

December 10, 2012

Is Daniel MacArthur ‘desi’?

My initial inclination in this post was to discuss a recent ordering snafu which resulted in many of my friends being quite peeved at 23andMe. But browsing through their new ‘ancestry composition’ feature I thought I had to discuss it first, because of some nerd-level intrigue. Though I agree with many of Dienekes concerns about this new feature, I have to admit that at least this method doesn’t give out positively misleading results. For example, I had complained earlier that ‘ancestry painting’ gave literally crazy results when they weren’t trivial. It said I was ~60 percent European, which makes some coherent sense in their non-optimal reference population set, but then stated that my daughter was >90 percent European. Since 23andMe did confirm she was 50% identical by descent with me these results didn’t make sense; some readers suggested that there was a strong bias in their algorithms to assign ambiguous genomic segments to ‘European’ heritage (this was a problem for East Africans too).

Here’s my daughter’s new chromosome painting:

One aspect of 23andMe’s new ancestry composition feature is that it is very Eurocentric. But, most of the customers are white, and presumably the reference populations they used (which are from customers) are also white. Though there are plenty of public domain non-white data sets they could have used, I assume they’d prefer to eat their own data dog-food in this case. But that’s really a minor gripe in the grand scheme of things. This is a huge upgrade from what came before. Now, it’s not telling me, as a South Asian, very much. But, it’s not telling me ludicrous things anymore either!

But in regards to omission I am curious to know why this new feature rates my family as only ~3% East Asian, when other analyses put us in the 10-15% range. The problem with very high values is that South Asians often have some residual ‘eastern’ signal, which I suspect is not real admixture, but is an artifact. Nevertheless, northeast Indians, including Bengalis, often have genuine East Asia admixture. On PCA plots my family is shifted considerably toward East Asians. The signal they are picking up probably isn’t noise. Almost every apportionment of East Asian ancestry I’ve seen for my family yields a greater value for my mother, and that holds here. It’s just that the values are implausibly low.

In any case, that’s not the strangest thing I saw. I was clicking around people who I had “shared” genomes with, and I stumbled upon this:

As you can guess from the screenshot this is Daniel MacArthur’s profile. And according to this ~25% of chromosome 10 is South Asian! On first blush this seemed totally nonsensical to me, so I clicked around other profiles of people of similar Northern European background…and I didn’t see anything equivalent.

What to do? It’s going to take more evidence than this to shake my prior assumptions, so I downloaded Dr. MacArthur’s genotype. Then I merged it with three HapMap populations, the Utah whites (CEU), the Gujaratis (GIH), and the Chinese from Denver (CHD). The last was basically a control. I pulled out chromosome 10. I also added Dan’s wife Ilana to the data set, since I believe she got typed with the same Illumina chip, and is of similar ethnic background (i.e., very white). It is important to note that only 28,000 SNPs remained in the data set. But usually 10,000 is more than sufficient on SNP data for model-based clustering with inter-continental scale variation.

I did two things:

1) I ran ADMIXTURE at K = 3, unsupervised

2) I ran an MDS, which visualized the genetic variation in multiple dimensions

Before I go on, I will state what I found: these methods supported the inference from 23andMe, on chromosome 10 Dr. MacArthur seems to have an affinity with South Asians (i.e., this is his ‘curry chromosome’). Here are the average (median) values in tabular format, with MacArthur and his wife presented for comparison.

ADMIXTURE results for chromosome 10
K 1 K 2 K 3
CEU 0.04 0.02 0.93
GIH 0.87 0.05 0.08
CHD 0.01 0.97 0.01
Daniel MacArthur 0.29 0.07 0.64
Ilana Fisher 0.01 0.06 0.94

You probably want a distribution. Out of the non-founder CEU sample none went above 20% South Asian. Though it did surprise me that a few were that high, making it more plausible to me that MacArthur’s results on chromosome 10 were a fluke:

And here’s the MDS with the two largest dimensions:

Again, it’s evident that this chromosome 10 is shifted toward South Asians. If I had more time right now what I’d do is probably get that specific chromosomal segment, phase it, and then compare it to various South Asian populations. But I don’t have time now, so I went and checked out the results from the Interpretome. I cranked up the settings to reduce the noise, and so that it would only spit out the most robust and significant results. As you can see, again chromosome 10 comes up as the one which isn’t quite like the others.

Is there is a plausible explanation for this? Perhaps Dr. MacArthur can call up a helpful relative? From what  recall his parents are immigrants from the United Kingdom, and it isn’t unheard of that white Britons do have South Asian ancestry which dates back to the 19th century. Though to be totally honest I’m rather agnostic about all this right now. This genotype has been “out” for years now, so how is it that no one has noticed this peculiarity??? Perhaps the issue is that everyone was looking at the genome wide average, and it just doesn’t rise to the level of notice? What I really want to do is look at the distribution of all chromosomes and see how Daniel MacArthur’s chromosome 10 then stacks up. It might be a random act of nature yet.

Also, I guess I should add that at ~1.5% South Asian that would be consistent with one of MacArthur’s great-great-great-great grandparents being Indian. Assuming 25 year generation times that puts them in the mid-19th century. Of course, at such a low proportion the variance is going to be high, so it is quite possible that you need to push the real date of admixture one generation back, or one generation forward.

December 9, 2012

The origins of the Romani determined definitively

Filed under: Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 11:52 am

In many cases there are questions of a historical and ethnographic nature which are subject to controversy and debate. Scholarly arguments are laid out, and further dispute ensues. For decades progress seems fleeting, as one hypothesis is accepted, only to be subject to later revision. This sort of pattern gives succor to the most cynical and jaded of ‘Post Modern’ set, especially when the ‘discourse’ in question is in the domain of science.

But thankfully these debates can come to an end in some cases. So it is with the origins of the European Romani, better known as ‘Gypsies’ (though the Roma are the most well known of the Romani, other groups within Europe have different ethnonyms). Obviously many of the basic elements have long been there, but I think the most recent genetic work now establishes a level of closure. Taking a step back, what do we know?

1) The Romani language seems to be Indo-Aryan, with a likely affinity with the northwest group of Indo-Aryan languages

2) The Romani presence in Europe only dates to the past ~1,000 years, with an entry point in the Byzantine Empire

3) They are an admixture between an ancestral Indian element, and local populations

4) Their history of endogamy has resulted in a strong genetic drift effect

The two papers which seem to nail the coffin shut on these questions use somewhat different methodologies. One relies on Y chromosomal STRs (hypervariable repeat regions) to generate a paternal phylogeny. Focusing just on the paternal phylogeny allows for one to make very robust genealogical inferences. Additionally, the authors had a very large data set across India. Their goal was to ascertain the exact region of origin of the Romani before they left India. As noted in bullet #1 there is already some evidence from their language that this must be in northwest India. The second paper uses a SNP-chip; hundreds of thousands of autosomal markers. This has been done to death for other populations, so the method isn’t new. Rather, it is that it is now being applied to the Romani.

First, the Y chromosomal paper. The Phylogeography of Y-Chromosome Haplogroup H1a1a-M82 Reveals the Likely Indian Origin of the European Romani Populations:

Linguistic and genetic studies on Roma populations inhabited in Europe have unequivocally traced these populations to the Indian subcontinent. However, the exact parental population group and time of the out-of-India dispersal have remained disputed. In the absence of archaeological records and with only scanty historical documentation of the Roma, comparative linguistic studies were the first to identify their Indian origin. Recently, molecular studies on the basis of disease-causing mutations and haploid DNA markers (i.e. mtDNA and Y-chromosome) supported the linguistic view. The presence of Indian-specific Y-chromosome haplogroup H1a1a-M82 and mtDNA haplogroups M5a1, M18 and M35b among Roma has corroborated that their South Asian origins and later admixture with Near Eastern and European populations. However, previous studies have left unanswered questions about the exact parental population groups in South Asia. Here we present a detailed phylogeographical study of Y-chromosomal haplogroup H1a1a-M82 in a data set of more than 10,000 global samples to discern a more precise ancestral source of European Romani populations. The phylogeographical patterns and diversity estimates indicate an early origin of this haplogroup in the Indian subcontinent and its further expansion to other regions. Tellingly, the short tandem repeat (STR) based network of H1a1a-M82 lineages displayed the closest connection of Romani haplotypes with the traditional scheduled caste and scheduled tribe population groups of northwestern India.

 

Two trees illustrate the results succinctly:

The bottom line:

- This particular Y chromosomal lineage which is highly diagnostic of South Asian origin in the Romani shows that the Romani seem to derive from the populations of northwest India

- Additionally, within these populations the Romani Y chromosomal lineages derive from the lower caste elements, the scheduled castes and scheduled tribes

But the above results don’t get directly at genome-wide admixture. The second paper does, using hundreds of thousands of markers to explore the Romani affinity to other populations. Reconstructing the Population History of European Romani from Genome-wide Data:

The Romani, the largest European minority group with approximately 11 million people…constitute a mosaic of languages, religions, and lifestyles while sharing a distinct social heritage. Linguistic…and genetic…studies have located the Romani origins in the Indian subcontinent. However, a genome-wide perspective on Romani origins and population substructure, as well as a detailed reconstruction of their demographic history, has yet to be provided. Our analyses based on genome-wide data from 13 Romani groups collected across Europe suggest that the Romani diaspora constitutes a single initial founder population that originated in north/northwestern India ∼1.5 thousand years ago (kya). Our results further indicate that after a rapid migration with moderate gene flow from the Near or Middle East, the European spread of the Romani people was via the Balkans starting ∼0.9 kya. The strong population substructure and high levels of homozygosity we found in the European Romani are in line with genetic isolation as well as differential gene flow in time and space with non-Romani Europeans. Overall, our genome-wide study sheds new light on the origins and demographic history of European Romani.

The plot to the left illustrates the relationship of the Romani to world-wide populations using multi-dimensional scaling, where genetic variation is decomposed into dimensions, and individuals are plotted on those dimensions. In short, the Romani exhibit a classic admixture cline pattern.That is, they are the products of a two-way admixture between populations which occupy distinct positions along a cline, and Romani individuals and populations are distributed along the cline in proportion to their admixture. One notable aspect is that the Romani are actually two clusters; one which manifests a strong ‘east’-'west’ distribution, and another which seems located purely within the European cluster. The latter seems to be the Welsh Romani, who in the neighbor-joining tree (see the supplements) fall on the same branch as European populations, as opposed to the other Romani, who form their own clade.

To drill down further you need to ascertain admixture with a model-based clustering algorithm. Ergo, ADMIXTURE. I’ve reedited the figure to illustrate the salient points. In particular, it is clear that the Roma populations except the Welsh have significant South Asian ancestry. The question is how much? To answer this question you need to know the source population in South Asia. A peculiar aspect of this plot is that the Romani have very little of the green ancestral component, which happens to be modal in the Middle East (not shown). This element happens to be highly enriched in many Pakistani populations, but not necessarily northwest Indian ones. Nevertheless, the issue that leaves me suspicious of this particular finding is that many of the European populations, in particular those groups (e.g., Balkans) which may have admixed with the Romani, have this element to extent not evident in one of their presumed ‘daughter’ populations. I wonder if perhaps the peculiarities of Romani inbreeding has skewed the allele frequency distribution so much that you get strangeness like this. I am not showing higher K’s because those break out with a Romani-cluster. Just like the Kalash-cluster this is to a great extent a feature of the long term endogamy of these communities. With high levels of drift the allele frequency of these groups moves into a very peculiar space in relation to their parental populations, but one must not become confused and assume that the Romani or Kalash are themselves appropriate independent clusters in the same way that Europeans or East Asians are.

Using various forms of admixture analysis the authors seem to conclude that the Balkan Romani are 30-50% South Asian. This seems in line with intuition. But that still leaves open the question of who those South Asians were. As I noted above the most thorough Y chromosomal data point to the lower caste elements of northwest India. What do the autosomes say?

I don’t want get into the technical details of how they tested the models, but it seems that one of the likely parental populations to the Romani had a close relationship to the Meghwal, a scheduled caste from northwest India. In other words, the autosome results align very well with the Y chromosomal inferences. Additionally, the models tested imply that the Romani likely left South Asian ~1,000 years before the present, which aligns well with what is known from the historical record (though this is a case where I put much more stock in the historical record than inferences from population genetic models; look at the intervals).

Finally, there is the question of inbreeding. One aspect of the Romani genome is jumps out you is that they have many long “runs-of-homozygosity” (ROH). This is totally expected, as decades of uniparental analyses suggested a great deal of population bottleneck events as the Romani spread throughout Europe. But the ROH patterns also unearth an interesting fact: some of the Balkan Romani clearly have recent European admixture, while the non-Balkan Romani had an initial period of admixture followed by endogamy. The latter scenario seems to resemble Askhenazi Jews, while the former would suggest that the boundary between Romani and non-Romani in the Balkans is more fluid than is sometimes portrayed.

So there we have it. The Romani derive from lower castes populations from the northwest Indian subcontinent who seem to have left ~1,000 years ago. Over time they admixed with local populations, and are now 50-70% non-South Asian, with some groups being ~90% European (e.g., Welsh Romani). And, they have a long history as an endogamous group, judging by their inbreeding.

December 2, 2012

TreeMix: Who were the West Eurasian ancestors of Ethiopians?

Filed under: Anthroplogy,Ethiopia,Genetics,Genomics — Razib Khan @ 3:46 pm

One of the primary concerns/questions I had about Luca Pagani’s paper on the genetic origin of Ethiopians is that he found that their West Eurasian ancestor was closer to Levantine than Arabian. I was confused by this because on model-based clustering (e.g., Admixture) when you push down to a fine level of granularity you always see that the Ethiopians cluster with the Yemenis for their non-African ancestry. More precisely, Yemeni Jews are often ~100% component X, which ~50% of the ancestry of Ethiopians.

From what I recall Pagani et al. used haplotype windows which they assigned to Eurasian or African ancestral components, and they compared these to the populations related to the putative ancestral groups. Because Pagani et al. used blocks of the genome, rather than just on specific genotypes, I weight their finding more strongly. But I wanted to double check with TreeMix if the finding in Admixture was peculiar.

So again, I took a ~150,000 SNP set ran it on TreeMix with migration = 5.

Again, you see that the gene flow to the Ethiopians is coming from a position on the tree rather close to Yemenite Jews. One model which may explain this, and still align with Pagani’s findings, is that Arabians themselves are a synthetic population. A “pure” Yemenite Jew may have ancient admixture of African affinity beneath an intrusive element from the north. The parallelism between Ethiopia and Arabia in this model is clear, with the major difference being magnitude of the source population admixture (greater in Arabia), as well as some differences of the target population.

This again reiterates us to be careful of trust first-blush summaries.

Layering genetic histories

Filed under: Anthroplogy,Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 12:14 pm

As a follow up to my post from yesterday, I decided to run TreeMix on a data set I happened to have had on hand (see Inference of Population Splits and Mixtures from Genome-Wide Allele Frequency Data for more on TreeMix). Basically I wanted to display a tree with, and without, gene flow.

The technical details are straightforward. I LD pruned ~550,000 SNPs down to ~150,000. I ran TreeMix without and with migration parameters with the Bantu Kenya population being the root. Finally, when I did turn on the migration parameter I set it for 5. You can see the results below.

Most of the flows are pretty expected. The West Eurasian flow from the Turks to the Uygurs makes sense, because there is a large West Asian component to what the Uygurs have (from East Iranians?). The Chuvash are a Turkic group with minor, but significant, Turkic component. The HGDP Russian sample does have some East Eurasian ancestry. And the Moroccans also have African ancestry. But your guess is as good as mine with the Bantu flow in. These are I think Kenya, so it might be trying to interpret Nilotic admixture as generalized Eurasian.

A minor note: installing TreeMix and generating the appropriate files from pedigree format is not to difficult. But you might have confusion in how to generate the pedigree input file. You do it like so in PLINK:

./plink --noweb --bfile YourFile --freq --within YourGroupNamesFile --out YourOutPutFile

It’s the last you want to put into TreeMix’s python conversion script. The YourGroupNamesFile is basically the .fam file with an extra column, the population names for each individual.

Older Posts »

Powered by WordPress