Razib Khan One-stop-shopping for all of my content

October 11, 2017

The architecture of skin color variation in Africa

Filed under: Human Genetics,Human Genome,Human Genomics,Pigmentation — Razib Khan @ 3:20 pm
Baby of hunter-gatherers in Southern Africa

Very interesting abstract at the ASHG meeting of a plenary presentation,Novel loci associated with skin pigmentation identified in African populations. This is clearly the work that one of the comments on this weblog alluded to last summer during SMBE. There I was talking about the likely introduction of the derived SLC24A5 variant to the Khoisan peoples and its positive selection in peoples in southern Africa.

Below is the abstract in full. Those who follow the literature on this see the usual suspects in relation to genes, but also new ones:

Despite the wide range of variation in skin pigmentation in Africans, little is known about its genetic basis. To investigate this question we performed a GWAS on pigmentation in 1,593 Africans from populations in Ethiopia, Tanzania, and Botswana. We identify significantly associated loci in or near SLC24A5MFSD12TMEM138…OCA2 and HERC2. Allele frequencies at these loci in global populations are strongly correlated with UV exposure. At SLC24A5 we find that a non-synonymous mutation associated with depigmentation in non-Africans was introduced into East Africa by gene flow, and subsequently rose to high frequency. At MFSD12, we identify novel variants that are strongly correlated with dark pigmentation in populations with Nilo-Saharan ancestry. Functional assays reveal that MFSD12 codes for a lysosomal protein that influences pigmentation in cultured melanocytes, zebrafish and mice. CRISPR knockouts of murine Mfsd12 display reduced pheomelanin pigmentation similar to the grizzled mouse mutant (gr/gr). Exome sequencing of gr/gr mice identified a 9 bp in-frame deletion in exon two of Mfsd12. Thus, using human GWAS data we were able to map a classic mouse pigmentation mutant. At TMEM138…we identify mutations in melanocyte-specific regulatory regions associated with expression of UV response genes. Variants associated with light pigmentation at this locus show evidence of a selective sweep in Eurasians. At OCA2 and HERC2 we identify novel variants associated with pigmentation and at OCA2, the oculocutaneous albinism II gene, we find evidence for balancing selection maintaining alleles associated with both light and dark skin pigmentation. We observe at all loci that variants associated with dark pigmentation in African populations are identical by descent in southern Asian and Australo-Melanesian populations and did not arise due to convergent evolution. Further, the alleles associated with skin pigmentation at all loci but SLC24A5 are ancient, predating the origin of modern humans. The ancestral alleles at the majority of predicted causal SNPs are associated with light skin, raising the possibility that the ancestors of modern humans could have had relatively light skin color, as is observed in the San population today. This study sheds new light on the evolutionary history of pigmentation in humans.

Much of this is not surprising. Looking at patterns of variation around pigmentation loci researchers suggested years ago that Melanesians and Africans exhibited evidence of similarity and functional constraint. That is, the dark skin alleles date back to Africa and did not deviate from their state due to selection pressures. In contrast, light skin alleles in places like eastern and western Eurasia are quite different.

Nyakim Gatwech

This abstract also confirms something I said in a comment on the same thread, that Nilotic peoples are the ones likely to have been subject to selection for dark skin in the last 10,000 years. You see above that variants on MFSD12 are correlated with dark complexion. In particular, in Nilo-Saharan groups. The model Nyakim Gatwech is of South Sudanese nationality and has a social media account famous for spotlighting her dark skin. In comparison to the Gatwech and the San Bushman child above are so different in color that I think it would be clear these two individuals come from very distinct populations.

The fascinating element of this abstract is the finding that most of the alleles which are correlated with lighter skin are very ancient and that they are the ancestral alleles more often than the derived! We’ll have to wait until the paper comes out. My assumption is that after the presentation Science will put it on their website. But until then here are some comments:

  • There is obviously a bias in the studies of pigmentation toward those which highlight European variability.
  • The theory of balancing selection makes sense to me because ancient DNA is showing OCA2 “blue eye” alleles which are not ancestral in places outside of Western Europe. And in East Asia there their own variants.
  • Lots of variance in pigmentation not accounted for in mixed populations (again, lots of the early genomic studies focused on populations which were highly diverged and had nearly fixed differences). Presumably, African research will pick a lot of this up.
  • This also should make us skeptical of the idea that Western Europeans were necessarily very dark skinned, as now we know that human pigmentation architecture is complex enough that sampling modern populations expand our understanding a great deal.
  • Finally, it’s long been assumed that at some stage early on humans were light skinned on most of their body because we had fur. When we lost our fur is when we would need to have developed dark skin. This abstract is not clear at how far long ago light and dark alleles coalesce to common ancestors.

October 7, 2017

The Tibeto-Burman and Austro-Asiatic ancestry of Bengalis

Filed under: Bangladeshi,Bengali,Bengali Genetics,Human Genetics,Human Genomics — Razib Khan @ 11:53 am
My father’s mtDNA lineage phylogeography

When I first got my father’s 23andMe results the Y and mtDNA were an interesting contrast. He, and therefore myself, carried Y lineage R1a1a, the lord of the paternal lineages. That was not that great a surprise. In the 1000 Genomes results for the Bangladeshi sample 20% of the men were direct paternal descendants of the R1a1a progenitor.

The mtDNA was a surprise. It was G1a2. This was curious to me since Bangladesh has some of the highest frequencies in the world of haplogroups M, the subhaplogroups in question being mostly restricted to South Asia. I wasn’t surprised that I was R1a1a, but I was even more confident that my maternal lineage was going to be an M, as would my father’s (my own mtDNA is U2b, not common, but not so surprising). As you can see from the map 23andMe places my father’s maternal lineage somewhere in Northeast Asia. The only information I could get about the geography was for G1a, “G1a has been found in samples from China (Daur, Hui, Kazakh, Korean, Manchu, and a sample of the general population of the city of Shenyang), Japan, Korea, Vietnam, and Siberia (Yakut).”

The biggest sample of mtDNA results from Bangladesh I could find at N = 240 does not find any G at all, let alone G1a2. So this is clearly it is a rare haplogroup in the region. But, the authors do classify 13% of the Bangladeshis as carrying an “East Eurasian” haplogroup. Haplogroup A is found among Southeast Asians and Southern China, though not among Austronesians. Haplogroup F seems to have a similar distribution, as does D, B. The other haplogroups also seem “correctly” assigned in terms of modal distribution. They are all mostly East Asian.

Looking at the Y chromosome haplogroups in the 1000 Genomes there are two of O2 and O3, and one of C3, which are clearly of Southeast Asian origin. With N =5 out of 44 samples that is ~10%. O2 is interesting because it is found at very high frequencies among the Austro-Asiatic populations in South Asia, whether it be the Khasi, or Munda groups (general O2a). O3 seems associated with Tibeto-Burman populations, and C3 with East Asia more generally.

If you know much about the ethnolinguistic of South Asia you know that the two major language families are Indo-Aryan and Dravidian. But, there are other groups. In the northwest you have various other Indo-European speaking populations, and along the northern and northeast fringe, you have Tibeto-Burman languages being spoken. But most anomolous is the distribution of Austro-Asiatic languages. The most numerous Austro-Asiatic language in the world today is Vietnamese, followed by the language of the Khmers.

But there are numerous other Austro-Asiatic languages in Southeast and South Asia. The indigenous people of the deep forests of the Malay peninsula, including the Negritos, speaking Austro-Asiatic languages. As one moves west there are Austro-Asiatic languages in Burma, such as Mon, which used to be far more common. And in India there are two groups, the language of the Khasi of the northeast, which seems to share some affinity with the Palaungic dialects of interior Burma and southern China, and the Munda languages farthest west which seem very distinct from all the other branches.

The genetics seems to suggest that the Munda tribes do have East Asian ancestry, but it is almost totally male-mediated. Their Y chromosomal lineages are very unique, with high proportions of O2a, but their mtDNA lineages are overwhelming South Asian macro-haplogroup M. The Khasi of the hills north of Bangladeshi occupy a different position, with both maternal and paternal East Asian heritage, as well as much higher genome-wide ancestry that is not South Asian. At this point, I am convinced that the Austro-Asiatic language groups came into South Asia from the east to the west.

The other language family with East Asian connections in South Asia is that the of the Tibeto-Burmans. Unlike the Austro-Asiatic group, these peoples tend to occupy only the periphery of South Asia, the far north and east.

Finally, there are historically attested Tai peoples who migrated into South Asia. The most famous of these are the Ahoms of Assam. These were part of the same migrations ~1,000 years ago that led to the shift of Thailand from being a zone dominated by Mon and Khmer Austro-Asiatic peoples, to Tai peoples. In Burma, the Tai migrations resulted in the Shan states of the uplands, though the Burman and Mon polities were able to fight off the attempts at take over.

Ultimately the Ahom became totally Indianized. Their traditional language became relegated to ritual, and they adopted the Indo-Aryan Assamese language. Additionally, at some point, they converted to orthodox Hinduism. This became so much a part of their identity that by the 17th century were checking Islamic expansion to the east by defeating the Mughals.

All of this ultimately goes back to the question: how did my father get his mtDNA? If you read my post from a few years back, How did Bengalis get East Asian?, you will know that it is probably a mix of Austro-Asiatic and Tibeto-Burman ancestry. Can we say any more at this stage?

Some Austronesian data sets have come online. So I thought I’d give it another shot. Additionally, I spent several hours removing outliers and combining populations to generate a full data set. The number of markers was 195,000 SNPs.

Label N Notes
AA 17 Munda (outliers removed)
BD 74 Bangladesh, 1K BEB (outliers removed)
Borneo 31 Orang Asli tribes (outliers removed)
Burmese 20 Bamar ethnicity
Cambodians 39 Outliers removed
Dai 40
Han_C 47 Pooled Han from HGDP and 1K
Han_N 28 Pooled Han from HGDP and 1K
Han_S 29 Pooled Han from HGDP and 1K
Japanese 28
Malay 21
Miao 10
Phil 16 Luzon and Visaya
Phil_Highland 15 Igorot tribesman Luzon (outliers removed)
Telugu 34 1K STU (outliers removed)
Viet 18

I ran ADMIXTURE at K = 4 on the full data set.  Please to click on on the image if you want details, but the results are straightforward:

yellow = South Asian (modal in Telugu)

green = Northeast Asian (modal in Japan and northern Han)

navy = Southeast Asian/Austro-Asiatic (modal in Cambodians)

red = Austronesian (modal in Igorot tribesman from the highlands of the Philippines)

The two bottom population groups are Bangladeshis and Munda. You can see that all are mostly yellow. That is, they’re mostly South Asian. But the Munda have a much lower South Asian proportion than the Bangladeshis. This is not surprising. The Munda language and mythology is very distinct from other South Asians. Clearly, they have ancient East Asian connections, and this shows in their genome-wide ancestry.

But notice a difference between Bangladeshis and Munda: most of the Bangladeshis have a green component, which is in common among Northeast Asians, while none of the Munda do. The total fractions are 38% navy (Austro-Asiatic) for the Munda, and 7% each for navy and green (Northeast Asian) for the Bangladeshis.

The two components also exhibit a negative correlation in the Bangladeshis of -0.47. Why? My own suspicion is there is some population structure and clinal variation exists within Bangladesh. As I’ve noted before my parents are among the most East Asian of Bangaldeshis I’ve ever analyzed…and it is no surprise that we are from the east of eastern Bengal. In contrast when I’ve looked at genotypes from West Bengalis, they tend to have less East Asian ancestry, though still an appreciable amount in a broader South Asian context (in fact, even Bengali Brahmins have East Asian ancestry, though at smaller fractions).

This seems to be pretty clear rejection of the model where Bangladeshis are a two population mix of Munda tribesman, and a more conventional South Asian group.

Here are the average percentages by population:

Group Austro-Asiatic Austronesian South Asian Northeast Asian
AA 38% 0% 62% 0%
BD 7% 2% 84% 7%
Borneo 61% 38% 0% 0%
Burmese 29% 0% 23% 48%
Cambodians 73% 1% 15% 11%
Dai 49% 7% 0% 44%
Han_C 16% 5% 0% 79%
Han_N 1% 1% 2% 96%
Han_S 27% 7% 0% 66%
Japanese 0% 1% 2% 97%
Malay 64% 16% 13% 7%
Miao 24% 3% 0% 73%
Phil 34% 37% 6% 22%
Phil_Highland 0% 100% 0% 0%
Telugu 0% 3% 96% 0%
Viet 45% 7% 0% 48%

I’m 99% sure that “South Asian” is in some of these cases a proxy for anything that’s not East Asian. But the Malay and Cambodian results are probably South Asian. And the Burmese certainly are.

Click to enlarge the PCA plot to the left, but PC1 is South Asian to East Asian, PC1 is Northeast Asian to Southeast Asian.

Both the Malays and the Burmese exhibit a “South Asia cline.” This is due to admixture. But the Burmese project toward the position of the central Han, while the Malays are shifted toward a Southeastern Asian population.

Both the Bangladeshis and Munda samples are East Asia shifted, but the Munda sample clearly skews toward the Southeast Asian populations. The Bangladeshi samples do not seem to exhibit this clear pattern.

Then I ran Treemix with blocks of 1000 SNPs and no migration edges as well as global rearrangements turned on and rooted with the Telugu.

 

The results are absolutely unsurprising. Unfortunately adding migration edges doesn’t really add much value with so many populations, as there is a great deal of complex population history in Southeast Asia.

Removing many of the populations and setting the migration edges to 3, you get:

 

The Austro-Asiatic connection between Cambodians and Munda is always clear no matter what you do. The Bangladeshis tend to have more complex relationships, but often the edges are toward the Burmese, who are a compound between South Asian, Austro-Asiatic, and Northeast Asian.

At this point I ran a “three population test.” Basically, you take an outgroup, and compare it to a clade of two other populations, and see how good the fit of the data to the model is. If there is “complex population history” you’ll get a negative f3 statistic. Complex population history means that there is almost certainly gene flow between the outgroup and one of the ingroups.

Below are results where the Bangladeshis are the outgroup, and f3 statistics are negative (sorted most negative to least).

Ougroup Pop1 Pop2 f3 f3-error Z-score
BD Telugu Miao -0.00240554 6.21107e-05 -38.7298
BD Telugu Han_S -0.00238905 5.49332e-05 -43.4901
BD Telugu Dai -0.00238103 5.73977e-05 -41.4831
BD Telugu Han_C -0.00237904 5.74148e-05 -41.4359
BD Telugu Viet -0.0023151 5.63663e-05 -41.0725
BD Telugu Han_N -0.00229979 5.55838e-05 -41.3752
BD Telugu Japanese -0.00225745 5.65642e-05 -39.9095
BD Telugu Phil_Highland -0.00225153 6.87595e-05 -32.745
BD Telugu Borneo -0.00219619 5.91978e-05 -37.0992
BD Telugu Phil -0.00209752 5.97396e-05 -35.1111
BD Telugu Cambodians -0.00198719 4.88719e-05 -40.6613
BD Telugu Malay -0.00195706 5.32466e-05 -36.7547
BD Telugu Burmese -0.00183415 4.79121e-05 -38.2816
BD AA Telugu -0.000744786 4.17995e-05 -17.818

 

The model where Bangladeshis are a combination of Austro-Asiatic populations and conventional South Asians is not crazy. But observe that there is a jump in the f3 statistics between that row and the previous row. Bangladeshis almost certainly have non-Austro-Asiatic ancestry, which is why the scores are more extreme for cases such as (Bangladesh(Telugu, Vietnamese)).

What I’ve established then are:

  • Bangladeshi East Asian ancestry is not sufficiently explained by Munda ancestry.
  • A minority of Bangladeshi Y and mtDNA lineages have East Asian connections, and this can not be explained exclusively by Munda ancestry.
  • Some of these Y and mtDNA lineages seems to be of Tibeto-Burman affinity.
  • Admixture analysis genome-wide indicates ancestry from non-Munda populations of East Asian origin.
  • The fraction of Austro-Asiatic ancestry is balanced with more “northern” elements, while in Burma the northern element is a greater proportion than in Bangladesh.
  • There is a moderate negative correlation between Austro-Asiatic ancestry and Northeast Asian ancestry in the Bangladeshi sample.
  • Bangladeshis seem to have moderate signatures of gene flow from a wide range of East Asian populations.
  • In contrast, the Mundas seem to have a connection most strongly with Cambodians.

A paper from several years ago looking at the patterns of genetic ancestry in the Bangladeshi population found that a single pulse of admixture around 500 AD from an East Asian population was a good fit for the origins of the variation they saw. A two-pulse model with more ancient and more recent admixture events did not improve the fit.

I assume that there is a true signal there. But the model may still be too parsimonious.

My own predictions are as follows:

  • There will be a east-west cline of Tibeto-Burman ancestry.
  • There will be a more constant fraction of Austro-Asiatic ancestry.
  • The ratio of Austro-Asiatic ancestry will be reversed from the Tibeto-Burman cline.
  • Two admixture events will eventually be detected. A strong sex-balanced pulse at 500 AD and later. And an older continuous event that will be more male skewed, as it will involve absorption of Munda substrate.
  • The Padma river will turn out to be a major differentiator, with much more Tibeto-Burman ancestry to the east (Bengali dialects from east of the Padma show more Tibeto-Burman influence).

 Note: a separate issue that I did not want to explore is that the South Asian ancestry of the Munda seems to show almost no Indo-Aryan influence. The Bengali population does have a small, but consistent, “Indo-Aryan” signature that you can not find in the Telugu sample. Naturally this will bias the statistics a touch.

September 18, 2017

Release the UK Biobank! (the prediction of height edition)

Filed under: Genomic prediction,Human Genetics,Human Genomics,UK Biobank — Razib Khan @ 9:25 pm


There’s so much science coming out of the UK Biobank it’s not even funny. It’s like getting the palantír or something.

Anyway, a preprint, submitted for your approval. A vision of things to come? Accurate Genomic Prediction Of Human Height:

We construct genomic predictors for heritable and extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). Replication tests show that these predictors capture, respectively, ~40, 20, and 9 percent of total variance for the three traits. For example, predicted heights correlate ~0.65 with actual height; actual heights of most individuals in validation samples are within a few cm of the prediction. The variance captured for height is comparable to the estimated SNP heritability from GCTA (GREML) analysis, and seems to be close to its asymptotic value (i.e., as sample size goes to infinity), suggesting that we have captured most of the heritability for the SNPs used. Thus, our results resolve the common SNP portion of the “missing heritability” problem – i.e., the gap between prediction R-squared and SNP heritability. The ~20k activated SNPs in our height predictor reveal the genetic architecture of human height, at least for common SNPs. Our primary dataset is the UK Biobank cohort, comprised of almost 500k individual genotypes with multiple phenotypes. We also use other datasets and SNPs found in earlier GWAS for out-of-sample validation of our results.

A scatter-plot is worth a thousand derivations.

You know what better than 500,000 samples? One billion samples! A nerd can dream….

September 14, 2017

After agriculture, before bronze

 

The above plot shows genetic distance/variation between highland and lowland populations in Papa New Guinea (PNG). It is from a paper in Science that I have been anticipating for a few months (I talked to the first author at SMBE), A Neolithic expansion, but strong genetic structure, in the independent history of New Guinea.

What does “strong genetic structure” mean? Basically Fst is showing the proportion of genetic variation which is partitioned between groups. Intuitively it is easy to understand, in that if ~1% of the genetic variation is partitioned between groups in one case, and ~10% in another, then it is reasonable to suppose that the genetic distance between groups in the second case is larger than in the first case. On a continental scale Fst between populations is often on the order of ~0.10. That is the value for example when you pool the variation amongst Northern Europeans and Chinese, and assess how much of it can be apportioned in a manner which differentiates populations (so it’s about ~10% of the variation).

This is why ancient DNA results which reported that Mesolithic hunter-gatherers and Neolithic farmers in Central Europe who coexisted in rough proximity for thousands of years exhibited differences on the order of ~0.10 elicited surprise. These are values we are now expecting from continental-scale comparisons. Perhaps an appropriate analogy might be the coexistence of Pygmy groups and Bantu agriculturalists? Though there is some gene flow, the two populations exist in symbiosis and exhibit local ecological segregation.

In PNG continental scale Fst values are also seen among indigenous people. The differences between the peoples who live in the highlands and lowlands of PNG are equivalent to those between huge regions of Eurasia. This is not entirely surprising because there has been non-trivial gene flow into lowland populations from Austronesian groups, such as the Lapita culture. Many lowland groups even speak Austronesian languages today.

Using standard ADMIXTURE analysis the paper shows that many lowland groups have significant East Asian ancestry (red), while none of the highland groups do (some individuals with East Asian admixture seem to be due to very recent gene flow). But even within the highlands the genetic differences are striking. The  Fst values between Finns and Southern European groups such as Spaniards are very high in a European context (due to Finnish Siberian ancestry as well as drift through a bottleneck), but most comparisons within the highland groups in PNG still exceeds this.

The paper also argues that genetic differences between Papuans and the natives of Australia pre-date the rising sea levels at the beginning of the Holocene, when Sahul divided between its various constituents. This is not entirely surprising considering that the ecology of the highlands during the Pleistocene would have been considerably different from Australia to the south, resulting in sharp differences in the hunter-gatherer lifestyles. Additionally, there does not seem to have been a genetic cline. Papuans are symmetrically related to all Australian groups they had samples from.

Using coalescence-based genomic methods they inferred that separation between highlands and some lowland groups occurred ~10-20,000 years ago. That is, after the Last Glacial Maximum. For the highlands, the differences seem to date to within the last 10,000 years. The Holocene. Additionally, they see population increases in the highlands, correlating with the shift to agriculture (cultivation of taro).

None of the above is entirely surprising, though I would take the date inferences with a grain of salt. The key is to observe that large genetic differences, as well as cultural differences, accrued in the highlands of PNG during the Holocene. In the paper they have a social and cultural explanation for what’s going on:

  Fst values in PNG fall between those of hunter-gatherers and present-day populations of west Eurasia, suggesting that a transition to cultivation alone does not necessarily lead to genetic homogenization.

A key difference might be that PNG had no Bronze Age, which in west Eurasia was driven by an expansion of herders and led to massive population replacement, admixture, and cultural and linguistic change (7, 8), or Iron Age such as that linked to the expansion of Bantu-speaking
farmers in Africa (24). Such cultural events have resulted in rapid Y-chromosome lineage expansions due to increased male reproductive variance (25), but we consistently find no evidence for this in PNG (fig. S13). Thus, in PNG, wemay be seeing the genetic, linguistic, and cultural diversity that sedentary human societies can achieve in the absence of massive technology-driven expansions.

Peter Turchin in books like Ultrasociety has aruged that one of the theses in Steven Pinker’s The Better Angels of Our Nature is incorrect: that violence has not decreased monotonically, but peaked in less complex agricultural societies. PNG is clearly a case of this, as endemic warfare was a feature of highland societies when they encountered Europeans. Lawrence Keeley’s War Before Civilization: The Myth of the Peaceful Savage gives so much attention to highland PNG because it is a contemporary illustration of a Neolithic society which until recently had not developed state-level institutions.

What papers like these are showing is that cultural and anthropological dynamics strongly shape the nature of genetic variation among humans. Simple models which assume as a null hypothesis that gene flow occurs through diffusion processes across a landscape where only geographic obstacles are relevant simply do not capture enough of the dynamic. Human cultures strongly shape the nature of interactions, and therefore the genetic variation we see around us.

July 9, 2017

SLC24A5 is very important, but we don’t know why


The golden of pigmentation genetics started in 2005 with SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. Prior to that pigmentation genetics was really to a great extent coat color genetics, done in mice and other organisms which have a lot of pelage variation.

Of course there was work on humans, mostly related to melanocortin 1. But more interesting were classical pedigree studies which indicated that the number of loci controlling variation in pigmentation was not that high. This, it was a mildly polygenic trait insofar as some large effect quantitative trait loci could be discerned in the inheritance patterns.

From The Genetics of Human Populations, written in the 1960s, but still useful today because of its comprehensive survey of the classical period:

Depending on what study samples you use variance on a locus of SLC24A5 explains less than 10% or more than 30% of the total variance. But it is probably the biggest effect locus on the whole in human populations when you pool them altogether (obviously it explains little variance in Africans or eastern non-Africans since it is homozygous ancestral by and large in both groups).

One aspect of the derived SNP in this locus is that it seems to be under strong selection. In a European 1000 Genomes sample there are 1003 SNPs of the derived variant, and 3 of the ancestral. Curiously this allele was absent in Western European Mesolithic European hunter-gatherers, though it was present in hunter-gatherers on the northern and eastern fringes of the continent. It was also present in Caucasian hunter-gatherers and farmers from the Middle East who migrated to Europe. It seems very likely that these sorts of high frequencies are due to selection in Europe.

The variant is also present in appreciably frequencies in many South Asian populations, and there seems to have been in situ selection there too, as well as the Near East. In Ethiopia it also seems to be under selection.

It could be something due to radiation…but the Near East and South Asia are quite high intensity in that regard. As are the highlands of Ethiopia. About seven years ago I suggested that rather that UV radiation as such the depigmentation that has occurred across the Holocene might be due to agriculture and changes in diet.

But a new result from southern Africa presented at the SMBE meeting this year suggests that this can not be a comprehensive answer. Meng Lin in Brenna Henn’s lab uses a broad panel of KhoeSan populations to find that the derived allele on SLC24A5 reaches ~40% frequency. Probably a high fraction of West Eurasian admixture in these groups is around ~10% being generous. Where did this allele come from? The results from Joe Pickrell a few years back are sufficient to explain: there was a movement of pastoralists with distant West Eurasian ancestry who brought cattle to southern Africa, and so resulted in the ethnogenesis of groups such as the Nama people (there is also Y chromosomal work by Henn on this).

Sad human with two derived alleles of SNP of interest

Lin reports that the haplotype around SLC24A5 is the same one as in Western Eurasia. Iain Mathieson (who is now at Penn if anyone is looking for something to do in grad school or a post-doc) has told me that the haplotype in the Motala Mesolithic hunter-gatherers and in the hunter-gatherers from the Caucasus are the same. It seems that this haplotype was widespread early in the Holocene. Curiously, the Motala hunter-gatherers also carry the East Asian haplotype around their derived EDAR variant.

I don’t know what to make of this. My intuition is that if a haplotype like this is so widespread nearly ~10,000 years ago recombination would have broken it apart into smaller pieces so that haplotype structure would be easier to discern. As it is that doesn’t seem to be the case.

And we also don’t know what’s going on withSLC24A5. Obviously it impacts skin color. It has been shown to do so in admixed populations. But it is hard to believe that that is the sole target of natural selection here.

May 10, 2017

The Bronze age demographic transformation of Britiain

Filed under: Bell Beaker,Britain,Evolution,History,Human Genetics,Human Genomics — Razib Khan @ 8:52 am

In Norman Davies’ the excellent The Isles: A History, he mentions offhand that unlike the Irish the British to a great extent have forgotten their own mythology. This is one reason that J. R. R. Tolkien created Middle Earth, they gave the Anglo-Saxons the same sort of mythos that the Irish and Norse had.

But to some extent I think we can update our assessments. Science is bringing myth to life. The legendary “Bell Beaker paper” is now available in preprint form, The Beaker Phenomenon And The Genomic Transformation Of Northwest Europe. The methods are not too abstruse if you have read earlier works on this vein (i.e., no Nick Patterson authored methodological supplement that I saw). And the results are straightforward.

And what are those results?

First, the Bell Beaker phenomenon was both cultural and demographic. Cultural in that it began in the Iberian peninsula, and was transmitted to Central Europe, without much gene flow from what they can see. Demographic in that its push west into what is today the Low Countries and France and the British Isles was accompanied by massive gene flow.

In their British samples they conclude that 90% of the ancestry of early Bronze Age populations derive from migrants from Central Europe with some steppe-like ancestry. In over words, in a few hundred years there was a 90% turnover of ancestry. The preponderance of the male European R1b lineage also dates to this period. It went from ~0% to ~75-90% in Britain over a few hundred years.

If most of the genetic-demographic character of modern Britain was established during the Bronze Age*, then there has been significant selection since the Bronze Age. The figure to the left shows ancient (Neolithic/Bronze age) frequencies of selected SNPs, with modern frequencies in the British in dashed read. The top-left SNP is for HERC2-OCA2, the region related to brown vs. blue eye color, and also associated with some more general depigmentation. The top-right SNP is in SLC45A2, the second largest effect skin color locus in Europeans. The bottom SNP is for a mutation on LCT, which allows for the digestion of milk sugar as adults.

The vast majority of the allele frequency change in Britons for digestion of milk sugar post-dates the demographic turnover. In other words, the modern allele frequency is a function of post-Bronze Age selection. This is not surprising, as it supports the result in Eight thousand years of natural selection.

1000 Genomes derived SLC45A2 SNP frequency

At least as interesting are the pigmentation loci. The fact that the derived frequency in HERC2-OCA2 is lower in both British and Central European Beaker people samples indicates that the lower proportion is not an artifact of sampling. Britons have gotten more blue-eyed over the last 4,000 years. Second, SLC45A2 is at shocking low proportions for modern European populations.

HGDP derived SLC45A2 SNP frequency

In the 1000 Genomes the 4% ancestral allele frequency is almost certainly a function of the Siberian (non-European) ancestry. In modern Iberians the ancestral frequency is 18% (and it is even higher in Sardinians last I checked), but in Tuscans it is ~2%. Though not diagnostic of Europeans in the way the derived SNP at SLC24A2 is, SLC452 derived variants are much more constrained to Europe. Individuals who are homozygote ancestral for SNPs atSLC45A2 rare in modern Northern Europeans (pretty much nonexistent actually). But even as late as the Bronze Age they would have been present at low but appreciable frequencies.

This particular result convinces me that the method in Field et al. which detected lots of recent (last 2,000 years) selection on pigmentation in British populations is not just a statistical artifact. Though these papers are solving much of European prehistory, they are also going to be essential windows into the trajectory of natural selection in human populations over the last 5,000 years.

* In the context of this paper the Anglo-Saxon migrations tackled by the PoBI paper are minor affairs because the two populations were already genetically rather close. Additionally, the PoBI paper found that the German migrations were significant demographic events, but most of the ancestry across Britain does date to the previous period.

January 5, 2013

Why the future won’t be genetically homogeneous

While reading The Founders of Evolutionary Genetics I encountered a chapter where the late James F. Crow admitted that he had a new insight every time he reread R. A. Fisher’s The Genetical Theory of Natural Selection. This prompted me to put down The Founders of Evolutionary Genetics after finishing Crow’s chapter and pick up my copy of The Genetical Theory of Natural Selection. I’ve read it before, but this is as good a time as any to give it another crack.

Almost immediately Fisher aims at one of the major conundrums of 19th century theory of Darwinian evolution: how was variation maintained? The logic and conclusions strike you like a hammer. Charles Darwin and most of his contemporaries held to a blending model of inheritance, where offspring reflect a synthesis of their parental values. As it happens this aligns well with human intuition. Across their traits offspring are a synthesis of their parents. But blending presents a major problem for Darwin’s theory of adaptation via natural selection, because it erodes the variation which is the raw material upon which selection must act. It is a famously peculiar fact that the abstraction of the gene was formulated over 50 years before the concrete physical embodiment of the gene, DNA, was ascertained with any confidence. In the first chapter of The Genetical Theory R. A. Fisher suggests that the logical reality of persistent copious heritable variation all around us should have forced scholars to the inference that inheritance proceeded via particulate and discrete means, as these processes do not diminish variation indefinitely in the manner which is entailed by blending.

More formally the genetic variance decreases by a factor of 1/2 every generation in a blending model. This is easy enough to understand. But I wanted to illustrate it myself, so I slapped together a short simulation script. The specifications are as follows:

1) Fixed population size, in this case 100 individuals

2) 100 generations

3) All individuals have 2 offspring, and mating is random (no consideration of sex)

4) The offspring trait value is the mid-parent value of the parents, though I also including a “noise” parameter in some of the runs, so that the outcome is deviated somewhat in a random fashion from expected parental values

In terms of the data structure the ultimate outcome is a 100 ✕ 100 matrix, with rows corresponding to generations, and each cell an individual in that generation. The values in each cell span the range from 0 to 1. In the first generation I imagine the combining of two populations with totally different phenotypic values; 50 individuals coded 1 and 50 individuals coded 0. If a 1 and 1 mate, the produce only 1′s. Likewise with 0′s. On the other hand a 0 and a 1 produce a 0.5. And so forth. The mating is random in each generation.

The figure to the left illustrates the decay in the variance of the trait value over generation time in different models. The red line is the idealized decay: 1/2 decrease in variance per generation. The blue line is one simulation. It roughly follows the decay pattern, though it is deviated somewhat because it seems that there was some assortative mating randomly (presumably if I used many more individuals it would converge upon the analytic curve). Finally you see one line which follows the trajectory of a simulation with noise. Though this population follows the theoretical decay more closely initially, it converges upon a different equilibrium value, one where some variance remains. That’s because the noise parameter continues to inject this every generation. The relevant point is that most of the variation disappears < 5 generations, and it is basically gone by the 10th generation. To maintain variation in a blending inheritance model requires a great deal of mutation, the extent of which is just not plausible.

To get a different sense of what occurred in these two particular simulations, here are heat maps. The interval 0 and  1 now have shading in each sell. I am displaying only 50 generations here. The top panel is one without noise, while the bottom panel has the noise parameter.

The contrast with a Mendelian model is striking. Imagine that 0 and 1 are now coded by two homozygote genotypes, with heterozygotes exhibiting a value of 0.5. If all the variation is controlled by the genotypes, then you have three genotypes, and three trait values. If I change the scenario above to a Mendelian one than variance will initially decrease, but the equilibrium will be maintained at a much higher level, as 50% of the population will be heterozygotes (0.5), and 50% homozygotes of each variety (0 and 1). With the persistence of heritable variation natural selection can operate to change the allele frequencies over time without the worry that the trait values within a breeding population will converge upon each other too rapidly. This is true even in cases of polygenic traits. Height and I.Q. remain variant, because they are fundamentally heritable through discrete and digital processes.

All this is of course why the “blond gene” won’t disappear, redheads won’t go extinct, nor will humans converge upon a uniform olive shade in a panmictic future. A child is a genetic cross between parents, but only between 50% of each parent’s genetic makeup. And that is one reason they are not simply an “averaging” of parental trait values.

January 4, 2013

Mitochondrial Eve: a de facto deception?

The above image, and the one to the left, are screenshots from my father’s 23andMe profile. Interestingly, his mtDNA haplogroup is not particularly common among ethnic Bengalis, who are more than ~80% on a branch of M. This reality is clear in the map above which illustrates the Central Asian distribution my father’s mtDNA lineage. In contrast, his whole genome is predominantly South Asianform, as is evident in the estimate that 23andMe provided via their ancestry composition feature, which utilizes the broader genome. The key takeaway here is that the mtDNA is informative, but it should not be considered to be representative, or anything like the last word on one’s ancestry in this day and age.


As a matter of historical record mtDNA looms large in human population genetics and phylogeography for understandable reasons. Mitchondria produce more genetic material than is found in the nucleus, and so were the lowest hanging fruit in the pre-PCR era. Additionally, because mtDNA lineages do not recombine they are well suited to a coalescent framework, where an idealized inverted treelike phylogeny converges upon a common ancestor. Finally, mtDNA was presumed to be neutral, so reflective of demographic events unperturbed by adaptation, and characterized by a high mutation rate, yielding a great amount of variation with which to differentiate the branches of the human family tree.

Many of these assumptions are are now disputable. But that’s not the point of this post. In the age of dense 1 million marker SNP-chips why are we still focusing on the history of one particular genetic region? In a word: myth. Eve, the primal woman. The “mother of us all,” who even makes cameos in science fiction finales!

In 1987 a paper was published which found that Africans harbored the greatest proportion of mtDNA variation among human populations. Additionally, these lineages coalesced back to a common ancestor on the order of 150,000 years ago. Since mtDNA is present in humans, there was a human alive 150,000 years ago who carried this ancestral lineage, from which all modern lineages derive. Mitochondrial DNA is passed from mothers to their offspring, so this individual must have been a woman. In the press she was labeled Eve, for obvious reasons. The scientific publicity resulted in a rather strange popular reaction, culminating in a Newsweek cover where Adam and Eve are depicted as naked extras from Eddie Murphy’s Coming to America film.

The problem is that people routinely believe that mtDNA Eve was the only ancestress of all modern humans from the period in which she lived. Why they believe this is common sense, and requires no great consideration. The reality is that the story being told by science is the story of mtDNA, with inferences about the populations which serve as hosts for mtDNA being incidental. These inferences need to be made cautiously and with care. It is basic logic that a phylogeny will coalesce back to a common ancestor at some point. Genetic lineages over time go extinct, and so most mtDNA lineages from the time of Eve went extinct. There were many woman who were alive during the same time as Eve, who contributed at least as much, perhaps more, to the genetic character of modern humans today. All we can say definitively is that their mtDNA lineage is no longer present. As mtDNA is passed from mother to daughter (males obviously have mtDNA, but we are dead ends, and pass it to no one), all one needs for a woman’s mtDNA lineage to go extinct is for her to have only sons. Though she leaves no imprint on the mtDNA phylogeny, obviously her sons may contribute genes to future generations.

Prior to ancient DNA and the proliferation of dense SNP data sets scholars were a bit too ambitious about what they believed they could infer from mtDNA and Y lineages (e.g., The Real Eve: Modern Man’s Journey Out of Africa). We are in a different time now, inferences made about the past rest on more than one leg. But the legend of Eve of the mtDNA persists, not because of its compelling scientific nature, but because this is a case where science piggy-backs upon prior conceptual furniture. This yields storytelling power, but a story which is based on a thin basis of fact becomes just another tall tale.

All this is on my mind because one of the scientists involved with Britain’s DNA, Jim Wilson, has penned a response to Vincent Plagnol’s Exaggerations and errors in the promotion of genetic ancestry testing (see here for more on this controversy). Overall I don’t find Wilson’s rebuttal too persuasive. It is well written, but it has the air of sophistry and lawyerly precision. I have appreciated Wilson’s science before, so I am not casting aspersions at his professional competence. Rather, some of the more enthusiastic and uninformed spokespersons for his firm have placed him in a delicate and indefensible situation, and he is gamely attempting to salvage the best of a bad hand. Importantly, he does not reassure me in the least that his firm did not use Britain’s atrocious libel laws as a threat to mute forceful criticism of their business model on scientific grounds. A more general issue here is that Wilson is in a situation where he must not damage the prospects of his firm, all the while maintaining his integrity as a scientist. From what I have seen once science becomes a business one must abandon the pretense of being a scientist first and foremost, no matter how profitable that aura of objectivity may be. The nature of marketing is such that the necessary caution and qualification essential for science becomes a major liability in the processing of communicating. It’s about selling, not convincing.

Going back to Eve, Wilson marshals a very strange argument:

“The claim that Adam and Eve really existed, as you suggest, refers to the most recent common ancestors of the mtDNA and non-recombining part of the Y chromosome. I don’t agree that there is nothing special about these individuals: there must have been a reason why mitochondrial Eve was on the front cover of Time magazine in the late 80s!….

A minor quibble, but I suspect he means the Newsweek cover. More seriously, this line of argumentation is bizarre on scientific grounds. Rather, it is a tack which is more rational when aiming toward a general audience which might purchase a kit which they believe might tell them of their relationship to “Eve.”

In the wake of the discussion at Genomes Unzipped I participated in further exchanges with Graham Coop and Aylwyn Scally on Twitter, and decided to spend 20 minutes this afternoon asking people what they thought about mitochondrial Eve. By “people,” I mean individuals who are pursuing graduate educations in fields such as genetics and forensics. My cursory “field research” left me very alarmed. Naturally these were individuals who did not make elementary mistakes in regards to the concept, but there was great confusion. I can only wonder what’s going through the minds of the public.

Analogies, allusions, and equivalences are useful when they leverage categories and concepts which we are solidly rooted in, and transpose them upon a foreign cognitive landscape. By pointing to similarities of structure and relation one can understand more fully the novel ground which one is exploring. Saying that the president of India is analogous to the queen of England is an informative analogy. These are both positions where the individual is a largely ceremonial head of state. In contrast, the president of the United States and the queen of England are very different figures, because the American executive is not ceremonial at all. This is not a useful analogy, even though superficially it sees no lexical shift.

Who was Eve? A plain reading is that she is the ancestor of all humans, and more importantly, the singular ancestress of all humans back to the dawn of time. This is a concept which the public grasps intuitively. Who is mtDNA Eve? A woman who flourished 150,000 years ago, who happened to carry the mtDNA lineage which would drift to fixation in the ancestors of modern humans. I think this is a very different thing indeed. For purposes of poetry and marketing the utilization of the name Eve is justifiable. But on scientific grounds all it does is confuse, obfuscate, and mislead.

The fiasco that Vincent Plagnol stumbled upon is just a symptom of a broader problem. Scientists need to engage in massive conceptual clean up, as catchy phrases such as “mitochondrial Eve” and “Y Adam” permeated the culture over the past generation, and mislead many sincere and engaged seekers of truth. This is of the essence because personal genomics, and the scientific understanding of genealogy, are now moving out of the ghetto of hobbyists, enthusiasts, and researchers. Though I doubt this industry will be massive, it will be ubiquitous, and a seamless part of our information portfolio. If people still have ideas like mitochondrial Eve in their head it is likely to cloud their perception of the utility of the tools at hand, and their broader significance.

December 18, 2012

Buddy, can you spare some ascertainment?

The above map shows the population coverage for the Geno 2.0 SNP-chip, put out by the Genographic Project. Their paper outlining the utility and rationale by the chip is now out on arXiv. I saw this map last summer, when Spencer Wells hosted a webinar on the launch of Geno 2.0, and it was the aspect which really jumped out at me. The number of markers that they have on this chip is modest, only >100,000 on the autosome, with a few tens of thousands more on the X, Y, and mtDNA. In contrast, the Axiom® Genome-Wide Human Origins 1 Array Plate being used by Patterson et al. has ~600,000 SNPs. But as is clear by the map above Geno 2.0 is ascertained in many more populations that the other comparable chips (Human Origins 1 Array uses 12 populations). It’s obvious that if you are only catching variation on a few populations, all the extra million markers may not give you much bang for the buck (not to mention the biases that that may introduce in your population genetic and phylogenetic inferences).


To the left are the list of populations against which the Human Origins 1 Array was ascertained, and they look rather comprehensive to me. In contrast, for Geno 2.0 ‘ancestrally informative markers’ were ascertained on 450 populations. The ultimate question for me is this: is all the extra ascertainment on diverse and obscure groups worth it? On first inspection Geno 2.0′s number of SNPs looks modest as I stated, but in my experience when you quality control and merge different panels together you are often left with only a few hundred thousand SNPs in any case. 100-200,000 SNPs is also sufficient to elucidate relationships even in genetically homogeneous regions such as Europe in my experience (it’s more than enough for model-based clustering, and seems to be overkill for MDS or PCA). One issue that jumps out at me about the Affymetrix chip is that it is ascertained toward the antipodes. In contrast, Geno 2.0 takes into account the Eurasian heartland. I suspect, for example, that Geno 2.0 would be better for population or ancestry assignment for South Asians because it would have more informative markers for those populations.

Ultimately I can’t really say much more until I use both marker sets in different and similar contexts. Since Geno 2.0 consciously excludes many functional and medically relevant SNPs its utility is primarily in the domain of demographics and history. If the populations in question are well covered by the Human Origins 1 Array, I see no reason why one shouldn’t go with it. Not only does it have more information about biological function, but the number of markers are many fold greater. On the other hand, Geno 2.0 may be more useful on the “blank zones” of the Affy chip. Hopefully the Genographic Project results paper for Geno 2.0 will come out soon and I can pull down their data set and play with it.

Cite: arXiv:1212.4116

Unveiling the genealogical lattice

To understand nature in all its complexity we have to cut down the riotous variety down to size. For ease of comprehension we formalize with math, verbalize with analogies, and visualize with representations. These approximations of reality are not reality, but when we look through the glass darkly they give us filaments of essential insight. Dalton’s model of the atom is false in important details (e.g., fundamental particles turn out to be divisible into quarks), but it still has conceptual utility.

Likewise, the phylogenetic trees popularized by L. L. Cavalli-Sforza in The History and Geography of Human Genes are still useful in understanding the shape of the human demographic past. But it seems that the bifurcating model of the tree must now be strongly tinted by the shades of reticulation. In a stylized sense inter-specific phylogenies, which assume the approximate truth of the biological species concept (i.e., little gene flow across lineages), mislead us when we think of the phylogeny of species on the microevolutionary scale of population genetics. On an intra-specific scale gene flow is not just a nuisance parameter in the model, it is an essential phenomenon which must be accommodated into the framework.


This is on my mind because of the emergence of packages such as TreeMix and AdmixTools. Using software such as these on the numerous public data sets allows one to perceive the reality of admixture, and overlay lateral gene flow upon the tree as a natural expectation. But perhaps a deeper result is the character of the tree itself is torn asunder. The figure above is from a new paper, Efficient moment-based inference of admixture parameters and sources of gene flow, which debuts MixMapper. The authors bring a lot of mathematical heft to their exposition, and I can’t say I follow all of it (though some of the details are very similar to Pickrell et al.’s). But in short it seems that in comparison to TreeMix MixMapper allows for more powerful inference of a narrower set of populations, selected for exploring very specific questions. In contrast, TreeMix explores the whole landscape with minimal supervision. Having used the latter I can testify that that is true.

The big result from MixMapper is that it extends the result of Patterson et al., and confirms that modern Europeans seem to be an admixture between a “north Eurasian” population, and a vague “west Eurasian” population. Importantly, they find evidence of admixture in Sardinians, which implies that Patterson et al.’s original were not sensitive to admixture in putative reference populations (note that Patterson is a coauthor on this paper as well). The rub, as noted in the paper, is that it is difficult to estimate admixture when you don’t have “pure” ancestral reference populations. And yet here the takeaway for me is that we may need to rethink our whole conception of pure ancestral populations, and imagine a human phylogenetic tree as a series of lattices in eternal flux, with admixed nodes periodically expanding so as to generate the artifice of a diversifying tree. The closer we look, the more likely that it seems that most of the populations which have undergone demographic expansion in the past 10,000 years are also the products of admixture. Any story of the past 10,000 years, and likely the past 100,000 years, must give space at the center of the narrative arc lateral gene flow across populations.

Cite: arXiv:1212.2555 [q-bio.PE]

December 13, 2012

We are Nature

Filed under: Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 8:03 am

There’s an interesting piece in Slate, The Great Schism in the Environmental Movement, which seems to be a distillation of trends which have been bubbling within the modern environmentalist movement for a generation now (I’ve read earlier manifestos in a similar vein). I can’t assess the magnitude of the shift, but here’s the top-line:

But that is a false construct that scientists and scholars have been demolishing the past few decades. Besides, there’s a growing scientific consensus that the contemporary human footprint—our cities, suburban sprawl, dams, agriculture, greenhouse gases, etc.—has so massively transformed the planet as to usher in a new geological epoch. It’s called the Anthropocene.

Modernist greens don’t dispute the ecological tumult associated with the Anthropocene. But this is the world as it is, they say, so we might as well reconcile the needs of people with the needs of nature. To this end, Kareiva advises conservationists to craft “a new vision of a planet in which nature—forests, wetlands, diverse species, and other ancient ecosystems—exists amid a wide variety of modern, human landscapes.”


Let’s take this debate as a given. It is fundamentally normative. That is, it is about values. We we need to tread carefully before projecting values across disputants. Far too often in this domain people seem to presume normative alignments, and therefore confuse ideological disagreement for rejection of factual truths. But, one thing to consider is that it is probable that human beings have already radically reshaped the ecological character of the world over the past 100,000 years. The implicit model that many older environmental activists seem to present is a framework pitting man & the machine vs. nature (the Shire vs. Mordor). But it is just not a useful dichotomy for many.

It is possible that there was, and is, no “pristine” nature. These disparate perspectives come to the fore in particular in post-colonial landscapes settled by Europeans. There is a long tradition in these areas of transforming ‘natives’ into ‘Noble Savages,’ who have attained some idealized harmony with Nature. The reality is that it is not harmony that was attained, but equilibrium. The arrival of anatomically modern humans to Australia and the New World resulted in a ‘shock’ to the ecological system, as megafauna went extinct due to the new variable of human predation. Even if H. sapiens were not the sufficient condition for these extinctions (populations naturally go through cycles), it is likely they were necessary (i.e., humans might extirpate species during times of low census size). But it is not just the initial impact in terms of species turnover. Australian and Amerindian populations seem to have reshaped the long term character of the landscape through fireCharles C. Mann argues in 1491 that  the vast forests which colonial and early American settlers cleared were in fact second growth, which emerged in the wake of massive die-offs of indigenous peoples due to Old World disease.

All of this is fundamentally complicated. Instead of a decision tree with two options, ‘Civilization’ vs. ‘Nature,’ there is actually a space populated with a multitude of positions. As someone touched by a moderate amount of biophilia my vision for the future is one of arcology based urbanism, massively scaled up algaculture, and megafaunal rewilding through genetic engineering and ancient DNA. Rather than idealize a mythic past we should endeavor to forge a new future. So it was, and so shall it ever be.

December 12, 2012

A lighter shade of brown: Dan MacArthur, look east or south!

Filed under: Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 2:58 pm

South Indian Udupi cuisine

In the post below I offered up my supposition that Dan MacArthur’s ancestry is unlikely to be Northwest Indian, which precludes a Romani origin for his South Asian ancestry. Indeed this is almost certainly so, Dienekes Pontikos followed up my crude analyses with IBD-sharing calculations (IBD = ‘identity by descent,’ which is basically what you would think it is). The South Asian population which MacArthur has the closest affinity to is from Karnataka, which is one of the Dravidian speaking states of the South. This does not necessarily refute my earlier contention, as aside from Brahmins most Bengalis seem to have broad South Indian affinities, except for the fact that they often have more East Asian ancestry.


Now, I may seem a touch obsessive on this issue at this point. There are several things motivating me. First, this was laying around in plain sight, but we missed it for years! Second, I’ve known Dan for a while, so this is very amusing on a personal level. Third, Dienekes’ has been pushing me to continue my exploration in a friendly competition. None of this is very difficult, and I’ve been going at it in the early hours of the day before work, or right before I go to sleep. In short, I’m doing this in part to show that you don’t need to just talk genomics, you too can do genomics. Ironically the age of “Big Data” is also the age of distributed data.

A lighter shade of brown: the Dan MacArthur chronicles, not a Romani

Filed under: Anthroplogy,Daniel MacArthur,Human Genetics,Human Genomics — Razib Khan @ 9:25 am

Pakistani honor guard

A few days ago I suggested that Dr. Daniel MacArthur might have South Asian ancestry. Now, when confronted with surprise the best option is to stick with your prior assumption, unless that surprise is powerful enough for you to “update” your model. After a few days of further analysis I will update: I do think Dan MacArthur has South Asian ancestry. Dienekes dug further, and noticed that there are hallmarks of “Ancestral South Indian” ancestry along the first 2/3 or so of chromosome 10. Now, you do have to remember that this genomic region is only half South Asian. The other half is European.

But in any case, one question that some people brought up: perhaps MacArthur has Romani heritage? I’m skeptical of this partly because:

1) there weren’t that many Romani in Britain in the 19th century

2) The British Romani are already very highly admixed

Another friend, who is a population genomicist himself, expressed some skepticism that such a long segment wasn’t broken up by recombination over the generations. My only moderately informed answer is this: we’d only notice the long segments, because if a very small region of ‘exotic’ ancestry was embedded within the dominant ancestral component it probably would not show up on some of these tests (or, we’d assume it was noise). Dan has another segment of South Asian ancestry, but much smaller in size. It may be there are other regions which we could find if we used better reference populations.

Here’s what I tentatively want to do with Dan’s data now. First, take the 80 MB or so which has South Asian ancestry, and phase it. That way I’d have a South Asian chromosome and a European one, and we could look for matches for only the South Asian one. But being busy I didn’t have time to do this. What I did have time to do was reduce the chromosomal region under consideration, and then run an IBS distance analysis in a private data set I have. This is a crude, but not always uninformative analysis. But by looking at the relationships I can now conclude that Dan MacArthur probably does not have Romani ancestry. Why? Because the Romani are of Northwest Indian heritage, and MacArthur’s match pattern using the diploid genotype (so South Asian + European) does not match what I expect would emerge from such a combination.

The full table is below, but to me the fact that he has so many matches with Northwest Indian populations is evidence that his ancestry was not Northwest Indian. Otherwise, he would be matching more Utah white (CEU samples) more often. Rather, someone with a mix of more conventional South Asian ancestry and European ancestry often resembles some of the less South Asian populations of South Asia (e.g., Brahui) in these crude measures. In fact, one of the closest matches to Dan’s IBS profile’s is that of my own mother. She is a rather vanilla ethnic Bengali, so I think there is a strong chance that his Indian ancestry is similar. This weak genetic data isn’t really the primary reason. The British East India company operated out of Bengal for much of its history, and there are simply a lot of Bengalis.

There’s a lot more that can be done here. Since I don’t have time, here’s the pedigree file if anyone wants to play with them (Dan is DGM001).

Population Genetic distance from Dan Standardized distance
Brahui 0.253 81.268
Burusho 0.257 82.736
Razib’s Mother 0.258 82.783
CEU 0.258 82.993
Burusho 0.258 83.024
CEU 0.26 83.547
Sakilli 0.26 83.555
Brahui 0.261 83.831
Brahui 0.261 83.857
GIH 0.261 83.955
CEU 0.261 83.972
CEU 0.261 83.985
CEU 0.262 84.043
North Kannadi 0.262 84.169
CEU 0.262 84.207
CEU 0.262 84.318
CEU 0.262 84.33
CEU 0.263 84.391
Paniya 0.263 84.408
CEU 0.263 84.437
CEU 0.263 84.445
CEU 0.263 84.488
CEU 0.263 84.606
CEU 0.263 84.609
CEU 0.264 84.691
Brahui 0.264 84.709
CEU 0.264 84.752
CEU 0.264 84.764
Brahui 0.264 84.822
GIH 0.264 84.826
Burusho 0.264 84.841
CEU 0.264 84.898
CEU 0.264 84.975
North Kannadi 0.264 84.992
CEU 0.265 85.087
Paniya 0.265 85.212
CEU 0.265 85.226
CEU 0.265 85.25
CEU 0.265 85.25
CEU 0.265 85.278
CEU 0.265 85.299
North Kannadi 0.265 85.3
Burusho 0.265 85.309
Burusho 0.266 85.328
CEU 0.266 85.363
CEU 0.266 85.409
North Kannadi 0.266 85.412
CEU 0.266 85.436
Burusho 0.266 85.446
Bene Israel 0.266 85.508
CEU 0.266 85.521
GIH 0.266 85.618
GIH 0.267 85.661
CEU 0.267 85.696
CEU 0.267 85.722
CEU 0.267 85.732
Brahui 0.267 85.777
GIH 0.267 85.793
CEU 0.267 85.799
CEU 0.267 85.816
Cochin Jews 0.267 85.85
CEU 0.267 85.943
Brahui 0.268 85.996
CEU 0.268 86.005
Cochin Jews 0.268 86.011
CEU 0.268 86.08
CEU 0.268 86.115
CEU 0.268 86.18
GIH 0.268 86.229
Cochin Jews 0.268 86.234
CEU 0.268 86.244
Burusho 0.268 86.265
CEU 0.268 86.277
CEU 0.268 86.278
CEU 0.269 86.288
CEU 0.269 86.291
CEU 0.269 86.318
CEU 0.269 86.325
CEU 0.269 86.326
GIH 0.269 86.327
CEU 0.269 86.329
CEU 0.269 86.354
CEU 0.269 86.387
CEU 0.269 86.463
CEU 0.269 86.515
CEU 0.269 86.517
CEU 0.269 86.55
CEU 0.27 86.609
Paniya 0.27 86.682
CEU 0.27 86.687
CEU 0.27 86.696
CEU 0.27 86.717
CEU 0.27 86.733
Sakilli 0.27 86.74
CEU 0.27 86.866
Malayan 0.27 86.879
North Kannadi 0.27 86.883
CEU 0.271 86.937
Brahui 0.271 86.952
Burusho 0.271 86.956
CEU 0.271 86.957
CEU 0.271 86.977
North Kannadi 0.271 86.995
GIH 0.271 87.018
CEU 0.271 87.042
CEU 0.271 87.066
CEU 0.271 87.07
Brahui 0.271 87.09
Bene Israel 0.271 87.094
Sakilli 0.271 87.141
CEU 0.271 87.2
CEU 0.271 87.24
North Kannadi 0.272 87.253
CEU 0.272 87.297
Burusho 0.272 87.307
CEU 0.272 87.327
GIH 0.272 87.353
CEU 0.272 87.355
Cochin Jews 0.272 87.381
CEU 0.272 87.384
CEU 0.272 87.5
CEU 0.272 87.535
CEU 0.273 87.594
Malayan 0.273 87.676
CEU 0.273 87.702
CEU 0.273 87.741
Burusho 0.273 87.806
CEU 0.273 87.846
Cambodians 0.274 87.932
North Kannadi 0.274 87.951
CEU 0.274 87.951
Burusho 0.274 88.03
CEU 0.274 88.047
CEU 0.274 88.081
CEU 0.274 88.089
CEU 0.274 88.101
CEU 0.274 88.179
CEU 0.274 88.19
North Kannadi 0.275 88.243
CEU 0.275 88.32
GIH 0.275 88.325
CEU 0.275 88.349
Brahui 0.275 88.393
CEU 0.275 88.402
CEU 0.275 88.457
Bene Israel 0.276 88.552
CEU 0.276 88.577
CEU 0.276 88.603
CEU 0.276 88.647
CEU 0.276 88.7
CEU 0.276 88.729
CEU 0.276 88.814
CEU 0.276 88.85
Brahui 0.276 88.855
CEU 0.277 88.923
GIH 0.277 88.99
Paniya 0.277 89.082
CEU 0.277 89.118
CEU 0.277 89.15
CEU 0.277 89.151
CEU 0.277 89.17
CEU 0.278 89.184
Cambodians 0.278 89.208
Cambodians 0.278 89.233
Cambodians 0.278 89.383
CEU 0.278 89.45
CEU 0.278 89.493
Cambodians 0.279 89.522
CEU 0.279 89.595
CEU 0.279 89.679
CEU 0.279 89.753
CEU 0.279 89.762
CEU 0.279 89.807
Cambodians 0.28 89.942
GIH 0.28 90.085
CEU 0.281 90.178
Brahui 0.281 90.364
Cambodians 0.282 90.543
Cambodians 0.282 90.559
Cambodians 0.282 90.77
Cambodians 0.283 90.898
CEU 0.283 90.956
CEU 0.284 91.316
CHD 0.289 92.952
Sakilli 0.29 93.103
Bene Israel 0.29 93.122
CHD 0.291 93.619
CHD 0.291 93.663
CHD 0.293 94.125
CHD 0.293 94.248
CHD 0.294 94.451
CHD 0.294 94.629
CHD 0.296 94.965
CHD 0.296 95.279
Yorubas 0.297 95.298
CHD 0.297 95.368
CHD 0.297 95.438
CHD 0.297 95.441
Yorubas 0.297 95.567
CHD 0.298 95.678
CHD 0.298 95.828
CHD 0.299 96.032
CHD 0.299 96.127
CHD 0.3 96.349
CHD 0.3 96.403
CHD 0.3 96.443
CHD 0.3 96.508
CHD 0.3 96.523
CHD 0.3 96.533
CHD 0.301 96.575
CHD 0.301 96.598
CHD 0.301 96.624
CHD 0.301 96.625
CHD 0.301 96.738
CHD 0.301 96.758
CHD 0.301 96.869
Yorubas 0.302 97.106
CHD 0.303 97.37
CHD 0.303 97.41
Yorubas 0.304 97.681
CHD 0.304 97.713
CHD 0.304 97.747
Yorubas 0.304 97.829
CHD 0.304 97.838
CHD 0.305 98.106
CHD 0.306 98.309
Yorubas 0.307 98.499
CHD 0.307 98.546
CHD 0.307 98.547
CHD 0.307 98.606
CHD 0.307 98.764
CHD 0.307 98.78
CHD 0.307 98.803
Yorubas 0.308 98.947
Yorubas 0.308 99.03
Yorubas 0.309 99.411
Yorubas 0.309 99.417
CHD 0.309 99.452
CHD 0.31 99.624
Yorubas 0.311 100

A lighter shade of brown: the Dan MacArthur chronicles, not a Romani

Filed under: Anthroplogy,Daniel MacArthur,Human Genetics,Human Genomics — Razib Khan @ 9:25 am

Pakistani honor guard

A few days ago I suggested that Dr. Daniel MacArthur might have South Asian ancestry. Now, when confronted with surprise the best option is to stick with your prior assumption, unless that surprise is powerful enough for you to “update” your model. After a few days of further analysis I will update: I do think Dan MacArthur has South Asian ancestry. Dienekes dug further, and noticed that there are hallmarks of “Ancestral South Indian” ancestry along the first 2/3 or so of chromosome 10. Now, you do have to remember that this genomic region is only half South Asian. The other half is European.

But in any case, one question that some people brought up: perhaps MacArthur has Romani heritage? I’m skeptical of this partly because:

1) there weren’t that many Romani in Britain in the 19th century

2) The British Romani are already very highly admixed

Another friend, who is a population genomicist himself, expressed some skepticism that such a long segment wasn’t broken up by recombination over the generations. My only moderately informed answer is this: we’d only notice the long segments, because if a very small region of ‘exotic’ ancestry was embedded within the dominant ancestral component it probably would not show up on some of these tests (or, we’d assume it was noise). Dan has another segment of South Asian ancestry, but much smaller in size. It may be there are other regions which we could find if we used better reference populations.

Here’s what I tentatively want to do with Dan’s data now. First, take the 80 MB or so which has South Asian ancestry, and phase it. That way I’d have a South Asian chromosome and a European one, and we could look for matches for only the South Asian one. But being busy I didn’t have time to do this. What I did have time to do was reduce the chromosomal region under consideration, and then run an IBS distance analysis in a private data set I have. This is a crude, but not always uninformative analysis. But by looking at the relationships I can now conclude that Dan MacArthur probably does not have Romani ancestry. Why? Because the Romani are of Northwest Indian heritage, and MacArthur’s match pattern using the diploid genotype (so South Asian + European) does not match what I expect would emerge from such a combination.

The full table is below, but to me the fact that he has so many matches with Northwest Indian populations is evidence that his ancestry was not Northwest Indian. Otherwise, he would be matching more Utah white (CEU samples) more often. Rather, someone with a mix of more conventional South Asian ancestry and European ancestry often resembles some of the less South Asian populations of South Asia (e.g., Brahui) in these crude measures. In fact, one of the closest matches to Dan’s IBS profile’s is that of my own mother. She is a rather vanilla ethnic Bengali, so I think there is a strong chance that his Indian ancestry is similar. This weak genetic data isn’t really the primary reason. The British East India company operated out of Bengal for much of its history, and there are simply a lot of Bengalis.

There’s a lot more that can be done here. Since I don’t have time, here’s the pedigree file if anyone wants to play with them (Dan is DGM001).

Population Genetic distance from Dan Standardized distance
Brahui 0.253 81.268
Burusho 0.257 82.736
Razib’s Mother 0.258 82.783
CEU 0.258 82.993
Burusho 0.258 83.024
CEU 0.26 83.547
Sakilli 0.26 83.555
Brahui 0.261 83.831
Brahui 0.261 83.857
GIH 0.261 83.955
CEU 0.261 83.972
CEU 0.261 83.985
CEU 0.262 84.043
North Kannadi 0.262 84.169
CEU 0.262 84.207
CEU 0.262 84.318
CEU 0.262 84.33
CEU 0.263 84.391
Paniya 0.263 84.408
CEU 0.263 84.437
CEU 0.263 84.445
CEU 0.263 84.488
CEU 0.263 84.606
CEU 0.263 84.609
CEU 0.264 84.691
Brahui 0.264 84.709
CEU 0.264 84.752
CEU 0.264 84.764
Brahui 0.264 84.822
GIH 0.264 84.826
Burusho 0.264 84.841
CEU 0.264 84.898
CEU 0.264 84.975
North Kannadi 0.264 84.992
CEU 0.265 85.087
Paniya 0.265 85.212
CEU 0.265 85.226
CEU 0.265 85.25
CEU 0.265 85.25
CEU 0.265 85.278
CEU 0.265 85.299
North Kannadi 0.265 85.3
Burusho 0.265 85.309
Burusho 0.266 85.328
CEU 0.266 85.363
CEU 0.266 85.409
North Kannadi 0.266 85.412
CEU 0.266 85.436
Burusho 0.266 85.446
Bene Israel 0.266 85.508
CEU 0.266 85.521
GIH 0.266 85.618
GIH 0.267 85.661
CEU 0.267 85.696
CEU 0.267 85.722
CEU 0.267 85.732
Brahui 0.267 85.777
GIH 0.267 85.793
CEU 0.267 85.799
CEU 0.267 85.816
Cochin Jews 0.267 85.85
CEU 0.267 85.943
Brahui 0.268 85.996
CEU 0.268 86.005
Cochin Jews 0.268 86.011
CEU 0.268 86.08
CEU 0.268 86.115
CEU 0.268 86.18
GIH 0.268 86.229
Cochin Jews 0.268 86.234
CEU 0.268 86.244
Burusho 0.268 86.265
CEU 0.268 86.277
CEU 0.268 86.278
CEU 0.269 86.288
CEU 0.269 86.291
CEU 0.269 86.318
CEU 0.269 86.325
CEU 0.269 86.326
GIH 0.269 86.327
CEU 0.269 86.329
CEU 0.269 86.354
CEU 0.269 86.387
CEU 0.269 86.463
CEU 0.269 86.515
CEU 0.269 86.517
CEU 0.269 86.55
CEU 0.27 86.609
Paniya 0.27 86.682
CEU 0.27 86.687
CEU 0.27 86.696
CEU 0.27 86.717
CEU 0.27 86.733
Sakilli 0.27 86.74
CEU 0.27 86.866
Malayan 0.27 86.879
North Kannadi 0.27 86.883
CEU 0.271 86.937
Brahui 0.271 86.952
Burusho 0.271 86.956
CEU 0.271 86.957
CEU 0.271 86.977
North Kannadi 0.271 86.995
GIH 0.271 87.018
CEU 0.271 87.042
CEU 0.271 87.066
CEU 0.271 87.07
Brahui 0.271 87.09
Bene Israel 0.271 87.094
Sakilli 0.271 87.141
CEU 0.271 87.2
CEU 0.271 87.24
North Kannadi 0.272 87.253
CEU 0.272 87.297
Burusho 0.272 87.307
CEU 0.272 87.327
GIH 0.272 87.353
CEU 0.272 87.355
Cochin Jews 0.272 87.381
CEU 0.272 87.384
CEU 0.272 87.5
CEU 0.272 87.535
CEU 0.273 87.594
Malayan 0.273 87.676
CEU 0.273 87.702
CEU 0.273 87.741
Burusho 0.273 87.806
CEU 0.273 87.846
Cambodians 0.274 87.932
North Kannadi 0.274 87.951
CEU 0.274 87.951
Burusho 0.274 88.03
CEU 0.274 88.047
CEU 0.274 88.081
CEU 0.274 88.089
CEU 0.274 88.101
CEU 0.274 88.179
CEU 0.274 88.19
North Kannadi 0.275 88.243
CEU 0.275 88.32
GIH 0.275 88.325
CEU 0.275 88.349
Brahui 0.275 88.393
CEU 0.275 88.402
CEU 0.275 88.457
Bene Israel 0.276 88.552
CEU 0.276 88.577
CEU 0.276 88.603
CEU 0.276 88.647
CEU 0.276 88.7
CEU 0.276 88.729
CEU 0.276 88.814
CEU 0.276 88.85
Brahui 0.276 88.855
CEU 0.277 88.923
GIH 0.277 88.99
Paniya 0.277 89.082
CEU 0.277 89.118
CEU 0.277 89.15
CEU 0.277 89.151
CEU 0.277 89.17
CEU 0.278 89.184
Cambodians 0.278 89.208
Cambodians 0.278 89.233
Cambodians 0.278 89.383
CEU 0.278 89.45
CEU 0.278 89.493
Cambodians 0.279 89.522
CEU 0.279 89.595
CEU 0.279 89.679
CEU 0.279 89.753
CEU 0.279 89.762
CEU 0.279 89.807
Cambodians 0.28 89.942
GIH 0.28 90.085
CEU 0.281 90.178
Brahui 0.281 90.364
Cambodians 0.282 90.543
Cambodians 0.282 90.559
Cambodians 0.282 90.77
Cambodians 0.283 90.898
CEU 0.283 90.956
CEU 0.284 91.316
CHD 0.289 92.952
Sakilli 0.29 93.103
Bene Israel 0.29 93.122
CHD 0.291 93.619
CHD 0.291 93.663
CHD 0.293 94.125
CHD 0.293 94.248
CHD 0.294 94.451
CHD 0.294 94.629
CHD 0.296 94.965
CHD 0.296 95.279
Yorubas 0.297 95.298
CHD 0.297 95.368
CHD 0.297 95.438
CHD 0.297 95.441
Yorubas 0.297 95.567
CHD 0.298 95.678
CHD 0.298 95.828
CHD 0.299 96.032
CHD 0.299 96.127
CHD 0.3 96.349
CHD 0.3 96.403
CHD 0.3 96.443
CHD 0.3 96.508
CHD 0.3 96.523
CHD 0.3 96.533
CHD 0.301 96.575
CHD 0.301 96.598
CHD 0.301 96.624
CHD 0.301 96.625
CHD 0.301 96.738
CHD 0.301 96.758
CHD 0.301 96.869
Yorubas 0.302 97.106
CHD 0.303 97.37
CHD 0.303 97.41
Yorubas 0.304 97.681
CHD 0.304 97.713
CHD 0.304 97.747
Yorubas 0.304 97.829
CHD 0.304 97.838
CHD 0.305 98.106
CHD 0.306 98.309
Yorubas 0.307 98.499
CHD 0.307 98.546
CHD 0.307 98.547
CHD 0.307 98.606
CHD 0.307 98.764
CHD 0.307 98.78
CHD 0.307 98.803
Yorubas 0.308 98.947
Yorubas 0.308 99.03
Yorubas 0.309 99.411
Yorubas 0.309 99.417
CHD 0.309 99.452
CHD 0.31 99.624
Yorubas 0.311 100

Don’t wait to have children!

Filed under: Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 7:34 am

The New Republic has a piece up, How Older Parenthood Will Upend American Society, which won’t have surprising data for readers of this weblog. But it’s nice to see this sort of thing go “mainstream.” My daughter was born when her parents were in their mid-30s, so I know all the statistics. They aren’t good bed-time reading (she’s healthy and robust so far!). If I had to do it over again I definitely wouldn’t have waited this long. After becoming a father it brought home to me that waiting was one of the worst decisions of my life. Why postpone something this incredible for the more far more prosaic pleasures of an extended adolescence? Granted, I’m not sure that I would have been the best father at 25, but I don’t think there’s much I can say in reply to the argument that I should have become a father by 30.

More concretely, we would have had sperm and egg “banked” if we had been smart delaying parenthood. The article notes that storage of sperm costs $850 up front, and $300 to $500 per year after that, and that many balk at the cost. And how much do you spend on your cell phone every year? The issue here seems to be time preference.


Of course one of the aspects of the piece which is implicit is its class-orientation. It finishes up:

It won’t be easy to make the world more baby-friendly, but if we were to try, we’d have to restructure the professions so that the most intensely competitive stage of a career doesn’t occur right at the moment when couples should be lavishing attention on infants….

Though the fertility transition has occurred across all groups, it has been most striking among the professional classes. But the the professional classes are a minority of the population. The reality is that we’d expect fertility to rebound over time as professional class people, whether for genetic or cultural (likely both) reasons, fail to reproduce, and are replaced by those who produce large families early, and continue to do so in a heritable fashion.

 
The plot to the left illustrates the dynamic which I’m alluding to. If at t = 0 you have two populations where one outnumbers the second by a factor of 100, if the population which is at ~1% proportion has a doubling-time of 30 years vs. replacement for the majority, then it takes ~200 years for the minority to outnumber the majority. Of course, we need to be careful about projections. Something I’ve criticized more gullible or historically naive readers about. Mormon fundamentalists or Old Order Amish have very high fertility rates in the context of being embedded in advanced societies which have very low fertility (at least in relation). But context-free models are useless. For example, some hyper-fertile groups, such as Mormon fundamentalists and Hasidic Jews, have been known to exploit the welfare state to subsidize their natalism. Such a subsidy lasts only so long as the ratio between producers and consumers favors the former. If a trend is unsustainable, it won’t be sustained. In the short term the “limits to growth” are probably not capped by resources, but human capital, which allows us to have increasing marginal returns. Systems do return to equilibria eventually.

As I don’t see a shift back to younger families among the professional classes, what’s the solution? Massive sperm and egg banking of 20 year old individuals seems like an economically feasible and effective strategy. I wouldn’t be surprised if the math works out that this reduces our long term health care costs, by cutting down on children born with congential defects and decreasing the lifetime morbidity rate of the population as a whole. Want to bend the cost curve? This might be a way! Combined with pre-implantation genetic diagnosis the ‘solution’ then is probably more technology, not less.

The long term fix of course is to reform the professional system and the culture of careerism so that what is rewarded is genuine productivity, rather than signalling. By this, I mean that I suspect many people who work “80 hours a week” aren’t all that productive beyond about 50, but they do signal to their superiors that they are “committed.” And of course now the whole peer group has to shift toward this irrational norm. Better workplace monitoring and productivity assessments might be a way to address this issue, so that superiors can actually evaluate gains to hours of putative labor.

Finally, the General Social Survey asks people about the ideal number of children they’d like to have, versus the actually number of children they do have (CHLDIDEL and CHILDS). To remove demographic confounds I limited the sample below to non-Hispanic white women age 45 and up between the year 2006 and 2010, and compared across educational attainment.

December 11, 2012

We don’t know why Ethiopians breathe easy

Filed under: Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 9:40 pm

Most people are aware that altitude imposes constraints on individual performance and function. Much of this is flexible; athletes who train at high altitudes may gain a performance edge. But over the long term there are costs, just as there are with computers which are ‘overclocked.’ This is the point where you make the transition from physiology to evolution. Residence at high altitude entails strong selective pressures on populations. Over the past few years there has been a great deal of exploration of the genetics of long resident high altitude groups, the Tibetans, Peruvians, and Ethiopians.

Why is high altitude adaptation of interest? First, it’s clear and distinct. Coding the two phenotypes is not that difficult. Second, the adaptive value is clear. Though people can survive at high altitudes, most do not flourish. And third, you have several distinct populations which are phylogenetically diverse. In other words, you can test evolution across multiple lineages, and see if the phenotypic and genetic features resemble each other, or differ. To some extent we know the answer: though there are overlaps, altitude adaptations differ both in their physiological presentation, and in their genetic architecture. Because it is such a powerful force, the shape of adaptation to high altitudes gives us a better sense of the arc of evolution across diverse populations and times.

With all that under our belt, I have to admit that a new paper in PLoS Genetics, The Genetic Architecture of Adaptations to High Altitude in Ethiopia, is quite peculiar in my opinion. To not put too fine a point on it, it seems hurried and confused. Granted, from what I can gather the authors confirmed the likelihood that the genetic architecture of high altitude adaptations in Ethiopian populations is such that variance exhibits a more polygenic character, resulting in weaker ability to detect genes of large effect. But if you read the paper you notice that they go off on a particular tangent on the adaptation to diseases on immune loci. Much of the work over the past 10 years on recent evolution in humans has focused on immune response, so I don’t see why that was necessary for the broader paper.

Additionally, there’s one element which I think needs to be explored more: the phylogenetic history of the Ethiopian people. The paper makes the blithe assertion that Ethiopians have been resident in the highlands of that nation for tens of thousands of years, but the latest research is that that just isn’t so. In short, it may be that Ethiopians as we understand them are recent compounds of a long resident Sub-Saharan African population and West Eurasian migrants. I’m not sure that this necessarily impacts the high altitude adaptations, as the admixed populations would presumably inherit the alleles from the local residents if they did not exterminate them. But, it does seem to complicate the story that you’d want to tell.

Overall, I think this sort of paper, which follows on the heels of other work, should have been presented in a more explicitly evolutionary context. If these adaptations are old, then there may be many variants in the highlands of Ethiopia. In contrast, the Peruvian adaptations are presumably new, and from what I have been told they are more like improvisations on the fly. In a way the spectrum of subtly of adaptation, with Ethiopians being the most nuanced, and Peruvians the least (which Tibetans in the middle), gives us a sense of the different power of evolution on the scale of 50,000 years vs. 10,000 years. Additionally, the region around Ethiopia has long been inhabited by hominins. No prior population was resident on the Altiplano before the arrival of the first Amerindians ~10,000 years before the present.

I’m hoping that the Nielsen Group comes out with a better paper. I know they’re going to revisit the topic of altitude adaptation (so I was told at ASHG 2012).

Citation: Alkorta-Aranburu G, Beall CM, Witonsky DB, Gebremedhin A, Pritchard JK, et al. (2012) The Genetic Architecture of Adaptations to High Altitude in Ethiopia. PLoS Genet 8(12): e1003110. doi:10.1371/journal.pgen.1003110

December 10, 2012

Is Daniel MacArthur ‘desi’?

My initial inclination in this post was to discuss a recent ordering snafu which resulted in many of my friends being quite peeved at 23andMe. But browsing through their new ‘ancestry composition’ feature I thought I had to discuss it first, because of some nerd-level intrigue. Though I agree with many of Dienekes concerns about this new feature, I have to admit that at least this method doesn’t give out positively misleading results. For example, I had complained earlier that ‘ancestry painting’ gave literally crazy results when they weren’t trivial. It said I was ~60 percent European, which makes some coherent sense in their non-optimal reference population set, but then stated that my daughter was >90 percent European. Since 23andMe did confirm she was 50% identical by descent with me these results didn’t make sense; some readers suggested that there was a strong bias in their algorithms to assign ambiguous genomic segments to ‘European’ heritage (this was a problem for East Africans too).

Here’s my daughter’s new chromosome painting:

One aspect of 23andMe’s new ancestry composition feature is that it is very Eurocentric. But, most of the customers are white, and presumably the reference populations they used (which are from customers) are also white. Though there are plenty of public domain non-white data sets they could have used, I assume they’d prefer to eat their own data dog-food in this case. But that’s really a minor gripe in the grand scheme of things. This is a huge upgrade from what came before. Now, it’s not telling me, as a South Asian, very much. But, it’s not telling me ludicrous things anymore either!

But in regards to omission I am curious to know why this new feature rates my family as only ~3% East Asian, when other analyses put us in the 10-15% range. The problem with very high values is that South Asians often have some residual ‘eastern’ signal, which I suspect is not real admixture, but is an artifact. Nevertheless, northeast Indians, including Bengalis, often have genuine East Asia admixture. On PCA plots my family is shifted considerably toward East Asians. The signal they are picking up probably isn’t noise. Almost every apportionment of East Asian ancestry I’ve seen for my family yields a greater value for my mother, and that holds here. It’s just that the values are implausibly low.

In any case, that’s not the strangest thing I saw. I was clicking around people who I had “shared” genomes with, and I stumbled upon this:

As you can guess from the screenshot this is Daniel MacArthur’s profile. And according to this ~25% of chromosome 10 is South Asian! On first blush this seemed totally nonsensical to me, so I clicked around other profiles of people of similar Northern European background…and I didn’t see anything equivalent.

What to do? It’s going to take more evidence than this to shake my prior assumptions, so I downloaded Dr. MacArthur’s genotype. Then I merged it with three HapMap populations, the Utah whites (CEU), the Gujaratis (GIH), and the Chinese from Denver (CHD). The last was basically a control. I pulled out chromosome 10. I also added Dan’s wife Ilana to the data set, since I believe she got typed with the same Illumina chip, and is of similar ethnic background (i.e., very white). It is important to note that only 28,000 SNPs remained in the data set. But usually 10,000 is more than sufficient on SNP data for model-based clustering with inter-continental scale variation.

I did two things:

1) I ran ADMIXTURE at K = 3, unsupervised

2) I ran an MDS, which visualized the genetic variation in multiple dimensions

Before I go on, I will state what I found: these methods supported the inference from 23andMe, on chromosome 10 Dr. MacArthur seems to have an affinity with South Asians (i.e., this is his ‘curry chromosome’). Here are the average (median) values in tabular format, with MacArthur and his wife presented for comparison.

ADMIXTURE results for chromosome 10
K 1 K 2 K 3
CEU 0.04 0.02 0.93
GIH 0.87 0.05 0.08
CHD 0.01 0.97 0.01
Daniel MacArthur 0.29 0.07 0.64
Ilana Fisher 0.01 0.06 0.94

You probably want a distribution. Out of the non-founder CEU sample none went above 20% South Asian. Though it did surprise me that a few were that high, making it more plausible to me that MacArthur’s results on chromosome 10 were a fluke:

And here’s the MDS with the two largest dimensions:

Again, it’s evident that this chromosome 10 is shifted toward South Asians. If I had more time right now what I’d do is probably get that specific chromosomal segment, phase it, and then compare it to various South Asian populations. But I don’t have time now, so I went and checked out the results from the Interpretome. I cranked up the settings to reduce the noise, and so that it would only spit out the most robust and significant results. As you can see, again chromosome 10 comes up as the one which isn’t quite like the others.

Is there is a plausible explanation for this? Perhaps Dr. MacArthur can call up a helpful relative? From what  recall his parents are immigrants from the United Kingdom, and it isn’t unheard of that white Britons do have South Asian ancestry which dates back to the 19th century. Though to be totally honest I’m rather agnostic about all this right now. This genotype has been “out” for years now, so how is it that no one has noticed this peculiarity??? Perhaps the issue is that everyone was looking at the genome wide average, and it just doesn’t rise to the level of notice? What I really want to do is look at the distribution of all chromosomes and see how Daniel MacArthur’s chromosome 10 then stacks up. It might be a random act of nature yet.

Also, I guess I should add that at ~1.5% South Asian that would be consistent with one of MacArthur’s great-great-great-great grandparents being Indian. Assuming 25 year generation times that puts them in the mid-19th century. Of course, at such a low proportion the variance is going to be high, so it is quite possible that you need to push the real date of admixture one generation back, or one generation forward.

December 9, 2012

The origins of the Romani determined definitively

Filed under: Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 11:52 am

In many cases there are questions of a historical and ethnographic nature which are subject to controversy and debate. Scholarly arguments are laid out, and further dispute ensues. For decades progress seems fleeting, as one hypothesis is accepted, only to be subject to later revision. This sort of pattern gives succor to the most cynical and jaded of ‘Post Modern’ set, especially when the ‘discourse’ in question is in the domain of science.

But thankfully these debates can come to an end in some cases. So it is with the origins of the European Romani, better known as ‘Gypsies’ (though the Roma are the most well known of the Romani, other groups within Europe have different ethnonyms). Obviously many of the basic elements have long been there, but I think the most recent genetic work now establishes a level of closure. Taking a step back, what do we know?

1) The Romani language seems to be Indo-Aryan, with a likely affinity with the northwest group of Indo-Aryan languages

2) The Romani presence in Europe only dates to the past ~1,000 years, with an entry point in the Byzantine Empire

3) They are an admixture between an ancestral Indian element, and local populations

4) Their history of endogamy has resulted in a strong genetic drift effect

The two papers which seem to nail the coffin shut on these questions use somewhat different methodologies. One relies on Y chromosomal STRs (hypervariable repeat regions) to generate a paternal phylogeny. Focusing just on the paternal phylogeny allows for one to make very robust genealogical inferences. Additionally, the authors had a very large data set across India. Their goal was to ascertain the exact region of origin of the Romani before they left India. As noted in bullet #1 there is already some evidence from their language that this must be in northwest India. The second paper uses a SNP-chip; hundreds of thousands of autosomal markers. This has been done to death for other populations, so the method isn’t new. Rather, it is that it is now being applied to the Romani.

First, the Y chromosomal paper. The Phylogeography of Y-Chromosome Haplogroup H1a1a-M82 Reveals the Likely Indian Origin of the European Romani Populations:

Linguistic and genetic studies on Roma populations inhabited in Europe have unequivocally traced these populations to the Indian subcontinent. However, the exact parental population group and time of the out-of-India dispersal have remained disputed. In the absence of archaeological records and with only scanty historical documentation of the Roma, comparative linguistic studies were the first to identify their Indian origin. Recently, molecular studies on the basis of disease-causing mutations and haploid DNA markers (i.e. mtDNA and Y-chromosome) supported the linguistic view. The presence of Indian-specific Y-chromosome haplogroup H1a1a-M82 and mtDNA haplogroups M5a1, M18 and M35b among Roma has corroborated that their South Asian origins and later admixture with Near Eastern and European populations. However, previous studies have left unanswered questions about the exact parental population groups in South Asia. Here we present a detailed phylogeographical study of Y-chromosomal haplogroup H1a1a-M82 in a data set of more than 10,000 global samples to discern a more precise ancestral source of European Romani populations. The phylogeographical patterns and diversity estimates indicate an early origin of this haplogroup in the Indian subcontinent and its further expansion to other regions. Tellingly, the short tandem repeat (STR) based network of H1a1a-M82 lineages displayed the closest connection of Romani haplotypes with the traditional scheduled caste and scheduled tribe population groups of northwestern India.

 

Two trees illustrate the results succinctly:

The bottom line:

- This particular Y chromosomal lineage which is highly diagnostic of South Asian origin in the Romani shows that the Romani seem to derive from the populations of northwest India

- Additionally, within these populations the Romani Y chromosomal lineages derive from the lower caste elements, the scheduled castes and scheduled tribes

But the above results don’t get directly at genome-wide admixture. The second paper does, using hundreds of thousands of markers to explore the Romani affinity to other populations. Reconstructing the Population History of European Romani from Genome-wide Data:

The Romani, the largest European minority group with approximately 11 million people…constitute a mosaic of languages, religions, and lifestyles while sharing a distinct social heritage. Linguistic…and genetic…studies have located the Romani origins in the Indian subcontinent. However, a genome-wide perspective on Romani origins and population substructure, as well as a detailed reconstruction of their demographic history, has yet to be provided. Our analyses based on genome-wide data from 13 Romani groups collected across Europe suggest that the Romani diaspora constitutes a single initial founder population that originated in north/northwestern India ∼1.5 thousand years ago (kya). Our results further indicate that after a rapid migration with moderate gene flow from the Near or Middle East, the European spread of the Romani people was via the Balkans starting ∼0.9 kya. The strong population substructure and high levels of homozygosity we found in the European Romani are in line with genetic isolation as well as differential gene flow in time and space with non-Romani Europeans. Overall, our genome-wide study sheds new light on the origins and demographic history of European Romani.

The plot to the left illustrates the relationship of the Romani to world-wide populations using multi-dimensional scaling, where genetic variation is decomposed into dimensions, and individuals are plotted on those dimensions. In short, the Romani exhibit a classic admixture cline pattern.That is, they are the products of a two-way admixture between populations which occupy distinct positions along a cline, and Romani individuals and populations are distributed along the cline in proportion to their admixture. One notable aspect is that the Romani are actually two clusters; one which manifests a strong ‘east’-'west’ distribution, and another which seems located purely within the European cluster. The latter seems to be the Welsh Romani, who in the neighbor-joining tree (see the supplements) fall on the same branch as European populations, as opposed to the other Romani, who form their own clade.

To drill down further you need to ascertain admixture with a model-based clustering algorithm. Ergo, ADMIXTURE. I’ve reedited the figure to illustrate the salient points. In particular, it is clear that the Roma populations except the Welsh have significant South Asian ancestry. The question is how much? To answer this question you need to know the source population in South Asia. A peculiar aspect of this plot is that the Romani have very little of the green ancestral component, which happens to be modal in the Middle East (not shown). This element happens to be highly enriched in many Pakistani populations, but not necessarily northwest Indian ones. Nevertheless, the issue that leaves me suspicious of this particular finding is that many of the European populations, in particular those groups (e.g., Balkans) which may have admixed with the Romani, have this element to extent not evident in one of their presumed ‘daughter’ populations. I wonder if perhaps the peculiarities of Romani inbreeding has skewed the allele frequency distribution so much that you get strangeness like this. I am not showing higher K’s because those break out with a Romani-cluster. Just like the Kalash-cluster this is to a great extent a feature of the long term endogamy of these communities. With high levels of drift the allele frequency of these groups moves into a very peculiar space in relation to their parental populations, but one must not become confused and assume that the Romani or Kalash are themselves appropriate independent clusters in the same way that Europeans or East Asians are.

Using various forms of admixture analysis the authors seem to conclude that the Balkan Romani are 30-50% South Asian. This seems in line with intuition. But that still leaves open the question of who those South Asians were. As I noted above the most thorough Y chromosomal data point to the lower caste elements of northwest India. What do the autosomes say?

I don’t want get into the technical details of how they tested the models, but it seems that one of the likely parental populations to the Romani had a close relationship to the Meghwal, a scheduled caste from northwest India. In other words, the autosome results align very well with the Y chromosomal inferences. Additionally, the models tested imply that the Romani likely left South Asian ~1,000 years before the present, which aligns well with what is known from the historical record (though this is a case where I put much more stock in the historical record than inferences from population genetic models; look at the intervals).

Finally, there is the question of inbreeding. One aspect of the Romani genome is jumps out you is that they have many long “runs-of-homozygosity” (ROH). This is totally expected, as decades of uniparental analyses suggested a great deal of population bottleneck events as the Romani spread throughout Europe. But the ROH patterns also unearth an interesting fact: some of the Balkan Romani clearly have recent European admixture, while the non-Balkan Romani had an initial period of admixture followed by endogamy. The latter scenario seems to resemble Askhenazi Jews, while the former would suggest that the boundary between Romani and non-Romani in the Balkans is more fluid than is sometimes portrayed.

So there we have it. The Romani derive from lower castes populations from the northwest Indian subcontinent who seem to have left ~1,000 years ago. Over time they admixed with local populations, and are now 50-70% non-South Asian, with some groups being ~90% European (e.g., Welsh Romani). And, they have a long history as an endogamous group, judging by their inbreeding.

December 2, 2012

More diversity in the Personal Genome Project

Filed under: Genomics,Human Genomics,PGP — Razib Khan @ 1:42 pm

Seeking Diversity (Especially Families):

Because the PGP is self-recruiting, we don’t have a very balanced set of participants. “Self-recruitment” means that all participants have enrolled in our project through word of mouth, finding our website and enrolling online. To put it bluntly, that means we mostly end up with young white men….

…Research within one or two racial/ethnic categories isn’t necessarily a virtue, biracial and multiracial heritage may be even more interesting to some researchers and can open more areas for future….

In particular, NIST is looking for “trios”: two parents and a child. Researchers like to use samples from trios because they know every piece of DNA in the child comes from one of the parents. This makes it easier to assess error rates — and that sort of quality control is what NIST expects the genome material to be used for. We think all such family groups are valuable, but current trios in the PGP haven’t been the most diverse….

Reader Paul brought this to my attention. I haven’t been too interested in the PGP for myself because it’s just so slow to “play” with whole genomes (~3 GB) as opposed to 1 million SNPs. But over Christmas I’ll look into signing up, and see if they are interested in my own “trio.” I also thought I’d pass this along to readers, though my readership actually looks almost exactly like current PGP participants, so I don’t know if I’ll be contributing to the problem.

Layering genetic histories

Filed under: Anthroplogy,Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 12:14 pm

As a follow up to my post from yesterday, I decided to run TreeMix on a data set I happened to have had on hand (see Inference of Population Splits and Mixtures from Genome-Wide Allele Frequency Data for more on TreeMix). Basically I wanted to display a tree with, and without, gene flow.

The technical details are straightforward. I LD pruned ~550,000 SNPs down to ~150,000. I ran TreeMix without and with migration parameters with the Bantu Kenya population being the root. Finally, when I did turn on the migration parameter I set it for 5. You can see the results below.

Most of the flows are pretty expected. The West Eurasian flow from the Turks to the Uygurs makes sense, because there is a large West Asian component to what the Uygurs have (from East Iranians?). The Chuvash are a Turkic group with minor, but significant, Turkic component. The HGDP Russian sample does have some East Eurasian ancestry. And the Moroccans also have African ancestry. But your guess is as good as mine with the Bantu flow in. These are I think Kenya, so it might be trying to interpret Nilotic admixture as generalized Eurasian.

A minor note: installing TreeMix and generating the appropriate files from pedigree format is not to difficult. But you might have confusion in how to generate the pedigree input file. You do it like so in PLINK:

./plink --noweb --bfile YourFile --freq --within YourGroupNamesFile --out YourOutPutFile

It’s the last you want to put into TreeMix’s python conversion script. The YourGroupNamesFile is basically the .fam file with an extra column, the population names for each individual.

Older Posts »

Powered by WordPress