Razib Khan One-stop-shopping for all of my content

September 21, 2018

The Munda arrived in India 4,000 years ago (probably)

Filed under: India Genetics,Indian Genetics — Razib Khan @ 8:26 pm

I didn’t plan to talk about the Munda any time soon, in part because I recently wrote a post, The Munda as upland rice cultivators, which outlined my views. But there is a new preprint with new samples which attempts to estimate admixture times using genome-wide data. You can see the results above, and, also note that they found similar estimates using Y chromosome SNP variation around haplogroup O2a1.

The preprint is, The genetic legacy of continental scale admixture in Indian Austroasiatic speakers:

Surrounded by speakers of Indo-European, Dravidian and Tibeto-Burman languages, around 11 million Munda (a branch of Austroasiatic language family) speakers live in the densely populated and genetically diverse South Asia. Their genetic makeup holds components characteristic of South Asians as well as Southeast Asians. The admixture time between these components has been previously estimated on the basis of archaeology, linguistics and uniparental markers. Using genome-wide genotype data of 102 Munda speakers and contextual data from South and Southeast Asia, we retrieved admixture dates between 2000 – 3800 years ago for different populations of Munda. The best modern proxies for the source populations for the admixture with proportions 0.78/0.22 are Lao people from Laos and Dravidian speakers from Kerala in India, while the South Asian population(s), with whom the incoming Southeast Asians intermixed, had a smaller proportion of West Eurasian component than contemporary proxies. Somewhat surprisingly Malaysian Peninsular tribes rather than the geographically closer Austroasiatic languages speakers like Vietnamese and Cambodians show highest sharing of IBD segments with the Munda. In addition, we affirmed that the grouping of the Munda speakers into North and South Munda based on linguistics is in concordance with genome-wide data.

There is a weird pattern of the affinities in f3 statistics in the IBD in this preprint. I think the explanation that they give, that Vietnamese and Cambodians have been subject to later admixture, probably explains it. In the case of the Vietnamese, it’s southern Chinese ancestry. In the case of the Cambodians…it might be Indian ancestry! This might strike you as strange, but the Indian ancestry in the Cambodians may be more enriched for the West Asian component that’s not found in the Munda specifically: the element brought in by the Indo-Aryans.

The peninsular Malay groups are “proto-Malays,” and these groups tend to be somewhat higher in AASI-like ancestry as well as lower in Austronesian ancestry. High shared drift tendencies with Lao and groups in more isolated areas of Malaysia may be a function of the fact that these are less cosmopolitan populations, with less Indian and Chinese ancestry, than other mainland Southeast Asians and Malays proper.

Click to enlarge

These results are broadly in line with the Narasimhan et al. preprint, which is cited within it. In that preprint the Reich group outlines its general model, where modern South Asians can be thought of as a compound of several different ancestral populations of different affinities. The Munda in particular are enriched for “Ancient Ancestral South Asian” (AASI) vs. any other group, and the hypothesis is given is that the Southeasts Asian mixed first with with an AASI group which lacked the admixture with West Asians, and then mixed again with “Ancestral South Indians”, which had some West Asian (“Iranian Farmer”) ancestry.

Since ALDER based methods, last I checked, tended to pick up the last admixture event, the more recent time for northern Munda groups makes sense. Looking at the Y chromosomes it is pretty clear to me that some of the East Asian ancestry in Bengali-speaking agriculturalists in the lower Gangetic plain is from Munda groups. Conversely, some of the Munda probably admixed populations from in from the west practicing intensive rice agriculture, which apparently did not become a feature of the landscape until after 1000 BC.

One of my points in the post above I wrote on the Munda is that the common words for Austro-Asiatic languages indicates that they were upland rice farmers. This is exactly the modern distribution of the Munda. One hypothesis, which I now am skeptical of, is that the Munda once occupied the bottomlands and were driven into the hills by people from the west and south. I no longer believe this. Rather, the Munda may always have preferred the uplands, and so traversed the flat lands between the Khasi hills and the Chota Nagpur plateau. This preference for uplands may strike us as strange, but it’s not that rare. Yankee farmers in Ohio preferred upland zones, even though these were less agriculturally rich (farmers moving up from the South didn’t have this aversion).

A point observed and implied in the preprint is that the expansion of Indo-Aryans, Dravidians, and Munda, seems to have happened all rather close in time. Though the northwest region of the subcontinent seems to have developed a settled agricultural society by 3000 BC of long standing, its expansion was limited by climatic restrictions on its crop toolkit. But by 2500 BC it seems pastoralists were already pushing into the Deccan via the dry-zone on the eastern edge of the Thar down from the Punjab. The Toda people of the far south of India are probably representative of the lifestyle of these peoples, who were Dravidian-speaking.

A few centuries after this period is probably when the proto-Munda began pushing out of Southeast Asia. The DNA evidence is pretty strong this was a hugely male-skewed event once it got beyond the Khasi hills. Why? My hypothesis is that these were not quite small-scale peoples. Perhaps the male-mediation of a lot of gene flow in South Asia is due to the emergence of militarized confederacies where elite lineages engaged in conquest of territory from native groups. The Munda have very low frequencies of R1a, and very high frequencies of O2a. The admixture with Dravidian and Indo-Aryan speaking peoples that occurred between 2000 BC and 0 AD was probably overwhelmingly female-mediated.

The narrative above suggests that most of the genetic changes we see in South Asia to result in the landscape of the present occurred in the period between 2500 BC and 500 BC. About 2,000 years. And yet agriculture of some form arrived in Mehegarh in western Pakistan 9,000 to 7,500 years ago, depending on what dates you trust. What took so long? Similarly, millet and rice agriculture in China is 7,000 years old, but only around 4,000 years ago did rice farmers start pushing south (and probably west in the case of the Munda).

I’ll present the hypothesis here that this coincidence wasn’t a coincidence, and that certain things in relation to social complexity have a particular rate of change. In general I agree with economic historians who say that our need to posit an “Industrial Revolution,” or a “Neolithic Revolution,” is somewhat of an imposition because humans don’t want to think quantitatively. It probably takes small-scale societies moving from hunting and gathering to full-brown agriculture a certain amount of time, and then to proceed to greater social complexity that enables migration which is more than due to simple natural increase and Malthusian driven expansion. Mainland India beyond what is today Pakistan and much of Southeast Asia were “filled up” by agricultural peoples around the same time after a long incubation to the west and north because similar social forces were at play.

July 15, 2018

India vs. China, genetically diverse vs. homogeneous

Filed under: China,China genetics,Human Population Genetics,India,India Genetics — Razib Khan @ 1:50 pm

About 36% of the world’s population are citizens of the Peoples’ Republic of China and the Republic of India. Including the other nations of South Asia (Pakistan, Bangladesh, etc.), 43% of the population lives in China and/or South Asia.

But, as David Reich mentions in Who We Are and How We Got Here China is dominated by one ethnicity, the Han, while India is a constellation of ethnicities. And this is reflected in the genetics. The relatively diversity of India stands in contrast to the homogeneity of China.

At the current time, the best research on population genetic variation within China is probably the preprint A comprehensive map of genetic variation in the world’s largest ethnic group – Han Chinese. The author used low-coverage sequencing of over 10,000 women to get a huge sample size of variation all across China. The PCA analysis recapitulated earlier work. Genetic relatedness among the Han of China is geographically structured. The largest component of variance is north-south, but a smaller component is also east-west. The north-south element explains more than 4.5 times the variance as the east-west.

Click to enlarge

Another dimension of the of the variation is that different parts of China are character by different levels of admixture between the Han and other groups. In Northwest China, there is gene flow from West Eurasian sources. In all likelihood, this is through proxy populations, such as Mongols, who are about ~10% West Eurasian. Also, during the period between the fall of the Han Dynasty and the rise of the Sui-Tang Dynasty much of northern China was dominated by barbarian groups from the steppe, and these groups settled down and were absorbed. In Northeast China, the source of admixture is from Siberian and Tungusic group. Again, this makes geographical sense.

In contrast in South China, the gene flow is from indigenous Chinese national groups, such as Dai. This is in keeping with the historical record, whereby South China became Han in the period between 0 and 1000 AD through migration, intermarriage, and acculturation.

Click to enlarge

I have my own small private dataset of Chinese individuals. Some with provenance. Some without. But using known populations I was able to divide China along the north to south cline.  Individuals from Guangdong in the south, those from Shaanxi in the north, and from Zhejiang to Sichuan in the center.

Using Punjabis as a West Eurasian outgroup I was able to plot these individuals on a PCA. If you click to enlarge you will see that a substantial minority of the Han_N sample is shifted to the left of the plot. This is toward the Punjabis. This is not because they have Punjabi ancestry, but because Punjabis are reasonable proxies for West Eurasians.

Click to enlarge

More importantly, I want to compare South Asia to China. To do that I created a small dataset that merged the Han with representative South Asian groups. The first PC, 1 and 2, illustrate the contrast. All three Chinese groups, sampled from the north to the south, occupy a very tight cluster, while the South Asians span PC 2. The Bengalis are shifted a bit to the Chinese, but most of the variance is due to within-South Asian genetic differences.

Click to enlarge

I ran PCA to 10 dimensions. Only at PC 10 did the Han Chinese separate along the north-south access. Most of the earlier PC’s separated out specific castes (e.g, Patels because if their large number in the Gujurati sample were PC 3). Here are the eigenvalues: 53.0682, 2.5641, 2.31876
1.97058, 1.90652, 1.88879, 1.7935, 1.69375, 1.61516, and 1.54207. The large value for PC 1 is what you’d expect, it’s a continental scale difference. PC 2 differentiates South Asia from north to south. It’s much more modest. The other PCs get progressively smaller, but within the data, it’s clear that the continental size difference is the big one. The variance between north and south China is a small one in a South Asian scale.

Click to enlarge

Pairwise Fst is more ambiguous. That’s probably because most of the South Asian samples have structure within them. Merging them into one pooled population just confuses the issue.

Using a South Asian dataset where groups are disaggregated makes a lot more sense, and you see the structure between the different groups.

Click to enlarge

Running Treemix gives similar results. The South Asian groups exhibit a fan-shaped topology, where the Han cluster tightly together. Since I removed Bengalis from Treemix adding migration edges doesn’t do anything between the two clusters, so I omitted those results.

Click to enlarge

Finally, of course I ran some admixture analysis. Using South Asians + Han Chinese, I thought K = 4 would be reasonable. Even if you don’t enlarge, the results are straightforward: the Han Chinese have very little diversity in unsupervised mode. A small South Asian-like component, which has affinities with Punjabis, is found in northern Han. This confirms other results with other methods that the northern Han have some West Eurasian gene flow.  Some of the southern and central Han have an affinity with one of the South Indian clusters. I think is artifactual, due to deep structure within Eastern Eurasian populations and affinities between those groups that the Han absorbed as they moved south.

This post doesn’t really shed new light on anything we didn’t know. Rather, it’s just a review of what jumps out at anyone who works with genotype data: there is not very much genetic diversity in China and there is a great deal of genetic diversity in India. Why? These are not questions genetics can really answer directly, though it can give us clues and support certain models over others.

Anyone who has read much about Chinese history knows that the cultural ideal of meritocracy is deeply ingrained, even if it is honored in the breach quite often. Chinese civilizations has been characterized by the domination of extended pedigrees (e.g., the Xianbei-Han ruling faction among the Tang), but those pedigrees never become ethno-religious castes. The exception occurred during the Yuan (Mongol) period where Kublai Khan entered into a divide-and-rule policy. But that was a short period which had no longer term cultural consequences.

In contrast, South Asia is characterized by long-term endogamy. This is not surprising to anyone who knows anything about South Asian history. The genetic evidence suggests that modern jati-barriers emerged around ~2,000 years ago. Not only do South Asian groups differ a great deal in biogeographic ancestry (deep ancestry), but historical endogamy has resulted in further drift between these groups.

June 27, 2018

South Asian Genotype Project, Summer 2018 Update

Filed under: India Genetics,India genomics,South Asian Genotype Project — Razib Khan @ 12:54 am


I’ve put another update on the South Asian Genotype Project. If you’ve contributed since March check it out.

Again, if you are interested: send me a 23andMe, Ancestry, MyHeritage, Family Tree DNA raw genotype file to contactgnxp -at- gmail.com.

In the subject please put:

  1. “South Asian Genotype Project”
  2. The state/province your family is from
  3. Ethnolinguistic group
  4. If applicable, caste

I decided to some poking around with some of the higher quality samples people have given me. 180,000 SNPs with almost no genotyping error. I also removed “relatives.” That means that a lot of Muslim groups from Pakistan had individuals dropping out. In the PCA above you can see 4 Burushos left! Not too many Pathans either.

Click to enlarge!

First, I decided to look at the Brahmin samples I had.

– Uttar Pradesh, Bihar, and the Gujarati Brahmin(s) I had are one cluster
– South Indian Brahmins (mostly Iyer) are another

To my surprise, the two Maharashtra Brahmins that I have are firmly in the South Indian cluster. The Bengali Brahmin is more like the North Indians. But there is a subtle skew toward the distant Bangladesh cluster. This individual seems less East Asian than even the typical Bengali Brahmin, but I think Bengali Brahmins can be modeled as North Indian Brahmin with non-Brahmin (and therefore East Asian) ancestry.

Click to enlarge!

Next, I wanted to look at Gujaratis. The 1000 Genomes has a large number of this population…but there’s not group identity. Years ago Zack Ajmal of Harappa DNA concluded that a large and relatively related cluster in these data were “Patels.” Someone who is a Bohra Muslim of presumably Patel background sent me their data. They did not fall in the Patel cluster. Rather, they were in the “Gujurati_ANI_1” group, which is more like Pakistanis than other Gujuratis. In fact, the Gujurati Brahmin is not in this cluster. An individual whos Solanki seems to be more ASI-shifted, like the Patels and Gujurati_ANI_4.

Overall, Gujarat has a lot of population structure in a rather small state (yes, I can’t spell Gujarat as you can see in my population labels).

Click to enlarge!

From Maharashtra, right to the south of Gujarat in western India, I have two Brahmins and one Kayastha. For non-South Asians, my understanding is that Kayasthas are literate non-Brahmin castes. In Bengal, they take the places of the Kshatriya in the caste hierarchy, and with Brahmins formed the traditional Hindu educated classes. I have seen Bengali Kayastha genotypes, and they look rather like other Bengalis (my mother’s father’s family is from a Kayastha family before their conversion to Islam judging from their customary surname).

There are Kayasthas in other parts of South Asia. I have a Kayastha sample from Maharashtra. Curiously on the PCA this individual is in the same position as the two Brahmins from the region, and South Indian Brahmins. I don’t know what this means.

Click to enlarge!

Next some odds and ends from the northwest of the subcontinent. I have a few Jatts who are not related. This group from Punjab is quite ANI-shifted. Someone who claims to be a Rajput from Rajasthan is where they should be on account of geography. The Punjabi 1000 Genome group is quite diverse. I have a Ramgarhia individual who seems to be somewhere between Punjabi_ANI_1 and Punjabi_ANI_2. The Jatt are on the edge (ANI-shifted) of Punjabi_ANI_1.

I have two individuals who claim to be Kashmiri. A Butt and a Syed. I have no idea what that means. But both are Punjabi_ANI_2…but they look somewhat East Asian shifted. This is not surprising. The curious thing about Kashmiris is that they are culturally and geographically quite distinct from Indians to their south. But genetically they are not so different. In fact, they are “more South Asian” (ASI) than Jatt, and considerably more than Iranian speaking groups like Pathans.

Finally, there is a Marwari individual. This community is from Rajasthan, though they occupy a mercantile role across the subcontinent. Strangely (or not?) they are very close to the Patels. Much more ASI-enriched than the Rajput.

Click to enlarge!

Shifting to South Indian samples, I plotted the Chamar samples, who I believe were collected from Uttar Pradesh in the north. These Dalits actually seem to cluster with a subset of the 1000 Genomes Tamil and Telugu samples I believe are Scheduled Caste (Dalit) as well. The Chamar are somewhat distinct. They are more ANI-shifted. But notice that the bulk of Tamils and Telugus are still more ANI-shifted than the Chamars are! This surprised me.

I have some Velama individuals, as well as a Reddy from Andhra Pradesh and a Padmashali. All these individuals are in the main distribution of South Indians. I do have a Mudaliar Tamil sample, and this individual is placed among the Chamars. Though not really in the Tamil Scheduled Caste group.

Click to enlarge!

Finally some odds & ends. The Nasrani samples from Kerala are between the South Indian Brahmins and middle caste South Indians. I suspect this is due to the origin of the Nasranis in the Nair community, who have mixed some with Brahmins. The Vania sample from Gujarat is clustered with South Indian Brahmins. The Dusadhs, an agricultural group from Uttar Pradesh and Bihar, that is depressed in some manner in relation to the dominant groups, are not quite Chamars, but they are quite ASI-shifted.

Some of you will be asking about admixture. I ran K = 4 unsupervised on the data set. You can find it here.

June 12, 2018

No steppe ancestry in the the Rakhigarhi samples = non sequitur

Filed under: India Genetics,Indian Genetics — Razib Khan @ 10:36 pm

Harappan site of Rakhigarhi: DNA study finds no Central Asian trace, junks Aryan invasion theory:

The much-awaited DNA study of the skeletal remains found at the Harappan site of Rakhigarhi, Haryana, shows no Central Asian trace, indicating the Aryan invasion theory was flawed and Vedic evolution was through indigenous people.

“The Rakhigarhi human DNA clearly shows a predominant local element — the mitochondrial DNA is very strong in it. There is some minor foreign element which shows some mixing up with a foreign population, but the DNA is clearly local,” Shinde told ET. He went on to add: “This indicates quite clearly, through archeological data, that the Vedic era that followed was a fully indigenous period with some external contact.”

I haven’t heard anything definitive, but this is what I have heard: that the genetics they could analyze indicates continuity, but none of the steppe element ubiquitous in modern North India (and that there was contamination in the Korean lab). The Rakhigarhi samples date to 2500 to 2250 BC last I checked. That means they shouldn’t have any steppe ancestry if the model of the relatively late demographic impact of Indo-Aryans after 2000 BC is correct.

Basically, the whole article is kind of a non sequitur. I do understand that many archaeologists think there was continuity culturally. And there could have been. But taking into account the genetics of the modern region of India where Rakhigarhi is located, there was a major demographic perturbation after 2250 BC.

April 27, 2018

Rakhigarhi sample doesn’t have steppe ancestry (probably “Indus Periphery”)

Filed under: India Genetics,India genomics,Rakhigarhi — Razib Khan @ 12:02 am

We’ve been waiting for two years now, and it looks like they’re about to pull the trigger, Indus Valley People Did Not Have Genetic Contribution From The Steppes: Head Of Ancient DNA Lab Testing Rakhigarhi Samples:

Niraj Rai, the head of the Ancient DNA Laboratory at Lucknow’s Birbal Sahni Institute of Palaeosciences (BSIP), where the DNA samples from the Harappan site of Rakhigarhi in Haryana are being analysed, has revealed that a forthcoming paper on the work will show that there is no steppe contribution to the DNA of the Harappan people….

“It will show that there is no steppe contribution to the Indus Valley DNA,” Rai said. “The Indus Valley people were indigenous, but in the sense that their DNA had contributions from near eastern Iranian farmers mixed with the Indian hunter-gatherer DNA, that is still reflected in the DNA of the people of the Andaman islands.” He added that the paper based on the examination of the Rakhigarhi samples would soon be published on bioRxiv (pronounced “bio-archive”), a preprint repository of papers in the life sciences.

At this point none of this is surprising. I also wonder if this preprint was hastened by the release of The Genomic Formation of South and Central Asia. It seems that the results here are totally consonant with what came before. My expectation is that the lone sample that they got genetic material out of will be similar to the “Indus Periphery” (InPe) individuals in the earlier preprint: a mix of West Asian with ancestry strongly shifted toward eastern Iran, and indigenous South Asian “hunter-gatherer.”  That’s pretty much what Niraj Rai states in the piece. I think genetically the individual won’t be that different from the Chamars of modern day Punjab.

In fact, Rai, the lead researcher, ends by twisting the knife:

In other words, the preprint observes that the migration from the steppes to South Asia was the source of the Indo-European languages in the subcontinent. Commenting on this, Rai said, “any model of migration of Indo-Europeans from South Asia simply cannot fit the data that is now available.”

A major caveat here is that we’re talking about one sample from the eastern edge of the Indus Valley Civilization (IVC). I’m not sure that this should adjust our probabilities that much. From all the other things we know, as well as copious ancient DNA from Central Asia, our probability for the model which the Rakhigarhi result aligns with should already be quite high.

Again, since it’s one sample, we need to be cautious…but I bet once we have more samples from the IVC the Rakhigarhi individual will probably be enriched for AASI relative to other samples from the IVC. The InPe samples in The Genomic Formation of South and Central Asia exhibited some variation, and it’s likely that the IVC region was genetically heterogeneous.

But, this is going to be a DNA sample from an individual who lived 4,600 years ago within the orbit of the IVC when it was in its mature phase. That’s still a big deal. As most of you know the IVC is prehistory because we haven’t deciphered the seals which are associated with this civilization. But, the IVC clearly had relationships with West Asia and Central Asia, with parts of eastern Iran and the BMAC culture both being influenced and interaction with it. Traders who were likely from the IVC seem to be mentioned in Mesopotamian records.

Additionally, the genetics of one individual can be highly informative if it’s high-quality whole-genome data (I’m skeptical of that in this case). One could possibly even identify the time period that admixture between West Asian and AASI components occurred from a single genome, by looking at ancestry tract lengths.

A single sample isn’t going to falsify the idea held by some that steppe peoples were long present within the IVC. Perhaps they’ll show up in other samples? That’s possible, and it’s what I would argue if I held their position, but I think the constellation of evidence on the balance now does suggest that a relatively late incursion into South Asia is likely. The steppe ancestry with Northern European affinities shows up in BMAC only around 4,000 years ago. It is hard to imagine it was in South Asia before it was in Central Asia.

As I’ve been saying for a while it seems that though there will be more genetic work written on India in the near future, the real analysis is going to have to come out of archaeology and mythology.

It’s pretty clear that in Northern Europe the arrival of the Corded Ware peoples from the steppe zone resulted in great tumult. A linguistic analysis suggests that the languages of Northern Europe have words related to agriculture with a non-Indo-European origin, of common provenance.  But we don’t have much in the way of mythos about the arrival of the Corded Ware.

In contrast, India has a rich mythos which seems to date to the early period of the arrival of the Indo-Aryans. One interpretation has been that since these myths seem to take as a given that Indo-Aryans were autochtonous to India, they were. But the genetic data seem to be strongly suggesting that the arrival of pastoralists occurred in South Asia concomitant with their arrival in West Asia, and somewhat after their expansion westward into Europe. Indian tradition and mythos could actually be a window into the general process of how these pastoralists dealt with native peoples and an illustration of the sort of cultural synthesis that often occurred.

March 11, 2018

The population genomics of South Asia is complicated, and politics doesn’t make it easier

Filed under: India Genetics,India genomics — Razib Khan @ 11:27 am


Many people have been sending me links to this article, By rewriting history, Hindu nationalists aim to assert their dominance over India. Here’s a key section:

The RSS asserts that ancestors of all people of Indian origin – including 172 million Muslims – were Hindu and that they must accept their common ancestry as part of Bharat Mata, or Mother India. Modi has been a member of the RSS since childhood. An official biography of Culture Minister Sharma says he too has been a “dedicated follower” of the RSS for many years.

Sharma told Reuters he expects the conclusions of the committee to find their way into school textbooks and academic research. The panel is referred to in government documents as the committee for “holistic study of origin and evolution of Indian culture since 12,000 years before present and its interface with other cultures of the world.”

Sharma said this “Hindu first” version of Indian history will be added to a school curriculum which has long taught that people from central Asia arrived in India much more recently, some 3,000 to 4,000 years ago, and transformed the population

There are several threads here. First, it is a fact that the ancestors of South Asia’s non-Hindus were Hindu. There are minor exceptions, such as the Parsis, who are ~75% Iranian. One can quibble as to whether many tribal and peasant populations were truly Hindu in a formal and explicit sense. But I think this is a semantic dodge. Muslims would recognize these beliefs and practices as Hindu, no matter if one was a Brahmin monk or a member of a tribe which still sacrificed animals.

I’ve looked at the genotypes of a fair amount of South Asians of Muslim background. The overwhelming (usually exclusive) proportion of their ancestry is South Asian. It’s a fact that the ancestors of non-Hindu South Asians were Hindu.

But, the article and a dominant theme in Hindu nationalism today are that distinctive groups like Indo-Aryans are indigenous to South Asia. This is set against a narrative of invasions and migrations, which is presumed more friendly to a multicultural paradigm (I have a hard time keeping track of the political valence of all these things). To some extent, the reality of invasions and migrations cannot be denied, whether it be Alexander, the Kushans, or the various Muslim groups. But these historical invasions left little genetic imprint.

When 2009’s Reconstructing Indian Population History was published things changed for earlier migrations. By the time the ancient Greeks were noting India in Classical Antiquity, it was already noted as the most populous nation in the world. I was initially skeptical about the result inReconstructing Indian Population History, that there was massive admixture between a West Eurasian and indigenous South Asian group, because that would imply massive migration. Additionally, phenotypically the pigmentation genes didn’t seem to work out if the source population was European-like.

Nearly 10 years on we have a lot more clarity. Ancient DNA has clarified a lot. Massive migrations are common. And, the pigmentation and genetic profile of modern Europeans is recent. The source population(s) for “Ancestral North Indians” may not have been Europeans in the way we’d understand them. In fact, a follow-up paper, Genetic Evidence for Recent Population Mixture in India, hinted at two admixtures. There’s a fair amount of circumstantial evidence now that one component of “Ancestral North Indian” relates to West Asian populations and another component to the more classical steppe Indo-Aryans. The former is more widespread than the latter.

I do understand Indians who want to interpret their own history through the lens of their own cultural priors. The problem is that genetic science has proceeded so fast in the last few years that many propositions which were speculative in the 20th century are testable in the 21st. Some Hindu nationalist friends and acquaintances express embarrassment and worry about the track that Indians are going on this. I don’t know what to say, but Americans have their own delusions and propaganda, so I’m not going to be one pointing fingers. Other Indians have told me via Facebook that they “believe in the results from the 2000s” (when they were more congenial to their viewpoints?). I guess that’s one strategy; just keep up with the science until it starts refuting your model.

That being said, with the ubiquity of datasets one can explore questions oneself. Additionally, the Indian government may suppress analysis of ancient DNA through soft coercion and negative incentives, but I’m sure at some point Pakistan will let people dig things up.

I recently posted my South Asian Genotype Project results. Though the sample size is small, since I have provenance it allowed me to get more clarity on what’s going on in the 1000 Genomes samples for South Asians.

There are several things I can tell you. For example, I have samples for a Sindhi and Gujarati Lohanna. They cluster near each other, and with Sindhis. Similarly, a Gujarati Muslim sample is also in with the Sindhis. A Gujarati Muslim gives the same results.

In contrast, other Gujaratis are placed between Pakistani populations and South Indians. A Vania and Solanki sample helped me label “Gujarati Middle Caste.” The Gujarati Patel samples, in contrast, are even more shifted toward South Indians, and admixture analysis implies less Indo-Aryan in these samples than other Gujaratis.

Gujarat seems to be a state with a lot of connections to Sindh, so you see a wide range of variation in this state on the ANI-ASI cline.

I’ve collected enough Brahmin samples from the four states of Southern India to see that they are very similar genetically (one exception is a Niyogi Brahmin who seemed more ASI-shifted than usual). If they don’t come from the same migration event, they diverged at around the same time.

Brahmins from North India are different from South Indian Brahmins, in being more ANI-shifted. It seems that to a great extent Brahmins from outside the Gangetic heartland can be modeled as North Indian Brahmins with local admixture. Bengali and Maharashtra Brahmins have shifted away from North Indian Brahmins, but not as much as South Indian Brahmins. Bengali Brahmins are also East Asian shifted, confirming the reality of local admixture.

One result that surprises me is how genetically similar Dalits from North and South India are. The Chamar samples from Uttar Pradesh seem to have some levels of Indo-Aryan admixture, but overall they’re not that different from Dalits from Tamil Nadu.

Additionally, non-Brahmin and non-Dalit individuals from places like Bengal and Tamil Nadu/Sri Lanka, and Andhra Pradesh are very distinct from Dalits. In other words, the caste system is not simple the traditional upper castes vs. everyone else, but it’s deeply structured. The implication here is that caste may predate the Indo-Aryans (this is not a new inference). Or at least not be related them.

Finally, there is a curious pattern where gene flow into southern Pakistan (Sindh) is more shifted toward the Middle East than in northern Pakistan (Punjab, the Pathan areas). I suspect this is due to dynamics which date deep into prehistory, rather than more recent events.

 

January 18, 2018

The Dravidianization of India

Filed under: Dravidian,India Genetics,India genomics,Indo-Aryan — Razib Khan @ 9:36 pm

On this week’s The Insight Spencer Wells and I talk about the Indo-Aryan arrival to South Asia. This was recorded very early last summer, and I’m rather unguarded (it’s well before I had the piece published in India Today).

I think 2018 will finally be the year that a lot of South Asia will be “solved.” There has been some foot-dragging on papers and results, but that can only go so long.

All that being said I suppose I should make some suppositions I have arrived at on this topic more explicit, as in a discussion with an Indian friend he admitted had no idea about some of my views, though he reads this weblog when I expressed them. That’s because they are speculative and my confidence in them is weak, though you can infer my opinions if you look very closely.

The figure to the left is from Genomic insights into the origin of farming in the ancient Near East, a paper published about a year and a half ago. You see various South Asian populations being modeled as a mixture of four different source populations. The Onge are an Andaman Islander population (and the closest we can get to the aboriginal peoples of South Asia). Iran_N represents Neolithic Iranians, the canonical “eastern farmer” population. Steppe_EMBA represent Yamnaya pastoralists, who are themselves modeled as a mixture of Eastern European Hunter-Gatherers (EHG) and southern population which has affinities with the Iran_N cluster. EHG in their turn seems to exhibit ancestry from Western European Hunter-Gatherers (WHG), whose heritage dates to the late Pleistocene, and Ancient North Eurasians (ANE), who flourished in Siberia, and contributed ancestry to populations to the west and east (including the ancestors of Native Americans).

When I first saw this specific figure I was incredulous. I had long thought that “Ancient North Indians” (ANI) were a compound of two elements, one related to the farmers of West Asia (Iran_N), and the other steppe Indo-European (Steppe_EMBA/Yamnaya). But the fraction of Yamnaya/Indo-European/Indo-Aryan ancestry seemed far too high.

A few years later I am not less certain about my skepticism. The fractions here in the details are debatable. Within the text of the paper, the author admits that the true ancestral populations are probably not represented by the model. But they are close. In most cases, the “Han” ancestry is probably indicative of the fact that the non-ANI component of South Asian ancestry is most closely related to the Onge, but is significantly different nonetheless.

The ratio of Iran_N and Steppe_EMBA is the key. Here is a selection from the paper:

Group Iran_N Steppe_EMBA Ratio
Jew_Cochin 0.53 0.23 2.27
Brahui 0.60 0.30 1.98
Kharia 0.13 0.07 1.97
Balochi 0.57 0.32 1.75
Mala 0.23 0.18 1.25
Vishwabrahmin 0.25 0.20 1.21
GujaratiD 0.29 0.28 1.04
Sindhi 0.38 0.38 1.00
Bengali 0.22 0.25 0.91
Pathan 0.36 0.45 0.81
Punjabi 0.24 0.33 0.72
GujaratiB 0.27 0.38 0.72
Lodhi 0.21 0.29 0.72
Burusho 0.27 0.43 0.64
GujaratiC 0.23 0.37 0.61
Kalash 0.29 0.50 0.58
GujaratiA 0.26 0.46 0.57
Brahmin_Tiwari 0.23 0.44 0.51

Any way you slice it, a group like the Tiwari Brahmins of Northern India have more Onge-like ancestry than most of the groups in Pakistan. But also observe that the ratio toward Steppe_EMBA is more skewed in them than among even Pathans or Kalash.  The Lodhi, a non-upper caste population from Uttar Pradesh in north-central South Asia are more skewed toward Steppe_EMBA than Pathans.

It is important for me to reiterate that the key is to focus on ratios and not exact percentages. Though the Steppe_EMBA fraction did strike me as high, glimmers of these sorts of results were evident in model-based clustering approaches as early as 2010. The population in the list above most skewed toward Iran_N are Cochin Jews. This group has known Middle Eastern ancestry. But next on the list are Brahui, a Dravidian speaking group in Pakistan. There is a north-south cline within Pakistan, with northern populations (Burusho) being skewed toward Steppe_EMBA and southern ones (Sindhi) being skewed toward Iran_N. Additionally, Iranian groups such as Pathans and Baloch likely have had some continuous gene flow with Middle Eastern groups, probably inflating their Iran_N.

Trends I see in the data:

  1. There is a north-south cline within Pakistan with Steppe_EMBA vs. Iran_N
  2. There is a north-south cline within South Asia with Steppe_EMBA vs. Iran_N
  3. There is caste stratification within regions between Steppe_EMBA vs. Iran_N
  4. Though not clear in this table, there are strong suggestions that Indo-European speaking groups tend to be enriched in Steppe_EMBA, all things equal (e.g., the Bengalis in the 1000 Genomes look a lot like the middle-caste Telugus in the 1000 Genomes when you remove the East Asian ancestry…except for a noticeable small fraction of a component which I think points to Indo-European ancestry)

What does this mean in terms of a model of the settlement of South Asian over the past 4,000 years? One conclusion I have come to is that Dravidian speaking groups are not the aboriginal peoples of the subcontinent. Rather, their settlement across much of South Asia is very recent. Almost as recent as Indo-Aryan habitation. In First Farmers the archaeologist Peter Bellwood proposed this model, whereby Indo-Aryans and Dravidians both expanded across South Asia concurrently. Though I think elements of Bellwood’s model that are incorrect, it’s far more correct in my opinion than I believed when I first encountered it.

Why do I believe this?

  1. The Neolithic begins in South India in 3000 BC.
  2. Sri Lanka is Indo-European speaking
  3. The Dravidian languages of South India don’t seem particularly diverged from each other
  4. There is ancestry/caste stratification in South India even excluding Brahmins (e.g., Reddys and Naidus in Andhra Pradesh look somewhat different from Dalits and tribals)
  5. Some scholars claim that there isn’t a Dravidian substrate in the Gangetic plain
  6. R1a1a-Z93, almost certainly associated with Indo-Aryans, is found in South Indian tribal populations
  7. Using LD-based methods researchers are rather sure that the last admixture events between ANI and ASI (“Ancestral South Indians”) populations occurred around ~4,000 years ago

Here is my revised model as succinctly as I can outline it. The northwest fringes of South Asia, today Pakistan, and later to be the home of the Indus Valley Civilization (IVC), was populated by a mix of indigenous populations, a form of ASI, when West Asian agriculturalists arrived ~9,000 years ago from what is today Iran. These were the Iran_N or “eastern farmer” groups. The West Asian agricultural toolkit was serviceable in northwestern South Asia for reasons of climate and ecology, but could not expand further east and south for thousands of years.

There is where the first admixture occurred that led to a population was mixed between ANI and ASI. These people lacked Steppe_EMBA. They were pre-Indo-European. They were almost certainly not all Dravidian speaking. The Burusho people of northern Pakistan, for example, speak a language isolate (in India proper you have Nihali and Kusunda)

By ~3000 BC this proto-South Asian (in a modern sense) population began to expand, while the IVC matured and waxed. Eventually, the IVC waned, fragmented, and disappeared.

Around ~2000 BC, or perhaps somewhat later, Indo-Aryans arrive in South Asia. The situation at this stage in not one of a primordial and static Dravidian India, on which Indo-Aryans place themselves on top. Rather, it’s a dynamic one as the collapse of the IVC has opened up a disordered power vacuum, and a reconfiguration of cultural and sociopolitical alliances.

In the paper above the author alludes to the pervasiveness of both Iran_N and Steppe_EMBA ancestry in South Asia, including in South India. “Indo-European” Y chromosomal lineages are also found among many South Indian groups, albeit at attenuated proportions region-wide. In Peter Turchin’s formulation, I believe that “Indo-Aryan” and “Dravidian” identities became meta-ethnic coalitions in the post-IVC world. Genetically the two groups are different, on average. But some Dravidian populations assimilated and integrated Indo-Aryan tribes and bands, while Indo-Aryans as newcomers assimilated many Dravidian populations.

The reason that the ratio of Iran_N to Steppe_EMBA does not decline monotonically as one goes from west to east along North Indian plain is that Indo-Aryans were not expanding into a Dravidian India.  Dravidian India was expanding only somewhat ahead of Indo-Aryan India, and in some places not all at all. In the northwest fringe of South Asia there had long been a settled population of peasants with West Asian ancestry with Iran_N affinities. In contrast to the east the landscape was populated by nomadic tribal populations with ASI affinities. North Indian Brahmins may have more Steppe_EMBA than some populations in Pakistan and more ASI because they descend from Indo-Aryan groups who absorbed indigenous ASI populations as they expanded across the landscape.

Dravidian groups as they expanded also assimilated indigenous populations. This explains some groups with very high fractions of ASI. Their ASI ancestry is a compound, of an old admixture in Northwest India, and also later assimilation in South India. The presence of R1a1a-Z93 in these populations reflects the integration of some originally Indo-Aryan groups into the expanding Dravidian wavefront.

Where does this leave us?

  1. The Indo-Aryan vs. Dravidian dichotomy is not one of newcomers vs. aboriginals. It is of two different sociocultural configurations which came into their current shape in the waning days of the IVC. That is, it is less than 4,000 years old
  2. The two populations were clearly interacting closely around the time of the collapse and disintegration of the IVC and post-IVC societies. There has been gene flow between the two
  3. ~4000 years ago ANI and ASI populations existed in their “pure” form, but that is because ASI aboriginals still existed to the south and east of the IVC, while Indo-Aryans were a new intrusive presence in the Indian subcontinent

July 28, 2017

The Indo-Aryan migration to the Indian subcontinent

Filed under: India Genetics,Indian Genetics — Razib Khan @ 7:45 am

The piece is up at India Today. The headline and title are of course optimized for clicks. I would, for example, say that the Indo-Aryans came from the west, not the West.

In the course of writing this it has become clear that many people have very specific commitments on this issue. I think it is clear I do not. Genetic inference methods have wide shoulders of confidence in particular dates. So I’ll leave it to those with more archaeological knowledge to argue over specific date. But it strikes me that the dates point to a likelihood that much of the expansion and diversification of Indo-Aryans may precede their expansion into the Gangetic plain ~1500 BCE, the date preferred by many scholars.

Apparently we shouldn’t have to wait too long for ancient DNA from Rakighari (months, not years). But I doubt that will settle anything, as opposed to being preliminary and setting off new debates.

July 8, 2011

On the genetic structure of Afro-Indians

ResearchBlogging.orgThe Pith: Afro-Indians are mostly African, with a substantial Indian minority ancestry. The latter is disproportionately female mediated. It also seems that that ancestry is more northwest Indian, and that natural selection has been operating upon them outside of the African environment.

Along the western coast of South Asia, from Makran in southwest Pakistan, down to the Konkan coast of southwest Iindia, there are isolated communities of Afro-Indians. They are called Siddis or Habshi. Their African origin is clear in their physical appearance, as well as aspects of their folk customs which tie them back to Sub-Saharan African. Nevertheless, they have assimilated to many Indian cultural traits. They generally speak the local language, and practice Islam, Hinduism, or Roman Catholic Christianity (in that order in proportion).

How and why did the Siddis arrive in India? The earliest date for their arrival almost certainly must be bounded by the period when Indo-Islamic polities rose to prominence in the early second millennium. The cosmopolitan melange of the armies of the Muslim warlords included diverse groups of Africans, some of whom took power, and established their own self-conscious Afro-Indian dynasties, set apart from the Turkish, Afghan, ...

June 24, 2011

Every South Asian “Arab” a descendant of Muhammad!

Filed under: Genetics,India Genetics,Syed — Razib Khan @ 1:23 pm

Y chromosomes of self-identified Syeds from the Indian subcontinent show evidence of elevated Arab ancestry but not of a recent common patrilineal origin:

Several cultural or religious groups claim descent from a common ancestor. The extent to which this claimed ancestry is real or socially constructed can be assessed by means of genetic studies. Syed is a common honorific title given to male Muslims belonging to certain families claiming descent from the Prophet Muhammad through his grandsons Hassan and Hussein, who lived 1,400 years ago and were the sons of the Prophet’s daughter Fatima. If all Syeds really are in direct descent from Hassan and Hussein, we would expect the Y chromosomes of Syeds to be less diverse than those of non-Syeds. Outside the Arab world, we would also expect to find that Syeds share Y chromosomes with Arab populations to a greater extent than they do with their non-Syed geographic neighbours. In this study, we found that the Y chromosomes of self-identified Syeds from India and Pakistan are no less diverse than those non-Syeds from the same regions, suggesting that there is no biological basis to the belief that self-identified Syeds in this part of the world share a recent common ancestry. In addition to Syeds, we also considered members of other hereditary Muslim lineages, which either claim descent from the tribe or family of Muhammad or from the residents of Medinah. Here, we found that these lineages showed greater affinity to geographically distant Arab populations, than to their neighbours from the Indian subcontinent, who do not belong to an Islamic honorific lineage.

The results are pretty simple. First:

1) The Syed lineages don’t exhibit a “Syed modal haplotype.” What you should see is a Syed haplotype of ~50%, and then a range of other lineages which introgressed through people lying about their origins or women being unfaithful to their husbands. Instead there are a wide range of haplotypes. Being Syed is an honorific.

2) I don’t think that they really prove higher Arab ancestry as such. They include really diverse populations, from Algerians to Israeli Arabs to Sudanese. The Islamic Honorific Lineages are somewhat closer to these groups, but that could be generic West Asian ancestry. For example, Persian. Or perhaps more African ancestry in cosmopolitan Syed lineages. Or, perhaps Syeds are just former high caste Hindus, who have more West Asian affinities.

Below is the PCA and list of Y chromosomal haplogroups. The paper is free at the link above.


June 16, 2011

Present genetic variation is a weak guide to past genetic variation

As I’ve been harping on and on for the past few years that the patterns of contemporary genetic variation are probably only weakly tied to past patterns of genetic variation (though Henry Harpending warned me about this as far back as 2004). A major reason that scholars operated under this presupposition is the axiom that most of the variation we see around us crystallized during the Last Glacial Maximum (~20 thousand years before the present).

This may be true in some cases, but I doubt it is true in most cases. I was pointed to a classic case of this problem just today. A reader alerted me to a short paper from this spring which attempts to ascertain the point of origin of the dominant mtDNA haplogroup among the Onge tribe of the Andaman Islanders, M31a1. This is an interesting issue because some researchers proposed, plausibly in the past, that these indigenous people in the Andaman Islands represent the descendants of the first wave “Out of Africa,” who took the rapid “beachcomber” path. Understanding the key to their genetics may then unlock the key to the “Out of Africa” event. Or ...

May 27, 2011

Indian genetics in the Indian media

Filed under: Genetics,Genomics,India Genetics — Razib Khan @ 1:47 pm

Most of you may know that the Times of India is working on a story about Zack’s Harappa Ancestry Project. Ruchira Paul points me to this story in OPEN about Indian genetics, The Story of Our Origins. They’re using the kit from National Geographic’s Genographic Project, which analyses paternal and maternal lineages. In other words, your father’s-father’s-father’s…father, and your mother’s-mother’s-mother’s…mother. This is a thin slice, but can be informative, and is useful for certain analyses such as molecular clock based time depths. But it is not the total genome, and so gives us only a small window into the past. In contrast Zack uses tens of thousands to hundreds of thousands of markers in his analyses. He may lack some of the specific analytic power in terms of temporal calibration, but he makes up for it in a much more rich and robust understanding of how Indian populations relate.

The story in OPEN tells the “orthodox” model inferred from uniparental lineages circa 2005. I think it’s somewhat outdated, in large part due to thick marker set analyses of the genome, but this part was pretty amusing:


THE MYTH OF THE INDO-EUROPEAN MARKER

This brings us to perhaps the most contentious of markers, Hartosh’s R1a1. The NGP states: ‘Some linguists believe that the Kurgans, nomadic horsemen roaming the steppes of southern Russia and the Ukraine, were the first to speak and spread a Proto-Indo-European language, some 5,000 to 10,000 years ago. Genetic data and the distribution of Indo-European speakers suggest the Kurgans … may have been descendents of M17 (the genetic marker that identifies the R1a1 haplogroup). Today a large concentration—around 40 per cent—of the men living from the Czech Republic across the steppes to Siberia, and south throughout Central Asia are descendants of this clan. In India, around 35 per cent of the men in Hindi-speaking populations carry the M17 marker, whereas the frequency in neighboring communities of Dravidian speakers is only about ten percent. This distribution adds weight to linguistic and archaeological evidence suggesting that a large migration from the Asian steppes into India occurred within the last 10,000 years.’

This NGP claim goes far beyond what the genetic data warrants. Says Bamezai, after looking through the NGP results published in this article, “For me as a scientist, it is necessary to be very conservative in my claims. Any broad conclusions require much more work and detailed study of not just haplogroups, but sub-haplogroups. I think the migration paths described in these cases are in question. I feel R1a1 originated here and contributed to Central Asia rather than the other way around.

Full disclosure, both Zack and I are carriers of the R1a1 lineage.

Notice how instead of retreating to agnosticism, as his conservatism might entail, the Indian scientist offers his own contentious R1a1-Out-of-India story! Of course you knew that Indo-European migrations had to be a myth for an Indian publication. This is rather how American popular science always has to prove that race has no biological basis. As it is, I think Bamezai is probably close to the truth. I suspect most of the Indian R1a1 lineages are from a different and older diversification of this haplogroup than the northern Eurasian ones, which tend to be characterized by a set of common markers which Indians often lack. But we’ll see.

May 22, 2011

Proper methods and false results

ResearchBlogging.orgThe Pith: Honorable intent and punctilious adherence to proper form and method does not guarantee a set of results which flesh out a genuine phenomenon. Much of science is tragic.

Most of the time I point to and review papers on this weblog which excite me. But in the interests of “balance” and dampening the bias toward material I find interesting and salient I thought it would be interesting to look at a paper which I thought wasn’t too interesting. It’s in the Journal of Human Genetics, part of the Nature Publishing Group empire. Also, it is open access, so you can read it yourself and make your own individual judgments.

The Soliga, an isolated tribe from Southern India: genetic diversity and phylogenetic affinities:

India’s role in the dispersal of modern humans can be explored by investigating its oldest inhabitants: the tribal people. The Soliga people of the Biligiri Rangana Hills, a tribal community in Southern India, could be among the country’s first settlers. This forest-bound, Dravidian speaking group, lives isolated, practicing subsistence-level agriculture under primitive conditions. The aim of this study is to examine the phylogenetic relationships of the Soligas in relation to 29 worldwide, geographically targeted, ...

April 21, 2011

Visualization of genetic distances, part n

Zack Ajmal has been taking his Reference 3 data set for a stroll over at the Harappa Ancestry Project. Or, more accurately, he’s been driving his computer to crunch up ADMIXTURE results ascending up a later of K’s. Because it is the Harappa Ancestry Project Zack’s populations are overloaded a touch on South Asians. He managed to get a hold of the data set from Reconstructing Indian History. If you will recall this paper showed that the South Asian component which falls out of ancestry structure inference algorithms may actually be a stabilized hybrid of two ancient populations, “Ancestral North Indian” (ANI) and “Ancestral South Indian” (ASI). ANI are a population which can be compared pretty easily to other West Eurasians. There are no “pure” groups of ASI, but the indigenous peoples of the Andaman Islands are the closest, having diverged from the mainland ASI populations tens of thousands of years ago.

At K = 11, that is, 11 inferred ancestral populations, Zack seems to have now stumbled onto the patterns which one would expect from this hybrid model of South Asians. Let me quote him:

Now let’s take all the reference populations with an Onge component between 10% ...

March 28, 2011

Genetics as the myth buster: Indian edition

Filed under: Genetics,Genomics,India Genetics,India genomics,Vishwakarma — Razib Khan @ 12:17 pm

Whenever Zack Ajmal posts a new update to the Harappa Ancestry Project he appends some data to his ethnic database. This sends me to Wikipedia, because how many people are supposed to know what a “Muslim Rawther” means? Well, if you are a Muslim Rawther, and perhaps from Southern India, you would. But South Asian ethno-linguistic categories and hierarchies are notoriously Byzantine, and I have difficulty making sense of them. This isn’t too surprising in my case, as my family’s background is relatively mixed in the very recent past (e.g., Hindus and Muslims, and people of various caste backgrounds), so we’re not the sort who can go at length about our pure ancestry and all that stuff. Unfortunately, Wikipedia isn’t always useful, because the people editing the entries on particular South Asian ethnic groups are often people from those ethnic groups, so you get a lot of extraneous information, and a particular slant on how awesome and high achieving the group (also, sometimes there’s funny stuff about how notoriously good looking that particular caste!). On occasion there are other sources which are informative. For example, Zack has several individuals from the Tamil Nadar caste. I know ...

December 20, 2010

Some of the Indo-Europeans found?

hunza
School girls in Hunza, Pakistan

A few days ago I observed that pseudonymous blogger Dienekes Pontikos seemed intent on throwing as much data and interpretation into the public domain via his Dodecad Ancestry Project as possible. What are the long term implications of this? I know that Dienekes has been cited in the academic literature, but it seems more plausible that this sort of project will simply distort the nature of academic investigation. Distort has negative connotations, but it need not be deleterious at all. Academic institutions have legal constraints on what data they can use and how they can use it (see why Genomes Unzipped started). Not so with Dienekes’ project. He began soliciting for data ~2 months ago, and Dodecad has already yielded a rich set of results (granted, it would not be possible without academically funded public domain software, such as ADMIXTURE). Even if researchers don’t cite his results (and no doubt some will), he’s reshaping the broader framework. In other words, he’s implicitly updating everyone’s priors. Sometimes it isn’t even a matter of new information, as much as putting a spotlight on information which was already there. Below is a slice of a bar plot from Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. It uses STRUCTURE with K = 7. To the right of the STRUCTURE slice are two plots of individual data on French and French Basque from the same HGDP data set using ADMIXTURE at K = 10 from Dodecad.

fbasq


Repeated runs and higher K’s make it clear that the French Basque lack a “West Asian” aspect which other French, and Iberians as well, have. Some of this is clear in the paper I referenced above as well…the key is you have to look at the supplements at K = 6. Because the Basque are the only native non-Indo-European speakers in Western Europe, their origin and relationship to nearby populations has always been of interest (they also have the highest Rh- frequency of world populations). Granted, the French Basque are very similar genetically to the French as a whole. But, it is obviously highly informative that they lack an ancestral component in totality which seems to exist at low but consistent levels across Western European populations. The only other European population at K = 15 who lack the West Asian component in totality are Finns (the Lithuanians come very close).

This is all preamble to a discussion of a post Dienekes put up today, A solution to the problem of Indo-Aryan origins. Remember that Dienekes has been “playing” with ADMIXTURE for only a few months. To claim to have found a ‘solution’ to a problem as intellectually and politically intractable and explosive as this is rather bold. The crux of the matter is that at a certain confluences of K’s and population sets Dienekes has discovered a distinctive signature of ancestry which seems to be modal on the north slope of the Caucasus, and spans India and Europe. He terms this “Dagestani,” due to the fact that among a population sample from this province in Russia this ancestral component is overwhelmingly dominant. The patterns of Dagestani admixture in Europe and India are curious and suggestive.

1 – In Europe the frequencies are low, but irregularly distributed (excepting around the North Caucasus). Scandinavians and British have appreciable fractions, Finns and Southern Europeans do not. Here’s Dienekes:

Interpreting this pattern is not easy, but it does seem that this component seems to have a V-like distribution, achieving its maximum in Caucasus and its environs, then undergoing a diminution, and achieving a secondary (lower) frequency mode in NW Europe.

The surprising appearance of the homonymous Dagestan component in India suggests a widespread presence of a common ancestry element. The West Asian element, by comparison seems to have a more normal /\-like distribution around its center in Anatolia-Caucasus-Iran region. It does reach the Atlantic coast, but is lacking in Scandinavia and Finland, and also in India itself.

2 – South Indian Brahmins have appreciable fractions, but non-Brahmins in the same region do not. In contrast, those who come from Indo-Aryan speaking backgrounds do seem to have Dagestani ancestral components, irrespective of other aspects of ancestry. For example Pakistanis don’t have that much more Dagestani than South Indian Brahmins or Gujaratis. Also compare the relatively narrow window of Dagestani ancestry variance among Dodecad South Asians (I’m DOD075). DOD088 is from what I recall a Reddy from Andhara Pradesh, a non-Brahmin but non-low caste. It is interesting that they have a high proportion of “Pakistan,” but no Dagestani. I have ~10% Dagestani, but no Pakistani.

Below is K = 10 for a selection of populations. Dienekes has now included in two non-Indo-European speaking Pakistani populations: the Brahui (Dravidian) and Burusho (linguistic isolate in the mountains of Pakistan):
dages

Some general patterns are evident. The light blue is indicative of generic “Indian” ancestry. It is not found in appreciable proportions outside of subcontinental populations (or those of recent subcontinental origin). The same with the red, and light orange. For your reference the dark orange is a “Northern European” component, modal in Lithuania. The light and dark Green are both East Asian components. The dark blue is a “West Asian” component modal in Georgia, and prominent across Europe with declining as a function of distance from the eastern shore of the Black Sea (this is surely the West Asian which distinguishes the French from the French Basque). I believe that the light purple dominant in the Brahui and the light red dominant in the Burusho probably form as a compound the aforementioned Pakistani component. The dark purple is the Dagestani.

587px-Dravidische_SprachenFirst, a word on the Brahui. These are a group of tribes who reside in northern Balochistan in Pakistan. A small number are even to be found in Afghanistan. Historically they have had close relations with the Baloch, an Iranian speaking cluster of tribes who totally envelop the Brahui. The Brahui do speak a Dravidian language, of a family dominant in South India and found in isolated regions of Central and Eastern India. There are two broad models for the existence of a Dravidian language in Pakistan. The first is that the Brahui are remnants of more widely spoken Dravidian languages which date back to the Indus Valley civilization. The second is that the Brahui arrived during the medieval period from another region of South Asia where Dravidian languages were more common. Assuming either model, it has long been presumed that their involution by the Baloch has had a strong impact on the Brahui genetically; the two groups are very close. This is evident in Dienekes’ results as well. But observe that the Baloch are the group which seems more cosmopolitan in ancestry than the Brahui. If the Brahui were Dravidians from deep in India it seems that they would have a greater residual component of India-specific ancestry (light blue and orange). This is not so. In fact the Baloch have more of the Indian ancestral component than the Brahui. The Brahui component is found across Pakistan, and into India, albeit at lower proportions. Naturally, the Baloch have the second highest fraction. I believe these results should shift us toward the position that the Brahui are indigenous in relation to the Baloch, and that the Baloch ethnic identity emerged through the shift of a Brahui substrate, as evidenced by the greater cosmpolitanism of the Baloch. Additionally, Dienekes observes that the Brahui have a lower proportion of the Dagestani component than most other Pakistani groups, and several Indo-Aryan groups in India proper.

The Burusho are event more interesting than the Brahui. Unlike the Brahui the Burusho are very isolated in the mountainous fastness of Baltistan in northern Pakistan. Additionally, their language, Burushashki, is a linguistic isolate. Others of the class are Basque and Sumerian. In general it is assumed that linguistic isolates were once part of broader families of languages which have gone extinct. Burushashki probably persists in large part because of the geography which its speakers inhabit. Mountainous areas often preserve ethnic and linguistic diversity because the terrain allows for the persistence of local variety. I believe it is plausible that the Burusho have been far more isolated than the Brahui. This seems to show up in the ADMIXTURE plot, the Burusho have a greater proportion of their modal ancestral component than the Brahui. Additionally, the Burusho have even an smaller component of Dagestani than the Brahui.

Below is a chart Dienekes constructed ordered by proportion of Dagestani for his South Asian populations. Next to it I’ve placed a chart from a PCA which has some of the same population samples. Compare & contrast:

pcadag

The PCA is looking at between population variation in totality. So naturally the Dagestani component isn’t going to be predictive of that. Rather, it speaks to the possibility which Dienekes is mooting: that the Dagestani component spread in the India subcontinent with the Indo-Aryans specifically, overlying the local resident substrate. In South India this meant that Brahmins brought this, mixing with the indigenous Dravidian population. In Pakistan the Indo-Aryan, and Iranians, were overlain on a substrate which were the ancestors of the Burusho and Brahui. The dominant signal of genetic relationship has to do with the substrate, not the Indo-Aryans. So that’s what’s going to show up on the PCA. In other PCA plots the model where South Indian Brahmins are a linear combination of a Pakistani-like population and a Dravidian population becomes clearer. But when you look at ancestry using something like ADMIXTURE you have the potential to tease apart different components, and so uncover relationships which may have been obscured when looking at aggregate variation.

dieDienekes’ model seems to post three steps in rapid succession ~4,000 years ago. A background variable which must be mentioned is that one must account for the Mitanni, a dominant Syrian power circa 1500 BC where a non-Indo-European language was the lingua franca, and yet a definite Indo-Aryan element existed within the elite. Indo-Aryan specifically because the Indo-European element within the Mitanni was not Iranian, but specifically Indo-Aryan. An easy explanation for this is that the Indo-Aryan component of the Indo-Iranian branch of the Indo-European languages crystallized outside South Asia, and independently reached Syria and India. In Syria it went extinct, while in India it obviously did not. By Dienekes’ model the Mitanni would be rather closer to the urheimat of the Indo-Aryans.

An aspect of his model which I do not understand is why it has to be Indo-Aryan, instead of Indo-Iranian. The South Asian population which the Dagestani component is modal, the Pathans, are Iranian, not Indo-Aryan. Additionally, this model seems to not speak in detail to the existence of the Dagestani element among Europeans. Here is a sorting of European populations (with Iranians included) by the Dagestani component:


Population Dagestan
Urkarah 93
Lezgins 47.9
Stalskoe 38.7
Adygei 16.4
Orcadian (Orkney) 12.6
Georgians 12.4
White_Utahns 11.2
Iranian 10.9
Scandinavian_D 10.2
Armenian_D 9.9
German_D 9.1
Turks 8.8
Armenians 8.4
French 7.9
Hungarians 7.5
Russian_D 6.3
Spanish_D 4.6
North_Italian 4.5
Spaniards 4.4
Romanian 4.1
Finnish_D 4.1
Russian 4
Greek_D 3.8
Portuguese_D 3.6
Tuscan 3.5
Tuscans 3.4
Lithuanians 2.9
S_Italian_Sicilian_D 2.8
Belorussian 2.5
Cypriots 2
Sardinian 1.5
French_Basque 0.7

There is here a strange pattern of rapid drop off from the Caucasus, and a bounce back very far away, on the margins of Germanic Northwestern Europe. This to me indicates some sort of leapfrog dynamic. A well known illustration of this would be the Ugric languages. The existence of Hungarian on what was Roman Pannonia is a function of the mobility and power of Magyar horseman, and their cultural domination over the Romance and Slavic speaking peasantry (their genetic impact seems to have been slight). No one believes that Germanic languages are closely related to Indo-Aryan (rather, if there is structure in Indo-European beyond Indo-Iranian, Celtic, etc., it would place the Indo-Iranian languages with Slavic). So what’s going on? I think perhaps the Dagestani component is part a reflection of the common Indo-European origin in that region. For whatever reason that signal is diminished in much of the rest of Europe. Perhaps Southern Europe was much more densely populated when the Indo-Europeans arrived. Additionally, it seems highly likely that in places like Sardinia, much of Spain, and Cyprus, Indo-European speech came through cultural diffusion (elite emulation) and not population movement. Or perhaps we’re seeing the vague shadows of population admixtures on the Pontic steppe, where distinct Germanic and Indo-Iranian confederations admixed with a common North Caucasian substrate.

Going back to India, let’s revisit the model of a two-way admixture between “Ancestral North Indians,” who were genetically similar to Europeans and West Asians, and “Ancestral South Indians,” who were closer to, but not very close to, East Eurasians. The ANI & ASI. The ASI were probably one of the ancient populations along the fringe of southern Eurasia, all of whom have been submerged by demographic movements from other parts of Eurasia over the past 10,000 years, excepting a few groups such as the Andaman Islanders and some Southeast Asian tribes. The model was admittedly a simplification. But taking that model as a given, and accepting that the Dagestani element is in indeed Indo-Aryan, we can infer that the ANI were not Indo-European. It is notable that the South Indian Brahmins have elevated fractions of both the Brahui and Burusho modal components. This is probably indicative of admixture of the Indo-Aryan element in the Indus Valley, prior to their expansion to other parts of India. I assume one of the languages spoken was Dravidian, though if ancient Mesopotamia was linguistically polyglot at the dawn of history I would not be surprised if the much more geographically Indus Valley civilization was as well.

arai
Aishwarya Rai

The irony is that today when someone refers to a “Dravidian” physical type, they’re not talking about someone who looks like a Pakistani. They’re talking about someone who looks South Indian, where most Dravidian languages are spoken. But combining the inference from Dienekes’ model and the previous two-way admixture model, you reach the conclusion that lighter skin and more West Asian features among South Asians may be more due to Dravidian-speaking ancestors in the Indus Valley, not Indo-Aryans! It goes to show the wisdom of differentiating linguistic classes from biological ones when discussing historical population genetics. Unfortunately wisdom most of us interested in these topics do not show, alas.

As I like to say, interesting times….

Note: If you leave a comment, please don’t be smarter-than-thou in your tone. I have stopped publishing those sorts of comments because the reality is that most of them have not been that smart or informed. At least by my estimation. If you actually are smarter than the average-bear, and impress me with your erudition and analysis clarity, I’ll probably let your comment through no matter your attitude. But I wouldn’t bet on it if I were you, so show some class and humility. Most of us are muddling through.

Image Credit: Georges Biard, iStockPhoto

December 17, 2010

South Asians too are sons of the farmers?

Filed under: Aryans,Genetic History,Genetics,Genomics,India,India Genetics,Indo-Aryans — Razib Khan @ 2:58 pm

I mentioned a few days ago that a friend was trying to get together some data to analyze the genetic variation of South Asians. By a strange coincidence Dienekes just published a more detailed analysis of South Asians…and uncovered something very interesting, though not that surprising. Some technical preliminaries:

A note of caution: The reduced marker set (~30k) means that a lot of noise is added in the admixture estimates. In particular, many individuals are likely to get low-level admixture from population sources that can be attributed to noise. But, as we will see, the small marker set does not really affect either the power of the GALORE approach, or of ADMIXTURE to infer meaningful clusters.

In addition to the various online sources of public data Dienekes got about a dozen South Asians. I was one of those South Asians, DOD075. In many ways I’m a rather standard issue South Asian, similar to Gujaratis, except that I have a substantial ‘East Asian’ component. More concretely, between 1/6 and 1/7 of my ancestry seems to be of eastern origin, far higher than the norm among South Asians. The rest of my ancestry was mostly South Asian specific, with a minor, but significant ‘West Asian’ component common across northern India.

Rerunning with more data with different samples Dienekes came out with a different set of ancestral components. Of particular interest to me he broke down the East Asian between East Asian proper and Southeast Asian. Below are a selection of populations with ancestral components + me. I’ve also renamed a few components. North Kannadi = Dravidian and Irula = Indian tribal. Indian = Generic Indian. Looking at the Fst it seems that Indian endogamy and population bottlenecks has had an effect…look at the North Kannadi distance from everyone else.

southasian_htm_m28240f53

Remember that in the previous analysis I was very similar to a Gujarati, except with an East Asian element. My supposition that my ancestry has some connection to Burma seems to be supported by these results. Looking at my balanced ratio between East Asian and Southeast Asian, that is what one might expect from someone of a Burman ethnicity. I am not saying that I have recent Burman ancestry per se. Rather, Ahom, Mizo, Chakma, and a range of tribal populations from the liminal zone between South and Southeast Asia may suffice. The main other option is that I have a great deal of Munda ancestry. Not implausible in light of the likelihood that Munda brought rice agriculture to northeast South Asia, and pre-date Indo-Aryans, and possibly Dravidians, in Bengal. How would I distinguish these possibilities? I’ve ordered 23andMe kits for both my parents. The most likely candidate for recent Southeast Asian ancestry is my paternal grandfather. If the admixture event was recent, if I have a recent ancestor(s) of “hill tribe” origin, I would expect to see more linked regions of East/Southeast Asian origin than if the admixture was ancient (and so distributed more equitably across DNA strands due to recombination).

But the bigger point of Dienekes’ post is what he terms “Dagestani” ancestry across much of Eurasia. I’ll quote him:

The most exciting thing, however, is the fact that the origins of a part of the West Asian component of my previous analyses can be partially located: it is the purple component centered in Dagestan, i.e., among Northeast Caucasian speakers such as Lezgins, and the Dargins who inhabit Urkarah.

Readers of this blog may remember the surprising appearance of this Lezgin-specific component in the Balkans (but not Greeks) a few weeks ago. Now it has turned up as a substantial component in India as well.

Back then, I speculated that this component may derive from a prehistoric population that was spread in (but not limited to) the northern arc of the Black Sea from the Balkans to the Caucasus. Even in this analysis, you can see that both Romanians and Hungarians have some of it, and so do Lithuanians and Belorussians, while Tuscans (like the Greeks of my previous experiment) do not.

Hence, this component stretches from at least the Baltic to India, but is largely absent in southern Europe. I will go out on a limb and propose that this component is representative of a non-Indo-European component in the ancestors of the Indo-Iranians.

Paul Conroy observes that on this finer-grained analysis I don’t have any “West Asian” at all. What had previously been West Asian terms out to have been, in my case, a compound of Dagestani + European. I can’t say that I’m that surprised by this. Years ago I noticed that HGDP STRUCTURE analyses were always giving suggestive signs of a connection between West-Central Eurasia and South Asia.

Who were the Indo-Iranians? I lean toward the proposition that they do derive from the Andronovo culture of the Eurasian steppe. This would date the entrance and expansion of Indo-Aryans in northern India 3-4,000 years ago. I also contend that the dominant element of ancestry among modern South Asians is not Indo-Aryan. Rather, it is an ancient stabilized hybrid of pre-agricultural societies in the Indus valley and Neolithic farmers who originated from what is today western Iran and eastern Anatolia. Therefore, I posit that the “Aryanization” of the Indian subcontinent is properly modeled as the same processes which led to the emergence of an Anatolian and Rumelian Turkish identity; a small elite population which forces a identity shift among the majority.

Back to farming:

As I’ve remarked in the past, Eurasia can be broadly seen as the playground of three major groups of people: the Caucasoids of the West, the Mongoloids of the East, and a southern group of people which is most strongly represented in South Asia, but whose presence can be detected in Southeast Asia as well, although in the latter case it has been marginalized and/or absorbed by the arrival of Mongoloids.

This southern group of people has sometimes been called “Australoid” because of its perceived resemblance to Australo-Melanesians. Indeed, in my K=5 mega-analysis an affinity between Papuans/Melanesians and people of South and Southeast Asia is apparent. These “Australoids” are very old populations, probably stemming from the early Out-of-Africa coastal dispersal route, and we shouldn’t be tricked by their phenotypic similarity into thinking that different groups of them are particularly close genetically. Just as “black Africans” are not the same, neither are the “Australoids” and mixed-”Australoids” at the shores of the Indian Ocean.

It is probably the invention of agriculture that is responsible for their marginalization. In Africa, the Pygmies and Bushmen have been absorbed or pushed aside by the demographic Bantu juggernaut, with a few other language groups also hitching a ride on the agriculture/pastoralism economy. In West Eurasia, where agriculture was invented earliest, pre-agricultural populations left no traces. In East Eurasia, the agriculturalists could not expand to the far north where many relic populations exist, but they could (and did) move to the south where they assimilated or drove away pre-existing populations, leaving a few of thems, like the Taiwanese Atayal as partial remnants of the older population stratum.

The Irula are South Indian tribals, so they are the the closest one can get to South Asian autochthons, and yet even they presumably have a large minor component of “Ancestral North Indian.” The tribal groups in Reconstructing Indian Population History all exhibited proportions on the order of ~40% ANI. It seems that agriculture “stalled” in the Indus valley and the highlands to the west for thousands of years in South Asia. During this period of stalling I believe that the farmers absorbed a great deal of genetic material from the indigenous hunter-gatherers, and so produced a “distinctive” Indian genetic profile. More West Eurasian than not, but with a very large dollop of the ancient substrate of southern Eurasia which had a distant, but closer, affinity with that of East Asia. Once social and cultural forces allowed for the rapid expansion of farmers there was a wave of advance from the Indus valley east and south. In the east the proto-Indians would have encountered Mundari speaking groups drifting who practiced rice agriculture, which they also adopted. In the south the proto-Indians would have encountered more hunter-gatherers. Many of the tribal people in India are today facultative hunter-gatherers, herders, and extensive farmers. I believe that these marginal proto-Indian groups assimilated hunter-gatherers more easily than would have otherwise been the case because some of the proto-Indians reverted to a hunter-gatherer lifestyle in the agriculturally unsuitable highlands of the Deccan and Chota Nagpur. The social boundaries in the uplands of South India were such that the line between hunter-gatherer and farmer was more fluid than elsewhere, explaining the former’s greater genetic impact through intermarriage and assimilation.

This sort of general dynamic probably applies to Indo-Europeans. There is no reason why the original Indo-European tribes could not have been compounds who picked up different ancestral components in their peregrinations. Compare the various Turkic people, Anatolian Turks, Chuvash, and Yakut. All of them have affinities with nearby peoples, despite having a common Turkic culture and genetic component. One notable trend in Europe is that while the French have a minor, but significant West Asian component, the Basque have none of it. Dienekes’ sample is small, but it looks as if Scandinavians have more of this than the Finns. This West Asian component may not have been the dominant one among the Indo-Europeans, but I suspect it was a significant one. If the original speakers of proto-Indo-European did not have it, they likely absorbed early on, just as the West Asians absorbed a native South Asian element in the Indus valley.

Finally, as a general rule of thumb, I would now suggest that the primary way in which hunter-gatherer genes can persist is through an ecological stall on the part of farmers. During the stall gene flow naturally occurs, probably through exchange of females (coercive or not), or the integration of hunter-gatherer males into war-bands or as slaves. Over time the farmers on the frontier have changed genetically, so that when they start expanding rapidly due to a technological or cultural innovation, they share more with the hunter-gatherers whom they supersede than they otherwise would have.

October 28, 2010

Sons of the conquerors: the story of India?

munda2

The past ten years has obviously been very active in the area of human genomics, but in the domain of South Asian genetic relationships in a world wide context it has seen veritable revolutions and counter-revolutions. The final outlines are still to be determined. In the mid-1990s the conventional wisdom was that South Asians were a branch of a broader West Eurasian cluster of peoples, albeit more distant from the core Middle Eastern-North-African-European-Caucasian clade. The older physical anthropological literature would have asserted that South Asians were predominantly Caucasoid, but with a Australoid element admixed in at varying proportions as a function of geography and caste. To put it more concretely, and I think accurately, a large degree of South Asian physical variety can be defined along the spectrum between A. R. Rahman and Nawaz Sharif. The regional and caste truisms are only correlations. Subrahmanyan Chandrasekhar was a Tamil Brahmin, but experienced anti-black racism in the United States. I think that is reasonable in light of his appearance.

ResearchBlogging.orgThis rough & ready mainstream understanding, supporting by classical genetic markers, was overturned in the early years of the 21st century. One line of thought argued that South Asians were much more distinctive from the broader Western Eurasian cluster of peoples. Representative of this body of work is a paper like The genetic heritage of the earliest settlers persists both in Indian tribal and caste populations. These researchers tended to start with the female lineages, mtDNA, and then supplement that with Y lineages, the paternal descent. A separate line of evidence, generally drawn from Y chromosomal results, indicated that there were deep connections between the people of India and those of Central Eurasia, in particular via the R1a haplogroup. Additionally, one aspect of the first set of results which was very surprising was that it actually placed South Asians closer to East, not West, Eurasians. But by the end of the aughts the uniparental studies had been supplemented by a range of results produced from SNP-chips, which looked at hundreds of thousands of genetic variants. These studies seemed to support the older view of South Asians being closer to West Eurasians than East Eurasians. Finally last year a paper came out which posited that almost all South Asian populations were actually an ancient stabilized hybrid between two groups, a European-like population, “Ancient North Indians” (ANI), and another group which is no longer present in unadmixed form, “Ancient South Indians” (ASI), of whom the Andaman Islanders are distant relatives. Though there was a slight bias toward ANI as a whole, the fraction of ASI increased as one went southeast, and down the caste ladder. The distinctive “South Asian” ancestral group in other words then may actually be conceived of as a compound of these two elements; an admixture of the native substrate against a European-like genetic background.

Strangely it sounds an awful lot like the older idea of a Caucasoid population with Australoid admixture. We know now that the connection between the tribal peoples of India, and the indigenous groups of South and Southeast Asia as a whole, to those of Australia and Melanesia, is tenuous at best. So the term “Australoid” is not really informative, and may even mislead. And in terms of historical linguistics I don’t think we’ve solved the problem by appealing to an “Aryan invasion.” The high fraction of ANI among South Indian tribal groups who are isolated from even Dravidian caste groups is a clue to the likelihood that the admixture event is very ancient, and probably precedes the arrival of the Aryans to the Indian subcontinent.

But there are more than two actors in this game. In Reconstructing Indian population history the authors acknowledge that their model is stylized, that reality is more complex. Additionally, they perceive in their data that some tribal groups from northeast India have an element which is outside of the purview of a two-way admixture event. They discarded this set from their broader analysis because this seemed to be a restricted phenomenon to these groups. A new paper in Molecular Biology and Evolution re-injects this third element into the picture. Population Genetic Structure in Indian Austroasiatic speakers: The Role of Landscape Barriers and Sex-specific Admixture:

The geographic origin and time of dispersal of Austroasiatic (AA) speakers, presently settled in South and Southeast Asia, remains disputed. Two rival hypotheses, both assuming a demic component to the language dispersal, have been proposed. The first of these places the origin of Austroasiatic speakers in Southeast Asia with a later dispersal to South Asia during the Neolithic, whereas the second hypothesis advocates pre-Neolithic origins and dispersal of this language family from South Asia. To test the two alternative models this study combines the analysis of uniparentally inherited markers with 610,000 common SNP loci from the nuclear genome. Indian AA speakers have high frequencies of Y chromosome haplogroup O2a; our results show that this haplogroup has significantly higher diversity and coalescent time (17-28 KYA) in Southeast Asia, strongly supporting the first of the two hypotheses. Nevertheless, the results of principal component and “structure-like” analyses on autosomal loci also show that the population history of AA speakers in India is more complex, being characterised by two ancestral components – one represented in the pattern of Y chromosomal and EDAR results, the other by mtDNA diversity and genomic structure. We propose that AA speakers in India today are derived from dispersal from Southeast Asia, followed by extensive sex-specific admixture with local Indian populations.

Some background is necessary here. South Asia is notoriously linguistically diverse, but, that diversity can be bracketed into several broad families. First, the Indo-European languages are represented by Indo-Aryan and Iranian dialects (and Germanic, if you include English). Second, the Dravidian languages are found across the subcontinent, from Brahui in Pakistan to Malto in Bangladesh. But they’re really the dominant languages in the southern cone of South Asia. That being said it seems likely that historically their distribution extended far into the north, with Brahui in western Pakistan being a relic of that period, as well as the fragmented tribal groups in Central India. There is also evidence down to historic periods of a Dravidian-speaking substrate in Maharashtra. And purely from a philological perspective it seems clear that many Indo-Aryan languages evolved within a Dravidian linguistic substrate.

Next, in the far north there are languages of Tibetan provenance and affinity. These are explicable in their origins and relationship. But in the northeast third of the Indian subcontinent there are a two groups of Austro-Asiatic languages. The prefix “Austro” is indicative of the symbiotic relationship between historical linguistics and physical anthropology in the early 20th century (most famously illustrated in the transplantation of the social-linguistic term Aryan from a South Asian and Iranian context, to a racialized Northern European term). The map at the top of this post shows the distribution of the Austro-Asiatic languages, as well as their subdivisions. There is clearly an eastern and western wing to the group, but most scholars assume that this is an artifact of the historical eruption of the Burman and Thai peoples out of the southern fringes of the Chinese Empire and into mainland Southeast Asia.

800px-Ramakrishna_Mission_Cherrapunjee_106Within India the Austro-Asiatic languages fall into two broad categories: the Munda and the Khasi. The Khasi inhabit the massif which separates Bengal and Assam. Their culture and society is at some variance from the norm in India (they are matrilocal, and animist or Christian). A close relationship to the people to the east is clear in both their language and their physical appearance. The Khasi, and other groups such as the Garo, are of the family of peoples and ethnicities which have arrived from the east and north relatively recently, making the transition from the world of Tibet and Burma to India. This is evident in the face of the Khasi child in the image to the left. Once passing out of their lands of origin these populations have assimilated to different degrees to the Indic domain. The Tripuri people for example retain a Tibeto-Burman language, but are adherents of Vaishnav Hinduism (my own family were once subjects of the Manikya dynasty). The Ahom of Assam were totally assimilated by the Indo-Aryan substrate. Like the Bulgars of Bulgaria their only influence was in the ethnonym that they contributed to their subjects. A quick survey of my own genetics, and those of other South Asians of eastern origin on 23andMe, clearly shows the influence of assimilated Tibeto-Burmans. One Bangladeshi Muslim individual clearly carries an East Asian Y chromosomal haplogroup.

The Munda are a somewhat different case. In older historical literature on South Asia there is some consideration that the Munda may be the earliest inhabitants of India; predating the Dravidians. Some readers of South Asian origin also point out that in the early Indo-Aryan language there may be more evidence of Munda, than Dravidian, influence. But the eastern connections of the Munda languages seem clear, albeit less explicable than those of the Khasi or the Tibeto-Burman peoples of the far northeast. If the Munda are the indigenous people then it stands to reason that the Mon-Khmer languages derive from South Asia. On the other hand the vast majority of the Austro-Asiatic languages exist in Southeast Asia, and, the Munda themselves have been hypothesized as being the bearers of rice-culture from the east.

This is where genetics comes into play. There has already been evidence of an eastern influence in the genes of the Munda from other researchers, so what this paper does is look at that in detail, instead of discarding it as a minor effect which muddles the broader picture. I’ve reformatted figure 3 to show how the groups relate to each other. On the left is a PCA. Most of the variance is west-east, ~6%, while some of it is north-south, ~1%. On the right is a bar plot generated from ADMIXTURE. I’ve edited out many of the populations. Focus on the Austro-Asiatic groups from India.

munda1

In the PCA you see the SE-NW axis of ANI-ASI admixture which is the primary aspect of genetic variation within South Asia. Numerically Dravidian and Indo-Aryan groups along this axis are the vast majority of South Asians. But the Munda and other Austro-Asiatic groups are not trivial; there are strong suggestions that the eastern Indo-Aryan groups, Oriya, Bengali, and Assamese, are to some extent shaped by influence from the Austro-Asiatic elements. The closer connection of the Khasi to East Asian populations is clear on the PCA. But the fact that the South Indian samples are further along axis-Y than the Munda are indicative of admixture in the Munda population. Looking at the bar plot that’s clear. The dominant dark-green signature of South Indian ancestry is also predominant among the Munda, and found at non-trivial amounts among Iranian, Khasi, and Southeast Asian populations, but the Munda clearly have an eastern component which is not found in South Indians. This is probably the element which perturbs them on the PCA.

But this just tells us the relationships in terms of total genome content. It doesn’t necessarily tells us the historical sequence of admixture events or the direction of migration. In fact the evidence of Indian ancestry in Southeast Asia could be suggesting migration from South Asia to the Southeast Asia (there is plenty of cultural evidence of transmission, though the presumption is that the demographic movements were marginal). They note in the paper that one phenomenon which could be obscuring and confusing our understanding is that much of gene flow occurs through isolation-by-distance (IBD). Village-to-village dynamics. In contrast to this you have folk wanderings, which result in a “leapfrog” aspect. The Hazara and Uyghur are both cases of leapfrogging, as their genetic makeup can’t be explained easily by IBD. So here the connections between the Munda and Southeast Asians, and the broader relationship between Southeast Asians and South Asians, could be IBD, or perhaps reflect deep ancient common ancestry. Perhaps the ASI group spanned the region from the Arabian Sea to the South China sea, and were only later overlain by ANI and East Asian populations.

To explore these questions the authors tunneled down to a more fine-grained scale, and looked at uniparental lineages as well as a gene at which recent selection seems to have operated upon East Asians in distinction to other groups, EDAR. Though uniparental lineages are only partially informative in terms of ancestry, they are very amenable to dating because of their haploid inheritance patterns. And the relationships between the branches of the termini can give us historical information.

The following figure shows the relationship and distribution of a particular Y chromosomal haplogroup which the Munda carry, and other South Asians tend not to, which connects them to the east:

munda3

The haplogroup is O2a (M95). The results from the Y chromosomal data are not clear, though they do seem to reject the model whereby Southeast Asian O2a lineages derive from Indian ones. But it does not seem as if you have a scenario where one founder lineage entered into South Asia from Southeast Asia, there are too many disparate branches of O2a found among Indians. Additionally, the coalescence time (back to last common ancestor) is deeper in Southeast Asia, but still deep in South Asia among the Munda. From this it seems that the origin of Austro-Asiatic languages in South Asia can be rejected, but the details of the emergence of Austro-Asiatic in South Asia can not be clearly perceived as of yet. From what I can gather the authors themselves do not necessarily believe that their results in this domain are robust (insensitive to varying the model’s assumptions even marginally).

An interesting point though is that the mtDNA, the female lineage, does not seem to diverge from other South Asians much at all. I find it intriguing that this is the same pattern we see along the major NW-SE axis of variation. It seems that mtDNA lineages unite South Asians, while the Y lineages separate them (by caste and region). The generality has many exceptions, but it points to a peculiar sex mediated admixture process from both the northwest and northeast. Men on the move have reshaped the genetics and culture of South Asia, but the mtDNA lineages still point to an ancient Eurasian group with distant but stronger affinities to the east than the west. The mtDNA are likely the purest distillation of ASI.

Finally, they look at frequencies of variants of EDAR among the South Asian groups. EDAR is in some ways diagnostic of East Asian ancestry; it seems that a variant which produces thick straight hair emerged relatively recently among East Asians.  Here’s the result from the HGDP browser:

edar1

edar2The G allele exhibits co-dominance, so the GA phenotype has intermediate hair-thickness between AA and GG. Haplotype structure based tests of natural selection have indicated that the derived G allele is recent. The map to the right shows the frequency of the derived G variant by population group. The bubble size is proportional to frequency, while the colors represent language groups. Again the Khasi and Tibeto-Burman groups are as you’d expect, they exhibit a relatively high frequency of the derived variant. The Hazara are a group which only came into being within the last 1,000 years through an admixture event. The Tharu seem to have their origins in Nepal’s transitional zone, and all the Nepali populations have significant admixture with Tibetan groups even if they themselves are not Tibetan in language and culture. The interesting result are the Munda. The Dravidian groups lack the derived EDAR variant, as do Indo-European groups without a plausible East Asian source of admixture. But within the Munda the derived variant is found in proportions ~5%. This is far lower than the 60% among the Tibeto-Burmans of the northeast, or the 40% among the Khasi, but it is significant. And this result allows the authors to reject the IBD model of connection for Austro-Asiatic groups, because the Munda harbor the variant which other South Asian groups in their environs do not. Gene flow predicated on linguistic affiliation at such a remove seems implausible, so the most parsimonious explanation is that the Munda languages arrived in India from Southeast Asia as part of a leapfrog folk wandering.

But why the low frequency of the derived variant? Obviously the Munda have admixed with the local substrate, so dilution would be one explanation. Another could be that when the Munda left East Asia the frequency was lower. Additionally, whatever selective forces were driving the frequency up may have abated in South Asia, and it could be that there was selection against the derived variant! Whatever the truth of it the existence of the derived EDAR variant among the Munda would be like finding the European LCT variant among an East Asian population: clear evidence of long distance gene flow and population movement.

So where does this lead us? First, let me observe that some of the authors on this paper are the same ones who argued for a predominantly indigenous origin for South Asians in the early 2000s based on mtDNA variation. In this paper they seem to be leaning against an indigenous origin for the Munda, or at least refuting the conjecture that the Munda are ur-Indians par excellence. I didn’t go into the details of the coalescence times because they’re rather a mess, but EDAR is probably a “tipping point” in arguing for a relatively recent exogenous origin for the Munda. The strong sex asymmetry in genetic variation is also suggestive, we have plenty of evidence of historical examples of genetic leapfrogs occurring through men-on-the-move. The asymmetry also seems to exist among the Khasi and other Tibeto-Burmans in India’s northeast (figure 2 of the paper).

The arguments about the history, culture, and genetics of South Asia have traditionally been disputed along the Aryan-Dravidian axis. I’m not interested in rehashing that aspect, but these data point us to another reality: on India’s northeast frontier there’s another component. As an ethnic Bengali myself I’ve always been somewhat aware of this. Some of my relatives and family acquaintances look much more like Garos than other South Asians. This component is even more evident on the face of Assamese and Nepali, whose languages are Indo-Aryan and religion is Hinduism, but whose appearance bespeaks a more variegated background. On some level South Asians from these regions are aware of their peculiarity, even if it isn’t spoken of much. I have read that in the wake of the victory of Japan over Russia in the early 20th century Bengali intellectuals expressed in public their pride at their Asiatic ancestry. With the rise of China in the 21st century I suspect more South Asians from Nepal, Bengal, and Assam, will rediscover that aspect of their background which links them to the east, and not the west. The genetics is just telling us what we already knew.

Citation: Gyaneshwer Chaubey, Mait Metspalu, Ying Choi, Reedik Mägi, Irene Gallego Romero, Pedro Soares, Mannis van Oven, Doron M. Behar, Siiri Rootsi, Georgi Hudjashov, Chandana Basu Mallick, Monika Karmin, Mari Nelis, Jüri Parik, Alla Goverdhana Reddy, Ene Metspalu, George van Driem, Yali Xue, Chris Tyler-Smith, Kumarasamy Thangaraj, Lalji Singh, Maido Remm, Martin B. Richards, Marta Mirazon Lahr, Manfred Kayser, Richard Villems, & Toomas Kivisild (2010). Population Genetic Structure in Indian Austroasiatic speakers: The Role of Landscape Barriers and Sex-specific Admixture Mol Biol Evol : 10.1093/molbev/msq288

Link acknowledgement: Dienekes Pontikos.

Addendum: This is more a speculative comment, so I will tack this on to the body of the main post. Here’s my current very tentative model for how South Asians came to be. At some point after the last Ice Age 10,000 years ago the ANI arrived, and hybridized with the ASI, who are descendants of the older original Out of Africa wave to South Asia. After this, but before the Aryans, the Munda arrived from the northeast, and pushed into lands inhabited by ANI-ASI groups. 4,000-3,000 years ago the Indo-Aryans arrive, and impose themselves as an elite on the ANI-ASI hybrid population, before being assimilated biologically and imparting their language to the Indian majority. I don’t know where Dravidian came from, but perhaps it was the language of the ANI (its existence in fragments all across the swath of the northern Indian subcontinent is suggestive, as well as possible connections to ancient Elamite, the language of Bronze Age southwest Iran). Eventually the Aryanized ANI-ASI marginalized the Munda in northeast India and drove them to the highlands. Finally, the Tibeto-Burmans arrived in the historical period.

Image Credit: Wikimedia Commons

Powered by WordPress