July 8, 2011

On the genetic structure of Afro-Indians

ResearchBlogging.orgThe Pith: Afro-Indians are mostly African, with a substantial Indian minority ancestry. The latter is disproportionately female mediated. It also seems that that ancestry is more northwest Indian, and that natural selection has been operating upon them outside of the African environment.

Along the western coast of South Asia, from Makran in southwest Pakistan, down to the Konkan coast of southwest Iindia, there are isolated communities of Afro-Indians. They are called Siddis or Habshi. Their African origin is clear in their physical appearance, as well as aspects of their folk customs which tie them back to Sub-Saharan African. Nevertheless, they have assimilated to many Indian cultural traits. They generally speak the local language, and practice Islam, Hinduism, or Roman Catholic Christianity (in that order in proportion).

How and why did the Siddis arrive in India? The earliest date for their arrival almost certainly must be bounded by the period when Indo-Islamic polities rose to prominence in the early second millennium. The cosmopolitan melange of the armies of the Muslim warlords included diverse groups of Africans, some of whom took power, and established their own self-conscious Afro-Indian dynasties, set apart from the Turkish, Afghan, ...

June 24, 2011

Every South Asian “Arab” a descendant of Muhammad!

Every South Asian "Arab" a descendant of Muhammad!

Y chromosomes of self-identified Syeds from the Indian subcontinent show evidence of elevated Arab ancestry but not of a recent common patrilineal origin:

Several cultural or religious groups claim descent from a common ancestor. The extent to which this claimed ancestry is real or socially constructed can be assessed by means of genetic studies. Syed is a common honorific title given to male Muslims belonging to certain families claiming descent from the Prophet Muhammad through his grandsons Hassan and Hussein, who lived 1,400 years ago and were the sons of the Prophet’s daughter Fatima. If all Syeds really are in direct descent from Hassan and Hussein, we would expect the Y chromosomes of Syeds to be less diverse than those of non-Syeds. Outside the Arab world, we would also expect to find that Syeds share Y chromosomes with Arab populations to a greater extent than they do with their non-Syed geographic neighbours. In this study, we found that the Y chromosomes of self-identified Syeds from India and Pakistan are no less diverse than those non-Syeds from the same regions, suggesting that there is no biological basis to the belief that self-identified Syeds in this part of the world share a recent common ancestry. In addition to Syeds, we also considered members of other hereditary Muslim lineages, which either claim descent from the tribe or family of Muhammad or from the residents of Medinah. Here, we found that these lineages showed greater affinity to geographically distant Arab populations, than to their neighbours from the Indian subcontinent, who do not belong to an Islamic honorific lineage.

The results are pretty simple. First:

1) The Syed lineages don’t exhibit a “Syed modal haplotype.” What you should see is a Syed haplotype of ~50%, and then a range of other lineages which introgressed through people lying about their origins or women being unfaithful to their husbands. Instead there are a wide range of haplotypes. Being Syed is an honorific.

2) I don’t think that they really prove higher Arab ancestry as such. They include really diverse populations, from Algerians to Israeli Arabs to Sudanese. The Islamic Honorific Lineages are somewhat closer to these groups, but that could be generic West Asian ancestry. For example, Persian. Or perhaps more African ancestry in cosmopolitan Syed lineages. Or, perhaps Syeds are just former high caste Hindus, who have more West Asian affinities.

Below is the PCA and list of Y chromosomal haplogroups. The paper is free at the link above.

June 16, 2011

Present genetic variation is a weak guide to past genetic variation

As I’ve been harping on and on for the past few years that the patterns of contemporary genetic variation are probably only weakly tied to past patterns of genetic variation (though Henry Harpending warned me about this as far back as 2004). A major reason that scholars operated under this presupposition is the axiom that most of the variation we see around us crystallized during the Last Glacial Maximum (~20 thousand years before the present).

This may be true in some cases, but I doubt it is true in most cases. I was pointed to a classic case of this problem just today. A reader alerted me to a short paper from this spring which attempts to ascertain the point of origin of the dominant mtDNA haplogroup among the Onge tribe of the Andaman Islanders, M31a1. This is an interesting issue because some researchers proposed, plausibly in the past, that these indigenous people in the Andaman Islands represent the descendants of the first wave “Out of Africa,” who took the rapid “beachcomber” path. Understanding the key to their genetics may then unlock the key to the “Out of Africa” event. Or ...

May 27, 2011

Indian genetics in the Indian media

Indian genetics in the Indian media

Most of you may know that the Times of India is working on a story about Zack’s Harappa Ancestry Project. Ruchira Paul points me to this story in OPEN about Indian genetics, The Story of Our Origins. They’re using the kit from National Geographic’s Genographic Project, which analyses paternal and maternal lineages. In other words, your father’s-father’s-father’s…father, and your mother’s-mother’s-mother’s…mother. This is a thin slice, but can be informative, and is useful for certain analyses such as molecular clock based time depths. But it is not the total genome, and so gives us only a small window into the past. In contrast Zack uses tens of thousands to hundreds of thousands of markers in his analyses. He may lack some of the specific analytic power in terms of temporal calibration, but he makes up for it in a much more rich and robust understanding of how Indian populations relate.

The story in OPEN tells the “orthodox” model inferred from uniparental lineages circa 2005. I think it’s somewhat outdated, in large part due to thick marker set analyses of the genome, but this part was pretty amusing:


This brings us to perhaps the most contentious of markers, Hartosh’s R1a1. The NGP states: ‘Some linguists believe that the Kurgans, nomadic horsemen roaming the steppes of southern Russia and the Ukraine, were the first to speak and spread a Proto-Indo-European language, some 5,000 to 10,000 years ago. Genetic data and the distribution of Indo-European speakers suggest the Kurgans … may have been descendents of M17 (the genetic marker that identifies the R1a1 haplogroup). Today a large concentration—around 40 per cent—of the men living from the Czech Republic across the steppes to Siberia, and south throughout Central Asia are descendants of this clan. In India, around 35 per cent of the men in Hindi-speaking populations carry the M17 marker, whereas the frequency in neighboring communities of Dravidian speakers is only about ten percent. This distribution adds weight to linguistic and archaeological evidence suggesting that a large migration from the Asian steppes into India occurred within the last 10,000 years.’

This NGP claim goes far beyond what the genetic data warrants. Says Bamezai, after looking through the NGP results published in this article, “For me as a scientist, it is necessary to be very conservative in my claims. Any broad conclusions require much more work and detailed study of not just haplogroups, but sub-haplogroups. I think the migration paths described in these cases are in question. I feel R1a1 originated here and contributed to Central Asia rather than the other way around.

Full disclosure, both Zack and I are carriers of the R1a1 lineage.

Notice how instead of retreating to agnosticism, as his conservatism might entail, the Indian scientist offers his own contentious R1a1-Out-of-India story! Of course you knew that Indo-European migrations had to be a myth for an Indian publication. This is rather how American popular science always has to prove that race has no biological basis. As it is, I think Bamezai is probably close to the truth. I suspect most of the Indian R1a1 lineages are from a different and older diversification of this haplogroup than the northern Eurasian ones, which tend to be characterized by a set of common markers which Indians often lack. But we’ll see.

May 22, 2011

Proper methods and false results

ResearchBlogging.orgThe Pith: Honorable intent and punctilious adherence to proper form and method does not guarantee a set of results which flesh out a genuine phenomenon. Much of science is tragic.

Most of the time I point to and review papers on this weblog which excite me. But in the interests of “balance” and dampening the bias toward material I find interesting and salient I thought it would be interesting to look at a paper which I thought wasn’t too interesting. It’s in the Journal of Human Genetics, part of the Nature Publishing Group empire. Also, it is open access, so you can read it yourself and make your own individual judgments.

The Soliga, an isolated tribe from Southern India: genetic diversity and phylogenetic affinities:

India’s role in the dispersal of modern humans can be explored by investigating its oldest inhabitants: the tribal people. The Soliga people of the Biligiri Rangana Hills, a tribal community in Southern India, could be among the country’s first settlers. This forest-bound, Dravidian speaking group, lives isolated, practicing subsistence-level agriculture under primitive conditions. The aim of this study is to examine the phylogenetic relationships of the Soligas in relation to 29 worldwide, geographically targeted, ...

April 21, 2011

Visualization of genetic distances, part n

Zack Ajmal has been taking his Reference 3 data set for a stroll over at the Harappa Ancestry Project. Or, more accurately, he’s been driving his computer to crunch up ADMIXTURE results ascending up a later of K’s. Because it is the Harappa Ancestry Project Zack’s populations are overloaded a touch on South Asians. He managed to get a hold of the data set from Reconstructing Indian History. If you will recall this paper showed that the South Asian component which falls out of ancestry structure inference algorithms may actually be a stabilized hybrid of two ancient populations, “Ancestral North Indian” (ANI) and “Ancestral South Indian” (ASI). ANI are a population which can be compared pretty easily to other West Eurasians. There are no “pure” groups of ASI, but the indigenous peoples of the Andaman Islands are the closest, having diverged from the mainland ASI populations tens of thousands of years ago.

At K = 11, that is, 11 inferred ancestral populations, Zack seems to have now stumbled onto the patterns which one would expect from this hybrid model of South Asians. Let me quote him:

Now let’s take all the reference populations with an Onge component between 10% ...

March 28, 2011

Genetics as the myth buster: Indian edition

Genetics as the myth buster: Indian edition

Whenever Zack Ajmal posts a new update to the Harappa Ancestry Project he appends some data to his ethnic database. This sends me to Wikipedia, because how many people are supposed to know what a “Muslim Rawther” means? Well, if you are a Muslim Rawther, and perhaps from Southern India, you would. But South Asian ethno-linguistic categories and hierarchies are notoriously Byzantine, and I have difficulty making sense of them. This isn’t too surprising in my case, as my family’s background is relatively mixed in the very recent past (e.g., Hindus and Muslims, and people of various caste backgrounds), so we’re not the sort who can go at length about our pure ancestry and all that stuff. Unfortunately, Wikipedia isn’t always useful, because the people editing the entries on particular South Asian ethnic groups are often people from those ethnic groups, so you get a lot of extraneous information, and a particular slant on how awesome and high achieving the group (also, sometimes there’s funny stuff about how notoriously good looking that particular caste!). On occasion there are other sources which are informative. For example, Zack has several individuals from the Tamil Nadar caste. I know ...

December 20, 2010

Some of the Indo-Europeans found?

School girls in Hunza, Pakistan

A few days ago I observed that pseudonymous blogger Dienekes Pontikos seemed intent on throwing as much data and interpretation into the public domain via his Dodecad Ancestry Project as possible. What are the long term implications of this? I know that Dienekes has been cited in the academic literature, but it seems more plausible that this sort of project will simply distort the nature of academic investigation. Distort has negative connotations, but it need not be deleterious at all. Academic institutions have legal constraints on what data they can use and how they can use it (see why Genomes Unzipped started). Not so with Dienekes’ project. He began soliciting for data ~2 months ago, and Dodecad has already yielded a rich set of results (granted, it would not be possible without academically funded public domain software, such as ADMIXTURE). Even if researchers don’t cite his results (and no doubt some will), he’s reshaping the broader framework. In other words, he’s implicitly updating everyone’s priors. Sometimes it isn’t even a matter of new information, as much as putting a spotlight on information which was already there. Below is a slice of a bar plot from Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. It uses STRUCTURE with K = 7. To the right of the STRUCTURE slice are two plots of individual data on French and French Basque from the same HGDP data set using ADMIXTURE at K = 10 from Dodecad.


Repeated runs and higher K’s make it clear that the French Basque lack a “West Asian” aspect which other French, and Iberians as well, have. Some of this is clear in the paper I referenced above as well…the key is you have to look at the supplements at K = 6. Because the Basque are the only native non-Indo-European speakers in Western Europe, their origin and relationship to nearby populations has always been of interest (they also have the highest Rh- frequency of world populations). Granted, the French Basque are very similar genetically to the French as a whole. But, it is obviously highly informative that they lack an ancestral component in totality which seems to exist at low but consistent levels across Western European populations. The only other European population at K = 15 who lack the West Asian component in totality are Finns (the Lithuanians come very close).

This is all preamble to a discussion of a post Dienekes put up today, A solution to the problem of Indo-Aryan origins. Remember that Dienekes has been “playing” with ADMIXTURE for only a few months. To claim to have found a ‘solution’ to a problem as intellectually and politically intractable and explosive as this is rather bold. The crux of the matter is that at a certain confluences of K’s and population sets Dienekes has discovered a distinctive signature of ancestry which seems to be modal on the north slope of the Caucasus, and spans India and Europe. He terms this “Dagestani,” due to the fact that among a population sample from this province in Russia this ancestral component is overwhelmingly dominant. The patterns of Dagestani admixture in Europe and India are curious and suggestive.

1 – In Europe the frequencies are low, but irregularly distributed (excepting around the North Caucasus). Scandinavians and British have appreciable fractions, Finns and Southern Europeans do not. Here’s Dienekes:

Interpreting this pattern is not easy, but it does seem that this component seems to have a V-like distribution, achieving its maximum in Caucasus and its environs, then undergoing a diminution, and achieving a secondary (lower) frequency mode in NW Europe.

The surprising appearance of the homonymous Dagestan component in India suggests a widespread presence of a common ancestry element. The West Asian element, by comparison seems to have a more normal /\-like distribution around its center in Anatolia-Caucasus-Iran region. It does reach the Atlantic coast, but is lacking in Scandinavia and Finland, and also in India itself.

2 – South Indian Brahmins have appreciable fractions, but non-Brahmins in the same region do not. In contrast, those who come from Indo-Aryan speaking backgrounds do seem to have Dagestani ancestral components, irrespective of other aspects of ancestry. For example Pakistanis don’t have that much more Dagestani than South Indian Brahmins or Gujaratis. Also compare the relatively narrow window of Dagestani ancestry variance among Dodecad South Asians (I’m DOD075). DOD088 is from what I recall a Reddy from Andhara Pradesh, a non-Brahmin but non-low caste. It is interesting that they have a high proportion of “Pakistan,” but no Dagestani. I have ~10% Dagestani, but no Pakistani.

Below is K = 10 for a selection of populations. Dienekes has now included in two non-Indo-European speaking Pakistani populations: the Brahui (Dravidian) and Burusho (linguistic isolate in the mountains of Pakistan):

Some general patterns are evident. The light blue is indicative of generic “Indian” ancestry. It is not found in appreciable proportions outside of subcontinental populations (or those of recent subcontinental origin). The same with the red, and light orange. For your reference the dark orange is a “Northern European” component, modal in Lithuania. The light and dark Green are both East Asian components. The dark blue is a “West Asian” component modal in Georgia, and prominent across Europe with declining as a function of distance from the eastern shore of the Black Sea (this is surely the West Asian which distinguishes the French from the French Basque). I believe that the light purple dominant in the Brahui and the light red dominant in the Burusho probably form as a compound the aforementioned Pakistani component. The dark purple is the Dagestani.

587px-Dravidische_SprachenFirst, a word on the Brahui. These are a group of tribes who reside in northern Balochistan in Pakistan. A small number are even to be found in Afghanistan. Historically they have had close relations with the Baloch, an Iranian speaking cluster of tribes who totally envelop the Brahui. The Brahui do speak a Dravidian language, of a family dominant in South India and found in isolated regions of Central and Eastern India. There are two broad models for the existence of a Dravidian language in Pakistan. The first is that the Brahui are remnants of more widely spoken Dravidian languages which date back to the Indus Valley civilization. The second is that the Brahui arrived during the medieval period from another region of South Asia where Dravidian languages were more common. Assuming either model, it has long been presumed that their involution by the Baloch has had a strong impact on the Brahui genetically; the two groups are very close. This is evident in Dienekes’ results as well. But observe that the Baloch are the group which seems more cosmopolitan in ancestry than the Brahui. If the Brahui were Dravidians from deep in India it seems that they would have a greater residual component of India-specific ancestry (light blue and orange). This is not so. In fact the Baloch have more of the Indian ancestral component than the Brahui. The Brahui component is found across Pakistan, and into India, albeit at lower proportions. Naturally, the Baloch have the second highest fraction. I believe these results should shift us toward the position that the Brahui are indigenous in relation to the Baloch, and that the Baloch ethnic identity emerged through the shift of a Brahui substrate, as evidenced by the greater cosmpolitanism of the Baloch. Additionally, Dienekes observes that the Brahui have a lower proportion of the Dagestani component than most other Pakistani groups, and several Indo-Aryan groups in India proper.

The Burusho are event more interesting than the Brahui. Unlike the Brahui the Burusho are very isolated in the mountainous fastness of Baltistan in northern Pakistan. Additionally, their language, Burushashki, is a linguistic isolate. Others of the class are Basque and Sumerian. In general it is assumed that linguistic isolates were once part of broader families of languages which have gone extinct. Burushashki probably persists in large part because of the geography which its speakers inhabit. Mountainous areas often preserve ethnic and linguistic diversity because the terrain allows for the persistence of local variety. I believe it is plausible that the Burusho have been far more isolated than the Brahui. This seems to show up in the ADMIXTURE plot, the Burusho have a greater proportion of their modal ancestral component than the Brahui. Additionally, the Burusho have even an smaller component of Dagestani than the Brahui.

Below is a chart Dienekes constructed ordered by proportion of Dagestani for his South Asian populations. Next to it I’ve placed a chart from a PCA which has some of the same population samples. Compare & contrast:


The PCA is looking at between population variation in totality. So naturally the Dagestani component isn’t going to be predictive of that. Rather, it speaks to the possibility which Dienekes is mooting: that the Dagestani component spread in the India subcontinent with the Indo-Aryans specifically, overlying the local resident substrate. In South India this meant that Brahmins brought this, mixing with the indigenous Dravidian population. In Pakistan the Indo-Aryan, and Iranians, were overlain on a substrate which were the ancestors of the Burusho and Brahui. The dominant signal of genetic relationship has to do with the substrate, not the Indo-Aryans. So that’s what’s going to show up on the PCA. In other PCA plots the model where South Indian Brahmins are a linear combination of a Pakistani-like population and a Dravidian population becomes clearer. But when you look at ancestry using something like ADMIXTURE you have the potential to tease apart different components, and so uncover relationships which may have been obscured when looking at aggregate variation.

dieDienekes’ model seems to post three steps in rapid succession ~4,000 years ago. A background variable which must be mentioned is that one must account for the Mitanni, a dominant Syrian power circa 1500 BC where a non-Indo-European language was the lingua franca, and yet a definite Indo-Aryan element existed within the elite. Indo-Aryan specifically because the Indo-European element within the Mitanni was not Iranian, but specifically Indo-Aryan. An easy explanation for this is that the Indo-Aryan component of the Indo-Iranian branch of the Indo-European languages crystallized outside South Asia, and independently reached Syria and India. In Syria it went extinct, while in India it obviously did not. By Dienekes’ model the Mitanni would be rather closer to the urheimat of the Indo-Aryans.

An aspect of his model which I do not understand is why it has to be Indo-Aryan, instead of Indo-Iranian. The South Asian population which the Dagestani component is modal, the Pathans, are Iranian, not Indo-Aryan. Additionally, this model seems to not speak in detail to the existence of the Dagestani element among Europeans. Here is a sorting of European populations (with Iranians included) by the Dagestani component:

Population Dagestan
Urkarah 93
Lezgins 47.9
Stalskoe 38.7
Adygei 16.4
Orcadian (Orkney) 12.6
Georgians 12.4
White_Utahns 11.2
Iranian 10.9
Scandinavian_D 10.2
Armenian_D 9.9
German_D 9.1
Turks 8.8
Armenians 8.4
French 7.9
Hungarians 7.5
Russian_D 6.3
Spanish_D 4.6
North_Italian 4.5
Spaniards 4.4
Romanian 4.1
Finnish_D 4.1
Russian 4
Greek_D 3.8
Portuguese_D 3.6
Tuscan 3.5
Tuscans 3.4
Lithuanians 2.9
S_Italian_Sicilian_D 2.8
Belorussian 2.5
Cypriots 2
Sardinian 1.5
French_Basque 0.7

There is here a strange pattern of rapid drop off from the Caucasus, and a bounce back very far away, on the margins of Germanic Northwestern Europe. This to me indicates some sort of leapfrog dynamic. A well known illustration of this would be the Ugric languages. The existence of Hungarian on what was Roman Pannonia is a function of the mobility and power of Magyar horseman, and their cultural domination over the Romance and Slavic speaking peasantry (their genetic impact seems to have been slight). No one believes that Germanic languages are closely related to Indo-Aryan (rather, if there is structure in Indo-European beyond Indo-Iranian, Celtic, etc., it would place the Indo-Iranian languages with Slavic). So what’s going on? I think perhaps the Dagestani component is part a reflection of the common Indo-European origin in that region. For whatever reason that signal is diminished in much of the rest of Europe. Perhaps Southern Europe was much more densely populated when the Indo-Europeans arrived. Additionally, it seems highly likely that in places like Sardinia, much of Spain, and Cyprus, Indo-European speech came through cultural diffusion (elite emulation) and not population movement. Or perhaps we’re seeing the vague shadows of population admixtures on the Pontic steppe, where distinct Germanic and Indo-Iranian confederations admixed with a common North Caucasian substrate.

Going back to India, let’s revisit the model of a two-way admixture between “Ancestral North Indians,” who were genetically similar to Europeans and West Asians, and “Ancestral South Indians,” who were closer to, but not very close to, East Eurasians. The ANI & ASI. The ASI were probably one of the ancient populations along the fringe of southern Eurasia, all of whom have been submerged by demographic movements from other parts of Eurasia over the past 10,000 years, excepting a few groups such as the Andaman Islanders and some Southeast Asian tribes. The model was admittedly a simplification. But taking that model as a given, and accepting that the Dagestani element is in indeed Indo-Aryan, we can infer that the ANI were not Indo-European. It is notable that the South Indian Brahmins have elevated fractions of both the Brahui and Burusho modal components. This is probably indicative of admixture of the Indo-Aryan element in the Indus Valley, prior to their expansion to other parts of India. I assume one of the languages spoken was Dravidian, though if ancient Mesopotamia was linguistically polyglot at the dawn of history I would not be surprised if the much more geographically Indus Valley civilization was as well.

Aishwarya Rai

The irony is that today when someone refers to a “Dravidian” physical type, they’re not talking about someone who looks like a Pakistani. They’re talking about someone who looks South Indian, where most Dravidian languages are spoken. But combining the inference from Dienekes’ model and the previous two-way admixture model, you reach the conclusion that lighter skin and more West Asian features among South Asians may be more due to Dravidian-speaking ancestors in the Indus Valley, not Indo-Aryans! It goes to show the wisdom of differentiating linguistic classes from biological ones when discussing historical population genetics. Unfortunately wisdom most of us interested in these topics do not show, alas.

As I like to say, interesting times….

Note: If you leave a comment, please don’t be smarter-than-thou in your tone. I have stopped publishing those sorts of comments because the reality is that most of them have not been that smart or informed. At least by my estimation. If you actually are smarter than the average-bear, and impress me with your erudition and analysis clarity, I’ll probably let your comment through no matter your attitude. But I wouldn’t bet on it if I were you, so show some class and humility. Most of us are muddling through.

December 17, 2010

South Asians too are sons of the farmers?

South Asians too are sons of the farmers?

I mentioned a few days ago that a friend was trying to get together some data to analyze the genetic variation of South Asians. By a strange coincidence Dienekes just published a more detailed analysis of South Asians…and uncovered something very interesting, though not that surprising. Some technical preliminaries:

A note of caution: The reduced marker set (~30k) means that a lot of noise is added in the admixture estimates. In particular, many individuals are likely to get low-level admixture from population sources that can be attributed to noise. But, as we will see, the small marker set does not really affect either the power of the GALORE approach, or of ADMIXTURE to infer meaningful clusters.

In addition to the various online sources of public data Dienekes got about a dozen South Asians. I was one of those South Asians, DOD075. In many ways I’m a rather standard issue South Asian, similar to Gujaratis, except that I have a substantial ‘East Asian’ component. More concretely, between 1/6 and 1/7 of my ancestry seems to be of eastern origin, far higher than the norm among South Asians. The rest of my ancestry was mostly South Asian specific, with a minor, but significant ‘West Asian’ component common across northern India.

Rerunning with more data with different samples Dienekes came out with a different set of ancestral components. Of particular interest to me he broke down the East Asian between East Asian proper and Southeast Asian. Below are a selection of populations with ancestral components + me. I’ve also renamed a few components. North Kannadi = Dravidian and Irula = Indian tribal. Indian = Generic Indian. Looking at the Fst it seems that Indian endogamy and population bottlenecks has had an effect…look at the North Kannadi distance from everyone else.


Remember that in the previous analysis I was very similar to a Gujarati, except with an East Asian element. My supposition that my ancestry has some connection to Burma seems to be supported by these results. Looking at my balanced ratio between East Asian and Southeast Asian, that is what one might expect from someone of a Burman ethnicity. I am not saying that I have recent Burman ancestry per se. Rather, Ahom, Mizo, Chakma, and a range of tribal populations from the liminal zone between South and Southeast Asia may suffice. The main other option is that I have a great deal of Munda ancestry. Not implausible in light of the likelihood that Munda brought rice agriculture to northeast South Asia, and pre-date Indo-Aryans, and possibly Dravidians, in Bengal. How would I distinguish these possibilities? I’ve ordered 23andMe kits for both my parents. The most likely candidate for recent Southeast Asian ancestry is my paternal grandfather. If the admixture event was recent, if I have a recent ancestor(s) of “hill tribe” origin, I would expect to see more linked regions of East/Southeast Asian origin than if the admixture was ancient (and so distributed more equitably across DNA strands due to recombination).

But the bigger point of Dienekes’ post is what he terms “Dagestani” ancestry across much of Eurasia. I’ll quote him:

The most exciting thing, however, is the fact that the origins of a part of the West Asian component of my previous analyses can be partially located: it is the purple component centered in Dagestan, i.e., among Northeast Caucasian speakers such as Lezgins, and the Dargins who inhabit Urkarah.

Readers of this blog may remember the surprising appearance of this Lezgin-specific component in the Balkans (but not Greeks) a few weeks ago. Now it has turned up as a substantial component in India as well.

Back then, I speculated that this component may derive from a prehistoric population that was spread in (but not limited to) the northern arc of the Black Sea from the Balkans to the Caucasus. Even in this analysis, you can see that both Romanians and Hungarians have some of it, and so do Lithuanians and Belorussians, while Tuscans (like the Greeks of my previous experiment) do not.

Hence, this component stretches from at least the Baltic to India, but is largely absent in southern Europe. I will go out on a limb and propose that this component is representative of a non-Indo-European component in the ancestors of the Indo-Iranians.

Paul Conroy observes that on this finer-grained analysis I don’t have any “West Asian” at all. What had previously been West Asian terms out to have been, in my case, a compound of Dagestani + European. I can’t say that I’m that surprised by this. Years ago I noticed that HGDP STRUCTURE analyses were always giving suggestive signs of a connection between West-Central Eurasia and South Asia.

Who were the Indo-Iranians? I lean toward the proposition that they do derive from the Andronovo culture of the Eurasian steppe. This would date the entrance and expansion of Indo-Aryans in northern India 3-4,000 years ago. I also contend that the dominant element of ancestry among modern South Asians is not Indo-Aryan. Rather, it is an ancient stabilized hybrid of pre-agricultural societies in the Indus valley and Neolithic farmers who originated from what is today western Iran and eastern Anatolia. Therefore, I posit that the “Aryanization” of the Indian subcontinent is properly modeled as the same processes which led to the emergence of an Anatolian and Rumelian Turkish identity; a small elite population which forces a identity shift among the majority.

Back to farming:

As I’ve remarked in the past, Eurasia can be broadly seen as the playground of three major groups of people: the Caucasoids of the West, the Mongoloids of the East, and a southern group of people which is most strongly represented in South Asia, but whose presence can be detected in Southeast Asia as well, although in the latter case it has been marginalized and/or absorbed by the arrival of Mongoloids.

This southern group of people has sometimes been called “Australoid” because of its perceived resemblance to Australo-Melanesians. Indeed, in my K=5 mega-analysis an affinity between Papuans/Melanesians and people of South and Southeast Asia is apparent. These “Australoids” are very old populations, probably stemming from the early Out-of-Africa coastal dispersal route, and we shouldn’t be tricked by their phenotypic similarity into thinking that different groups of them are particularly close genetically. Just as “black Africans” are not the same, neither are the “Australoids” and mixed-”Australoids” at the shores of the Indian Ocean.

It is probably the invention of agriculture that is responsible for their marginalization. In Africa, the Pygmies and Bushmen have been absorbed or pushed aside by the demographic Bantu juggernaut, with a few other language groups also hitching a ride on the agriculture/pastoralism economy. In West Eurasia, where agriculture was invented earliest, pre-agricultural populations left no traces. In East Eurasia, the agriculturalists could not expand to the far north where many relic populations exist, but they could (and did) move to the south where they assimilated or drove away pre-existing populations, leaving a few of thems, like the Taiwanese Atayal as partial remnants of the older population stratum.

The Irula are South Indian tribals, so they are the the closest one can get to South Asian autochthons, and yet even they presumably have a large minor component of “Ancestral North Indian.” The tribal groups in Reconstructing Indian Population History all exhibited proportions on the order of ~40% ANI. It seems that agriculture “stalled” in the Indus valley and the highlands to the west for thousands of years in South Asia. During this period of stalling I believe that the farmers absorbed a great deal of genetic material from the indigenous hunter-gatherers, and so produced a “distinctive” Indian genetic profile. More West Eurasian than not, but with a very large dollop of the ancient substrate of southern Eurasia which had a distant, but closer, affinity with that of East Asia. Once social and cultural forces allowed for the rapid expansion of farmers there was a wave of advance from the Indus valley east and south. In the east the proto-Indians would have encountered Mundari speaking groups drifting who practiced rice agriculture, which they also adopted. In the south the proto-Indians would have encountered more hunter-gatherers. Many of the tribal people in India are today facultative hunter-gatherers, herders, and extensive farmers. I believe that these marginal proto-Indian groups assimilated hunter-gatherers more easily than would have otherwise been the case because some of the proto-Indians reverted to a hunter-gatherer lifestyle in the agriculturally unsuitable highlands of the Deccan and Chota Nagpur. The social boundaries in the uplands of South India were such that the line between hunter-gatherer and farmer was more fluid than elsewhere, explaining the former’s greater genetic impact through intermarriage and assimilation.

This sort of general dynamic probably applies to Indo-Europeans. There is no reason why the original Indo-European tribes could not have been compounds who picked up different ancestral components in their peregrinations. Compare the various Turkic people, Anatolian Turks, Chuvash, and Yakut. All of them have affinities with nearby peoples, despite having a common Turkic culture and genetic component. One notable trend in Europe is that while the French have a minor, but significant West Asian component, the Basque have none of it. Dienekes’ sample is small, but it looks as if Scandinavians have more of this than the Finns. This West Asian component may not have been the dominant one among the Indo-Europeans, but I suspect it was a significant one. If the original speakers of proto-Indo-European did not have it, they likely absorbed early on, just as the West Asians absorbed a native South Asian element in the Indus valley.

Finally, as a general rule of thumb, I would now suggest that the primary way in which hunter-gatherer genes can persist is through an ecological stall on the part of farmers. During the stall gene flow naturally occurs, probably through exchange of females (coercive or not), or the integration of hunter-gatherer males into war-bands or as slaves. Over time the farmers on the frontier have changed genetically, so that when they start expanding rapidly due to a technological or cultural innovation, they share more with the hunter-gatherers whom they supersede than they otherwise would have.

October 28, 2010

Sons of the conquerors: the story of India?


The past ten years has obviously been very active in the area of human genomics, but in the domain of South Asian genetic relationships in a world wide context it has seen veritable revolutions and counter-revolutions. The final outlines are still to be determined. In the mid-1990s the conventional wisdom was that South Asians were a branch of a broader West Eurasian cluster of peoples, albeit more distant from the core Middle Eastern-North-African-European-Caucasian clade. The older physical anthropological literature would have asserted that South Asians were predominantly Caucasoid, but with a Australoid element admixed in at varying proportions as a function of geography and caste. To put it more concretely, and I think accurately, a large degree of South Asian physical variety can be defined along the spectrum between A. R. Rahman and Nawaz Sharif. The regional and caste truisms are only correlations. Subrahmanyan Chandrasekhar was a Tamil Brahmin, but experienced anti-black racism in the United States. I think that is reasonable in light of his appearance.

ResearchBlogging.orgThis rough & ready mainstream understanding, supporting by classical genetic markers, was overturned in the early years of the 21st century. One line of thought argued that South Asians were much more distinctive from the broader Western Eurasian cluster of peoples. Representative of this body of work is a paper like The genetic heritage of the earliest settlers persists both in Indian tribal and caste populations. These researchers tended to start with the female lineages, mtDNA, and then supplement that with Y lineages, the paternal descent. A separate line of evidence, generally drawn from Y chromosomal results, indicated that there were deep connections between the people of India and those of Central Eurasia, in particular via the R1a haplogroup. Additionally, one aspect of the first set of results which was very surprising was that it actually placed South Asians closer to East, not West, Eurasians. But by the end of the aughts the uniparental studies had been supplemented by a range of results produced from SNP-chips, which looked at hundreds of thousands of genetic variants. These studies seemed to support the older view of South Asians being closer to West Eurasians than East Eurasians. Finally last year a paper came out which posited that almost all South Asian populations were actually an ancient stabilized hybrid between two groups, a European-like population, “Ancient North Indians” (ANI), and another group which is no longer present in unadmixed form, “Ancient South Indians” (ASI), of whom the Andaman Islanders are distant relatives. Though there was a slight bias toward ANI as a whole, the fraction of ASI increased as one went southeast, and down the caste ladder. The distinctive “South Asian” ancestral group in other words then may actually be conceived of as a compound of these two elements; an admixture of the native substrate against a European-like genetic background.

Strangely it sounds an awful lot like the older idea of a Caucasoid population with Australoid admixture. We know now that the connection between the tribal peoples of India, and the indigenous groups of South and Southeast Asia as a whole, to those of Australia and Melanesia, is tenuous at best. So the term “Australoid” is not really informative, and may even mislead. And in terms of historical linguistics I don’t think we’ve solved the problem by appealing to an “Aryan invasion.” The high fraction of ANI among South Indian tribal groups who are isolated from even Dravidian caste groups is a clue to the likelihood that the admixture event is very ancient, and probably precedes the arrival of the Aryans to the Indian subcontinent.

But there are more than two actors in this game. In Reconstructing Indian population history the authors acknowledge that their model is stylized, that reality is more complex. Additionally, they perceive in their data that some tribal groups from northeast India have an element which is outside of the purview of a two-way admixture event. They discarded this set from their broader analysis because this seemed to be a restricted phenomenon to these groups. A new paper in Molecular Biology and Evolution re-injects this third element into the picture. Population Genetic Structure in Indian Austroasiatic speakers: The Role of Landscape Barriers and Sex-specific Admixture:

The geographic origin and time of dispersal of Austroasiatic (AA) speakers, presently settled in South and Southeast Asia, remains disputed. Two rival hypotheses, both assuming a demic component to the language dispersal, have been proposed. The first of these places the origin of Austroasiatic speakers in Southeast Asia with a later dispersal to South Asia during the Neolithic, whereas the second hypothesis advocates pre-Neolithic origins and dispersal of this language family from South Asia. To test the two alternative models this study combines the analysis of uniparentally inherited markers with 610,000 common SNP loci from the nuclear genome. Indian AA speakers have high frequencies of Y chromosome haplogroup O2a; our results show that this haplogroup has significantly higher diversity and coalescent time (17-28 KYA) in Southeast Asia, strongly supporting the first of the two hypotheses. Nevertheless, the results of principal component and “structure-like” analyses on autosomal loci also show that the population history of AA speakers in India is more complex, being characterised by two ancestral components – one represented in the pattern of Y chromosomal and EDAR results, the other by mtDNA diversity and genomic structure. We propose that AA speakers in India today are derived from dispersal from Southeast Asia, followed by extensive sex-specific admixture with local Indian populations.

Some background is necessary here. South Asia is notoriously linguistically diverse, but, that diversity can be bracketed into several broad families. First, the Indo-European languages are represented by Indo-Aryan and Iranian dialects (and Germanic, if you include English). Second, the Dravidian languages are found across the subcontinent, from Brahui in Pakistan to Malto in Bangladesh. But they’re really the dominant languages in the southern cone of South Asia. That being said it seems likely that historically their distribution extended far into the north, with Brahui in western Pakistan being a relic of that period, as well as the fragmented tribal groups in Central India. There is also evidence down to historic periods of a Dravidian-speaking substrate in Maharashtra. And purely from a philological perspective it seems clear that many Indo-Aryan languages evolved within a Dravidian linguistic substrate.

Next, in the far north there are languages of Tibetan provenance and affinity. These are explicable in their origins and relationship. But in the northeast third of the Indian subcontinent there are a two groups of Austro-Asiatic languages. The prefix “Austro” is indicative of the symbiotic relationship between historical linguistics and physical anthropology in the early 20th century (most famously illustrated in the transplantation of the social-linguistic term Aryan from a South Asian and Iranian context, to a racialized Northern European term). The map at the top of this post shows the distribution of the Austro-Asiatic languages, as well as their subdivisions. There is clearly an eastern and western wing to the group, but most scholars assume that this is an artifact of the historical eruption of the Burman and Thai peoples out of the southern fringes of the Chinese Empire and into mainland Southeast Asia.

800px-Ramakrishna_Mission_Cherrapunjee_106Within India the Austro-Asiatic languages fall into two broad categories: the Munda and the Khasi. The Khasi inhabit the massif which separates Bengal and Assam. Their culture and society is at some variance from the norm in India (they are matrilocal, and animist or Christian). A close relationship to the people to the east is clear in both their language and their physical appearance. The Khasi, and other groups such as the Garo, are of the family of peoples and ethnicities which have arrived from the east and north relatively recently, making the transition from the world of Tibet and Burma to India. This is evident in the face of the Khasi child in the image to the left. Once passing out of their lands of origin these populations have assimilated to different degrees to the Indic domain. The Tripuri people for example retain a Tibeto-Burman language, but are adherents of Vaishnav Hinduism (my own family were once subjects of the Manikya dynasty). The Ahom of Assam were totally assimilated by the Indo-Aryan substrate. Like the Bulgars of Bulgaria their only influence was in the ethnonym that they contributed to their subjects. A quick survey of my own genetics, and those of other South Asians of eastern origin on 23andMe, clearly shows the influence of assimilated Tibeto-Burmans. One Bangladeshi Muslim individual clearly carries an East Asian Y chromosomal haplogroup.

The Munda are a somewhat different case. In older historical literature on South Asia there is some consideration that the Munda may be the earliest inhabitants of India; predating the Dravidians. Some readers of South Asian origin also point out that in the early Indo-Aryan language there may be more evidence of Munda, than Dravidian, influence. But the eastern connections of the Munda languages seem clear, albeit less explicable than those of the Khasi or the Tibeto-Burman peoples of the far northeast. If the Munda are the indigenous people then it stands to reason that the Mon-Khmer languages derive from South Asia. On the other hand the vast majority of the Austro-Asiatic languages exist in Southeast Asia, and, the Munda themselves have been hypothesized as being the bearers of rice-culture from the east.

This is where genetics comes into play. There has already been evidence of an eastern influence in the genes of the Munda from other researchers, so what this paper does is look at that in detail, instead of discarding it as a minor effect which muddles the broader picture. I’ve reformatted figure 3 to show how the groups relate to each other. On the left is a PCA. Most of the variance is west-east, ~6%, while some of it is north-south, ~1%. On the right is a bar plot generated from ADMIXTURE. I’ve edited out many of the populations. Focus on the Austro-Asiatic groups from India.


In the PCA you see the SE-NW axis of ANI-ASI admixture which is the primary aspect of genetic variation within South Asia. Numerically Dravidian and Indo-Aryan groups along this axis are the vast majority of South Asians. But the Munda and other Austro-Asiatic groups are not trivial; there are strong suggestions that the eastern Indo-Aryan groups, Oriya, Bengali, and Assamese, are to some extent shaped by influence from the Austro-Asiatic elements. The closer connection of the Khasi to East Asian populations is clear on the PCA. But the fact that the South Indian samples are further along axis-Y than the Munda are indicative of admixture in the Munda population. Looking at the bar plot that’s clear. The dominant dark-green signature of South Indian ancestry is also predominant among the Munda, and found at non-trivial amounts among Iranian, Khasi, and Southeast Asian populations, but the Munda clearly have an eastern component which is not found in South Indians. This is probably the element which perturbs them on the PCA.

But this just tells us the relationships in terms of total genome content. It doesn’t necessarily tells us the historical sequence of admixture events or the direction of migration. In fact the evidence of Indian ancestry in Southeast Asia could be suggesting migration from South Asia to the Southeast Asia (there is plenty of cultural evidence of transmission, though the presumption is that the demographic movements were marginal). They note in the paper that one phenomenon which could be obscuring and confusing our understanding is that much of gene flow occurs through isolation-by-distance (IBD). Village-to-village dynamics. In contrast to this you have folk wanderings, which result in a “leapfrog” aspect. The Hazara and Uyghur are both cases of leapfrogging, as their genetic makeup can’t be explained easily by IBD. So here the connections between the Munda and Southeast Asians, and the broader relationship between Southeast Asians and South Asians, could be IBD, or perhaps reflect deep ancient common ancestry. Perhaps the ASI group spanned the region from the Arabian Sea to the South China sea, and were only later overlain by ANI and East Asian populations.

To explore these questions the authors tunneled down to a more fine-grained scale, and looked at uniparental lineages as well as a gene at which recent selection seems to have operated upon East Asians in distinction to other groups, EDAR. Though uniparental lineages are only partially informative in terms of ancestry, they are very amenable to dating because of their haploid inheritance patterns. And the relationships between the branches of the termini can give us historical information.

The following figure shows the relationship and distribution of a particular Y chromosomal haplogroup which the Munda carry, and other South Asians tend not to, which connects them to the east:


The haplogroup is O2a (M95). The results from the Y chromosomal data are not clear, though they do seem to reject the model whereby Southeast Asian O2a lineages derive from Indian ones. But it does not seem as if you have a scenario where one founder lineage entered into South Asia from Southeast Asia, there are too many disparate branches of O2a found among Indians. Additionally, the coalescence time (back to last common ancestor) is deeper in Southeast Asia, but still deep in South Asia among the Munda. From this it seems that the origin of Austro-Asiatic languages in South Asia can be rejected, but the details of the emergence of Austro-Asiatic in South Asia can not be clearly perceived as of yet. From what I can gather the authors themselves do not necessarily believe that their results in this domain are robust (insensitive to varying the model’s assumptions even marginally).

An interesting point though is that the mtDNA, the female lineage, does not seem to diverge from other South Asians much at all. I find it intriguing that this is the same pattern we see along the major NW-SE axis of variation. It seems that mtDNA lineages unite South Asians, while the Y lineages separate them (by caste and region). The generality has many exceptions, but it points to a peculiar sex mediated admixture process from both the northwest and northeast. Men on the move have reshaped the genetics and culture of South Asia, but the mtDNA lineages still point to an ancient Eurasian group with distant but stronger affinities to the east than the west. The mtDNA are likely the purest distillation of ASI.

Finally, they look at frequencies of variants of EDAR among the South Asian groups. EDAR is in some ways diagnostic of East Asian ancestry; it seems that a variant which produces thick straight hair emerged relatively recently among East Asians.  Here’s the result from the HGDP browser:


edar2The G allele exhibits co-dominance, so the GA phenotype has intermediate hair-thickness between AA and GG. Haplotype structure based tests of natural selection have indicated that the derived G allele is recent. The map to the right shows the frequency of the derived G variant by population group. The bubble size is proportional to frequency, while the colors represent language groups. Again the Khasi and Tibeto-Burman groups are as you’d expect, they exhibit a relatively high frequency of the derived variant. The Hazara are a group which only came into being within the last 1,000 years through an admixture event. The Tharu seem to have their origins in Nepal’s transitional zone, and all the Nepali populations have significant admixture with Tibetan groups even if they themselves are not Tibetan in language and culture. The interesting result are the Munda. The Dravidian groups lack the derived EDAR variant, as do Indo-European groups without a plausible East Asian source of admixture. But within the Munda the derived variant is found in proportions ~5%. This is far lower than the 60% among the Tibeto-Burmans of the northeast, or the 40% among the Khasi, but it is significant. And this result allows the authors to reject the IBD model of connection for Austro-Asiatic groups, because the Munda harbor the variant which other South Asian groups in their environs do not. Gene flow predicated on linguistic affiliation at such a remove seems implausible, so the most parsimonious explanation is that the Munda languages arrived in India from Southeast Asia as part of a leapfrog folk wandering.

But why the low frequency of the derived variant? Obviously the Munda have admixed with the local substrate, so dilution would be one explanation. Another could be that when the Munda left East Asia the frequency was lower. Additionally, whatever selective forces were driving the frequency up may have abated in South Asia, and it could be that there was selection against the derived variant! Whatever the truth of it the existence of the derived EDAR variant among the Munda would be like finding the European LCT variant among an East Asian population: clear evidence of long distance gene flow and population movement.

So where does this lead us? First, let me observe that some of the authors on this paper are the same ones who argued for a predominantly indigenous origin for South Asians in the early 2000s based on mtDNA variation. In this paper they seem to be leaning against an indigenous origin for the Munda, or at least refuting the conjecture that the Munda are ur-Indians par excellence. I didn’t go into the details of the coalescence times because they’re rather a mess, but EDAR is probably a “tipping point” in arguing for a relatively recent exogenous origin for the Munda. The strong sex asymmetry in genetic variation is also suggestive, we have plenty of evidence of historical examples of genetic leapfrogs occurring through men-on-the-move. The asymmetry also seems to exist among the Khasi and other Tibeto-Burmans in India’s northeast (figure 2 of the paper).

The arguments about the history, culture, and genetics of South Asia have traditionally been disputed along the Aryan-Dravidian axis. I’m not interested in rehashing that aspect, but these data point us to another reality: on India’s northeast frontier there’s another component. As an ethnic Bengali myself I’ve always been somewhat aware of this. Some of my relatives and family acquaintances look much more like Garos than other South Asians. This component is even more evident on the face of Assamese and Nepali, whose languages are Indo-Aryan and religion is Hinduism, but whose appearance bespeaks a more variegated background. On some level South Asians from these regions are aware of their peculiarity, even if it isn’t spoken of much. I have read that in the wake of the victory of Japan over Russia in the early 20th century Bengali intellectuals expressed in public their pride at their Asiatic ancestry. With the rise of China in the 21st century I suspect more South Asians from Nepal, Bengal, and Assam, will rediscover that aspect of their background which links them to the east, and not the west. The genetics is just telling us what we already knew.

Citation: Gyaneshwer Chaubey, Mait Metspalu, Ying Choi, Reedik Mägi, Irene Gallego Romero, Pedro Soares, Mannis van Oven, Doron M. Behar, Siiri Rootsi, Georgi Hudjashov, Chandana Basu Mallick, Monika Karmin, Mari Nelis, Jüri Parik, Alla Goverdhana Reddy, Ene Metspalu, George van Driem, Yali Xue, Chris Tyler-Smith, Kumarasamy Thangaraj, Lalji Singh, Maido Remm, Martin B. Richards, Marta Mirazon Lahr, Manfred Kayser, Richard Villems, & Toomas Kivisild (2010). Population Genetic Structure in Indian Austroasiatic speakers: The Role of Landscape Barriers and Sex-specific Admixture Mol Biol Evol : 10.1093/molbev/msq288

Link acknowledgement: Dienekes Pontikos.

Addendum: This is more a speculative comment, so I will tack this on to the body of the main post. Here’s my current very tentative model for how South Asians came to be. At some point after the last Ice Age 10,000 years ago the ANI arrived, and hybridized with the ASI, who are descendants of the older original Out of Africa wave to South Asia. After this, but before the Aryans, the Munda arrived from the northeast, and pushed into lands inhabited by ANI-ASI groups. 4,000-3,000 years ago the Indo-Aryans arrive, and impose themselves as an elite on the ANI-ASI hybrid population, before being assimilated biologically and imparting their language to the Indian majority. I don’t know where Dravidian came from, but perhaps it was the language of the ANI (its existence in fragments all across the swath of the northern Indian subcontinent is suggestive, as well as possible connections to ancient Elamite, the language of Bronze Age southwest Iran). Eventually the Aryanized ANI-ASI marginalized the Munda in northeast India and drove them to the highlands. Finally, the Tibeto-Burmans arrived in the historical period.

