Razib Khan One-stop-shopping for all of my content

January 18, 2018

The Dravidianization of India

Filed under: Dravidian,India Genetics,India genomics,Indo-Aryan — Razib Khan @ 9:36 pm

On this week’s The Insight Spencer Wells and I talk about the Indo-Aryan arrival to South Asia. This was recorded very early last summer, and I’m rather unguarded (it’s well before I had the piece published in India Today).

I think 2018 will finally be the year that a lot of South Asia will be “solved.” There has been some foot-dragging on papers and results, but that can only go so long.

All that being said I suppose I should make some suppositions I have arrived at on this topic more explicit, as in a discussion with an Indian friend he admitted had no idea about some of my views, though he reads this weblog when I expressed them. That’s because they are speculative and my confidence in them is weak, though you can infer my opinions if you look very closely.

The figure to the left is from Genomic insights into the origin of farming in the ancient Near East, a paper published about a year and a half ago. You see various South Asian populations being modeled as a mixture of four different source populations. The Onge are an Andaman Islander population (and the closest we can get to the aboriginal peoples of South Asia). Iran_N represents Neolithic Iranians, the canonical “eastern farmer” population. Steppe_EMBA represent Yamnaya pastoralists, who are themselves modeled as a mixture of Eastern European Hunter-Gatherers (EHG) and southern population which has affinities with the Iran_N cluster. EHG in their turn seems to exhibit ancestry from Western European Hunter-Gatherers (WHG), whose heritage dates to the late Pleistocene, and Ancient North Eurasians (ANE), who flourished in Siberia, and contributed ancestry to populations to the west and east (including the ancestors of Native Americans).

When I first saw this specific figure I was incredulous. I had long thought that “Ancient North Indians” (ANI) were a compound of two elements, one related to the farmers of West Asia (Iran_N), and the other steppe Indo-European (Steppe_EMBA/Yamnaya). But the fraction of Yamnaya/Indo-European/Indo-Aryan ancestry seemed far too high.

A few years later I am not less certain about my skepticism. The fractions here in the details are debatable. Within the text of the paper, the author admits that the true ancestral populations are probably not represented by the model. But they are close. In most cases, the “Han” ancestry is probably indicative of the fact that the non-ANI component of South Asian ancestry is most closely related to the Onge, but is significantly different nonetheless.

The ratio of Iran_N and Steppe_EMBA is the key. Here is a selection from the paper:

Group Iran_N Steppe_EMBA Ratio
Jew_Cochin 0.53 0.23 2.27
Brahui 0.60 0.30 1.98
Kharia 0.13 0.07 1.97
Balochi 0.57 0.32 1.75
Mala 0.23 0.18 1.25
Vishwabrahmin 0.25 0.20 1.21
GujaratiD 0.29 0.28 1.04
Sindhi 0.38 0.38 1.00
Bengali 0.22 0.25 0.91
Pathan 0.36 0.45 0.81
Punjabi 0.24 0.33 0.72
GujaratiB 0.27 0.38 0.72
Lodhi 0.21 0.29 0.72
Burusho 0.27 0.43 0.64
GujaratiC 0.23 0.37 0.61
Kalash 0.29 0.50 0.58
GujaratiA 0.26 0.46 0.57
Brahmin_Tiwari 0.23 0.44 0.51

Any way you slice it, a group like the Tiwari Brahmins of Northern India have more Onge-like ancestry than most of the groups in Pakistan. But also observe that the ratio toward Steppe_EMBA is more skewed in them than among even Pathans or Kalash.  The Lodhi, a non-upper caste population from Uttar Pradesh in north-central South Asia are more skewed toward Steppe_EMBA than Pathans.

It is important for me to reiterate that the key is to focus on ratios and not exact percentages. Though the Steppe_EMBA fraction did strike me as high, glimmers of these sorts of results were evident in model-based clustering approaches as early as 2010. The population in the list above most skewed toward Iran_N are Cochin Jews. This group has known Middle Eastern ancestry. But next on the list are Brahui, a Dravidian speaking group in Pakistan. There is a north-south cline within Pakistan, with northern populations (Burusho) being skewed toward Steppe_EMBA and southern ones (Sindhi) being skewed toward Iran_N. Additionally, Iranian groups such as Pathans and Baloch likely have had some continuous gene flow with Middle Eastern groups, probably inflating their Iran_N.

Trends I see in the data:

  1. There is a north-south cline within Pakistan with Steppe_EMBA vs. Iran_N
  2. There is a north-south cline within South Asia with Steppe_EMBA vs. Iran_N
  3. There is caste stratification within regions between Steppe_EMBA vs. Iran_N
  4. Though not clear in this table, there are strong suggestions that Indo-European speaking groups tend to be enriched in Steppe_EMBA, all things equal (e.g., the Bengalis in the 1000 Genomes look a lot like the middle-caste Telugus in the 1000 Genomes when you remove the East Asian ancestry…except for a noticeable small fraction of a component which I think points to Indo-European ancestry)

What does this mean in terms of a model of the settlement of South Asian over the past 4,000 years? One conclusion I have come to is that Dravidian speaking groups are not the aboriginal peoples of the subcontinent. Rather, their settlement across much of South Asia is very recent. Almost as recent as Indo-Aryan habitation. In First Farmers the archaeologist Peter Bellwood proposed this model, whereby Indo-Aryans and Dravidians both expanded across South Asia concurrently. Though I think elements of Bellwood’s model that are incorrect, it’s far more correct in my opinion than I believed when I first encountered it.

Why do I believe this?

  1. The Neolithic begins in South India in 3000 BC.
  2. Sri Lanka is Indo-European speaking
  3. The Dravidian languages of South India don’t seem particularly diverged from each other
  4. There is ancestry/caste stratification in South India even excluding Brahmins (e.g., Reddys and Naidus in Andhra Pradesh look somewhat different from Dalits and tribals)
  5. Some scholars claim that there isn’t a Dravidian substrate in the Gangetic plain
  6. R1a1a-Z93, almost certainly associated with Indo-Aryans, is found in South Indian tribal populations
  7. Using LD-based methods researchers are rather sure that the last admixture events between ANI and ASI (“Ancestral South Indians”) populations occurred around ~4,000 years ago

Here is my revised model as succinctly as I can outline it. The northwest fringes of South Asia, today Pakistan, and later to be the home of the Indus Valley Civilization (IVC), was populated by a mix of indigenous populations, a form of ASI, when West Asian agriculturalists arrived ~9,000 years ago from what is today Iran. These were the Iran_N or “eastern farmer” groups. The West Asian agricultural toolkit was serviceable in northwestern South Asia for reasons of climate and ecology, but could not expand further east and south for thousands of years.

There is where the first admixture occurred that led to a population was mixed between ANI and ASI. These people lacked Steppe_EMBA. They were pre-Indo-European. They were almost certainly not all Dravidian speaking. The Burusho people of northern Pakistan, for example, speak a language isolate (in India proper you have Nihali and Kusunda)

By ~3000 BC this proto-South Asian (in a modern sense) population began to expand, while the IVC matured and waxed. Eventually, the IVC waned, fragmented, and disappeared.

Around ~2000 BC, or perhaps somewhat later, Indo-Aryans arrive in South Asia. The situation at this stage in not one of a primordial and static Dravidian India, on which Indo-Aryans place themselves on top. Rather, it’s a dynamic one as the collapse of the IVC has opened up a disordered power vacuum, and a reconfiguration of cultural and sociopolitical alliances.

In the paper above the author alludes to the pervasiveness of both Iran_N and Steppe_EMBA ancestry in South Asia, including in South India. “Indo-European” Y chromosomal lineages are also found among many South Indian groups, albeit at attenuated proportions region-wide. In Peter Turchin’s formulation, I believe that “Indo-Aryan” and “Dravidian” identities became meta-ethnic coalitions in the post-IVC world. Genetically the two groups are different, on average. But some Dravidian populations assimilated and integrated Indo-Aryan tribes and bands, while Indo-Aryans as newcomers assimilated many Dravidian populations.

The reason that the ratio of Iran_N to Steppe_EMBA does not decline monotonically as one goes from west to east along North Indian plain is that Indo-Aryans were not expanding into a Dravidian India.  Dravidian India was expanding only somewhat ahead of Indo-Aryan India, and in some places not all at all. In the northwest fringe of South Asia there had long been a settled population of peasants with West Asian ancestry with Iran_N affinities. In contrast to the east the landscape was populated by nomadic tribal populations with ASI affinities. North Indian Brahmins may have more Steppe_EMBA than some populations in Pakistan and more ASI because they descend from Indo-Aryan groups who absorbed indigenous ASI populations as they expanded across the landscape.

Dravidian groups as they expanded also assimilated indigenous populations. This explains some groups with very high fractions of ASI. Their ASI ancestry is a compound, of an old admixture in Northwest India, and also later assimilation in South India. The presence of R1a1a-Z93 in these populations reflects the integration of some originally Indo-Aryan groups into the expanding Dravidian wavefront.

Where does this leave us?

  1. The Indo-Aryan vs. Dravidian dichotomy is not one of newcomers vs. aboriginals. It is of two different sociocultural configurations which came into their current shape in the waning days of the IVC. That is, it is less than 4,000 years old
  2. The two populations were clearly interacting closely around the time of the collapse and disintegration of the IVC and post-IVC societies. There has been gene flow between the two
  3. ~4000 years ago ANI and ASI populations existed in their “pure” form, but that is because ASI aboriginals still existed to the south and east of the IVC, while Indo-Aryans were a new intrusive presence in the Indian subcontinent

December 20, 2010

Some of the Indo-Europeans found?

School girls in Hunza, Pakistan

A few days ago I observed that pseudonymous blogger Dienekes Pontikos seemed intent on throwing as much data and interpretation into the public domain via his Dodecad Ancestry Project as possible. What are the long term implications of this? I know that Dienekes has been cited in the academic literature, but it seems more plausible that this sort of project will simply distort the nature of academic investigation. Distort has negative connotations, but it need not be deleterious at all. Academic institutions have legal constraints on what data they can use and how they can use it (see why Genomes Unzipped started). Not so with Dienekes’ project. He began soliciting for data ~2 months ago, and Dodecad has already yielded a rich set of results (granted, it would not be possible without academically funded public domain software, such as ADMIXTURE). Even if researchers don’t cite his results (and no doubt some will), he’s reshaping the broader framework. In other words, he’s implicitly updating everyone’s priors. Sometimes it isn’t even a matter of new information, as much as putting a spotlight on information which was already there. Below is a slice of a bar plot from Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. It uses STRUCTURE with K = 7. To the right of the STRUCTURE slice are two plots of individual data on French and French Basque from the same HGDP data set using ADMIXTURE at K = 10 from Dodecad.


Repeated runs and higher K’s make it clear that the French Basque lack a “West Asian” aspect which other French, and Iberians as well, have. Some of this is clear in the paper I referenced above as well…the key is you have to look at the supplements at K = 6. Because the Basque are the only native non-Indo-European speakers in Western Europe, their origin and relationship to nearby populations has always been of interest (they also have the highest Rh- frequency of world populations). Granted, the French Basque are very similar genetically to the French as a whole. But, it is obviously highly informative that they lack an ancestral component in totality which seems to exist at low but consistent levels across Western European populations. The only other European population at K = 15 who lack the West Asian component in totality are Finns (the Lithuanians come very close).

This is all preamble to a discussion of a post Dienekes put up today, A solution to the problem of Indo-Aryan origins. Remember that Dienekes has been “playing” with ADMIXTURE for only a few months. To claim to have found a ‘solution’ to a problem as intellectually and politically intractable and explosive as this is rather bold. The crux of the matter is that at a certain confluences of K’s and population sets Dienekes has discovered a distinctive signature of ancestry which seems to be modal on the north slope of the Caucasus, and spans India and Europe. He terms this “Dagestani,” due to the fact that among a population sample from this province in Russia this ancestral component is overwhelmingly dominant. The patterns of Dagestani admixture in Europe and India are curious and suggestive.

1 – In Europe the frequencies are low, but irregularly distributed (excepting around the North Caucasus). Scandinavians and British have appreciable fractions, Finns and Southern Europeans do not. Here’s Dienekes:

Interpreting this pattern is not easy, but it does seem that this component seems to have a V-like distribution, achieving its maximum in Caucasus and its environs, then undergoing a diminution, and achieving a secondary (lower) frequency mode in NW Europe.

The surprising appearance of the homonymous Dagestan component in India suggests a widespread presence of a common ancestry element. The West Asian element, by comparison seems to have a more normal /\-like distribution around its center in Anatolia-Caucasus-Iran region. It does reach the Atlantic coast, but is lacking in Scandinavia and Finland, and also in India itself.

2 – South Indian Brahmins have appreciable fractions, but non-Brahmins in the same region do not. In contrast, those who come from Indo-Aryan speaking backgrounds do seem to have Dagestani ancestral components, irrespective of other aspects of ancestry. For example Pakistanis don’t have that much more Dagestani than South Indian Brahmins or Gujaratis. Also compare the relatively narrow window of Dagestani ancestry variance among Dodecad South Asians (I’m DOD075). DOD088 is from what I recall a Reddy from Andhara Pradesh, a non-Brahmin but non-low caste. It is interesting that they have a high proportion of “Pakistan,” but no Dagestani. I have ~10% Dagestani, but no Pakistani.

Below is K = 10 for a selection of populations. Dienekes has now included in two non-Indo-European speaking Pakistani populations: the Brahui (Dravidian) and Burusho (linguistic isolate in the mountains of Pakistan):

Some general patterns are evident. The light blue is indicative of generic “Indian” ancestry. It is not found in appreciable proportions outside of subcontinental populations (or those of recent subcontinental origin). The same with the red, and light orange. For your reference the dark orange is a “Northern European” component, modal in Lithuania. The light and dark Green are both East Asian components. The dark blue is a “West Asian” component modal in Georgia, and prominent across Europe with declining as a function of distance from the eastern shore of the Black Sea (this is surely the West Asian which distinguishes the French from the French Basque). I believe that the light purple dominant in the Brahui and the light red dominant in the Burusho probably form as a compound the aforementioned Pakistani component. The dark purple is the Dagestani.

587px-Dravidische_SprachenFirst, a word on the Brahui. These are a group of tribes who reside in northern Balochistan in Pakistan. A small number are even to be found in Afghanistan. Historically they have had close relations with the Baloch, an Iranian speaking cluster of tribes who totally envelop the Brahui. The Brahui do speak a Dravidian language, of a family dominant in South India and found in isolated regions of Central and Eastern India. There are two broad models for the existence of a Dravidian language in Pakistan. The first is that the Brahui are remnants of more widely spoken Dravidian languages which date back to the Indus Valley civilization. The second is that the Brahui arrived during the medieval period from another region of South Asia where Dravidian languages were more common. Assuming either model, it has long been presumed that their involution by the Baloch has had a strong impact on the Brahui genetically; the two groups are very close. This is evident in Dienekes’ results as well. But observe that the Baloch are the group which seems more cosmopolitan in ancestry than the Brahui. If the Brahui were Dravidians from deep in India it seems that they would have a greater residual component of India-specific ancestry (light blue and orange). This is not so. In fact the Baloch have more of the Indian ancestral component than the Brahui. The Brahui component is found across Pakistan, and into India, albeit at lower proportions. Naturally, the Baloch have the second highest fraction. I believe these results should shift us toward the position that the Brahui are indigenous in relation to the Baloch, and that the Baloch ethnic identity emerged through the shift of a Brahui substrate, as evidenced by the greater cosmpolitanism of the Baloch. Additionally, Dienekes observes that the Brahui have a lower proportion of the Dagestani component than most other Pakistani groups, and several Indo-Aryan groups in India proper.

The Burusho are event more interesting than the Brahui. Unlike the Brahui the Burusho are very isolated in the mountainous fastness of Baltistan in northern Pakistan. Additionally, their language, Burushashki, is a linguistic isolate. Others of the class are Basque and Sumerian. In general it is assumed that linguistic isolates were once part of broader families of languages which have gone extinct. Burushashki probably persists in large part because of the geography which its speakers inhabit. Mountainous areas often preserve ethnic and linguistic diversity because the terrain allows for the persistence of local variety. I believe it is plausible that the Burusho have been far more isolated than the Brahui. This seems to show up in the ADMIXTURE plot, the Burusho have a greater proportion of their modal ancestral component than the Brahui. Additionally, the Burusho have even an smaller component of Dagestani than the Brahui.

Below is a chart Dienekes constructed ordered by proportion of Dagestani for his South Asian populations. Next to it I’ve placed a chart from a PCA which has some of the same population samples. Compare & contrast:


The PCA is looking at between population variation in totality. So naturally the Dagestani component isn’t going to be predictive of that. Rather, it speaks to the possibility which Dienekes is mooting: that the Dagestani component spread in the India subcontinent with the Indo-Aryans specifically, overlying the local resident substrate. In South India this meant that Brahmins brought this, mixing with the indigenous Dravidian population. In Pakistan the Indo-Aryan, and Iranians, were overlain on a substrate which were the ancestors of the Burusho and Brahui. The dominant signal of genetic relationship has to do with the substrate, not the Indo-Aryans. So that’s what’s going to show up on the PCA. In other PCA plots the model where South Indian Brahmins are a linear combination of a Pakistani-like population and a Dravidian population becomes clearer. But when you look at ancestry using something like ADMIXTURE you have the potential to tease apart different components, and so uncover relationships which may have been obscured when looking at aggregate variation.

dieDienekes’ model seems to post three steps in rapid succession ~4,000 years ago. A background variable which must be mentioned is that one must account for the Mitanni, a dominant Syrian power circa 1500 BC where a non-Indo-European language was the lingua franca, and yet a definite Indo-Aryan element existed within the elite. Indo-Aryan specifically because the Indo-European element within the Mitanni was not Iranian, but specifically Indo-Aryan. An easy explanation for this is that the Indo-Aryan component of the Indo-Iranian branch of the Indo-European languages crystallized outside South Asia, and independently reached Syria and India. In Syria it went extinct, while in India it obviously did not. By Dienekes’ model the Mitanni would be rather closer to the urheimat of the Indo-Aryans.

An aspect of his model which I do not understand is why it has to be Indo-Aryan, instead of Indo-Iranian. The South Asian population which the Dagestani component is modal, the Pathans, are Iranian, not Indo-Aryan. Additionally, this model seems to not speak in detail to the existence of the Dagestani element among Europeans. Here is a sorting of European populations (with Iranians included) by the Dagestani component:

Population Dagestan
Urkarah 93
Lezgins 47.9
Stalskoe 38.7
Adygei 16.4
Orcadian (Orkney) 12.6
Georgians 12.4
White_Utahns 11.2
Iranian 10.9
Scandinavian_D 10.2
Armenian_D 9.9
German_D 9.1
Turks 8.8
Armenians 8.4
French 7.9
Hungarians 7.5
Russian_D 6.3
Spanish_D 4.6
North_Italian 4.5
Spaniards 4.4
Romanian 4.1
Finnish_D 4.1
Russian 4
Greek_D 3.8
Portuguese_D 3.6
Tuscan 3.5
Tuscans 3.4
Lithuanians 2.9
S_Italian_Sicilian_D 2.8
Belorussian 2.5
Cypriots 2
Sardinian 1.5
French_Basque 0.7

There is here a strange pattern of rapid drop off from the Caucasus, and a bounce back very far away, on the margins of Germanic Northwestern Europe. This to me indicates some sort of leapfrog dynamic. A well known illustration of this would be the Ugric languages. The existence of Hungarian on what was Roman Pannonia is a function of the mobility and power of Magyar horseman, and their cultural domination over the Romance and Slavic speaking peasantry (their genetic impact seems to have been slight). No one believes that Germanic languages are closely related to Indo-Aryan (rather, if there is structure in Indo-European beyond Indo-Iranian, Celtic, etc., it would place the Indo-Iranian languages with Slavic). So what’s going on? I think perhaps the Dagestani component is part a reflection of the common Indo-European origin in that region. For whatever reason that signal is diminished in much of the rest of Europe. Perhaps Southern Europe was much more densely populated when the Indo-Europeans arrived. Additionally, it seems highly likely that in places like Sardinia, much of Spain, and Cyprus, Indo-European speech came through cultural diffusion (elite emulation) and not population movement. Or perhaps we’re seeing the vague shadows of population admixtures on the Pontic steppe, where distinct Germanic and Indo-Iranian confederations admixed with a common North Caucasian substrate.

Going back to India, let’s revisit the model of a two-way admixture between “Ancestral North Indians,” who were genetically similar to Europeans and West Asians, and “Ancestral South Indians,” who were closer to, but not very close to, East Eurasians. The ANI & ASI. The ASI were probably one of the ancient populations along the fringe of southern Eurasia, all of whom have been submerged by demographic movements from other parts of Eurasia over the past 10,000 years, excepting a few groups such as the Andaman Islanders and some Southeast Asian tribes. The model was admittedly a simplification. But taking that model as a given, and accepting that the Dagestani element is in indeed Indo-Aryan, we can infer that the ANI were not Indo-European. It is notable that the South Indian Brahmins have elevated fractions of both the Brahui and Burusho modal components. This is probably indicative of admixture of the Indo-Aryan element in the Indus Valley, prior to their expansion to other parts of India. I assume one of the languages spoken was Dravidian, though if ancient Mesopotamia was linguistically polyglot at the dawn of history I would not be surprised if the much more geographically Indus Valley civilization was as well.

Aishwarya Rai

The irony is that today when someone refers to a “Dravidian” physical type, they’re not talking about someone who looks like a Pakistani. They’re talking about someone who looks South Indian, where most Dravidian languages are spoken. But combining the inference from Dienekes’ model and the previous two-way admixture model, you reach the conclusion that lighter skin and more West Asian features among South Asians may be more due to Dravidian-speaking ancestors in the Indus Valley, not Indo-Aryans! It goes to show the wisdom of differentiating linguistic classes from biological ones when discussing historical population genetics. Unfortunately wisdom most of us interested in these topics do not show, alas.

As I like to say, interesting times….

Note: If you leave a comment, please don’t be smarter-than-thou in your tone. I have stopped publishing those sorts of comments because the reality is that most of them have not been that smart or informed. At least by my estimation. If you actually are smarter than the average-bear, and impress me with your erudition and analysis clarity, I’ll probably let your comment through no matter your attitude. But I wouldn’t bet on it if I were you, so show some class and humility. Most of us are muddling through.

Image Credit: Georges Biard, iStockPhoto

Powered by WordPress