Razib Khan One-stop-shopping for all of my content

March 30, 2019

South Asian Genotype Project, Spring 2019 update

Filed under: South Asian Genetics,South Asian Genotype Project — Razib Khan @ 11:01 pm

It’s been a while since I updated the South Asian Genotype Project. Well, I updated almost everyone. A few people had strangely formatted text files, so I’ll go add them tomorrow. Thanks to everyone who has submitted so far!

One of the main things that I’ve been curious about is undersampled groups. I finally got an Uttar Pradesh Kayastha in the data set (well, technically my second…but the first is a friend). I also got a submission of a Bengali Brahmin with origins in the west, and another in the east (in fact, from Comilla, which is where my own family is from). And, I got the submission of another West Bengali Kayastha.

Finally, I got another Maharashtra Kayastha.

If you click the image above you see some obvious things:

  • Bengali Brahmins don’t seem to be geographically structured. The eastern and western individuals are night near each other on the PCA. Additionally, they are very close to Uttar Pradesh Pradesh Brahmins. Not the main Bangladesh cluster.
  • In contrast, the West Bengal Kayastha is positioned close to the Bangladeshis, though outside of their cluster.

In other words, to some extent Bengal’s landscape reflects both aspects of the South Asian genetic variation: it is strongly structured by caste, and, geography also plays a role. People from western Bengal have least East Asian ancestry and more affinity with peoples to the west on the Gangetic plain. But Bengali Brahmins are genetically entirely dissimilar from other Bengalis.

The dissimilar position of Kayastha groups across South Asia is in contrast to Brahmins. Though Brahmin groups in Bengal and South India seem to have mixed with local groups (they are always somewhat shifted to the regional substrate), overall their genetic character indicates shared common ancestry. In contrast, the different Kayastha groups seem much more likely to be a case of local populations who arose to fill a particular occupational niche that emerged with polities which required a bureaucratic class.

March 13, 2019

Swidden rice farming does not lead to high population density

Filed under: Genetics,South Asian Genetics — Razib Khan @ 12:18 am
Admixture on K = 5

I’ve been looking at the data from the recent Munda paper. Standard stuff, admixture, treemix, and f-statistics.The northern Munda samples were collected in Bangladesh. So I thought: I can test the hypothesis that the East Asian ancestry in Bangladesh is to a large part Santhal. After looking at it every which way, I think that in fact, the Munda may not have ever been very populous in much of northeast India. The Santhal is just not a good donor population to Bengalis, at least not when comparing mixes such as Dai + Tamil.

Additionally, the Santhal are really not that well modeled by mixing South Asians with any particular Southeast Asian group, though it works. I think that’s suggestive of the possibility that the Austro-Asiatic group which gave rise to the Munda don’t exist in their current form anywhere in Southeast Asia. Additionally, the Lao samples that are provided in the new paper I think may have Indian ancestry via admixture from Austro-Asiatic Mon or Khmer groups.

Basically, there is so much bidirectional gene flow that I think it’s really hard to get a grip on what’s going on. Additionally, the Burmese and northeast Indian populations (e.g., the Mizos) clearly have a strand of ancestry that derives from relatively recent migrants that came down from the region of eastern Tibet, and perhaps Sichuan or even further north. And this component shows up in Bengalis as well.

On top of this, there is the “Australo-Melanesian” substrate that is present all across Southeast Asia, and probably was present in modern southern China in the early Holocene, which has distant affinities with the “Ancient Ancestral South Indians” (AASI).

At this point, I keep my own counsel. But there may be an interesting story to tell related to how efficient and effective different forms of agriculture were, and how that interplayed with genes and language.

September 20, 2018

Indic civilization came to Southeast Asia because Indian people came to Southeast Asia. Lots of them

Reading Indonesia: Peoples and Histories. I selected it because unlike many books it wasn’t incredibly skewed to the early modern and postcolonial period. The author makes the interesting point that the Islamicization of western Indonesia and the rise of the great Javanese Hindu kingdom of Majapahit occurred around the same time. This, in contrast to the skein of Indic civilization which had been layered over maritime Southeast Asia for hundreds of years before the medieval period, starting around 500 AD with polities such as that of Kalingga.

As is usual in these sorts of books, it is emphasized that Indian civilization spread through cultural diffusion (in contrast to the fact that though Chinese trade was evident and present early on, the cultural impact was minimal). Any migrations are dismissed as legends, with the possible exception of a few elite religious functionaries.

I now believe this is wrong. I’ve discussed this extensively in the past, but the Singapore Genome Variation Project (SGVP) data set along with more Southeast Asians allows me to illustrate rather clearly the issues. The short of it is that it is highly likely that substantial South Asian ancestry exists within Southeast Asia, and that that ancestry is not just a function of colonial contact (e.g., as certainly occurred in Malaysia).

Click to enlarge

Merging the various data sets together I got 172,000 SNPs. The initial PC plot shows that Southeast Asian populations exist on a cline to Indians (these are Tamils from the SGVP). The Burmese and Malays in particular have a wide distribution toward the Indians, indicate of a range of ancestry due to continuous admixture. I separated the SGVP Malays into two groups: Malay, and MalayMix. The MalayMix are those Malays who are more shifted toward the Indians, and like the Burmese show wide variance. The Mala proper as a more straightforward cluster. Shifted toward Indians more as a group.

Click to enlarge

Zooming in you see that Malays (not MalayMix) are not too different from Cambodians, but are slightly shifted toward Papuans. Filipino samples are similar, but further from Indians. Please note that Malaysia and the Philippines both are somewhat shifted toward the Papuans, and these are two nations where there are still extant Negrito populations (in contrast to Cambodia).

Groups like Lahu, Dai, Koreans, and the Dayak samples from Borneo I put in there partly because I assumed they would be less admixed with South Asians.

Click to enlarge

Running the samples in an admixture model with K = 5, the results are pretty clean even in unsupervised mode. Part of this is that I did do some outlier analysis and pruning ahead of time.

The Melanesian sample has admixture from something that is maximized in Filipinos and the Borneo samples. This is clearly Austronesian. Notice that the Melanesian samples don’t have any other Southeast Asian ancestry. This indicates that the cosmopolitan nature of some Austronesian groups in maritime Southeast Asia were due to later admixture. In particular, I accept the argument of Lipson et al. that there was an Austro-Asiatic substrate that was absorbed by incoming Austronesians.

Because I was very particular about sample selection, the Indians are nearly fixed for their modal ancestral component. Notice which groups don’t have the Indian ancestry in Southeast Asia: the Borneo samples. Additionally, the frequency in the Philippines may be due to European ancestry. Notice that in the Filipino samples the more diverse individuals tend to have more Indian ancestry, perhaps indicative of cosmopolitanism.

The Lahu and Dai do not have any the Indian modal ancestry, suggesting that this was not present when the Southeast Asians arrived.

The Cambodians have the Indian modal ancestry, as do many of the Malays. The MalayMix population has a lot, as expected. They are rather like the Burmese samples in that way. Some of the Malays don’t have Indian ancestry though. I think this may be due to the reality that the Malay population is actually cosmopolitan in origin, absorbing Indians, Chinese, and, Orang Asli groups. The latter of which may not have had Indian ancestry.

Click to enlarge

Next I ran some Treemix. Cambodians and MalayMix have affinity with Indians, as you’d expect. The Malay group gets gene flow from the Borneo population, and is positioned rather closer to Indians.

Here are some f3-statistics. At least those with z less than -2.

out p1 p2 f3 z
Burm Korea Indian -0.00371314 -40.1063
Burm Dai Indian -0.00368793 -36.4354
Burm Lahu Indian -0.00363462 -33.3115
Burm Borneo Indian -0.00297696 -30.3724
Burm Filipino Indian -0.00222445 -24.3581
Burm Korea Papuan -0.00243075 -19.9711
Burm Dai Papuan -0.00213815 -15.6106
Burm Malay Indian -0.00133932 -15.233
Burm Korea NAN_Melanesian -0.00158736 -12.6428
Burm Cambodia Indian -0.000991136 -10.9863
Burm Lahu Papuan -0.00185199 -10.8255
Burm Dai NAN_Melanesian -0.0011808 -8.7863
Burm Lahu NAN_Melanesian -0.00136834 -8.35811
Burm Korea MalayMix -0.000470731 -7.64052
Burm Borneo Papuan -0.00105531 -7.04586
Burm Korea Cambodia -0.000388278 -6.74484
Cambodia Dai Indian -0.00166543 -19.5634
Cambodia Borneo Indian -0.00135571 -16.9002
Cambodia Lahu Indian -0.00106449 -10.9303
Cambodia Dai Papuan -0.00128886 -9.86858
Cambodia Borneo Papuan -0.000607278 -4.5826
Cambodia Dai NAN_Melanesian -0.000449035 -3.69865
Cambodia Lahu Papuan -0.000455081 -2.64151
Filipino Borneo Papuan -0.000462553 -3.8874
Filipino Borneo NAN_Melanesian -0.000325208 -3.54648
Malay Filipino Cambodia -0.000763086 -32.6034
Malay Borneo Indian -0.0020853 -29.1425
Malay Borneo Cambodia -0.000613918 -26.5048
Malay Borneo Papuan -0.00223031 -20.037
Malay Dai Indian -0.00136434 -14.4879
Malay Borneo NAN_Melanesian -0.00131484 -14.4241
Malay Dai Papuan -0.00188121 -13.6787
Malay Filipino Indian -0.000850623 -12.4534
Malay Borneo Burm -0.000447661 -11.0181
Malay Dai NAN_Melanesian -0.00122082 -10.1649
Malay Lahu Indian -0.000658295 -6.56147
Malay Filipino Papuan -0.00061669 -6.52747
Malay Borneo MalayMix -0.000237474 -5.75298
Malay Lahu Papuan -0.000942326 -5.35136
Malay Lahu NAN_Melanesian -0.000755618 -5.0158
Malay Korea Papuan -0.000473046 -3.65977
Malay Dai MalayMix -8.93679E-05 -2.12082
MalayMix Borneo Indian -0.00469843 -45.6919
MalayMix Filipino Indian -0.00377864 -39.6124
MalayMix Dai Indian -0.00412557 -35.9643
MalayMix Malay Indian -0.0028506 -33.0568
MalayMix Lahu Indian -0.00345861 -28.1738
MalayMix Korea Indian -0.00281846 -23.528
MalayMix Borneo Papuan -0.00322593 -21.9346
MalayMix Cambodia Indian -0.00192058 -19.5189
MalayMix Dai Papuan -0.00302494 -19.2884
MalayMix Borneo NAN_Melanesian -0.00208153 -15.6894
MalayMix Dai NAN_Melanesian -0.00213561 -14.4382
MalayMix Filipino Papuan -0.0019272 -14.0354
MalayMix Korea Papuan -0.00198522 -12.4299
MalayMix Cambodia NAN_Melanesian -0.00114701 -11.2074
MalayMix Malay Papuan -0.00123309 -10.69
MalayMix Cambodia Papuan -0.00119651 -10.578
MalayMix Lahu Papuan -0.00212514 -10.5372
MalayMix Malay NAN_Melanesian -0.00100416 -9.70624
MalayMix Lahu NAN_Melanesian -0.0017095 -9.61884
MalayMix Korea NAN_Melanesian -0.00120984 -7.54544
MalayMix Filipino NAN_Melanesian -0.000920147 -6.96775
MalayMix Borneo Burm -0.000446434 -5.966
MalayMix Filipino Cambodia -0.000336794 -5.33937
MalayMix Filipino Burm -0.000279165 -4.31016
MalayMix Burm Malay -0.000236247 -4.15308

No big surprises.

I’m trying to get rolloff to work on one of the Reich lab datasets, but it isn’t working (says not enough snps, but the file has 350,000!). I need to establish the admixture date. Perhaps I’ll look to use fineStructure?

Note that this paper shows that of 125 male Cambodians, 9 of them carry R1a1a. This is unlikely to come from French, and Cambodia, unlike Malaysia, doesn’t have a colonial Indian community.

More to come….

July 9, 2018

The main interesting thing about Bangladeshi genetics is how East Asian Bangladeshis are

Filed under: Bangladeshi Genetics,South Asian Genetics — Razib Khan @ 11:29 pm
Click to enlarge


I got a question about endogamy and Bangladeshis on of my other weblogs, as well as their relatedness to western (e.g., Iranian) and eastern (e.g., Southeast Asian) populations. Instead of talking, what do the data say? Most of you have probably seen me write about this before, but I think it might be useful to post again for Google (or Quora, as Quora seems to like my blog posts as references).

The 1000 Genomes project collected samples a whole lot of Bangladeshis in Dhaka. The figure at the top shows that the Bangladeshis overwhelmingly form a relatively tight cluster that is strongly shifted toward East Asians. There is one exception: about five individuals, several of which were collected right after each other (their sample IDs are sequential) who show almost no East Asian shift.

This to me is very strange.

Looking at other Bengali samples, whether it be a Kayastha and/or Brahmins from West Bengal, there tends to be a noticeable East Asian shift. The Brahmins though are mostly genetically similar to Brahmins from further north and west, with a minority of their ancestry probably indigenous Bengali, judging by the fact that they usually have less than 5% East Asian ancestry (depending on your metric). I have one Bengali Brahmin in the sample. You can see it as the outlier shifted to the Northwest Indian/Pakistani populations. This individual has very little East Asian shift. In contrast, the West Bengal Kayasthas, a typical “middle caste”, look similar to the Bangladeshi samples, except they have a lot less East Asian ancestry. In other words, they would plot between the Bangladeshi cluster and the other South Asian populations.

I suspect that the individuals with no East Asian ancestry may be from one of the Telugu migrant Dalit communities which settled in Dhaka during the British period.

There are some other East Bengalis on the plot that I added from the SAGP. Four of them are from Comilla. Though now a city, traditionally this region encompassed the area to the south and east of Dhaka, to the border with Tripura. Two of these individuals are my parents. I also added several from Sylhet, which borders Assam to the north (Syhlet people speak an unintelligible language to standard Bengali, similar to the people of Chittagong and the Rohingya). Two things to note.

click to enlarge

First the Comilla individuals are found in the most East Asian shifted portion of the distribution. This suggests, along with the position of West Bengalis, that the eastern ancestry in Bengalis exhibits a west to east cline. My father is somewhat atypical, in that he is shifted out of the main Bangladesh cluster ever so slightly. A genealogical fact though is that his maternal grandfather was reputedly from a Bengali Brahmin family (more likely looking at the attenuated extent of the skew, his mother’s paternal grandfather was a Brahmin).

Second, the Syhlet individuals seem to have something of a shift to Northwest Indians and Pakistanis. But the individual who sent me these data noted that several of the individuals have family records and memory of partial descent from Muslims from Afghanistan and such. That seems likely looking at their position.

Using Treemix, it is notable to me that both the Syhlet and Comilla groups show gene flow more directly from the Dai than the Bangladeshis more generally. I think this is likely an artifact…but there is some slight structure in the Bangladesh population which is probably being missed. The eastern ancestry in Bengalis probably comes from both Austro-Asiatic and Tibeto-Burman people, and this fraction must vary across the region (or normal variation as part of Mendelian segregation).

click to enlarge

Earlier I said the Bangladeshi population is relatively unstructured. Click the Treemix plot above. Or check this Admixture run at K = 5. The Bangladeshi sample has only modest quantitative differences in comparison to most of the other South Asians. To the left are plots of Telugus sampled in the UK, along with South Indian Brahmins. Notice the relatively large range of variance. This is not atypical in sampling from South Asian populations. You see the same pattern in the Gujuratis sampled from Houston, and the Punjabis sampled from Lahore. The partial exception here at the Tamils from Sri Lanka sampled in the UK. There are a small number of individuals who cluster with Dalit groups, but far less than the Telugus. Why? I suspect that panmixia is somewhat along the way in Sri Lankan Tamil populations.

So what’s going on in eastern Bengal? One thing to note is that Muslim Punjabi populations seem to have a huge amount of genetic variation. On par with what you see in Indian populations. The relatively well-mixed character of eastern Bengalis isn’t just a function of caste-less Islam. Bengali Muslims are no more strictly Muslim than Muslims from Punjab. In fact, the stereotype arguably goes in the other direction. Additionally, the variation in East Asian ancestry in Bangladeshis is significant, but aside from what are likely scheduled caste (Dalit) groups which may descend from Indian colonial migrants, I suspect that range in quantum is probably mostly due to geography. The only group of Bangladeshi genotypes where I’ve seen a higher East Asian fraction than my own is of individuals from Chittagong, which is entirely expected from on history and geography.

Based on LD decay the admixture between the East and South Asian components in modern Bangladeshis dates to about 52 generations ago. That’s 1,300 years ago assuming 25 year generation times. A single pulse admixture is a better fit than two distinct events. Because of the range of physical appearance in my family, from mildly East Asian looking (I have family members who can easily pass for Malay or Filipino, at least judging by the languages people speak to them in cosmopolitan areas) to not very East Asian looking, I had assumed that a great deal of the admixture was recent due to proximity (several branches of my family lived in the princely state of Tripura; my grandmother almost killed by a rampaging elephant owned by the Maharani of Tripura). So, I was surprised that my parents both had about the same amount of East Asian ancestry (~15%). This is not entirely shocking, but consider that my mental model of the admixture process was similar to that of African Americans. So if you are a black American, and our parents both turn out to be ~15% European, rather than say 12% vs. 18%, you have to start wondering. So when the LD decay estimate suggested an older, but singular, admixture I was not entirely surprised.

In The Rise Of Islam And The Bengal Frontier 1204-1760 the author presents a model whereby the collapse of Hindu rule in Bengal in the 13th century was coincident with the emergence of a frontier society which expanded the zone of intensive agriculture through reclamation projects. Though the expansion and settlement was directed by Muslims of originally West Asian provenance (Turanian Turks and Afghans), the settlers themselves were peasants who spoke the proto-Bengali language. Curiously, both West Bengalis and Burmese individuals have told me that there is a belief that the indigenous people of Bengal were of East Asian character. The LD decay statistics indicate that most of the admixture occurred well before the arrival of Islam to South Asia, but if settlers were drawn already from the eastern fringe of Indo-Aryan speech, then they would be more enriched to East Asian ancestry. That still leaves one to explain the west-to-east cline of East Asian ancestry even within Bangladesh (East Bengal). That is probably due to secondary  admixture, combined with further gene flow from the west diluting the original admixture signal.

Going back the original question in terms of affinities to western and eastern population and Bengalis. There is a northwest to southeast gradient of “Ancestral North Indian” (Iranian farmer + Indo-Aryan) ancestry in South Asia, and that is evident in Bengal. But, Bengalis clearly have a substantial minority ancestral component from Eastern Eurasia, probably via Austro-Asiatic and Tibeto-Burmans tribes. Though some Bengalis have a small proportion of distinct West Asian ancestry that is distinct from what is found typically in South Asians, that’s about one order of less magnitude significant than the East Asian ancestry.

All of this was pretty clear about five years ago. The more genotypes I get, the more clear and obvious the above assertions are.

January 28, 2018

Punjabi genetic variation in 1000 Genomes: Hindu caste in the Land of the Pure?

Filed under: Genetics,Punjabi,South Asian Genetics — Razib Khan @ 8:58 pm

In the 1000 Genomes, there is a Punjabi dataset. Here is the description:

These cell lines and DNA samples were prepared from blood samples collected in Lahore, Pakistan. The samples are from a mix of parent- adult child trios and unrelated individuals who identified themselves and their parents as Punjabi.

A few years ago I did an analysis of the population structure in the 1000 Genomes dataset. In the Chinese data, there seemed to be some curious structure (there were two clusters of South Chinese). But the biggest issues predictably were in the South Asians. To give concrete examples, there were a few Brahmins in the Telugu data. A subset of Tamils and Telugus were highly ASI shifted. The Gujurati were highly heterogeneous, and one subcluster were almost certainly Patels (the samples were collected in Houston). The ASI shifted groups were almost certainly Scheduled Castes (Dalits) because I could see that they clustered with those samples from Estonian Biocentre dataset.

There was something curious about the samples from Pakistan and Bangladesh. Aside from a small number of individuals, whose samples were collected at the same time judging by their IDs (these individuals cluster with Scheduled Castes), the Bangladeshi sample didn’t have much South Asian style structure. That is, there wasn’t a cline or lots substructure within the ethnicity.

As noted by some commenters, the Punjabi samples were very different. Like the Gujurati samples, there was a huge variance along the ANI-ASI cline. To me, this was somewhat surprising. To make the 1000 Genomes more useful I used PCA and divided both Gujuratis and Punjabis into groups based on their position on the ANI-ASI cline. So that ANI_1 is the subpopulation with the most ANI and ANI_4 the least.

Using Treemix produced some weird results. As you can see above Punjabi_ANI_1 looks like an Iranian population with gene flow from Punjabi_ANI_3. Punjabi ANI_2 looks like a North Indian population with Iranian gene flow (so it is more ASI). Punjabi_ANI_3 are less ANI shifted than Uttar Pradesh Brahmins, but more than Uttar Pradesh Kshatriya. Finally, Punjabi_ANI_4 actually is very similar to Punjabi_ANI_2, except it has gene flow from a Dalit-like population.

With the South Asian Genotype Project I have a few Punjabi samples. All of them are within Punjabi_ANI_1.

I don’t know what’s going on here. Is this really caste-like structure in Punjab? Or are we see lots of admixture of people who are called “Punjabi” today? For example, the gene flow edges suggest lots of mixing between quite South Asian types of groups and an Iranian sort. Perhaps this is the absorption of Pathans into South Asian groups? Could it be Muhajir people who mixed with local Punjabis and identified as such?

I was curious to see if I could find something similar in relation to the three Jatts. As you can see with Treemix, no. Jatts are just very ANI-shifted. I added Lithuanians and Georgians, and you can see that Uttar Pradesh Brahmins get gene flow from a Lithuanian shifted group, while South Indian Brahmins have a more Georgian gene flow. This is just an artifact I suspect of the fact that South Indian Brahmins have a lot of admixture from non-Brahmin South Indians, who are more Georgian than Lithuanian (Iran_N as opposed to Yamnaya).

Finally, going back to the Bengali (Bangladeshi) vs. Punjabi contrast, it is really interesting. If Punjab has such deep caste-like structures it really goes to show how within South Asia caste is a very very powerful institution, and ~1,000 years of Muslim rule and in western Punjab a majority Muslim population did not break down the institution. In contrast, in Bangladesh, there doesn’t seem to be much caste structure. I am routinely the most East Asian shifted Bengali in datasets, but my family is also from the eastern edge of eastern Bengal. Why the difference?

in The Rise of Islam and the Bengal Frontier the author posits that the Islamicization of eastern Bengal was to a great extent the function of the opening up of lands for cultivation under the supervision of Muslim elites under the rule of Afghans and later Mughals. This would explain the lack of caste structure because presumably, caste structure would be difficult to maintain in a frontier landscape, where the cultural elite does not promote or accept caste (though the elite West Asian Muslims were racially exclusive, they were also a very small minority).

In contrast, the Punjab has long been settled by Indo-Aryan peoples, and despite its long history of Islam, it was not recently a frontier society.

Anyway, that’s all I got to say for that. I’m sure readers will have more insight on this pattern than I do….

January 25, 2018

How South Asian populations relate to each other

Filed under: Population genetics,South Asian Genetics — Razib Khan @ 12:45 am

Since people asking me about this, and I’m running the South Asian Genotype Project, I thought I would post two non-PCA visualizations of how various South Asian groups relate to each other (along with a few outgroups).

The radial plot above is a neighbor-joining tree visualized from pairwise Fst statistics (basically a proxy for genetic distance).

I also used Treemix to generate a plot. You see the similar patterns as the one above, though the two methods are different. Treemix tests a bunch of models and sees how the data fit those models. The visualization of Fst is just a way of representing the summary statistic.

I added 5 migration edges to the plot to the right. Not sure if they add anything, but you can see that some of the nodes move around because they are so mixed.

January 20, 2018

South Asian Genotype Project, update

Filed under: South Asian Genetics,South Asian Genotype Project — Razib Khan @ 10:08 pm

I’ve been working on the South Asian Genotype Project. Again, if you are interested: send me a 23andMe, Ancestry, or Family Tree DNA raw genotype file to contactgnxp -at- gmail.com.

In the subject please put:

  1. “South Asian Genotype Project”
  2. The state/province your family is from
  3. Ethnolinguistic group
  4. If applicable, caste

I changed the reference populations because the earlier ones were too complicated. You can see the population averages from public data sets for some groups.  The results for project members are here. I re-ran everyone who has sent data in so far. I’ll leave commentary for later.

At this point, I think the easiest way to update project members is to create a mailing list. If you are have submitted genotypes, please join:

Subscribe to the South Asian Genotype Project

* indicates required

January 6, 2018

South Asian Genotype Project update

Just a quick update. I know I haven’t been responsive, but I’ve been traveling and spending time with the family and working a lot for the past few weeks. I’m going to make some revisions to my pipeline as well. I will get back to generating results soon (as in a week or so). So please keep sending data to contactgnxp@gmail.com.

June 16, 2011

Present genetic variation is a weak guide to past genetic variation

As I’ve been harping on and on for the past few years that the patterns of contemporary genetic variation are probably only weakly tied to past patterns of genetic variation (though Henry Harpending warned me about this as far back as 2004). A major reason that scholars operated under this presupposition is the axiom that most of the variation we see around us crystallized during the Last Glacial Maximum (~20 thousand years before the present).

This may be true in some cases, but I doubt it is true in most cases. I was pointed to a classic case of this problem just today. A reader alerted me to a short paper from this spring which attempts to ascertain the point of origin of the dominant mtDNA haplogroup among the Onge tribe of the Andaman Islanders, M31a1. This is an interesting issue because some researchers proposed, plausibly in the past, that these indigenous people in the Andaman Islands represent the descendants of the first wave “Out of Africa,” who took the rapid “beachcomber” path. Understanding the key to their genetics may then unlock the key to the “Out of Africa” event. Or ...

March 12, 2011

Harappa Ancestry Project @ N ~ 50

Zack Ajmal now has over 50 participants in the Harappa Ancestry Project. This does not include the Pakistani populations in the HGDP, the HapMap Gujaratis, the Indians from the SVGP. Nevertheless, all these samples still barely cover vast heart of South Asia, the Indo-Gangetic plain. Here is the provenance of the submitted samples Zack has so far:

Punjab: 7 Iran: 7 Tamil: 6 Bengal: 5 Andhra Pradesh: 2 Bihar: 2 Karnataka: 2 Caribbean Indian: 2 Kashmir: 2 Uttar Pradesh: 2 Sri Lankan: 2 Kerala: 2 Iraqi Arab: 2 Anglo-Indian: 1 Roma: 1 Goa: 1 Rajasthan: 1 Baloch: 1 Unknown: 1 Egyptian/Iraqi Jew: 1 Maharashtra: 1

Again, note the underrepresentation of two of India’s most populous states, Uttar Pradesh, ~200 million, and Bihar, ~100 million. Nevertheless, there are already some interesting yields from the project. Below I’ve reedited Zack’s static images (though go to his website for something more dynamic) with the labels of individuals. I’ve highlighted myself and my parents with the red pointers.

To the left is a set of plots and tables which I’ve spliced together from Zack’s various posts. What you need to know is that this at K = 12, and I’ve used the labels that Zack gave the various putative “ancestral populations” which emerged out ...

January 24, 2011

Harappa Ancestry Project, update

Last week I announced the Harappa Ancestry Project. It now has its own dedicate website, http://www.harappadna.org. Additionally, it has its own Facebook page. For Zack to get his own URL he needs about 10 more “likes,” so please like it! (if you are so disposed) Finally, from what I’ve heard the first wave of the 23andMe holiday sale results are coming online this week. Actually, one of the relatives who I purchased the kit for is in processing currently, so I know that we should have a bunch of new people in the system very, very, soon.

Speaking of people, last I heard Zack had gotten about a dozen responses. That’s enough to start an initial round of runs, but obviously he needs more people. More importantly, the goal here is to get better population coverage. One of the things we know intuitively and also from the most current research is the existence of a lot of within-region population variation in South Asia which is structured by community. In other words, a sample of 30 people, where you have 3 from 10 different communities exhibiting geographical and ...

August 10, 2010

PCA, Razib around the world (a little)

price_fig1I have put up a few posts warning readers to be careful of confusing PCA plots with real genetic variation. PCA plots are just ways to capture variation in large data sets and extract out the independent dimensions. Its great at detecting population substructure because the largest components of variation often track between population differences, which consist of sets of correlated allele frequencies. Remeber that PCA plots usually are constructed from the two largest dimensions of variation, so they will be drawn from just these correlated allele frequency differences between populations which emerge from historical separation and evolutionary events. Observe that African Americans are distributed along an axis between Europeans and West Africans. Since we know that these are the two parental populations this makes total sense; the between population differences (e.g., SLC24A5 and Duffy) are the raw material from which independent dimensions can pop out. But on a finer scale one has to be cautious because the distribution of elements on the plot as a function of principal components is sensitive to the variation you input to generate the dimensions in the first place.

I can give you a concrete example: me. I showed you my 23andMe ancestry painting yesterday. I didn’t show you my position on the HGDP data set because I’ve shared genes with others and I don’t want to take the step of displaying other peoples’ genetic data, even if at a remove. But, I have reedited some “demo” screenshots and placed where I am on the plot to illustrate what I’m talking about above. The first shot is my position on the two-dimensional plot of first and second principal components of genetic variation from the HGDP data set.

gsa-lillymendel-worldNo surprise that I’m in the Central/South Asian cluster. But what may surprise you is that I’m not in the South Asian cluster, I’m in the Central Asian cluster. In the Central Asian cluster are Uyghurs and Hazaras. These are two hybrid populations, a mixture of West and East Eurasian elements. The Uyghurs are likely the outcome of a process of admixture between the Iranian and Tocharian Indo-European populations of the cities of the Tarim basin, and later Turkic speaking settlers who arrived in the wake of the expansion and later collapse of the first Uyghur Empire (the historical connection between the current Uyghurs and ancient Uyghurs is tenuous at best, and complicated). The Hazaras are a more recent population, likely emerging as the product of intermarriages between Mongol soldiers who arrived in the 13th century, and indigenous women, Persians, Turks, and assorted Indo-Iranian groups between the Zagros and Khyber Pass. It is somewhat ironic that I’m on the edge of the Hazara cluster since they are almost certainly in part descended from Genghis Khan’s family, and my own surname is Khan. But I know that my Y chromosomal lineage is R1a1, very common across Central and Southern Eurasia, and not a Mongolian one at all.

23andmepcazoomZoom! Now we’ve constrained the input data set to the Central/South Asian groups. First, look at the Kalash. They’re strange, which is no surprise, they’re an inbred mountain group in Pakistan who have not adopted Islam. The Pakistani Taliban looks to be ending them as we speak. I really would prefer that they were just thrown out of the data set for this zoom view, because on this fine grained scale I don’t think they add much at all. They’re just an example of what long term endogamy can do to your allele frequencies. The bigger picture is the axis between the populations of Pakistan, and those of Central Asia. Observe that I’ve changed position. Whereas when taking world wide genetic variation into account I clustered with Central Asians, now I’m 2/3 of the way to the South Asian cluster. I will tell you that I’ve shared “genes” with around 50 South Asians now, from various parts of the subcontinent, and in the 23andMe plot they overlay the South Asians nearly perfectly. I’ve put labels at the approximate ethno-linguistic position. I’m an outlier. 23andMe tells me that I’m 43% “East Asian.” The typical South Asian is in the 10-30% range. My first assumption was that I have a lot of ancient South Indian, which just shows up as East Asian in their algorithm. With this in mind I tried sharing with a lot of South and East Indians, and found out two interesting points. First, South Indians seem no higher than 30-35% East Asian. Bengalis on the other hand are more East Asian, with Bangladeshis more East Asian than West Bengalis. My sample size for Bengalis is small, so take that with caution. Second, the PCA plots put the South Indians firmly in the South Asian cluster, but the Bengalis trail out toward my own position. This indicates again that different methods are telling you slightly different things. The PCA is only a thin slice of variation, but it’s highly informative of between population differences. A Bengali and a South Indian with the same “East Asian” fraction in the ancestry painting nevertheless have consistently different positions on the PCA, with Bengalis closer to the East Asians. Additionally, there’s an ethnic Persian in this zoom plot that I’m describing, and they are positioned near the Balochi. But on the world wide plot they’re on the margins of the European cluster. Another illustration that position of an element is sensitive to the input data because of how the dimensions are generated.

Blaine Bettinger, who inspired me to post this, told a story with his ancestry painting which was plausible. What can I say? First, I have less than 1% African ancestry. This could be noise. But, I do observe that the South Asians with Muslim names are enriched in the set of those who I’ve shared genes with and who have less than 1%, but not 0%, African ancestry. Just as Muslim South Asians have non-trivial West Asian ancestry, I suspect that many of us have Sub-Saharan African ancestry through the same dynamic. Sub-Saharan African soldiers were prominent across South Asia with the arrival of Muslims. Bengal even has a period of rule by Abyssinian rulers. But the bigger issue for me is the East Asian component. Here is a figure from a paper published 4 years ago:


The figure is showing Fst value comparing Indian Americans with Europeans and East Asians. Fst measures between population differences in allele frequency, in this case the alleles being 207 indels. Take a look at the Bengalis. These are West Bengalis, who I believe have a lesser East Asian component, but even there the allele frequency difference to East Asians is near that of Europeans. The Assamese, who speak a language very close to Bengali, are similar. Assam was ruled by a Tibeto-Burman people for nearly 600 years. The Oriya speakers, from the southwest of Bengal, are more distant from East Asians. As one goes south and east, and west and north, the distance from East Asians increases. This shouldn’t be that surprising, but nice to confirm. The fact that the genetic distance increases as one goes south means that for northeast South Asia you need to complexify the model from a two-way admixture with “ancient North Indians” and “ancient South Indians.” Set next to these two is an East Asian element, which is also clear in the Indo-Aryan peoples of Nepal.

Sheikh Hasina, Khaleda ZiaOf course anyone who knows Bengalis won’t be totally surprised by an East Asian component to their ancestry. To the left are head shots of the two women who have dominated Bangladeshi politics for the past two decades, Khaleda Zia and Sheik Hasina. They’re both Bengalis, but they do look different, and I know many people who look like one or the other (or a combination). My family is from one of most easternmost districts of Bengali, next to Tripura. In fact my late maternal grandmother lived in Tripura for some of her childhood (she was almost trampled to death by the Maharani of Tripura’s insane elephant as a young girl!). When I was a young child I once saw a black and white photo from my father’s college days, and I was curious who the Asiatic looking young man in the middle of the photograph was. Turns out it was my father! Sometimes our expectations affect how we perceive people. I have never perceived my father to have an Asian cast to his features as a more mature man, but others have told me that he does still exhibit them.

There is still the question of how Bengalis came to have this particular admixture. I think the most plausible scenario probably synthesizes conventional village-to-village intermarriage and isolation-by-distance, along with some component of migrationism. Tribes such as the Chakma have left Burma in historical time. The Chakma of Bangladesh now speak a dialect of Bengali, not their ancestral Sino-Tibetan tongue. I believe that a non-trivial portion of Bengalis have ancestors who were tribal people who shifted their religious identity to that of Hinduism or Islam (from Theravada Buddhism in the case of the Chakma, or animism in the case of the Garos before their Christianization). But eastern South Asia is adjacent to mainland Southeast Asia, and it stands to reason that continuous gene flow would over time would also have introduced East Asian alleles into the Bengali gene pool.

Image Credit: TopNews.in

Powered by WordPress