Razib Khan One-stop-shopping for all of my content

June 15, 2011

The Cape Coloureds are a mix of everything

A Cape Coloured family

I’ve mentioned the Cape Coloureds of South Africa on this weblog before. Culturally they’re Afrikaans in language and Dutch Reformed in religion (the possibly related Cape Malay group is Muslim, though also Afrikaans speaking traditionally). But racially they’re a very diverse lot. In this way they can be analogized to black Americans, who are about ~75% West African and ~25% Northern European, with the variance in ancestral proportions being such that ~10% are ~50% or more European in ancestry. The Cape Coloureds though are much more complex. Some of their ancestry is almost certainly Bantu African. This element is related to the West African affinities of black Americans. And, they have a Northern European element, which likely came in via the Dutch, German, and Huguenot settlers (mostly males). But the Cape Coloureds also have other contributions to their genetic heritage. Firstly, they have Khoisan ancestry, whether from Bushmen or Khoi. This is well known in their oral memory. The the hinterlands of the Cape of Good Hope are beyond the ecological range of the Bantu agricultural toolkit, so the region was still dominated ...

April 23, 2011

Resolutions in the Indian genetic layer cake

Filed under: Genetics,Genomics,Indian Genetics,Indian genomics — Razib Khan @ 7:54 pm

Two years ago Reconstructing Indian Genetic History reframed how we should view South Asian historical genomics. In short, Indians can be viewed as a hybrid between a West Eurasian group, “Ancestral North Indians” (ANI) and a very different group, “Ancestral South Indians” (ASI), which had distant connections to West and East Eurasians. At least to a first approximation. Last fall I posted on a new paper which surveyed the Austro-Asiatic speaking peoples of India, and concluded that they were exogenous to the subcontinent. This is an interesting point. Prehistoric treatments of South Asia often use linguistic terms to denote putative ancient populations. One model is that first it was the Munda, the most ancient Austro-Asiatics. Then the Dravidians. And finally the Indo-Aryans. These genetic data imply that the Munda arrived after the initial ANI-ASI synthesis. The Munda people of India can be thought of as ANI-ASI, with an overlay of East Eurasian ancestry.

Zack Ajmal’s K = 11 ADMIXTURE run has highlighted some further issues. He has a set of Austro-Asiatic samples, as well as a host of Indo-Aryan and Dravidian speaking populations. I now believe we can now further clarify and refine our model of the peopling ...

March 12, 2011

Harappa Ancestry Project @ N ~ 50

Zack Ajmal now has over 50 participants in the Harappa Ancestry Project. This does not include the Pakistani populations in the HGDP, the HapMap Gujaratis, the Indians from the SVGP. Nevertheless, all these samples still barely cover vast heart of South Asia, the Indo-Gangetic plain. Here is the provenance of the submitted samples Zack has so far:

Punjab: 7 Iran: 7 Tamil: 6 Bengal: 5 Andhra Pradesh: 2 Bihar: 2 Karnataka: 2 Caribbean Indian: 2 Kashmir: 2 Uttar Pradesh: 2 Sri Lankan: 2 Kerala: 2 Iraqi Arab: 2 Anglo-Indian: 1 Roma: 1 Goa: 1 Rajasthan: 1 Baloch: 1 Unknown: 1 Egyptian/Iraqi Jew: 1 Maharashtra: 1

Again, note the underrepresentation of two of India’s most populous states, Uttar Pradesh, ~200 million, and Bihar, ~100 million. Nevertheless, there are already some interesting yields from the project. Below I’ve reedited Zack’s static images (though go to his website for something more dynamic) with the labels of individuals. I’ve highlighted myself and my parents with the red pointers.

To the left is a set of plots and tables which I’ve spliced together from Zack’s various posts. What you need to know is that this at K = 12, and I’ve used the labels that Zack gave the various putative “ancestral populations” which emerged out ...

January 24, 2011

Harappa Ancestry Project, update

Last week I announced the Harappa Ancestry Project. It now has its own dedicate website, http://www.harappadna.org. Additionally, it has its own Facebook page. For Zack to get his own URL he needs about 10 more “likes,” so please like it! (if you are so disposed) Finally, from what I’ve heard the first wave of the 23andMe holiday sale results are coming online this week. Actually, one of the relatives who I purchased the kit for is in processing currently, so I know that we should have a bunch of new people in the system very, very, soon.

Speaking of people, last I heard Zack had gotten about a dozen responses. That’s enough to start an initial round of runs, but obviously he needs more people. More importantly, the goal here is to get better population coverage. One of the things we know intuitively and also from the most current research is the existence of a lot of within-region population variation in South Asia which is structured by community. In other words, a sample of 30 people, where you have 3 from 10 different communities exhibiting geographical and ...

August 10, 2010

PCA, Razib around the world (a little)

price_fig1I have put up a few posts warning readers to be careful of confusing PCA plots with real genetic variation. PCA plots are just ways to capture variation in large data sets and extract out the independent dimensions. Its great at detecting population substructure because the largest components of variation often track between population differences, which consist of sets of correlated allele frequencies. Remeber that PCA plots usually are constructed from the two largest dimensions of variation, so they will be drawn from just these correlated allele frequency differences between populations which emerge from historical separation and evolutionary events. Observe that African Americans are distributed along an axis between Europeans and West Africans. Since we know that these are the two parental populations this makes total sense; the between population differences (e.g., SLC24A5 and Duffy) are the raw material from which independent dimensions can pop out. But on a finer scale one has to be cautious because the distribution of elements on the plot as a function of principal components is sensitive to the variation you input to generate the dimensions in the first place.

I can give you a concrete example: me. I showed you my 23andMe ancestry painting yesterday. I didn’t show you my position on the HGDP data set because I’ve shared genes with others and I don’t want to take the step of displaying other peoples’ genetic data, even if at a remove. But, I have reedited some “demo” screenshots and placed where I am on the plot to illustrate what I’m talking about above. The first shot is my position on the two-dimensional plot of first and second principal components of genetic variation from the HGDP data set.

gsa-lillymendel-worldNo surprise that I’m in the Central/South Asian cluster. But what may surprise you is that I’m not in the South Asian cluster, I’m in the Central Asian cluster. In the Central Asian cluster are Uyghurs and Hazaras. These are two hybrid populations, a mixture of West and East Eurasian elements. The Uyghurs are likely the outcome of a process of admixture between the Iranian and Tocharian Indo-European populations of the cities of the Tarim basin, and later Turkic speaking settlers who arrived in the wake of the expansion and later collapse of the first Uyghur Empire (the historical connection between the current Uyghurs and ancient Uyghurs is tenuous at best, and complicated). The Hazaras are a more recent population, likely emerging as the product of intermarriages between Mongol soldiers who arrived in the 13th century, and indigenous women, Persians, Turks, and assorted Indo-Iranian groups between the Zagros and Khyber Pass. It is somewhat ironic that I’m on the edge of the Hazara cluster since they are almost certainly in part descended from Genghis Khan’s family, and my own surname is Khan. But I know that my Y chromosomal lineage is R1a1, very common across Central and Southern Eurasia, and not a Mongolian one at all.

23andmepcazoomZoom! Now we’ve constrained the input data set to the Central/South Asian groups. First, look at the Kalash. They’re strange, which is no surprise, they’re an inbred mountain group in Pakistan who have not adopted Islam. The Pakistani Taliban looks to be ending them as we speak. I really would prefer that they were just thrown out of the data set for this zoom view, because on this fine grained scale I don’t think they add much at all. They’re just an example of what long term endogamy can do to your allele frequencies. The bigger picture is the axis between the populations of Pakistan, and those of Central Asia. Observe that I’ve changed position. Whereas when taking world wide genetic variation into account I clustered with Central Asians, now I’m 2/3 of the way to the South Asian cluster. I will tell you that I’ve shared “genes” with around 50 South Asians now, from various parts of the subcontinent, and in the 23andMe plot they overlay the South Asians nearly perfectly. I’ve put labels at the approximate ethno-linguistic position. I’m an outlier. 23andMe tells me that I’m 43% “East Asian.” The typical South Asian is in the 10-30% range. My first assumption was that I have a lot of ancient South Indian, which just shows up as East Asian in their algorithm. With this in mind I tried sharing with a lot of South and East Indians, and found out two interesting points. First, South Indians seem no higher than 30-35% East Asian. Bengalis on the other hand are more East Asian, with Bangladeshis more East Asian than West Bengalis. My sample size for Bengalis is small, so take that with caution. Second, the PCA plots put the South Indians firmly in the South Asian cluster, but the Bengalis trail out toward my own position. This indicates again that different methods are telling you slightly different things. The PCA is only a thin slice of variation, but it’s highly informative of between population differences. A Bengali and a South Indian with the same “East Asian” fraction in the ancestry painting nevertheless have consistently different positions on the PCA, with Bengalis closer to the East Asians. Additionally, there’s an ethnic Persian in this zoom plot that I’m describing, and they are positioned near the Balochi. But on the world wide plot they’re on the margins of the European cluster. Another illustration that position of an element is sensitive to the input data because of how the dimensions are generated.

Blaine Bettinger, who inspired me to post this, told a story with his ancestry painting which was plausible. What can I say? First, I have less than 1% African ancestry. This could be noise. But, I do observe that the South Asians with Muslim names are enriched in the set of those who I’ve shared genes with and who have less than 1%, but not 0%, African ancestry. Just as Muslim South Asians have non-trivial West Asian ancestry, I suspect that many of us have Sub-Saharan African ancestry through the same dynamic. Sub-Saharan African soldiers were prominent across South Asia with the arrival of Muslims. Bengal even has a period of rule by Abyssinian rulers. But the bigger issue for me is the East Asian component. Here is a figure from a paper published 4 years ago:


The figure is showing Fst value comparing Indian Americans with Europeans and East Asians. Fst measures between population differences in allele frequency, in this case the alleles being 207 indels. Take a look at the Bengalis. These are West Bengalis, who I believe have a lesser East Asian component, but even there the allele frequency difference to East Asians is near that of Europeans. The Assamese, who speak a language very close to Bengali, are similar. Assam was ruled by a Tibeto-Burman people for nearly 600 years. The Oriya speakers, from the southwest of Bengal, are more distant from East Asians. As one goes south and east, and west and north, the distance from East Asians increases. This shouldn’t be that surprising, but nice to confirm. The fact that the genetic distance increases as one goes south means that for northeast South Asia you need to complexify the model from a two-way admixture with “ancient North Indians” and “ancient South Indians.” Set next to these two is an East Asian element, which is also clear in the Indo-Aryan peoples of Nepal.

Sheikh Hasina, Khaleda ZiaOf course anyone who knows Bengalis won’t be totally surprised by an East Asian component to their ancestry. To the left are head shots of the two women who have dominated Bangladeshi politics for the past two decades, Khaleda Zia and Sheik Hasina. They’re both Bengalis, but they do look different, and I know many people who look like one or the other (or a combination). My family is from one of most easternmost districts of Bengali, next to Tripura. In fact my late maternal grandmother lived in Tripura for some of her childhood (she was almost trampled to death by the Maharani of Tripura’s insane elephant as a young girl!). When I was a young child I once saw a black and white photo from my father’s college days, and I was curious who the Asiatic looking young man in the middle of the photograph was. Turns out it was my father! Sometimes our expectations affect how we perceive people. I have never perceived my father to have an Asian cast to his features as a more mature man, but others have told me that he does still exhibit them.

There is still the question of how Bengalis came to have this particular admixture. I think the most plausible scenario probably synthesizes conventional village-to-village intermarriage and isolation-by-distance, along with some component of migrationism. Tribes such as the Chakma have left Burma in historical time. The Chakma of Bangladesh now speak a dialect of Bengali, not their ancestral Sino-Tibetan tongue. I believe that a non-trivial portion of Bengalis have ancestors who were tribal people who shifted their religious identity to that of Hinduism or Islam (from Theravada Buddhism in the case of the Chakma, or animism in the case of the Garos before their Christianization). But eastern South Asia is adjacent to mainland Southeast Asia, and it stands to reason that continuous gene flow would over time would also have introduced East Asian alleles into the Bengali gene pool.

Image Credit: TopNews.in

August 6, 2010

Strange genetic variation in South Asia

Filed under: Genetics,Genomics,Indian Genetics,Indian genomics — Razib Khan @ 12:11 am

Dienekes has a post up where he highlights the fact that the recent paper on South Asian metabolic diseases has a figure which elucidates population structure within the region. Accounting for structure is important for genome-wide associations since you might get a spurious correlations if trait value/disease frequency is simply tracking cryptic population variation. Dienekes says:

The existence of two clusters is kind of obvious, while their interpretation is not as dots of the same color appear in both clusters: a placement of these individuals in a global context might have been useful here. Things are clearer at the top cluster which shows a clear gradient anchored by Punjabi Sikh and Hindu Tamils on either end.

Also of interest is the group of isolated Muslim/Christian individuals on the left which deviate strongly from the mainstream; these probably represent exogenous elements that don’t resembe the bulk of the Indian population.

The second issue is easily addressed. The Christian outliers are both give English as their native language. That suggests to me that they’re Anglo-Indian, a community of mixed South Asian and European origin. South Asian Muslims are overwhelmingly of indigenous origin. But, a minority of the Muslim elite are West Asian, or have substantial West Asian ancestry, as is evident by the fact that they look white. Benazir Bhutto’s mother was of Kurdish and Persian ethnic background (her family was from Esfahan in Iran). I’ve reedited the religious & linguistic PC plots to fit onto the screen.


So what’s going on with the cluster which extends along the second principal component? The first component is probably just a European/West Asian-South Asian axis of variation. But I don’t understand where the variation for the second is coming from. Observe that the one South Indian group, Tamil speakers, are not represented in the secondary cluster. The plot reminded me of something I saw last fall.

Below is figure S4 is from the supplements of Reconstructing Indian population history. I added some labels. The Indian cluster is tight when the genetic variation includes non-Indian groups. But, when you constrain the variation to Europeans and South Asians only, something strange happens:

The Gujarati sample is from Houston, and is from HapMap Phase 3. I have a suspicion that the secondary cluster among the Gujaratis here is of the same class of phenomenon as the secondary cluster in the first plot. The Anglo-Indians and West Asian Muslims serve as rough proxies for Europeans, and you have an expected European-South Asian axis. But you also have this strange orthogonal component. I had assumed that the plot from the Reich et al. paper was an anomaly, but I’m not so sure seeing the second paper.

July 23, 2010

One principal component to rule them all?

ResearchBlogging.orgDespite the reality that I’ve cautioned against taking PCA plots too literally as Truth, unvarnished and without any interpretive juice needed, papers which rely on them are almost magnetically attractive to me. They transform complex patterns of variation which you are not privy to via your gestalt psychology into a two or at most three dimensional representation which can you can grok immediately. That is why History and Geography of Genes was so engrossing. You recognize patterns which were otherwise unrecognizable. But how you interpret those patterns, that’s a wholly different matter. And how those patterns arise is also not something one can ignore.

price_fig1First, let’s start with an easy case. To the left is a PCA plot with four populations. Nigerians, East Asians (Chinese + Japanese), Europeans (whites from Utah), and finally, African Americans. The x-axis is the first principal component of variation, and the y-axis the second. That means that the x-axis is the independent dimension of variation within the patterns of genetic data which explains the largest fraction of the total amount of genetic variation. The sum totality of the variation can be decomposed into an large set of independent dimensions which can be rank ordered from the largest explanatory components to the smaller ones, successively by number. In a human genetic context the first principal component invariably separates Africans from non-Africans, and the second principal component often maps onto a west-east axis from Europe to the New World. Subsequent principal components can often be useful in smoking out fine scale distinctions, or relationships which are confused by the existence of similar but different signals in admixed populations.

The interpretation of this plot is rather easy. You see that African Americans lay along a continuum between Nigerians and Europeans, skewed toward Nigerians, with some outliers toward East Asians. We know from other genetic findings that ~20% of the African American ancestral quanta is European, but, that quanta is not equally distributed across the population. ~10% of the African American population is more than 50% European in ancestry, while 90% is less than 50% European. And so you have a distribution which reflects this variation. As for the outliers, I will speculate and suggest that these are indications of Native American ancestry among some African Americans.

The story I presented above is probably plausible as an explanation of the visual because we have a wealth of historical data to corroborate the plausibility of that narrative. The fit between the results from the technique of analysis of genetic variation and what scholars have long inferred from textual sources is relatively easy. It is far more difficult to look at a PCA plot, and generate a plausible narrative that you yourself accept with a high degree of confidence with little external support. It is with that caveat in mind that I present Toward a more uniform sampling of human genetic diversity: A survey of worldwide populations by high-density genotyping:

High-throughput genotyping data are useful for making inferences about human evolutionary history. However, the populations sampled to date are unevenly distributed, and some areas (e.g., South and Central Asia) have rarely been sampled in large-scale studies. To assess human genetic variation more evenly, we sampled 296 individuals from 13 worldwide populations that are not covered by previous studies. By combining these samples with a data set from our laboratory and the HapMap II samples, we assembled a final dataset of ~ 250,000 SNPs in 850 individuals from 40 populations. With more uniform sampling, the estimate of global genetic differentiation (FST) substantially decreases from ~ 16% with the HapMap II samples to ~ 11%. A panel of copy number variations typed in the same populations shows patterns of diversity similar to the SNP data, with highest diversity in African populations. This unique sample collection also permits new inferences about human evolutionary history. The comparison of haplotype variation among populations supports a single out-of-Africa migration event and suggests that the founding population of Eurasia may have been relatively large but isolated from Africans for a period of time. We also found a substantial affinity between populations from central Asia (Kyrgyzstani and Mongolian Buryat) and America, suggesting a central Asian contribution to New World founder populations.

The studies which came out of the original HapMap had northern Europeans, Yoruba from Nigerians, and Chinese & Japanese. These three populations can tell us a lot, but there’s something lacking in the coverage. The HGDP sample is better. But specifically because of political considerations it was not feasible to collect Indian samples, so Pakistani ones are used in their stead. Additionally, the HGDP sample is a touch biased toward isolated and distinctive populations, such as the Kalash of Pakistan. This genetic distinctiveness is important to catalog because it is fast disappearing. But the Kalash are so unique because of their long history of isolation, so one can’t really use them as a proxy population for Pakistanis, as one could with Sindhis. The POPRES sample seems to complement the HGDP well, but I don’t see it being used so much. Since the next phase of the HapMap has more populations, some of the deficiencies which emerged with the utilization of just three terminal groups (in a World Island context) will soon no longer be an issue.

But until that time it’s nice when studies come out which close some of the gaps in our knowledge of world wide genetic variation. This is one such study. I’m somewhat familiar with the samples already because I’ve seen it in an analysis of Indian populations. It seems that it is somewhat skewed toward South and Southeast Asian populations, but hey, these are groups which need to draw the long straw sometimes as well.

Before I go any further I should mention that they use a SNP-chip with hundreds of thousands of markers. Additionally, they looked at copy number variation. Two rather different types of variation within the genome, probably to double check that the outcomes were the same. Population historical events which shape patterns of genomic variation would presumably have a similar large scale effect on both types of variation. In their results that checked out, or so they claimed, as the paper is a manuscript without the supplements attached.

Though there’s some interesting fine-grained analysis to be had, they draw some macro-scale and deep time inferences as well. First, you probably know the famous fact that 15% of variation in genes is between races, and 85% within races. That’s derived from the Fst statistic, which is basically partitioning between and within population variance across two populations. Obviously the value of Fst varies by the set of populations you’re comparing. That between Mbuti Pygmies and Japanese is far higher than between Chinese and Japanese. Using the HapMap the Fst was 16%. About what you’d expect. To equalize sample sizes with the HapMap they randomly selected individuals from a pooled set grouped by continent from their populations, and calculated Fst. They found values around 11%. Why the difference? Because their data set included populations which were between the three clusters within the HapMap.

This is naturally not a surprising result at all, but it does reiterate one issue which sometimes crops up: Platonism in relation to race. The northern European whites in the HapMaps are the whites par excellence. Turks, who are perhaps more centrally located in the genetic variation of West Eurasian and North African peoples, what used to be termed “Caucasoid,” are “less white.” Similarly, Nigerians are more African than Ethiopians. Chinese and Japanese are more Asian than Burmese. And so forth. When modeling between group differences there is I think a somewhat old-fashioned tendency to consider some populations racial archetypes. That modulates the input which modifies the results somewhat. The analytical technique may be as cold as stone, but they are used by flesh and blood human beings.

There is also some funny business going on with haplotype and SNP heterozygosities which I think needs to be highlighted, and speaks to the fact that SNP-chips are not perfect. They’re tools, and human tools are impacted by arbitrary or instrumental choices humans make. Let me quote:

We also compared the SNP and haplotype heterozygosity values in each population (Figure 2B). These two quantities are generally highly correlated, although there are several exceptions: First, SNP heterozygosity is higher than haplotype heterozygosity in European and Central Asian populations. This may reflect a SNP ascertainment bias, since many of these polymorphisms were historically selected to maximize heterozygosity in European populations. Second, the Pygmy sample shows a low SNP heterozygosity despite relatively high haplotype heterozygosity. This unusual pattern could be caused by stronger effects of SNP ascertainment bias in this population than in others. Indeed, a recent study of Khoisan individuals (another hunter-gatherer group from Africa) showed a similar pattern: despite high SNP heterozygosity (~60%) in whole-genome sequence data, a Khoisan individual showed low heterozygosity on the SNP microarray genotypes (~22%) . Alternatively, this difference could also reflect unique attributes of population history.

In plain English the gene chips were designed with Europeans in mind, so they don’t necessarily pick up all the variation in non-European groups, who are believe it or not genetically different. This issue cropped up (as alluded to in the above text) with the recent paper which sequenced some Bushmen as well as Desmond Tutu. The Bushmen have a lot of variation, this is well known, but they have variation at markers where Europeans don’t, and if Europeans don’t the chips may not look for polymorphism at that locus. This sort of thing probably doesn’t affect broad population relationships, but if you want to zoom in and do analysis which is sensitive to fine distinctions and quantitative differences, then it might be problematic.

Let’s jump to the pretty charts. First, a PCA plot with all of the individuals from all of the populations:


Note that PC 1 accounts for nearly eight times as much variation as PC 2. This speaks to the African vs. non-African gap. Because their data set is relatively thick in “intermediate” groups you see a spectrum. The vertical axis is obviously mostly east-west. And here’s the accompanying bar plot derived from the ADMIXTURE program. K = putative ancestral populations.


With this many populations at K = 12 I think you could write a fantasy novel worthy of Tolkien. K = 4 is more realistic. Among the African populations you see likely Eurasian admixture in some eastern, and it seems Bushmen, individuals. In Eurasia itself you see a clinal gradation of admixture between putative ancestral components that seems to follow longitude rather well.

Because so much of the variation in the total sample is due to Africans, removing them from the picture will allow us to focus more on the relationships of the Eurasian groups. And so that’s exactly what they did. Note that focusing on the Eurasian groups does not mean simply magnifying or zooming in on the Eurasian section of the PCA plot, rather, the plots are regenerated with a subset of the previous genetic variation. In other words, the dimensions will shake out a bit differently.

The first plot shows Eurasian populations as a whole. The second removes Europeans and Near Easterners.


Notice again the scale. The vast majority of the variance seems to be east-west. But, there is a noticeable north-south split. For the South Asian population it looks like they had Pakistanis who were farmers of modest means (Arain), high caste South Indians, and very low caste or tribal South Indians. For this Indian sample there’s a problem, and it’s the sample problem which plagued the Up Series, they are looking at the very top and bottom of Indian society and ignoring the middle. Presumably the middle is going to be somewhere in the middle genetically as well, but nevertheless that’s something to consider in a paper which presumes to fill in the patchiness of others. In contrast, the Nepali sample was notably ethnically diverse, including both the dominant Indo-Aryan segment as well as the Tibeto-Burman Newar.

In the first panel there are some curious patterns with the Southeast Asian groups. Culturally, as in language and history, the Thai and Vietnamese have relatively recent roots in the southern regions of modern China. The Dai of Yunnan are the same people in origin as the Thai of Thailand and the Lao of Laos. Both derive from migrations from Yunnan. This is historically attested, even if somewhat fragmentarily. The heartland of the Vietnamese was in the Red River valley and north into southern China, and they spread down the coast and toward the Me kong only within the last 1,000 years. Southeast Asia was not uninhabited during this period. It was dominated by the Khmer Empire, which was slowly consumed by the expanding Thai and Vietnamese polities. Some scholars argue that French colonialism actually preserved an independent Khmer nation, which otherwise would have been divided between Thailand and Vietnam, as Poland was between Germany and Russia. So the Khmer are the indigenous people, while the Thai and Vietnamese are intrusive.

What do the PCA plots tell us? I do not know where the Vietnamese samples were collected. If they were from South Vietnam, then their close position to the Chinese suggests to me that there was substantial demographic replacement or expansion from the Red River valley. In contrast, the Thai are relatively distant from the Chinese. In fact, the Cambodians are somewhat closer to the Chinese! The samples here are small, and the sets overlap, so I wouldn’t put too much stock in that. But, Thailand is geographically closer to South Asia, so isolation by distance models would predict this pattern. It seems that the ethnogenesis of the Thai occurred through the expansion of the Thai identity, likely among Khmer peoples. And it is intriguing that the Iban, an indigenous people of Borneo, are closer to the Vietnamese than they are to the Cambodians. We know that there was substantial migration between coast Vietnam and Maritime Southeast Asia, the Chams of central Vietnam, and dominant in the southern half of the nation before the Vietnamese expansion, are a Malayan people who may have migrated from Borneo.

Shifting to the second panel there’s more here to say about the South Asians. First, geography. The two lower caste groups are actually Dalits from Andhara Pradesh, a South Indian state. Dalits used to be called outcastes, so they aren’t even lower caste, but without caste. The upper caste groups are Brahmins from Andhara Pradesh and Tamil Nadu. Finally, the Irula are tribal people from Tamil Nadu. To me the tribal samples often produce weird results, and I suspect that has to do with population bottlenecks and their demographic isolation. People leave the tribes (becoming part of the Hindu society, or converting to Islam or Christianity), but few join them. The Pakistani sample are Araina, a group of conventional Punjabi farmers who have a made up ancestry from Arabs (obviously made up because they don’t cluster with Near Easterners). Let’s compare to a chart from Reich et al.:


It seems to me that they’re in rough agreement (Reich et al. uses the same two low caste groups for Andhara Pradesh for low caste South Indians by the way). Though South Indian Brahmins speak South Indian languages, and reside amongst other South Indian groups, their genetic heritage is somewhat different. Similarly, tribal peoples are also distinct from caste Hindus. Reich et al. posit that South Asians can be modeled as a composite of two groups, Ancestral North Indians, ANI, and Ancestral South Indians, ASI. Presumably the former are intrusive to the subcontinent in relation to the latter. There seem two clear dimensions along which the ratio of ANI to ASI vary: geography and caste. The proportion of ASI seems to increase from the northwest to the southeast. And, the proportion of ANI seems to increase from tribal to low caste to upper caste. The Pakistani sample does not seem to be from an elite caste (or it does not seem they were converted from an elite caste), but they have more affinity with West Eurasian populations than South Indian Brahmins. It is likely that the latter are intrusive to the south, and have admixed with the local population.

Finally, a word on the Nepali sample. On top of the ANI-ASI mixture, the Nepali groups have varying levels of Tibeto-Burman, and so East Asian, affinity. This is not a surprise if you have met Nepalis. The Assamese, and to a lesser extent Bengalis, also exhibit this pattern of Tibeto-Burman admixture. The Brahmins of Nepal are intrusive like the Brahmins of South India, and like the South Indians they admixed with the local substrate.

Next let’s move to a ADMIXTURE plot.


The selection of a particular K obviously is conditioned by the patterns which “fit” with what you know, and what you expect. With that caution aired, the population represented by red can easily be thought of as a Middle Eastern group which expanded with agriculture. That seems to be what the authors favor. The brown population is the modal Indian ancestral population, which has little presence outside the subcontinent (nice color coding by the way! Brown people are brown). A green color represents a population which the tribal group, the Irula, are heavily weighted on. This reminds me too much of the Kalash. I suspect that the Irula went through some bottleneck or other distinctive event, and some have assimilated to various low status groups in South India.

I’m not a fantasist intent on world-building, so I’ll stop with that in reading the tea leaves of the charts. But there’s an important section which I skipped over, and will move back to now. And that’s the deep time aspect:

A more likely explanation for the OoA bottleneck is that Eurasia was populated by a larger population that had been relatively isolated from other modern human populations for tens of thousands of years prior to the expansion. The first fossil evidence for modern humans outside of Africa is in the Middle East at Skhul and Qafzeh between 80,000-100,000 years ago, which is at least 20,000 years prior to the Eurasian diaspora. If a population of modern humans remained in the Middle East until the expansion into Eurasia, there would have been sufficient time for genetic drift to reduce heterozygosity dramatically before the Eurasia expansion. This “Middle East isolation” hypothesis provides a robust explanation for the relative homogeneity of European and Asian populations relative to African populations (see Figures 3A-B) and is supported by a recent maximum likelihood estimate of 140,000 years ago for the time of Eurasian-West African population separation . Interestingly, a recent study of the Neandertal genome suggests that the non-African individuals, but not the Africans, contain similar amount of admixture (1-4%) with the Neandertals . The authors suggest that the admixture must have happened between the Neandertals with an ancestral non-African population before the Eurasian expansion. Given the fossil, archaeological, and genetic evidence, the Middle East isolation hypothesis warrants rigorous evaluation as whole-genome sequence data become available.

Like the vast majority of genetic studies this work supports the Out of Africa hypothesis. Non-Africans are all branches from a specific African branch. Or more accurately, an African branch which left Africa. The reduction in heterozygosity, a measure of genetic variation, from Africa to Eurasians was large. Additionally, within Africa south of the Sahara there’s little difference in heterozygosity as a function of geography, but outside of Africa it drops off as a function of distance from Africa. A plausible model then is a radiation from a small ancestral population to the four corners of the world, going through a series of bottlenecks along the way. Or at least that’s a model supported by genomic data. But, the drop in heterozygosity is so great a quick separation from the parental African population would require an implausibly small number of founders (less than 10 in one generation). So, to explain the data, they are suggesting here that the original population was not quite so small, but was isolated from the large African population for thousands of years. They assume genetic drift reduced heterozygosity, but if the model is correct I suspect that the way it worked was that bottlenecks due to climatic fluctuations swept clean a lot of the genetic variation. But in the interregnum the isolated population may have interbred with Neandertals. In fact, perhaps they picked up genes from Neandertals when their own effective population was extremely small.

In any case, a wide ranging paper. They manage to tie their results into two other blockbuster papers.

H/T Dienekes

Citation Xing J, Watkins WS, Shlien A, Walker E, Huff CD, Witherspoon DJ, Zhang Y, Simonson TS, Weiss RB, Schiffman JD, Malkin D, Woodward SR, & Jorde LB (2010). Toward a more uniform sampling of human genetic diversity: A survey of worldwide populations by high-density genotyping. Genomics PMID: 20643205

Powered by WordPress