Razib Khan One-stop-shopping for all of my content

July 15, 2018

India vs. China, genetically diverse vs. homogeneous

Filed under: China,China genetics,Human Population Genetics,India,India Genetics — Razib Khan @ 1:50 pm

About 36% of the world’s population are citizens of the Peoples’ Republic of China and the Republic of India. Including the other nations of South Asia (Pakistan, Bangladesh, etc.), 43% of the population lives in China and/or South Asia.

But, as David Reich mentions in Who We Are and How We Got Here China is dominated by one ethnicity, the Han, while India is a constellation of ethnicities. And this is reflected in the genetics. The relatively diversity of India stands in contrast to the homogeneity of China.

At the current time, the best research on population genetic variation within China is probably the preprint A comprehensive map of genetic variation in the world’s largest ethnic group – Han Chinese. The author used low-coverage sequencing of over 10,000 women to get a huge sample size of variation all across China. The PCA analysis recapitulated earlier work. Genetic relatedness among the Han of China is geographically structured. The largest component of variance is north-south, but a smaller component is also east-west. The north-south element explains more than 4.5 times the variance as the east-west.

Click to enlarge

Another dimension of the of the variation is that different parts of China are character by different levels of admixture between the Han and other groups. In Northwest China, there is gene flow from West Eurasian sources. In all likelihood, this is through proxy populations, such as Mongols, who are about ~10% West Eurasian. Also, during the period between the fall of the Han Dynasty and the rise of the Sui-Tang Dynasty much of northern China was dominated by barbarian groups from the steppe, and these groups settled down and were absorbed. In Northeast China, the source of admixture is from Siberian and Tungusic group. Again, this makes geographical sense.

In contrast in South China, the gene flow is from indigenous Chinese national groups, such as Dai. This is in keeping with the historical record, whereby South China became Han in the period between 0 and 1000 AD through migration, intermarriage, and acculturation.

Click to enlarge

I have my own small private dataset of Chinese individuals. Some with provenance. Some without. But using known populations I was able to divide China along the north to south cline.  Individuals from Guangdong in the south, those from Shaanxi in the north, and from Zhejiang to Sichuan in the center.

Using Punjabis as a West Eurasian outgroup I was able to plot these individuals on a PCA. If you click to enlarge you will see that a substantial minority of the Han_N sample is shifted to the left of the plot. This is toward the Punjabis. This is not because they have Punjabi ancestry, but because Punjabis are reasonable proxies for West Eurasians.

Click to enlarge

More importantly, I want to compare South Asia to China. To do that I created a small dataset that merged the Han with representative South Asian groups. The first PC, 1 and 2, illustrate the contrast. All three Chinese groups, sampled from the north to the south, occupy a very tight cluster, while the South Asians span PC 2. The Bengalis are shifted a bit to the Chinese, but most of the variance is due to within-South Asian genetic differences.

Click to enlarge

I ran PCA to 10 dimensions. Only at PC 10 did the Han Chinese separate along the north-south access. Most of the earlier PC’s separated out specific castes (e.g, Patels because if their large number in the Gujurati sample were PC 3). Here are the eigenvalues: 53.0682, 2.5641, 2.31876
1.97058, 1.90652, 1.88879, 1.7935, 1.69375, 1.61516, and 1.54207. The large value for PC 1 is what you’d expect, it’s a continental scale difference. PC 2 differentiates South Asia from north to south. It’s much more modest. The other PCs get progressively smaller, but within the data, it’s clear that the continental size difference is the big one. The variance between north and south China is a small one in a South Asian scale.

Click to enlarge

Pairwise Fst is more ambiguous. That’s probably because most of the South Asian samples have structure within them. Merging them into one pooled population just confuses the issue.

Using a South Asian dataset where groups are disaggregated makes a lot more sense, and you see the structure between the different groups.

Click to enlarge

Running Treemix gives similar results. The South Asian groups exhibit a fan-shaped topology, where the Han cluster tightly together. Since I removed Bengalis from Treemix adding migration edges doesn’t do anything between the two clusters, so I omitted those results.

Click to enlarge

Finally, of course I ran some admixture analysis. Using South Asians + Han Chinese, I thought K = 4 would be reasonable. Even if you don’t enlarge, the results are straightforward: the Han Chinese have very little diversity in unsupervised mode. A small South Asian-like component, which has affinities with Punjabis, is found in northern Han. This confirms other results with other methods that the northern Han have some West Eurasian gene flow.  Some of the southern and central Han have an affinity with one of the South Indian clusters. I think is artifactual, due to deep structure within Eastern Eurasian populations and affinities between those groups that the Han absorbed as they moved south.

This post doesn’t really shed new light on anything we didn’t know. Rather, it’s just a review of what jumps out at anyone who works with genotype data: there is not very much genetic diversity in China and there is a great deal of genetic diversity in India. Why? These are not questions genetics can really answer directly, though it can give us clues and support certain models over others.

Anyone who has read much about Chinese history knows that the cultural ideal of meritocracy is deeply ingrained, even if it is honored in the breach quite often. Chinese civilizations has been characterized by the domination of extended pedigrees (e.g., the Xianbei-Han ruling faction among the Tang), but those pedigrees never become ethno-religious castes. The exception occurred during the Yuan (Mongol) period where Kublai Khan entered into a divide-and-rule policy. But that was a short period which had no longer term cultural consequences.

In contrast, South Asia is characterized by long-term endogamy. This is not surprising to anyone who knows anything about South Asian history. The genetic evidence suggests that modern jati-barriers emerged around ~2,000 years ago. Not only do South Asian groups differ a great deal in biogeographic ancestry (deep ancestry), but historical endogamy has resulted in further drift between these groups.

May 1, 2018

Hui have a lot of West Eurasian Y chromosomes

Filed under: China genetics,Hui — Razib Khan @ 1:26 am
O C R1a R1b R2 E1b G H I1 I2 J1 J2 L N Q T Total N
Han 258 12 2 2 2 1 1 2 1 1 7 9 2 300
Hui 24 7 21 1 9 1 3 1 1 4 1 11 1 3 14 4 106
Tibetan 49 11 18 1 1 3 3 3 3 7 1 100

It’s been a while since I checked in on the genetics of the Hui people. I found the paper, Analysis of 17 Y‐STR loci haplotype and Y‐chromosome haplogroup distribution in five Chinese ethnic groups. About 50% of the Y chromosomal haplogroups are normally classified as “West Eurasian” (R, E, G, I and J). But curious a fraction of the Han have these too, as do some Tibetans.

Additionally, know that some Mongols also have R1a1a. It’s hard to differentiate different periods of admixture. But to me the presence of R2 and J2 point to a Central/South Asian origin of a lot of the Hui R1a as well.

August 1, 2017

The great genetic map and history of China

Filed under: China genetics — Razib Khan @ 8:49 pm


About 20 percent of the world’s population is Chinese (and since over 90% of Chinese citizens are ethnically Han, so by Chinese here I mean Han to a first approximation). In comparison to other non-European groups a fair amount of genetics research has been done with Chinese populations. But in comparison to their overall numbers, not too much has really been done. That will change.

A new preprint, A comprehensive map of genetic variation in the world’s largest ethnic group – Han Chinese, aims to enrich our knowledge set somewhat. The authors used low coverage next generation sequencing to get increase their sample sizes greatly (cheaper). By low coverage, I mean instead of hitting each genetic position on average 30 times or more, as is in the norm in medical genomics, they sampled a position closer to twice.

But while any given genome was usually not given much close attention, their overall sample size of individuals was 11,670 Han Chinese women. Impressive This means that if they called a position as a variant, they could assess their confidence that it was a variant by looking at how many times it was called as a variant across their data set (as coverage declines one’s confidence that a call of a variant is a true call declines because there is a relatively high base rate of error set against the proportion of true expected polymorphisms; in contrast if you sample 30 times the error rate gets overwhelmed by repeated sampling). Overall they counted 25,057,223 variants, which sounds about right. They also found 548,401 novel variants with at least a count of 10 in the data set (a ~0.04% allele frequency, so a very low cut-off).

The most important thing about this preprint is not that the sample size is large enough that they could detect low frequency variants and add to the catalog. No, for me, it is that they sampled so many of the provinces. As you can see in the figure up top just like Europe China’s Han population recapitulate the map of China. That is, populations arrange themselves spatially when projected onto a principle components analysis plot in the same manner that they do geographically. This is a new finding in some ways because previous sampling strategies had not been robust enough to detect the east-west cline (though to be honest if you looked at the Chinese samples in the 1000 Genomes there was suggestion of this).

All that being said, please note that the PCA is not to scale, insofar as most of the variation is north-south (4 to 5 times more than east-west). Rather like Europe in this regard. Part of this difference is due to the fact that gene flow from non-Han populations, particularly in the South, inflate the genetic variation on the first dimension. Another aspect of interest is that genetic variation between Han populations is rather low to begin with.

One way to visualize this is a matrix like the one to the left. You see pairwise population Fst statistics. The largest is between Guangdong in the south, home to Hong Kong and Guangzhou (Canton), and the northern provinces. The Fst value between Guangdong and Shanxi in the center-north is 0.0029. You may know that the Fst value between Han Chinese and Northern Europeans is ~0.10. A 34 factor difference, more than one order of magnitude. As a point of comparison you can find Fst tables which show values between English and Croations and English and Spaniards are about the same as between Guangdong and Shanxi.

What is just as interesting is the very low genetic differentiation on the North China plain. Why is this? There are two reasons I can think of. The easy explanation is that across politically unified flat landscapes gene flow occurs so easily that genetic differences disappear over time.

But, this presupposes there were genetic differences in the first place. The reason I say this is that though there was a early period of migration from the north to the south (from the Han dynasty onward), and absorption of non-Chinese peoples, there were also periods when much of China north of the Yangtze river valley was under barbarian domination or politically unstable. Elite northern families fled to the south, and eventually when political stability reemerged migrated back to the north (similarly, persistent north-south migration occurred, as the Hakka people of South China are clearly of northern provenance).

The low genetic differentiation across northern China may then be thought of as the outcome of structural fixtures of the landscape (no mountains to obstruct gene flow), as well as possibly due to historical instances of copious back-migration from various regions of southern China (or perhaps more accurately Central China, as I’m presuming much of the settlement would come from the lower Yangtze river valley). Both of these dynamics may have led to little intra-regional structure. In contrast you notice that genetic distance between Fujian and Guangdong, two regions adjacent to each other in the South, is still higher than between any of the northern regions.

Again, this is not surprising due to both geography and history. The dialect map of China shows that southeast China is more fragmented than the north (or southwest). These differences are long-standing and date to the initial founding of Han communities in the south via migrants from the north. Unlike North China South China is a topographically diverse landscape, with beautiful escarpments and deep gorges. Fujian literally hugs the ocean, and has long had a relationship to overseas communities for this reason. Geographic barriers mean there are genetic barriers. Combined with admixture with local populations this means it is not surprising that there were greater genetic differences between southern regions than in the north.

Additionally, China south of the Yangtze has been relatively shielded from foreign conquest and invasion compared to the North China plain. Obviously events like the Taiping rebellion and famine more generally had impacts on South China, but North China has had more periods of domination in a destabilizing manner by non-Chinese invaders over the past 2,000 years.

Perhaps more intriguing than the modern genetic relationships within China are the relations with non-Chinese populations. It is not surprising that the South Chinese populations show evidence of admixture with Dai and Tawainese aboriginals (the basal group of the Austronesian migration). The genetics and cultural practices in parts of South China have long suggested relationships to indigenous groups, as well as Sinicization. Honestly I suspect many were surprised how similar North and South Chinese were, indicating either continuous gene flow or descent from a large demographic expansion.

More curious is that some North Chinese seem to show evidence of admixture with West Eurasians. In particular, they show affinities with European populations. Again, this is not surprising. Some earlier analyses have shown evidence of European-like admixture in northern China, and among ethnic groups like Mongolians. More precisely there are strong signals of European-like admixture in the northwestern provinces of Gansu, Shaanxi, and Shanxi.

The details here are important though. The authors note that Hellenthal et al. detected admixture in the from Northern Europeans into North China using haplotype based methods to around 1200 AD. This preprint finds a similar admixture date. But they caution that these admixture dates may only signal the latest of the events.

As for what that event could be, there was clearly turmoil on the Silk Road in the years around 1000 AD. After 750 AD for all practical purposes the Chinese lost control of their portion of the Silk Road, what is now Xinjiang. Turkic groups like Uyghurs and Iranian ones such as Sogdians were prominent in China due to a power vacuum (the Uyghurs were used by the Tang emperors like the Germans were used by the later Roman Emperors, as federates). Later on one saw the emergence of Tanguts, various groups from Manchuria, and finally the Mongols. Since both haplotype based methods and these preprint suggest something around 1000 AD, the most likely candidate was the absorption of Central Asians with some European-like ancestry into the Chinese substrate. The Uyghur conquest of the major cities of in the centuries before the rise of the Mongols famously resulted in the assimilation of a European-like population which had earlier spoken Indo-European languages.

But admixture was not a feature of just recent Chinese history. The figure to the right is somewhat difficult to read, but it shows on the y-axis variance in the f3 statistic. In short, how well does the Chinese data set here form a clade with the outgroup, and how much does that statistic vary between groups. The x-axis is for the D statistic, which measures the relationship of four populations, with two clades. On the bottom left you see the Siberian genome from 45,000 years ago. On the y-axis you can see all provinces show very little variation, and that’s because the Siberian genome is old enough that it is basal to all the Chinese and Europeans. The D statistic indicates no gene flow between the Siberian populations and modern groups. Not so with other populations. You see the Pleistocene European populations are shifted to the right, and that’s because they all contribute to later Europeans. The Chinese-European clade is not a good fit. This is true across the Chinese populations (so the variance of the f3 statistic is very low),.

Also in the text they note that there is high shared drift with the three “Ancient North Eurasian” (ANE) samples from Siberia. This is discussed extensively in the supplements to Lazaridis et al. 2016. Another replicated finding is that the Chinese share drift with ancient European hunter-gatherers. The drift declines later on, likely because the Chinese do not share as much drift with the early farmers. This is due in part to the “Basal Eurasian” (BEu) element. But in Fu et al. 2016 they observe that drift between East Eurasians and European hunter-gatherers increases after 15,000 years BP, when there was a genetic turnover, and the Villabruna cluster (in their terminology) came to dominate the landscape.

The most probable, though not certain, explanation for this pattern is that ANE populations contributed ancestry to both antipodes of Eurasia. To European hunter-gatherers, and, to the ancestors of the Chinese in Pleistocene East Asia (remember that there was a fusion between a proto-East Asian population and ANE to give rise to the ancestors of Amerindians 15-20,000 years ago). Another explanation could be East Asian gene flow rather early on into Europe, some time after the Last Glacial Maximum ~20,000 years ago. We don’t have the sample density outside of Europe to really say with certainty.

Finally, I have to mention that at SMBE Melinda Yang of Qiaomei Fu’s lab gave a talk about the Tianyuan genome. Their group has found that the Tianyuan individual, who dates to 40,000 years ago, is the likely ancestor of modern East Asians. That is, Tianyuan shares more drift with modern East Asians than Europeans. No huge surprise. What was surprising though is that Tianyuan also shared appreciable drift with GoyetQ116, a 35,000 year old sample from Belgium, whose descendants seem to have played a role in the emergence of the Magdalenian culture. But not later European hunter-gatherer populations. The Tianyuan sample also seemed to share some drift with Australasian samples (a possible resolution for why some Amerindians share drift with Oceanians presents itself here obviously). Overall, the group’s conclusion was that this might be evidence of ancient population structure rather early on in the “Out of Africa” populations, which eventually carried over as the groups dispersed (rather than each geographic region being direct descendants from a single panmictic “Out of Africa” group). The implications here are beyond the purview of Chinese genetics so I’ll address it in a later post.

I have to mention there is a fair amount within this paper on selection as well as medical genetics. I didn’t tackle that in this post since there’s so much phylogenomics one could talk about.

Charleston W. K. ChiangSerghei MangulChristopher R. RoblesWarren W.KretzschmarNa CaiKenneth S. KendlerSriram SankararamJonathan Flint

July 27, 2011

How Chinese genetics is like Chinese food

Representatives of Szechuan and Shangdong cuisine

The Pith: The Han Chinese are genetically diverse, due to geographic scale of range, hybridization with other populations, and possibly local adaptation.

In the USA we often speak of “Chinese food.” This is rather peculiar because there isn’t any generic “Chinese cuisine.” Rather, there are regional cuisines, which share a broad family similarity. Similarly, American “Mexican food” and “Indian food” also have no true equivalent in Mexico or India (naturally the novel American culinary concoctions often exhibit biases in the regions from which they sample due to our preferences and connections; non-vegetarian Punjabi elements dominate over Udupi, while much authentic Mexican American food has a bias toward the northern states of that nation). But to a first approximation there is some sense in speaking of a general class of cuisine which exhibits a lot of internal structure and variation, so long as one understands that there is an important finer grain of categorization.

Some of the same applies to genetic categorizations. Consider two of the populations in the original HapMap, the Yoruba from Nigeria, and the Chinese from Beijing. There are ~30 million ...

Powered by WordPress