Razib Khan One-stop-shopping for all of my content

October 11, 2018

Why PCA and genetics are a match made in heaven

Filed under: Evolution,Genetics,science — Razib Khan @ 8:13 pm
Insitome customers and selected populations

The image above is not the work of a small child trying to sketch out a B-2 Stealth Bomber. Rather, it is a PCA plot, which shows the distribution of a subset of Insitome’s customers who have purchased the Regional Ancestry Insight — in terms of how they relate to each other genetically.

In green, I have added some British individuals, in red some Africans from Nigeria, and in blue individuals who are ethnically Chinese. The majority of our customers are of Northern European heritage, but a substantial minority are African-American or Asian-American and various mixes therein.

So why do we use Principal Components Analyses, PCA, in the first place? And how does it work to matches our intuitions about relatedness through abstruse mathematical formulae?

Why we use PCA in genetics

Real genetic varition…a little bit

Consider this slice of diversity to the left. Six individuals, top to bottom, genotyped on a small number of genetic positions, left to right. You should recognize the letters, as they are DNA base pairs, A, C, G, and T. You can see above that there are variations between the positions across individuals. Now imagine attempting to gain insight from looking at thousands of individuals (rows) across hundreds of thousands of markers (columns).

Raw genetic data is basically just a huge text file. When you are concerned with the variation on a single position, you can view from the results for individuals or populations in a table and expect most people to immediately understand the implications. Europeans who are lactose tolerant have a variant on a particular marker. If you are TT or CT you can digest milk sugar, lactose, as an adult. If you are CC, you can’t. There are only one a few things to keep track of: the person, and their genotype.

Representing variation on a single marker, a single variable, isn’t necessary because the human mind can process all that information. In contrast, lots of simultaneous variables are impossible to understand just by visually looking at a table. PCA is just one of many excellent ways to extracting signal out of the noise.

The plot to the left was generated from ~30,000 markers on a few hundred individuals from eight populations. This is not a large dataset today. The time it took to run the function which generated the raw PCA result output was the period between me pressing “enter” on the keyboard and me looking at the computer screen.

And yet despite the modesty of this dataset can you imagine me looking at 30,000 variables across 200 samples, and obtaining any understanding? Perhaps if I devoted my life to the project!

What about the math?

The way it works mathematically is that it takes the voluminous raw data, which is totally incomprehensible to the human mind and summarizes it into a set of independent equations — making it completely essential to the analytical toolkit. The data is actually a “matrix.” PCA transforms it with a series of distinct equations which can define the total variation of the underlying data.

A matrix of genotypes

These equations, or more properly dimensions, are arrayed in order of proportion of variation in the data explained. On a conventional PCA plot, you see the first two dimensions, which explain the largest and second largest proportion of the variation, as the x and y-axes. But there are many more dimensions you can break the data apart by, though quite often for genetic analysis the largest ones are sufficient to smoke out the population structure that you are interested in. The values of individuals in each dimension that drops out of the data can then be placed onto a coordinate system, which is much easier to digest than a table of raw variation.

The branching of human populations

But how can a mathematical framework make biological variation comprehensible through maps so well — especially with regards to genetic differences between populations? The answer to this is straightforward: human evolutionary history has a pattern, and that pattern leaves its stamp on the genome. PCA is just a pattern extraction method.

The raw material of variation are mutations, and the pattern of mutations in any human genome is defined by a pedigree back to common ancestors. People who tend to share common ancestors share mutations — and mutations are the raw material for the genetic variation that PCA summarizes.

When used in evolutionary genetics, PCA should ideally recapitulate the phylogenetic tree. Assuming that sample sizes are balanced, humans in worldwide datasets have the first principal component of variation, which invariably a dimension that separates Africans from non-Africans.

Why? Because this is the earliest separation between large lineages, and so this ‘separation’ has had the most time to accumulate distinct and unique mutations in their two respective lineages. The second dimension is usually one that defines the difference between people from the Eastern portion of Eurasia and those from the western portion of Eurasia. Again, this is an important phylogenetic distinction because these two groups seem to have diverged soon after their ancestors left Africa.

And so on. PCA is not the only way to visualize the data. If you run a computer program that counted up raw similarities and differences between individuals at each genetic position, you would notice that some individuals are more similar to others, some groups more similar to other groups, and this too would reflect the phylogenetic history. If you had more time and wanted to dig deeper, you could construct various models of population history, and see how well the data fit those models.

PCA is not the only way to understand genetic variation. PCA itself is not the genetic variation, but a way to represent that variation, but it is a fast method that starts with few assumptions and lends itself to easy graphical representation. It’s not coincidence that it remains popular to this day.

Interested in learning where your ancestors came from? Check out Regional Ancestry by Insitome to discover various regional migration stories and more!


Why PCA and genetics are a match made in heaven was originally published in Insitome on Medium, where people are continuing the conversation by highlighting and responding to this story.

September 26, 2018

The Insight Show Notes — Season 2, Episode 3: ANI, ASI, IVC and The Genetics of India

Filed under: Genetics,History,India,science — Razib Khan @ 3:49 pm
A scene from an ancient Indian epic

This week on The Insight (Apple Podcasts, Stitcher and Google Play) we discussed how the genetics of 25% of the world’s population, the people of South Asia, came to be. It’s a journey of thousands of years.

We cited the preprint, The Genomic Formation of South and Central Asia.

Additionally, we cite a chapter in David Reich’s Who We Are and How We Got Here, where he discusses the genetics of India, and how it’s analogous to Europe.

A cover story from India Today, 4500-year-old DNA from Rakhigarhi reveals evidence that will unsettle Hindutva nationalists, was also referenced. Please read with caution! The research has not been published, and there are likely going to be changes based on new results (actually, probably certainly from what I have heard)….

There was a discussion of some technical, but important, statistical genetic tests to infer admixture. The paper in Genetics, Ancient Admixture in Human History, outlines these methods in detail. The three and four population tests, as well LD decay estimates of admixture time are all discussed in this paper. All are alluded to or discussed in the podcast.

Linguistic families in South Asia

There was extensive discussion of the various language families in India, in particular, Indo-Aryan, Dravidian, and Munda. We discussed the results of a recent, paper A Bayesian phylogenetic study of the Dravidian language family, which indicates a recent expansion of this language family in South Asia. Also, a new preprint on Munda, The genetic legacy of continental scale admixture in Indian Austroasiatic speakers suggests that the Munda emerged around the same time as the Dravidians.

A lot of ethnographic terms were thrown around with deeper exploration. If you want to follow-up, Elamites from ancient Iran, Indo-European Sintashta culture, and the Bactria-Margiana culture of Central Asia.

We talked about ANI and ASI. The 2009 paper, Reconstructing Indian Population History, introduced these terms and constructs. The Kalash and Pulayar people of Pakistan and southern India respectively were mentioned as modern-day exemplars of ANI and ASI.

Distribution of R1a1a

The distribution of R1a1a in India and Eastern Europe was also discussed, and how it is associated with expanding steppes. Also, caste and its antiquity were discussed, in particular, that modern boundaries between groups seem to have emerged around 2,000 years ago, after several thousand years of admixture between disparate Indian groups. The promise of disease gene discovery in South Asia is a preprint that explores the relevance of this endogamy today for health risks.

Linguistic isolates Burusho and Nihali were mentioned. And, the development of the “Yankee” identity, which Razib analogized to Indo-Aryans!

Interested in learning where your ancestors came from? Check out Regional Ancestry by Insitome to discover various regional migration stories and more!


The Insight Show Notes — Season 2, Episode 3: ANI, ASI, IVC and The Genetics of India was originally published in Insitome on Medium, where people are continuing the conversation by highlighting and responding to this story.

September 24, 2018

Podcast on South Asian genetics this week

Filed under: Genetics — Razib Khan @ 10:07 am

As some of you know I co-host a podcast on genetics and history with Spencer Wells. The very first podcast we recorded in late June of 2017 was about India, but we were still getting the hang of it to be honest, and we didn’t cover much territory.

A lot has happened between then and now, and so it’s time for an “update,” which is going to cover many more topics. That being said, we haven’t recorded yet and so I’m open to “questions from the audience” that we might integrate. So please use this post to leave comments about specific topics…. (please note we have only ~1 hour or so so might not get to everything)

September 19, 2018

The Insight Show Notes — Season 2, Episode 2: The Greatest Human Journey

Filed under: Genetics,hawaii,Podcast,science — Razib Khan @ 8:10 pm

This week on The Insight (Apple Podcasts, Stitcher and Google Play) we touched upon arguably one of the greatest human journeys of humankind, the expansion of the Polynesians across the Pacific.

Bishop Museum

Spencer discussed his visit to the Bishop Museum in Hawaii.

We discussed broadly the interesting confluence of biology, geology, and history one can see in Hawaii. The book The Monkey’s Voyage: How Improbable Journeys Shaped the History of Life discusses the biogeographic characteristics of many islands, including Hawaii.

We discussed the context of Polynesian languages and culture as part of the broader zone of Austronesian language and culture.

The extent of Austronesian languages

Austronesian societies spread over the last 6,000 years from Taiwan to the far west in Madagascar, and far east in Easter Island. The expansion into Polynesia was prefigured by the expansion of the Lapita culture between 1500 BC and 500 BC.

The Lapita culture is defined by its unique pottery. But curiously the usage of pottery disappeared among the Polynesians, the likely later descendants of the Lapita people. Razib mentioned how there is some evidence that cultural bottlenecks and small populations can result in loss of skills such as pottery.

On the other hand, Spencer pointed out that the Polynesians also did not practice rice agriculture, unlike other Austronesian societies. Instead, they expanded with a cultural toolkit of taro, which likely was adopted from the peoples of Near Oceania, New Guinea, and Melanesia.

Sweet Potato

Additionally, Spencer brought up the fact that the cultivation of sweet potatoes in Polynesia likely indicates contact between Polynesians and the peoples of South America. The genomic evidence that Polynesian sweet potatoes derive from South American ones is conflicted. Spencer mentioned that the word for “sweet potato” in Quechua, the language of highland Peru, is kumar. In Hawaiian, it is ku ala.

We mentioned in passing Thor Heyerdahl’s view that there was a South American migration to Polynesia. But the genetic, cultural, and archaeological evidence does not support this.

The Polynesian mtDNA motif was mentioned. With a high frequency in Polynesia, the mtDNA lineage seems to have spread from the west, in line with the idea of a migration to the east. In contrast, the Polynesian Y chromosomes show a mix of Asian and Melanesian heritage.

Much of the arguments hinge on the argument of whether the expansion of Austronesians into the Pacific was via the “slow boat” or “express train” model. The slow boat model suggests widespread cultural and genetic mixture gradually with the Austronesian expansion through Melanesia. The express train model implies a more rapid migration with far less interaction. Culturally the adoption of taro cultivation aligns with the slow boat thesis. As does the existence of Melanesian Y chromosomes across the range of Polynesians. But the overwhelming Asian nature of Polynesian mtDNA lineages fits the express train model.

One way that scholars have reconciled this is that there was a slow expansion of the Lapita people, but that they only assimilated Papuan and Melanesian men into their matrilineal communities. This broad framework was reinforced with the publication of genetic results from native Hawaiians, which showed a minority ancestry from a Papuan-like population.

But wait, there was a twist! Ancient DNA now shows that the Lapita people had almost no admixture with Melanesian people! Follow-up results from Vanuatu and Tonga confirm that the Lapita people had no admixture from Melanesians. Rather, in Vanuatu 2,500 years ago the Lapita people are replaced by an almost entirely Melanesian population, and the Melanesian ancestry begins to show up in Polynesians after this period. The conclusion then is there were multiple migrations into Polynesia!

Spencer and I concluded that the broad sketch is now established, but a lot of complicated details need to be worked out. Instead of express trains or slow boats, some researchers now wonder if Polynesia was more like a subway network.

Interested in learning where your ancestors came from? Check out Regional Ancestry by Insitome to discover various regional migration stories and more!


The Insight Show Notes — Season 2, Episode 2: The Greatest Human Journey was originally published in Insitome on Medium, where people are continuing the conversation by highlighting and responding to this story.

Hawaii: complicated a journey to paradise

Filed under: anthropology,Genetics,hawaii,science — Razib Khan @ 7:11 pm
The extent of Austronesian Diaspora

Ask any American what they think when you say the word “Hawaii,” and certain words will no doubt reoccur from person to person. That’s because certain images, feelings, come to mind. A gentle breeze, beaches, and volcanoes. The 50th state has been the byword for paradise on the mainland. A certain sense of Hawaii is part of American popular culture.

But Hawaii is a real place with real people. It isn’t a dreamland. Rather, it is one of the most isolated large islands in the world. Over 2,500 miles from the nearest continent, there is only a single terrestrial mammal native to the islands: predictably, a bat!

Obviously, the island is crawling with mammals today. Nearly 1,000 years ago voyagers from the lands of the western Pacific landed on the Society Islands, which includes famed Tahiti, and then sailed northward to the Hawaiian archipelago. When the ancient Polynesians settled Hawaii they did not arrive alone. They brought with them pigs, chickens, and dogs. Naturally, rats tagged along as unwanted passengers.

Humans arrived in Hawaii in catamarans

But the settlement of Hawaii by humans was the end of a long journey which began thousands of years earlier in the mists of prehistory. Six thousand years ago a small group of stone-age seafarers, who we call Austronesians, journeyed south from Taiwan and settled the northern Phillippines.

But they did not stop there. Over a period of thousands of years, these ancient mariners spread out over Southeast Asia, sometimes introducing intensive forms of rice agriculture and their distinctive language. But they did not stop there. For whatever reason, these were a people who wondered what was over the horizon, even if it was the deep blue ocean. They moved on west and east. Over 1,000 years ago their descendants reached the western Indian Ocean, mixing with the Bantu farmers of eastern Africa and occupying the island of Madagascar. In the other direction, Austronesians moved into Oceania, abandoning rice and adopting taro from Melanesians. Less than 1,000 years ago the Pacific expansion finally crested, as Polynesians settled in New Zealand, off the coast of Australia, Easter Island, 2,300 miles west of South America. And of course, they ventured north to Hawaii, an isolated ecologically rich and unique jewel in the midst of the Pacific.

In Southeast Asia, the Austronesians merged with native populations of farmers which migrated out of southern China earlier. But as they moved west and east they encountered very different populations, whether it be African farmers and pastoralists, on the one hand, or Melanesians in the case of the ancestors of the Polynesians.

Citation: Kim SK, Gignoux CR, Wall JD, Lum-Jones A, Wang H, Haiman CA, et al. (2012) Population Genetic Structure and Origins of Native Hawaiians in the Multiethnic Cohort Study. PLoS ONE 7(11): e47881

And just as the people of Madagascar, despite speaking a language closest to those spoken in Borneo, have a blended with nearby populations. Polynesians carry signatures of interactions with the peoples of Near Oceania, which includes New Guinea, Australia, and Melanesian islands in the western Pacific, such as the Solomon Islands and New Caledonia.

As genomics began to illuminate all the relationships between human populations, in 2012 a paper was published that surveyed the genomes of many native Hawaiians. The results were clear: the indigenous peoples of Hawaii had a dominant signature of ancestry shared with mainland Asian peoples, but also a minority component that had more affinities with the peoples of Near Oceania.

Lapita culture sites

This result was relevant to what traditionally had been termed the “express train vs. slow boat” models of the settlement of Polynesia. The “express train” hypothesis implies that the Austronesian Lapita culture rapidly pushed out of maritime Southeast Asia, with minimal interaction with local Papuans and other Melanesians. In contrast, the “slow boat” model meant that the expanding proto-Polynesians mixed with Papuans and Melanesians as they spread eastward more gradually, creating a fused culture which pushed onward into the far Pacific.

The results above, along with maternal and Y chromosomal lineages seem to support the “slow boat” model. Not only are all Polynesians, including Hawaiians, descended from Southeast Asian farmers, but their ancestors also include the people who first pushed to the edge of the Pacific. These were the ancestors of Oceanians who settled New Guinea, Near Oceania, and Australia more 40,000 years ago with the first “Out of Africa” migration.

Citation: Skoglund, P., Posth, C., Sirak, K., Spriggs, M., Valentin, F., Bedford, S., … & Fu, Q. (2016). Genomic insights into the peopling of the Southwest Pacific. Nature, 538(7626), 510.

So case closed? Not exactly. Science and history are often more complex than our elegant human imaginings. Over the past few years, the field of ancient DNA has come upon the scene to disturb hypotheses and provoke the development of new ones. Now researchers can see snapshots of the past with much crisper detail than would have been the case in the past.

Two papers have helped reshape our understanding of the peopling of Polynesia. First, a 2016 paper showed that samples of ancient Lapita people don’t show any admixture from Melanesians. This is in accordance with the “express train” model, which the genetic heritage of modern Polynesians presumably refuted!

An immediate solution to this conundrum is that the old models were too simple. That there wasn’t just a simple migration outward, but rather several, and that Melanesian ancestry arrived later. Within the last 2,000 years.

A paper published in 2018 added more nuance and clarity to what may have been going on. Today the island of Vanuatu is considered to be Melanesian and is settled by people of predominant Oceanian heritage. But ancient DNA from 3,000 years ago yielded individuals of nearly total Asian heritage. But by about 2,000 years ago these people were replaced, by the ancestors of modern Melanesians, as later samples show overwhelming Oceanian heritage.

Poke is a melange of flavors and ingredients from the four corners of the world

Where does this leave us? Appropriately, a paper appeared with the title “Human Genetics: Busy Subway Networks in Remote Oceania?” was penned as a response to all this uncertainty and confusion. The title says it all, doesn’t it?

These findings may actually be consonant with recent archaeological results that eastern Polynesia and New Zealand were subject to a massive demographic expansion and radiation beginning around ~1,000 years ago.

Today modern Hawaii is a melange of peoples, reflected in its cuisines, such as Poke, which has been inflected and modified by new ingredients brought by immigrants from the mainland and Asia. And yet perhaps this was always so, as paradise was never as serene and eternal as we may dream in our imaginings. Rather, Hawaii and the Hawaiians were products of daring voyages generation after generation, and the waxing and waning of peoples and cultures, bringing together diverse and disparate threads of the human expansion out of Africa.

Interested in learning where your ancestors came from? Check out Regional Ancestry by Insitome to discover various regional migration stories and more!


Hawaii: complicated a journey to paradise was originally published in Insitome on Medium, where people are continuing the conversation by highlighting and responding to this story.

September 12, 2018

Season 2, Episode 1: The Legacy of L. L. Cavalli-Sforza

Filed under: anthropology,Evolution,Genetics,History — Razib Khan @ 1:15 pm
L. L. Cavalli-Sforza, 1922–2018

This week on The Insight (Apple Podcasts, Stitcher and Google Play) we discussed the life and legacy of L. L. Cavalli-Sforza, who died on August 31st, 2018. See the Stanford obit. From John Hawks, The man who tried to catalog humanity on Medium is probably the most thorough review of his life and works.

Cavalli-Sforza’s magnum opus

We discussed the popularization of PCA and phylogenetic trees by Cavalli-Sforza and his colleagues, outlined in his magnum opus The History and Geography of Humans. The methods of analysis developed in the 20th century prepared us for the avalanche of data we confront into the 21st century.

Also, Cavalli-Sforza’s collaboration with Marcus Feldman that laid the seed for the field of cultural evolution. See the book Cultural Transmission and Evolution. Today the field of cultural evolution is being pushed by researchers such as Joe Henrich.

Cavalli-Sforza was also instrumental in synthesizing the application of formal demographic models to archaeology. See The Neolithic Transition and the Genetics of Populations in Europe.

L. L. Cavalli-Sforza remained active into the 21st century. See this interview from 2012.

After his retirement, several labs have carried on the project of exploring population genetic history. David Reich’s lab’s publications. Also, see Eske Willerslev’s work.

Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. A 2008 study that uses the HGDP sample on a SNP-array, bringing The History and Geography of Humans into the 21st century!

Finally, the cultural anthropologist wrote a full-length biography of L. L. Cavalli-Sforza’s life, with a focus on his scientific contributions, A Genetic and Cultural Odyssey.

Interested in learning where your ancestors came from? Check out Regional Ancestry by Insitome to discover various regional migration stories and more!


Season 2, Episode 1: The Legacy of L. L. Cavalli-Sforza was originally published in Insitome on Medium, where people are continuing the conversation by highlighting and responding to this story.

September 11, 2018

The Munda as upland rice cultivators

Filed under: Genetics,Munda — Razib Khan @ 7:32 pm


I’m reading Ben Keirnan’s Viet Nam: A History from Earliest Times to the Present. I picked it up mostly because over half the book does not consist of the history of the Vietnam War (a major failing I’ve noticed with books which are histories of Vietnam, as opposed to histories of Vietnamese-American relations).

The section on Austro-Asiatic languages (Vietnamese is one) has something of relevance to the “Munda question”. But before that, let me review a few things.

Until very recently many historians and prehistorians of India have suggested that the Munda people, who speak very distinctive dialects related to the Austro-Asiatic languages of Southeast Asia, are the primal people. That is, they are the aboriginals. The original adivasis.

I do not believe that this case is tenable. Because I am a geneticist, I make this judgment on genetic grounds. Chaubey et al., Population Genetic Structure in Indian Austroasiatic Speakers: The Role of Landscape Barriers and Sex-Specific Admixture, reveals what we know about the genome-wide patterns in the Munda.

1) They are highly enriched for East Asian ancestry compared to other South Asians.

2) Many Munda males carry a haplogroup, O-K18 (once O2a), that is very common in Southeast Asia, especially Austro-Asiatic groups. Additionally, it is more diverse in Southeast Asia. The Munda O-K18 branch seems to be a side shoot from the broader Southeast Asian tree.

3) The Munda mtDNA, defining the maternal line, is uniformly South Asian. This is in contrast to the situation with Bengalis, who have East Asia Y and mtDNA. This indicates that the Munda migration was heavily male-mediated.

4) The Munda carry mutations in genes that are associated with recent selective sweeps in East Asians (e.g., on the EDAR locus). Though this may be a parallelism, it’s unlikely. Rather, it is through shared common descent that this occurs.

The Genomic Formation of South and Central Asia has a graph which shows population relationships and gene flow that illustrates important aspects of the Munda ethnogenesis (Juang below):

AASI in this model = Ancient Ancestral South Indians. These are very distantly related to Andaman Islanders, Australo-Melanesian Southeast Asians, and more distantly to eastern Eurasians generally. They are likely aboriginal people to South Asia, with no West Eurasian ancestry.

The model above indicates that an East Asian (Austro-Asiatic) population encountered an AASI population and produced a daughter population. Then, that daughter population mixed with an ASI population, ASI being an old and stable mix of West Eurasian Iranian farmer (~25%) and AASI (75%).

This means two things for the Munda. First, they are very AASI enriched. This is obvious in any analysis. And, their West Eurasian ancestry is almost all Iranian farmer and not steppe. This is totally not surprising either. Using more naive model-based clustering Munda samples always seem to lack the components which are most easily adduced to be Indo-Aryan. They have very low frequencies of Y haplogroup R1a1a-Z93.

Let’s take a step back now. The fact that the Austro-Asiatic males arrived when there were unmixed AASI indicates that this was somewhat early. There are no unmixed AASI on the Indian subcontinent today. When we reach the Iron Age, by 500 BCE it is clear that Indo-Aryan society had pushed at least to Bihar. This component would bring steppe ancestry, as well as mixing into any remnant AASI.

So when could the Austro-Asiatics have arrived at the earliest? Two papers with extensive ancient DNA, Ancient genomes document multiple
waves of migratin in Southeast Asian prehistory
and The prehistoric peopling of Southeast Asia give us a good sense. It seems that the expansion of Austro-Asiatic farmers dates to about 4,000 years ago. That is when the transition seems to occur in northern Vietnam.

One thing that is also evident: the East Asian gene flow into the Munda seems to come from northern Austro-Asiatic groups in Thailand, not the southern branch which resulted in the people of the Nicobar Islands and was eventually submerged by Austronesians. On a final note, a site in northern Burma yielded an individual who was clearly Tibeto-Burman, and not Austro-Asiatic, 3,000 years ago. So even at that date mainland Southeast Asia was heterogeneous.

But, considering that there is no evidence of Tibeto-Burman ancestry Munda, whose Austro-Asiatic ancestry seems to have come through Burma through a mainland route (as opposed to up from maritime Southeast Asia), I think one should push the date of their arrival before 1000 BCE. With the expansion of farming in mainland Southeast Asia at around ~4,000 years ago, that puts the arrival of a distinctive Munda culture in South Asia to between 2000BCE and 1000 BCE. It is entirely reasonable that during this period there were unmixed AASI in eastern South Asia, though the admixture graph may also be picking up assimilation Austro-Melanesian ancestry in southern China/Southeast Asia.

This is where Viet Nam: A History from Earliest Times to the Present comes in: the author suggests that the early Austro-Asiatic farmers were dry-land rice farmers who occupied uplands. The reason being that reconstructed Austro-Asiatic common words for rice culture is indicative of dry-land practices, with later wet-rice terminology often being borrowings from Tai and Austronesians.

I don’t know enough Indian archaeology and agricultural history to comment further, but, a visual inspection of where Munda are concentrated does suggest upland farming….

September 8, 2018

Ancient Indian Genetics At ASHG

Filed under: Genetics — Razib Khan @ 10:19 am

At ASHG next Monday Niraj Rai will be presenting this poster, Reconstructing the peopling of old world south Asia: From modern to ancient genomes.

South Asia was one of the first geographic regions to be peopled by modern humans after their African exodus. Today, the diverse ethnic groups of South Asia comprise an array of tribes, castes, and religious groups, who are largely endogamous and have hence developed complex, multi-layered genetic differentiation. From such a complex structure, several questions have stood out from the research of our group and others that are only beginning to be resolved using modern sequencing techniques and targeted sampling of populations and archaeological specimens. Here, for the first time we have used ancient genomics approach to understand the deep population ancestry of Indian Sub- continent. Despite the rich sources available of modern Indian populations, success from ancient DNA specimens in the subcontinent have been limited. We have successfully analysed several museum samples and fresh excavation from the different part of India which provides us a wonderful opportunity to be able to relate these modern populations genetically with those in the past and build complex models of population mixture and migration in India. Using ancient genomics data from the human remains who have lived about 4-5 thousand years before present in North West and South of India, we are trying to understand the population history of Iron age people and their genetic relation with the North West of Indians and Iranian Farmers. Furthermore, we are providing a solid Genetic evidence that substantiates archaeological and linguistic evidence for the origins of Dravidian languages and the language of the Indus valley people.

I’ll probably be trying to make sure I catch Rai at the poster. I’m most interested in the South Indian samples. If they date to more than 4,000 years before the present, it will be quite interesting.

Below the fold is my response to a comment on The Roots of Indo-Iranian cultural genesis. My response is in bold. JR’s responses to my original comment are in italics.

as you know there more than those 3 now. none of the harrapan era individuals exhibit evidence of noticeable steppe drift.

What I know, and this is second hand info, is that there are 15 more InPe samples found by the Reich team, and they have enormous variation in the AASI levels. There is also ydna and mtdna which is South Asian. But I know nothing about the ANE/WSHG levels of these populations.

It seems likely that ANE/WSHG gene flow occurred into South Asia before the arrival of steppe people.

by sampling one individual you are reconstructing the phylogeny of whole populations.

That is true but we now know that the Rakhigarhi sample is so low coverage that it is practically useless.

I’m talking about more than something like PSMC. You can tune your priors even on a little data. Assume you have a population you don’t know about. You have 100 markers from one individual. That individual maps onto African populations. It doesn’t guarantee everyone else will map in the same way, but it tunes your priors.

Similarly, all InPe samples so far lack steppe ancestry. The size of the sample is small, and the quality varies, but we are getting more confident, not less.

So we have to fall back on the InPe samples. The InPe that we know so far were definitely not a homogenous group as you know it well. The question was really whether these groups lacked steppe_mlba which modern Indians have. But the paper unfortunately failed to demonstrate this conclusively.

Final conclusions are going to come down to 

  1. large sample sizes
  2. good geographic representation

But we can make preliminary bets. It seems likely to me that the bulk of steppe_mlba arrived after 2000 BC to South Asia. Not definitive, but likely.

If you look into the Supplementary section on South Asians in the Narasimhan paper, they only use Swat_IA, Swat_H, Punjabi & Mala to represent the post Harappan South Asians. And in the various qpAdm models they generate (using 2 separate sets of outgroups), the combination with the highest p-value for both the Punjabi & the Mala involve older steppe groups beside the steppe_mlba. It is only the Swat samples that probably show steppe_mlba preference. But then again, the Swat samples show greater preference for steppe_mlba_west rather than steppe_mlba_east, which is again making no sense archaeologically.

So there are no clear answers. And to top it all, the authors implicitly assume in their qpAdm models that South Asians have to be modelled as comprising of 3 streams – steppe, Iran_N groups, AASI – to arrive at their conclusion.

the Loschbour ancient genome form luxembourg was the first mesolithic western european hunter-gatherer. *every subsequent mesolithic hunter-gatherer has been genetically very smilar*, across all of western europe. that’s because they went through a bottleneck and range expansion

But we do not know that such a range expansion took place among Harappans. And it appears that they are arguing Harappans being endogamous already and having population structure which goes against the idea of a range expansion.

This is not correct. When you undergo range expansion you will admix with the population beyond the frontier. This will produce dilution (or not, depending on population densities).

At any rate, the 2 scenarios are different, we are comparing a hunter gatherer society with a complex Bronze Age urbanised society. In Harappan times, if there was a range expansion, it was likely to be more among the less developed populations in Inner India or among the Harappans it is more likely to be restricted to the elites.

when i say *steppe* i’m talking specifically of the genetic element that is Yamnaya+EEF, which is the distinctive european back-migration into the steppe after the corded ware western migration. this component probably has added iranian farmer+ANE/WSHG in some fractions.

Though I was initially reluctant to accept this – it appears that the more steppe rich South Asian groups also have slightly more ANF/EEF. The Daamgard et al paper argued that these Northern groups also have CHG which is lacking in others. One may argue that this could only come from steppe_mlba. But then the Harappans were already interacting with Iranians, Central Asians and the Mesopotamians and probably the Caucasus groups as well. This could very well have brought in ANF and CHG in their genetic mix and we do not need to resort to a steppe_mlba explanation for it especially as the archaeological evidence is quite lacking.

otoh the incoming group could have been heterogeneous with non-steppe lineages ‘hitchhiking’. this seems to have occurred in europe: haplogroup I1 shows a star-phylogeny just like r1b and r1a. i’m pretty sure it was integrated into the indo-european expansion after cultural and biological assimilation.

Could be. Or it maybe that the reality of things were very different from how it is being interpreted.

perhaps r1a1a-z93 isn’t connected to indo-aryans necessarily, though it is sufficiently

I think that the Harappans, East Iranians and Central Asians were already IE. So even if R1a z93 expanded locally, considering how Poznick et al dated the South Asian z93 to between 2500 – 2000 BC, it is more likely, in my opinion, to be related to Indo-Aryan expansion. Nothing else that we know of historically can explain the widespread South Asian presence of this Bronze Age lineage.

But this would ofcourse mean that the steppe_mlba also received R1a Z93 from South Asia. At present state of our knowledge, there is no evidence so far to support such a claim unless we can show some sort of admixture in steppe_mlba from Harappans or BMAC or Shahr I Sokhta.

It is possible for Harappans to have been IE and not Indo-Aryan. I am skeptical of this scenario, but it’s not impossible. Indo-Aryan is clearly related to Indo-Iranian cultures which came out of the Eurasian steppe.

September 1, 2018

Takeaways from the golden age of Indian population genetics

Filed under: Genetics — Razib Khan @ 1:01 am

There are lots of strange takes on the India Today piece, 4500-year-old DNA from Rakhigarhi reveals evidence that will unsettle Hindutva nationalists. I’m friendly with the author and saw an early draft. So I’m going to address a few things.

The genetic results are becoming more and more clear. A scaffold is building and becoming very firm. In the 2020s there will be a lot of medical genomics in India. But before that, there will be population genetics. Ancient DNA will be the cherry on the cake.

Here’s what genetics tells us. First, a component of South Asian ancestry, especially in North India, and especially in North Indian upper caste groups, seems to be the same as ancient agro-pastoralists who ranged between modern Ukraine and modern Tajikistan. Genetically, these people are very similar to certain peoples of Central and Eastern Europe of this time, though there is a varied dynamic of uptake of local Central Eurasian elements as they ranged eastward.

This ancestral component is often called “steppe.” This ancestral component is a synthesis of ancient European hunter-gatherer, Siberian, and West Asian. The steppe component seems to arrive in Central and South Asia after 2000 BC.

Second, another component of South Asian ancestry is very distinctive to the region. It is deeply but distantly related to branches of humanity which dominate Melanesia and eastern Eurasia, up into Siberia. The magnitude of the distance probably dates to ~50 thousand years ago, when the dominant element of modern humans expanded outward from West Asia, east, north, and west. These people are called “Ancient Ancestral South Indians,” or AASI. Their closest relatives today may be the natives of the Andaman Islands, but this is a very distant relationship.

AASI is the dominant component of what was once called “Ancestral South Indians,” or ASI. It turns out that “ASI” themselves were a compound synthetic population. This was long suspected by many (e.g., David W.). What was ASI a compound of? About ~75 percent of its ancestry was AASI, but the balance seems to have been a West Eurasian component related to farmers from western Iran. We can call this group “farmers.”

With a few samples from outside of the IVC region, and one (or two) samples from within the IVC region, geneticists are converging upon the likelihood that the profile in the greater IVC region before 2000 BC was a compound of these farmers with the AASI. But even within the IVC region, there seems to have been a range of variation in ancestry. The IVC was a huge zone. It may not have been dominated by a single ethnolinguistic group (even today there is the Burusho linguistic isolate in northern Pakistan). Note that the much smaller Mesopotamian civilization was multiethnic, with a  non-Semitic south and a Semitic north (Sumer and Akkad).

The key point is that it is very likely the IVC lacked the steppe ancestral component. That it did have AASI component. And, it did have a farmer component with likely ultimate provenance in western Iran. Additionally, there were smaller components derived from pre-steppe Central Eurasian people.

While the steppe people arrived in the last 4,000 years, and at least some of the ancestors of the AASI are likely to have been in South Asia for 40,000 years, the presence of the AASI-farmer synthesis genetically is conditional on when a massive presence of western farmers came to affect the northwestern quarter of South Asia. It seems unlikely to have been before Mehrgarh was settled 8,500 years ago. The genetic inferences to estimate the time of admixture between AASI and farmer are currently imprecise, but it seems likely to have begun at least a few thousand years before 2000 BC.  range of 8,500 and 6,000 years ago seems reasonable.

So 4,000 years ago the expanse of the IVC was dominated by a variable mix of farmer and AASI. One can call this “Indus Valley Indian” (IVI).

Just like ASI, there was an earlier abstract construct, “Ancestral North Indian” (ANI). Today it seems that that too was a compound. To be concise, ANI is a synthesis of steppe with IVI. The Kalash of northern Pakistan are very close genetically to ANI. This means that while ASI had West Eurasian ancestry, albeit to a minor extent. And ANI had AASI ancestry, albeit to a minor extent. The main qualitative difference is that ANI had a substantial minority of steppe ancestry.

To a great extent, the algebra of genetic composition across South Asia can be thought of as modulating these three components, farmer, steppe, and AASI.* Consider:

  • Bhumihar people in Bihar tend to have more steppe than typical, but not more farmer than typical, and average amounts of AASI.
  • Sindhi people in Pakistan tend to have lots of farmer, some steppe, and not much AASI.
  • Reddy people in South India have lots of farmer, very little steppe, and average amounts of AASI.
  • Kallar people in South India have some farmer, very little steppe, and lots of AASI.

For details of where I’m getting this, you can look at The Genomic Formation of South and Central Asia for quantities. But as a stylized fact farmer ancestry tends to peak around the Sindh. In Pakistan steppe ancestry increases as you go north. As you go east and south AASI increases pretty steadily, but there are groups further east, such as Jatts and Brahmins, who have a lot of steppe, almost as much as northern Pakistani groups. And curiously you get a pattern where some groups have more steppe and AASI, and less farmer, than is the case to the west (you see this in the Swat valley transect, as steppe & AASI increase in concert).

Going back to the history, by the time the steppe people arrived in South Asia, in the period between 2000 BC and 1000 BC, it may be that the IVI ancestry is what they mixed with predominantly. Though it is likely that the southern and eastern peripheries had “pure” AASI, by the time steppe people spread their culture to these fringes they were already thoroughly mixed with IVI populations, and so already had some AASI ancestry.

In contrast, the farmer populations likely mixed extensively with AASI in situations where the two populations were initially quite distinct.

Please note I have not used the words “Aryan” or “Dravidian.” The reason is that these are modern ethnolinguistic terms. Genetics is arriving at certain truths about population changes and connections, but we don’t have a time machine to go back to the past and determine what language people were speaking 4,000 years ago.

Our inferences rest on supposition, and a shaky synthesis of historical linguistics and archaeology and genetic demography, a synthesis which is unlikely to ever be brought together in one person due to vast chasm of disciplinary method and means.

It is highly likely that the steppe component is associated with Indo-European speaking peoples. Probably Indo-Aryan speaking peoples. The reason is that by historical time, the period after 1000 BC, Iran and Turan seem to already have been dominated by Indo-Iranian peoples. But, in the period around 2000 BC, western Iran was not Indo-Iranian. People like the Guti and the Elamites were not Indo-European, and they were not Semitic. We have some genetic transects which show that steppe ancestry did arrive in parts of Turan and Iran in the period after 2000 BC.

Where did the Dravidian languages come from? We don’t know. They could have been spoken by an AASI group. Or, they could be associated with farmers from the west. We don’t know. Ultimately, we may never know. Unlike Indo-European languages, there are no Dravidian languages outside of South Asia.

Various toponymic evidence indicates that Dravidian languages were spoken at least as far north and west as Gujurat. And Brahui exists today in Balochistan. Though I don’t have strong opinions, I think Dravidian languages probably are descended from a group of extinct languages that were present in Neolithic Iran.

Though unlike Indo-Aryan languages, Dravidian exploded onto the scene after a long period of incubation within South Asia, as part of at least one of the language groups dominant with the IVC and pre-IVC societies.

At least that’s my general assessment. I have strong opinions about the genetics. But am much more curious about what others have to say about linguistics and archaeology.

* Some groups, such as Munda and Indo-Aryan groups in Northeast India, have East Asian ancestry. Some groups in coastal Pakistan have African ancestry.

August 31, 2018

Genetics stories in India Today

Filed under: Genetics — Razib Khan @ 8:47 am

4500-year-old DNA from Rakhigarhi reveals evidence that will unsettle Hindutva nationalists:

The ‘petrous bone’ is an inelegant but useful chunk of the human skull — basically it protects your inner ear. But that’s not all it protects. In recent years, genetic scientists working to extract DNA from ancient skeletons have discovered that, thanks to the extreme density of a particular region of the petrous bone (the bit shielding the cochlea, since you ask), they could sometimes harvest 100 times more DNA from it than from any other remaining tissue.

Now this somewhat macabre innovation may well resolve one of the most heated debates about the history of India.

And, from me, 3 strands of ancestry.

Nothing new for close readers. I would caution

1) Many Hindu nationalists really don’t care and are not perturbed by these findings. I know, because I know them.

2) I don’t know if the paper is going to be published soon. It may, but we’ve been waiting two years now.

August 29, 2018

On the genetics of Bengal and Southeast Asia

Filed under: Genetics — Razib Khan @ 10:57 pm

Over at my other weblog, genetics post some readers might have an interest in. I think in the near future I’ll be talking more about the genetics of Southeast Asians and how they were influenced by Indians. Long story short: there’s a significant Indian genetic impact in many areas of Southeast Asia that can’t be ascribed to colonialism. Rather, the spread of Indian culture in the region was probably catalyzed by Indians….

August 26, 2018

The coming genetic invasion of history, and the rage to come

Filed under: Genetics,History — Razib Khan @ 10:42 am

About ten years ago I reviewed Bryan Sykes’ book Saxons, Vikings, and Celts: The Genetic Roots of Britain and Ireland. It was what it was, a product of the Y/mtDNA era. Therefore, there were a fair amount of conclusions which in hindsight turn out to be wrong. Sykes, and other genetic historians, such as Stephen Oppenheimer, have annoyed historians for years with their genetic imperialism. More frequently, genetic research has been an accent or inflection on historical work. Peter Heather has integrated some genetic results in his earlier books, though you can ignore those and still obtain the general conclusions.

The recent work on near antiquity is a hint that that is going to be blown apart. Ancient DNA in the historical period has been a slow simmer for a while now. The reason is simple: ancient DNA returns more on the investment for prehistory, where there aren’t historical documents. Until recently ancient DNA techniques were expensive in a variety of ways. The industrial process described in Who We Are and How We Got There is going to change that.

In the near future, a large number of projects are going to surface which test hypotheses and conjectures offered by historians.

You would think that testing hypotheses, generally with demographic predictions, would be something that historians would welcome. The problem is that the test will mean some scholars are going to turn out to be wrong. People who spent decades building up a particular model or understanding of the past are going to have that torn away from them.

The normal human reaction is to get defensive. But the problem is that many historians are not well trained in genetic methods. In fact, many geneticists are not well trained in the abstruse statistical methods developed by scholars in ancient DNA.

We’ve seen some of the same from archaeologists. But archaeologists had models which were, to be frank, more speculative than those historians cling to. Even if a particular historical model may be wrong, it is likely there are reasonable grounds to have held onto to that position. If ancient DNA falsifies it the reaction will be even more strident I suspect.

Of course, geneticists need the help of historians. So when the bad feelings clear I think the synthesis will get us to a better understanding of the past.

August 15, 2018

The Insight Show Notes: Episode 32, So you want to be a geneticist…

Filed under: anthropology,Archaeology,Genetics,science — Razib Khan @ 5:45 pm
Drosophila

This week on The Insight (Apple Podcasts, Stitcher and Google Play) we talk to an “early career” geneticist, Austin Reynolds. A graduate of Indian University and University of Texas-Austin, he is currently a post-doctoral fellow at University of California-Davis.

Alfred H. Sturtevant in his own “fly lab”

As a field, genetics is officially a bit over a century old. Though Gregor Mendel made his key discoveries fifty years before. Since the year 2000 genetics has undergone a revolution driven by sequencing technology and more powerful computing. Around 2010, a different revolution began, which Austin has been a part of, involving the synthesis of archaeology and genetics with the field of ancient DNA.

The first ancient whole-genome analysis, Ancient human genome sequence of an extinct Palaeo-Eskimo. Also, the Neanderthal paper which revolutionized our understanding of our relation to this lineage.

An excellent review of the state of the current research, Ancient Human Genomics: The First Decade. And a preview of the future, Tales of Human Migration, Admixture, and Selection in Africa.

David Reich’s book Who We Are and How We Got Here is a good primer on ancient DNA and population genetics. Highly accessible to the lay audience without sacrificing any of the scientific content.

Loci associated with skin pigmentation identified in African populations.

Nuclear DNA sequences from the Middle Pleistocene Sima de los Huesos hominins.

On career issues, Track the fate of postdocs to help the next generation of scientists.

Interested in learning where your ancestors came from? Check out Regional Ancestry by Insitome to discover various regional migration stories and more!


The Insight Show Notes: Episode 32, So you want to be a geneticist… was originally published in Insitome on Medium, where people are continuing the conversation by highlighting and responding to this story.

August 6, 2018

Genetics is not about “dunking” on Hindu nationalists

Filed under: Aryans,Genetics — Razib Khan @ 2:17 pm

I need to weigh in real quick about something I’ve been noticing: geneticists don’t do genetics because they are excited about debunking views promoted by some Hindu nationalists and other Indians of a variety of political stripes. In fact, most non-Indian scientists (as in people who don’t live in India) are not totally savvy to the political and social context in South Asia, and so are not aware of how their results may be taken.

Unlike some scientists, I tend to take a dim view of those who assert we need to be careful about how results are going to be interpreted. Science is science. Interpretation is society. Therefore, I don’t particularly care if someone’s cherished views are refuted.

That being said, I have seen on Twitter and elsewhere exultation by anti-Hindu nationalists about new genetic findings, where individuals are wrong in many details of the implications. In the general broad sketch, they understand some implications, but they clearly haven’t paid attention to the science closely, nor do they comprehend it.

There are many examples of confusions and misimpressions. Here is one: the idea that “Vedic civilization” is exogenous to South Asia. I think we need to be very careful about this because I think one can make the case (and this is my position) that by the time most of the archaic mythos of the Indian Aryans crystallized these people were already highly Indianized. To put the political implications on the table, they were much more assimilated in their elite culture than the Muslim rulers of India or the British ever were (and let’s be honest, these are the comparisons people care about).

Rough back-of-the-envelope calculations on my part suggest that ~15% of the total ancestry of all South Asians is steppe derived. That is, about 50% ANI, which is 30% steppe (70% Indus Periphery). Is this a lot? Or not a lot?

Interpretations differ.

August 3, 2018

Why I don’t accept the para-Munda hypothesis

Filed under: Genetics — Razib Khan @ 10:50 pm


There has been a discussion of Michael Witzel’s ideas in the comments below. Long familiar with his thesis that a Munda-like language was dominant in the northern Indus valley and in the Gangetic plain, I have also been long skeptical of it.

The reason for me is simple: I have leaned to the position that Munda are intrusive from Southeast Asia. Over the past 10 years my confidence in this proposition as grown. Let’s review

1) They speak an Austro-Asiatic language. Most Austro-Asiatic languages are in Southeast Asia and seem to have spread from the north to the south

2) The Munda have genetic signatures on the Y chromosome and some of their traits which are distinctive to East Asians and totally unrelated to any other South Asians. These genetic signatures are not found in South Asia outside of the Munda areas, and northeast India (i.e., they are not present in the Indus or Gangetic plains).

3) The most common Y chromosome of the Munda seems to be from Southeast Asia. That is, Southeast Asian lineages are basal and more diverse than the ones in India.

4) Genetic data from ancient DNA indicate that Austro-Asiatic people did not arrive in northern Vietnam until 4,000 years ago. To me this, this implies they arrived in India well after 4,000 years ago.

5) We now suspect that Indo-Aryans arrived well after 4,000 years ago to the Indus valley. The Munda and Indo-Aryans could not have met in that region 3,500 years ago in any reasonable scenario.

Let’s assume that Witzel and others are correct that the early Indo-Aryans and the languages/toponyms of the Gangetic plains do not show Dravidian influence. How could that be? It could be that in the northern Indus valley a non-Dravidian language was dominant. Consider Burusho, a linguistic isolate. Mesopotamia was long divided between a Semitic north and a Sumerian south.

Second, the genetic data seem to suggest that some Indo-Aryan groups have more AASI and more steppe than groups to their west. North Indian Brahmins vs. Sindhis are an example. To me, this is indicative of the possibility that the Indo-Aryans pushed past areas where Dravidian languages were dominant, and only AASI hunter-gatherers were flourishing. The lack of a Dravidian substrate is because the AASI groups the Indo-Aryans encountered were not Dravidian speakers.

 

August 2, 2018

Rakhigarhi sneak-peaks

Filed under: Genetics — Razib Khan @ 11:30 pm

Over at my other weblog, noting that the Indian press is finally starting to simply report the substantive contents of the Rakhigarhi results. As we all know the media can distort and misrepresent, so we need to be cautious and wait on the final paper, mostly because with that the authors can speak freely and without intermediation. But, I have heard through the grapevine the general results, and the results are exactly what Outlook India is currently reporting.

The Rakhigarhi samples themselves aren’t that interesting to me. But, Niraj Rai seems to be pushing the admixture event with IndoA-Aryans after 1500 BC. This could be a misquote, or, it could be that the researchers from various groups now have enough data to fine-tune their parameters so as to narrow down various admixture timing events.

Ancient pigmentation pathways and modern genomics

Filed under: Forensics,Genetics,skin — Razib Khan @ 1:02 am
Piebald horses emerge out of common pigmentation pathways found in humans

Unlike most mammals humans are highly dependent on our sense of sight. This is due to the diurnal nature of many primates. Our ancestors foraged for bright fruit, and so we developed stereoscopic color vision. But eventually the human lineage left the forests of our ancestors, and ventured out to the savanna. We turned our eyes to other uses than detecting fruit, from hunting, to developing a keen eye for art.

Humans are pre-adapted toward color vision

It is not surprising then that humans have had a fixation on the color of our skin and the pelage of our domesticates. Skin is our largest organ, and our complexion is one of the best indicators of ill health.

Additionally, humans have utilized the skin as a canvas upon which to apply tattoos and other coloration so as to indicate group membership. And, as humans from very different geographic regions began to meet each other, any differences in pallor were salient indicators of difference and distinction. Whole people were defined by their color!

In the ancient Near East the Egyptians termed themselves red, while their neighbors to the south were black, and West Asians from the Levant were yellow. Greeks and Arabs distinguished between the ruddy peoples of the north, and the black and brown peoples to the south, with their own ethnicity often defined as being at some sort of equipoise.

Nubians were depicted accurately by the ancient Egyptians

And yet for such an important trait, the genetic elucidation of skin color, and pigmentation more generally, has evaded us until very recently. To be fair, the genetic elucidation of most traits in humans evaded us until the last decade or so, because we did not have genomic tools to explore the whole range of possible genetic sites.

In 2003 the evolutionary biologist Armand Leroi wrote in the afterword of his book Mutants that it was surprising that geneticists were still unclear about what underlay normal variation on the trait of human skin color. This passage was written at an opportune moment. In 2006 a review paper was published, A golden age of pigmentation genetics, which reflected the fact that much had changed since Leroi had written that passage just three years before.

Through analysis of British mixed-race pedigrees geneticists in the 1950s concluded that skin color was controlled by many genes, but that much of the variation was localized to only a few loci. That is, variation on a few genes had a large impact. This means that genomic methods pioneered in the 2000s were well placed to discover the genetic basis of the variation of the trait. If the impact of the mutation was large, then you didn’t need a large sample size to detect it.

75% of the variation in eye color in Europeans is due to one gene

And so they have. Today researchers now know that about half the variation in skin color across populations is due to variation on about ten or so genes. The other half is mostly distributed across the genome. Additionally, they know that the gene that is correlated with blue eye color also effects skin color. Similarly, the gene that causes much of the blondness in Northern Europe is also correlated with skin color. The pigmentation characteristics are usually correlated together. Skin, hair and eyes are all often controlled by the same set of genes.

Though East Asians and Europeans achieve light skin through different mutations, it is also the case that those mutations are found on an overlapping set of genes. Pigmentation pathways are highly conserved in human populations. The wheel is always reinvented in the same way. In fact, the same genes show up over and over across vertebrates.The genetic mutation that results in blonde hair causes the piebald pelage in horses. The mutations associated with red hair in humans are found in the gene that is important in mouse coat color. The gene responsible for much of the difference in pigmentation between Europeans and Africans also has a lightening effect in zebrafish.

There is a great to be done to understanding the genetic basis of many diseases and complex behavioral traits. But with pigmentation genomics has yielded incredible results, producing forensic applications with utility in a wide range of contexts. This is because tens of thousands of years have produced humans who come in all colors, but through simple fine-tuning of the pigmentation pathways which vertebrates had utilized for hundreds of millions of years.

Skin color is a complex topic with numerous historical and anthropological layers. But when it comes to genetics it’s actually surprisingly simple.

You can see your skin, but are you curious about what your genes say about your pigment? Check out Neanderthal by Insitome to learn more!


Ancient pigmentation pathways and modern genomics was originally published in Insitome on Medium, where people are continuing the conversation by highlighting and responding to this story.

July 25, 2018

The Insight show notes: episode 30, Genetics and educational attainment

Filed under: Education,Genetics,Psychology — Razib Khan @ 3:47 pm

This week Razib and Spencer discussed the relationship between educational attainment and genetics on The Insight (Apple Podcasts, Stitcher and Google Play) with James Lee, lead author of Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals (published in Nature Genetics).

Here are some more resources: FAQs about “Gene discovery and polygenic prediction from a 1.1-million-person GWAS of educational attainment”. The Atlantic and The New York Times also covered the paper. An op-ed in The New York Times, Why Progressives Should Embrace the Genetics of Education.

The three laws of behavior genetics and the fourth law of behavior genetics are both mentioned. The study was a meta-analysis of genome-wide associations (GWAS), and may have been the largest GWAS published to date.

Much of the discussion centered around intelligence. The podcast with Stuart Ritchie was cited as a useful primer (remember to subscribe with Apple Podcasts, Stitcher and Google Play). You might want to check out Ritchie’s book, Intelligence.

Population stratification was mentioned. Martin et al., and two preprints, Berg et al. and and Sohail et al., tackle this issue in relation to disease and height, and how it confounds our understanding. Lee discussed LD score regression as a way to account for stratification in this particular analysis..

There was extensive discussion of the concept of heritability, where genetics explains variation in a trait.

The Social Science Genetic Association Consortium (SSGAC) and its research projects were referenced extensively.

Each allele seems to effect ~1 week of education. The authors returned more than 1,000 statistically significant markers.

Spencer brought up the “omnigenetic” model. This comes from Boyle et al., An Expanded View of Complex Traits: From Polygenic to Omnigenic.

James mentioned some of Camille Benbow’s work, in particular Life Paths and Accomplishments of Mathematically Precocious Males and Females Four Decades Later.

Interested in learning where your ancestors came from? Check out Regional Ancestry by Insitome to discover various regional migration stories and more!


The Insight show notes: episode 30, Genetics and educational attainment was originally published in Insitome on Medium, where people are continuing the conversation by highlighting and responding to this story.

July 23, 2018

The genetics of education

Filed under: Education,Genetics,Intelligence,Psychology — Razib Khan @ 11:49 am
Yale University

In the modern world, obtaining an education is a rite of passage. Not only does education provide one with skills useful for the modern economy, but it also helps to form one’s values and socializes one with peers who go through the same life experiences. Education isn’t just learning about various disciplines, it is a way to learn how to live in the modern world.

It is a topic which intersects with sociology, politics, and even ethics. As it turns out, education, or the attainment thereof, also intersects with genetics. This follows naturally from the first law of behavior genetics, “all human behavioral traits are heritable.” By “heritable,” geneticists refer to the fact that variation of a trait correlates with variation in genes. That variation tracks a causal relationship — so that genetic variants in some way cause a particular outcome.

This is easy enough to illustrate with an example. Imagine a genetic variant that changes the production of a biochemical that impacts whether someone is hyperactive or not. Hyperactivity is a behavioral characteristic with a lot of variables. Someone who drinks too much coffee will exhibit hyperactivity. But, it is surely true that some of the variation on these personality traits are due to cognitive neurological differences — some of which are then due to genetic differences between people. We may not be “born that way,” but we are probably “disposed to be that way.”

That’s a lot of caveats, and that is accounted for in the third law of behavior genetics: “a substantial portion of the variation in complex human behavioral traits is not accounted for by the effects of genes or families.” When it comes to behavior, environment matters. Although, what constitutes “environment” is not always clear, but any understanding of the genetic basis of behavioral variation needs to account for the fact that much of behavioral variation has nothing to do with genetics.

Galton’s classic illustration of parent-child correlation on height

And a when it comes to educational attainment, there is obviously no one “gene for education.” Whether or not you obtain a degree is impacted by many factors: from family encouragement and resources, inspirational teachers, intelligence, and your own conscientiousness. But, some of these characteristics, in particular the ones having to do with intelligence and personality, are impacted by your genes.

It has long been known through indirect methods that intelligence and personality are heritable. Identical twins are much more similar on these characteristics than conventional siblings, and relatives are much more similar than non-relatives. But, finding the biophysical genetic basis has been difficult because of the fourth law of behavior genetics: “A typical human behavioral trait is associated with very many genetic variants, each of which accounts for a very small percentage of the behavioral variability.”

A traditional way for geneticists to discover the cause of a trait is to target particular genetic locations and see if they are associated with the trait in question. This “candidate gene” method has been useful for many diseases, where a single defective mutation is responsible for much of the cause of the disease. But, it has been an utter failure in behavior genetics because of the fourth law of behavior genetics. To establish a connection between a genetic variant and a behavioral trait requires enormous sample sizes and a good knowledge of the human genome.

Until the year 2000, we didn’t have the sequence of a human genome, and until the past decade, dense assays of human genomic variation were expensive — this meant that studies were limited to small sample sizes and only a few genes. Most of the published results did not replicate, because they were not true in terms of the effect of the gene on the trait in question.

Recently, all that has changed. Thanks to “next generation sequencing” and “chip technology” researchers now have access to hundreds of thousands of markers in any given person — and cheaply at that. This cost-effectiveness allows for an increase in sample size; as many more people can be tested. This shatters the barriers implied by both the third and fourth laws of behavior genetics: small effect sizes no longer impede discovering ‘the needles in the haystack’ of complex traits. Bigger sample sizes and more subtle statistical methods are producing results that only a few years ago would have seemed fantastical.

A new paper in Nature Genetics illustrates this starkly, Gene discovery and polygenic prediction from a genome-wide association study of educational
attainment in 1.1 million individuals
. The authors identified 1,271 independent genome-wide-significant SNPs. This is a big achievement, considering that five years ago a paper with ~125,000 individuals identified just 3 SNPs that were significant for this trait!

Though it is hard to generalize about 1,000+ genetic variants, the figure to the left illustrates that the genes that these variants are found in are highly over-expressed in the nervous system. This is exactly what you see in most genetic analyses of complex traits that are behavioral. The genetic “hits” are found disproportionately in genes that control variation in neurological function because behavior is downstream of brain function. To be fair, many genes express in the brain, so that’s not a surprise. Rather, the authors compared the gene’s expression level to the typical gene.

Curiously these hits are not particularly over-expressed in genes associated with the development of glial cells, those cells in the central nervous system which are not neurons. Because these cells form the tissue which scaffolds the connections between neurons, the authors suggest that this might mean that differences in cognitive ability between individuals may not be a function of “transmission speed.” This highlights the fact that the these sorts of abstruse statistical analyses ultimately aim to uncover underlying biological phenomena.

And yet such a paper, with over 1 million samples from numerous cohorts, will have to get into the statistical weeds. One of the major issues that crops up in these analyses is “stratification.” This means that the genetic variation in the sample is correlated with variables such as geographical population structure. Therefore, some of the positive hits for any of these sorts of analyses might easily be picking up the overall population genetic variation and differences between groups, which may not have a genetic basis at all (e.g., British tend to drink tea, Americans tend to drink coffee).

Empirical genetic relationship of siblings

To get around this, the authors look at a sub-sample of 20,000 sibling-pairs. Many of the issues presented by population stratification do not apply within families. Families have the same broad genetic background, and also control for many environmental differences (since siblings are raised in the same family and socioeconomic context). But, there is still genetic variation among siblings, and some of this variation is responsible for variation in traits between siblings. After all, height tends to run in families, but the difference in height between same-sex siblings is not usually due to differences in nutrition (at least in the developed world).

Looking at the associations between genetic variation and educational attainment within families the authors found “that within-family
effect sizes are roughly 40% smaller than GWAS effect sizes.” In other words, there are factors that seem to result in the overestimation of the genetic effects on educational attainment within the broader population. The authors note that the same does not apply to height.

What might account for this then? One possibility is that some of the genes that a parent has, but does not transmit to the offspring, might result in a more beneficial environment. This is often termed the “parental effect.”

The paper looked are more than just educational attainment. With sample sizes in the hundreds of thousands they also looked at cognitive performance and mathematical ability (self-reported). Using the same methods as for educational attainment, the authors predicted around 10% of the variance.

Of course there are limitations. The sample size is large, but not diverse genetically. Overwhelmingly of European origin, the authors found that their method could predict less than 5% of the variance in African Americans. This is not surprising, because genome-wide associations often do not predict well across different populations.

Additionally, there is the reality that these methods focus on common variation within populations. The heritability of most behavioral traits using more indirect classical methods is much higher than this ~10% of variation explained would imply, so there is still a genetic component to be accounted for. Perhaps this variation is found in rare genetic variants, which are not explored in this sort of research.

Ultimately, we may look back at this 1 million-person analysis as the first in a scholarly tradition of massive GWAS sample sizes. Genomics is cheap enough that it is possible that genetic sample sizes in the range of a billion are feasible within 15 years. That will probably require a whole different set of esoteric methods but will probably yield many novel results.

Interested in learning where your ancestors came from? Check out Regional Ancestry by Insitome to discover various regional migration stories and more!


The genetics of education was originally published in Insitome on Medium, where people are continuing the conversation by highlighting and responding to this story.

July 18, 2018

The Insight show notes: Episode 29, The Genetics of China, Han & Beyond

Filed under: China,Genetics,History,science — Razib Khan @ 3:39 pm

This week Razib and Spencer discussed the genetics and history of China on The Insight (iTunes, Stitcher and Google Play).

Chinese history looms large in the podcast, and there are many books one can read on the topic. In particular, John King Fairbank’s China: A New History is one of the rest comprehensive treatments. To understand what’s going on in China today it’s probably good to have at least one survey book or course of its past under your belt!

For the purposes of this episode though, you can just check out a list of Chinese dynasties, if you just want a visual outline of the timeframe and period which Razib and Spencer covered in the podcast.

In relation to the genetics alluded, for genome-wide patterns of relatedness across Chinese regions: Genetic Structure of the Han Chinese Population Revealed by Genome-wide SNP Variation. This 2009 paper uses 350,000 markers from 10 provinces to perform exploratory analysis of genetic structure within China.

More recently, A comprehensive map of genetic variation in the world’s largest ethnic group — Han Chinese, is a preprint that utilizes whole-genome sequencing to assemble an even larger dataset.

For maternal mtDNA, Large-Scale mtDNA Screening Reveals a Surprising Matrilineal Complexity in East Asia and Its Implications to the Peopling of the Region. For Y chromosomes on the paternal side, Y Chromosomes of 40% Chinese Descend from Three Neolithic Super-Grandfathers.

To get a sense of how China’s population has grown genetically, see Robust and scalable inference of population history from hundreds of unphased whole-genomes. The figure to the left shows the “Out of Africa” bottleneck, and then demographic expansion in the last 50,000 years. “CHB” represents Chinese sampled in Beijing. Along with “GIH”, who are Gujuratis, and “CEU”, a Northern European American cohort from Utah, the Chinese exhibit explosive growth in the last 10,000 years.

There is extensive discussion of the environment and geography of China, and how it related to agricultural expansion and migration southward. The Retreat of the Elephants by Mark Elvin chronicles this process of the expansion of rice farming into the jungles of southern China through natural history and human geography.

Though most people are aware of the Mongols, fewer are cognizant of the interregnum between the Han and Sui-Tang, when many steppe nomads settled in China, Buddhism took root, and many elite Han lineages migrated from the north to the south. For those curious about this period, China Between Empires: The History of the Northern and Southern Dynasties is an excellent introduction accessible to all.

Finally, there was extensive discussion about the future of Chinese science. For a deeper exploration of that that, see A Chinese Province Is Sequencing One Million of Its Residents’ Genomes and China Has Already Gene-Edited 86 People With CRISPR.

Interested in learning where your ancestors came from? Check out Regional Ancestry by Insitome to discover various regional migration stories and more!


The Insight show notes: Episode 29, The Genetics of China, Han & Beyond was originally published in Insitome on Medium, where people are continuing the conversation by highlighting and responding to this story.

Older Posts »

Powered by WordPress