Razib Khan One-stop-shopping for all of my content

November 16, 2017

PCA remains the swiss-army-knife to explore population structure

Filed under: PCA,Population Structure,Population Substructure — Razib Khan @ 5:14 pm


I put up a poll without context yesterday to gauge people about what methods they preferred when it came to population genetic structure.* PCA came out on top by a plural majority. More explicitly model-based methods, such as Structure/Admixture, come in right behind them. Curiously, the oldest method, pairwise Fst comparisons (greater Fst means more variance partitioned between the groups), and Treemix, the newest method, have lower proportions of adherence.

Why is PCA so popular? Unlike Treemix or pairwise Fst you don’t have to label populations ahead of time. You just put the variation in there, and the individuals shake out by themselves. Pairwise Fst and Treemix both require you to stipulate which population individuals belong to a priori. This means you often end up using PCA or some other method to do a pre-analysis stage. Structure/Admixture model-based methods make you select the number of distinct populations you want to explore, and often assume an underlying model of pulse admixture between populations (Treemix does this too when you have an admixture edge).

PCA is also better at smoking out structure than Structure/Admixture for the same number of markers, and, it’s pretty fast as well. This is why the first thing I do when I get population genetic data where I want to explore structure is do a PCA and look for clusters and outliers. After this pre-analysis stage, I can move onto other methods.

Further reading:

* I stipulated “genotyped-based” methods to set aside some of the new-fangled techniques, which often assume phasing and analysis of haplotypes, such as Chromopainter or explicit local ancestry deconvolution (some local ancestry deconvolution does not require phased haplotypes, but the most popular do).

January 28, 2012

Population structure using haplotype data

The Pith: New software which gives you a more fine-grained understanding of relationships between populations and individuals.

According to the reader survey >50 percent of you don’t know how to interpret PCA or model-based (e.g., ADMIXTURE) genetic plots, so I am a little hesitant to point to this new paper in PLoS Genetics, Inference of Population Structure using Dense Haplotype Data, as it extends the results of those earlier methods. But it’s an important paper, and at some point I’ll starting using their software. The “big picture” is that earlier methods left “some information on the table.” That’s partly due to the fact that they were developed (or in the case of PCA leveraged, as it’s a very general technique) in an era where very dense marker data sets were not available (today we’re shifting to full genome sequences in many cases!). The information left on the table would be haplotype structure. Genetic variation in a concrete form manifests as sequences along a line, many of them physically connected. These correlations of nearby variant markers represent haplotypes of great interest, because they are excellent clues to admixture or divergence events across populations. In contrast the older methods, were looking at variation from marker to marker, each in turn independently, which collapses some of the important genomic structure that we can now inspect (in fact, linkage disequilibrium due to these correlations can distort some of the results in the older methods, so you want to “thin” your marker set).

Let me make this concrete for you. On 23andMe you can see where your friends shake out on a PCA plot using the HGDP data set as a reference. What this means is that the HGDP data set is used to generate independent dimensions of genetic variation. As is the usual case in these analyses the largest dimension separates Africans from everyone else, and the second largest dimension separates Asians from Europeans and Africans. 23andMe customers are then projected upon this variation, so you can get a sense where you are positioned in the clusters. To the left is a zoom in on the section for Central/South Asians. You can see that one of my friends, highlighted with a green color, falls almost perfectly in the Uygur cluster. According to ancestry estimates my friend is 50 percent Asian and 50 percent European. The “representative” Uygur in the 23andMe chromosome painting gives about the same results. But these are total genome estimates. The historical nature of my friend’s admixture and that of the Uygur woman is very different, as one can see in the below figure.

 

My friend is to the right, and the Uygur woman is to the left. Why the big difference? My friend has an East Asian parent an a European parent. The Uygur woman is the product of a marriage between Uygurs, a population which is due to admixture betwen East Asians and Europeans one to two thousand years ago. Recombination has broken apart the perfect linkage between European and East Asian regions among the Uygurs. Obviously this isn’t the case with my friend, as recombination has had no time to generate alternative sequences of ancestry. This is critical information which genome-wide estimates displayed on PCA or ADMIXTURE will miss out on.

As for this particular paper and method, I want to point you to figure 5. The darker/bluish colors indicate higher conancestry estimates, and yellower colors lower ones. Red is in the middle. The diagonal tends to be blue/red because that represents populations’ correlations with themselves, which one would expect to be high. You can’t really read the labels, but  I wanted to highlight the Italian and Sardinian blocks. Explanation below.

You can see an ADMIXTURE plot underneath the heat-map. What’s going on? Sardinians exhibit the hallmarks of an isolated population with smaller effective population which has undergone more genetic drift than Italians over the same amount of time. This is naturally one reason that they “break out” rather quickly in ADMIXTURE and PCA. You see this in South Asia with the Kalash, who often emerge as their own cluster rather quickly, and separate out in a PCA as well. This is simply a function of their isolation and lower effective population size. Most of the people who use ADMIXTURE and PCA know this, but those reading these plots do not. Without that knowledge one can make incorrect inferences. The methods outlined here in the paper allow one to visually observe immediately these trends, while keeping in place broader wold-wide correlations across populations in mind. This is a big step forward not only in data analysis, but result visualization.

If you are more interested in this topic, the first author has a comparison of the various tools up. Both Dienekes and Eurogenes are using the new software. Get the software at PaintMyChromosomes.com!

Citation: Lawson DJ, Hellenthal G, Myers S, Falush D (2012) Inference of Population Structure using Dense Haplotype Data. PLoS Genet 8(1): e1002453. doi:10.1371/journal.pgen.1002453

July 27, 2011

How Chinese genetics is like Chinese food

Representatives of Szechuan and Shangdong cuisine

The Pith: The Han Chinese are genetically diverse, due to geographic scale of range, hybridization with other populations, and possibly local adaptation.

In the USA we often speak of “Chinese food.” This is rather peculiar because there isn’t any generic “Chinese cuisine.” Rather, there are regional cuisines, which share a broad family similarity. Similarly, American “Mexican food” and “Indian food” also have no true equivalent in Mexico or India (naturally the novel American culinary concoctions often exhibit biases in the regions from which they sample due to our preferences and connections; non-vegetarian Punjabi elements dominate over Udupi, while much authentic Mexican American food has a bias toward the northern states of that nation). But to a first approximation there is some sense in speaking of a general class of cuisine which exhibits a lot of internal structure and variation, so long as one understands that there is an important finer grain of categorization.

Some of the same applies to genetic categorizations. Consider two of the populations in the original HapMap, the Yoruba from Nigeria, and the Chinese from Beijing. There are ~30 million ...

April 6, 2011

Africa in 12 ADMIXTURE chunks

Some have asked what the point is in poking around African population structure when Tishkoff et al. and Henn et al. have done such a good job in terms of coverage. First, it is nice to run your own analyses so you can slice & dice to your preference, and not rely on the constrained menu provided by others. There’s value in home cooking; you can flavor to your taste. Second, you never know what data people might leave on your doorstep. I’ve received the genotypes of three Somalis. Nothing too surprising, a touch more Cushitic than the Ethiopians in Behar et al., but interesting nonetheless.

Also, you can see how ADMIXTURE tends to come to weird conclusions in certain circumstances. Below is a K = 12 run ~50,000 SNPs. I’ve included in a few Behar et al. and HGDP populations to the Henn et al. set, as well as pruned a lot of the African groups which seem redundant in terms of information. I’ve added a few geographically informative labels as well.

Observe below that there is a Fulani cluster. I think this is pretty much an artifact. At K = 7 the Fulani have a majority component ...

February 1, 2011

A genomic map of human variation, where we’re at

Zack has started exploring the K’s of his merged data set for HAP. A commenter suggests that:

As you have begun interpreting the reference results, let me make a friendly warning: you have to keep in mind that most of the reference populations of ethnic groups are extremely limited in sample size (with only between 2 and 25 individuals) and from very obscure sources, and you should keep away from drawing conclusions about millions of people based on such limited number of individuals.

This seems a rather reasonable caution. But I don’t think such a vague piece of advice really adds any value. These sorts of caveats are contingent upon:

- The scope of the question being asked (i.e., how fine a grain is the variation you are attempting to measure going to be)

- The sample size

- The representativeness

- The thickness of the marker set (10 autosomal markers vs. 500,000 SNPs)


This isn’t a qualitative issue, easily to divide into “right” and “wrong.” Sometimes an N = 1 is very insightful. That’s why the whole genome of one Bushman was very useful. In fact, the whole genome of any random Sub-Saharan African, and the whole genome of any random non-African ...

January 29, 2011

“Asian” in all the right places


mtDNA haplogroup G1a2

The pith: In this post I examine the most recent results from 23andMe for my family in the context of familial and regional (Bengal) history. I also use these results to offer up a framework for the ethnognesis of the eastern Bengali people within the last 1,000 years, and their relationship to other South Asian and Southeast Asian populations.

Since I received my 23andMe results last May I’ve been blogging about it a fair amount. In a recent post I inferred that perhaps I had a recent ancestor who was an ethnic Burman or some related group. My reasoning was that this explained a pattern of elevated matches on chromosomal segments with populations from southwest China in the HGDP data set. But now we have more than my genome to go on. This week I got the first V3 chip results from a sibling. And finally, yesterday the results from my parents came in. One thing that I immediately found interesting was my father’s mtDNA haplogroup assignment, G1a2. This came from his maternal grandmother, and as you can see it has a distribution which ...

December 14, 2010

To classify humanity is not that hard

snpskinIn my post below I quoted my interview L. L. Cavalli-Sforza because I think it gets to the heart of some confusions which have emerged since the finding that most variation on any given locus is found within populations, rather than between them. The standard figure is that 85% of genetic variance is within continental races, and 15% is between them. You can see some Fst values on Wikipedia to get an intuition. Concretely, at a given locus X in population 1 the frequency of allele A may be 40%, while in population 2 it may be 45%. Obviously the populations differ, but the small difference is not going to be very informative of population substructure when most of the difference is within populations.

But there are loci which are much more informative. Interestingly, one controls variation on a trait which you are familiar with, skin color (unless you happen to lack vision). A large fraction (on the order of 25-40%) of the between population variance in the complexion of Africans and Europeans can be predicted by substitution on one SNP in the gene SLC24A5. The substitution has a major phenotypic effect, and, exhibits a great deal of between population variation. One variant is nearly fixed in Europeans, and another is nearly fixed in Africans. In other words the component of genetic variance on this trait that is between population is nearly 100%, not 15%. This illustrates that the 15% value was an average across the genome, and in fact there are significant differences on the genetic level which can be ancestrally informative. You can take this to the next level: increase the number of ancestrally informative markers to obtain a fine-grained picture of population structure. In the illustration above the top panel shows the frequencies at the SNP mentioned earlier on SLC24A5. The second panel shows variation at another SNP controlling skin color, SLC45A2. This second SNP is useful in separating South and Central Asians from Europeans and Middle Easterners, if not perfectly so. In other words, the more markers you have, the better your resolution of inter-population difference. This is why I found the following comment very interesting:

Razib’s final concession (that genetic variation exists) is revealing because I think that’s as far as the argument can really be taken. It’s a bit of a strawman, in that people who argue that race is entirely a social construct don’t actually deny that human genetic variation exists. What they deny is that there are non-arbitrary and mutually exclusive categories into which humans can be resolved. This is, I think, the point being made by the “Race by Fingerprints” etc. rhetorical device cited earlier.

In other words, it may be possible for any particular phenotypic trait or genetic locus to be resolved into a strictly cladistic system but humans, being an amalgam of such traits and locii, defy such resoution. So while the study of human genetic variation does, indeed, have “instrumental utility” the concept of biological races is, itself, an arcahic relic.

As I noted below, the comment doesn’t make sense. Here is a PCA of world populations using 250,000 markers:

lotsofmarkers

The relationships between individuals is hypothesis-free. That is, the two largest components of variance in the data just happen to produce clusters which neatly map onto geographic realities. If you think about this a little weird, it makes total sense: populations share a history of intermarriage, so over time they will develop population-specific distinctiveness. It may be true that most of the variance is between populations, but it is not difficult at all to discriminate populations, or generate clusters which are not arbitrary as a function of geography or social identity.

There are relationships which do not match intuition. Or at least intuition as it crystallized during the period of the rise of modern taxonomic science. The various phenotypically “black” peoples of the world, Africans, Melanesians, and some South Asians, do not cluster together. Rather, all non-Africans are separated from Africans by the largest component of variance within the data set. The traits used to make inferences of taxonomy in “folk biology” and early scientific attempts to generate a systematic tree of life in relation to the human races were not necessarily representative of total genome variation, which captures the evolutionary history of a population with greater accuracy and precision.

And obviously you don’t need 250,000 markers, let alone all ~3 billion base pairs in the human genome, to distinguish on the level of continental races/populations. A paper in 2002 laid out the parameters. δ is a measure of between population difference on genes.

sig1
sig2

From the paper:

…we can estimate that about 120 unselected SNPs or 20 highly selected SNPs can distinguish group CA from NA, AA from AS and AA from NA. A few hundred random SNPs are required to separate CA from AA, CA from AS and AS from NA, or about 40 highly selected loci. STRP loci are more powerful and have higher effective δ values because they have multiple alleles. Table 3 reveals that fewer than 100 random STRPs, or about 30 highly selected loci, can distinguish the major racial groups. As expected, differentiating Caucasians and Hispanic Americans, who are admixed but mostly of Caucasian ancestry, is more difficult and requires a few hundred random STRPs or about 50 highly selected loci. These results also indicate that many hundreds of markers or more would be required to accurately differentiate more closely related groups, for example populations within the same racial category.

The paper was written in 2002. Since then much has changed. Here is an image from a post from last summer:

village1

People within European villages tend to be relatively closely related. Again, it is totally reasonable that given enough markers you could assign individuals to different villages with a high confidence. Concretely, person X may show up in the pedigree of individuals from village 1 ~100 times at a given generation, while the same person may show up in the pedigree of individuals from village 2 ~10 times at a given generation. This isn’t rocket science, the basic logic as to why populations shake out based on geography and endogamy patterns is pretty obvious when you think about it.

At about the same time as the above work, A. W. F. Edwards, a statistical geneticist, published a paper titled Lewontin’s Fallacy which took direct aim at the misunderstand of the human Fst statistic and its relevance for classification. Here is Edwards answering why he wrote the article in 2002 (my co-blogger at GNXP, David B, is doing the questioning):

4. Your recent article on ‘Lewontin’s Fallacy’ criticises the claim that human geographical races have no biological meaning. As the article itself points out, it could have been written at any time in the last 30 years. So why did it take so long – and have you had any reactions from Lewontin or his supporters? [David B's question -R]

I can only speak for myself as to why it took me so long. Others closer to the field will have to explain why the penny did not drop earlier, but the principal cause must be the huge gap in communication that exists between anthropology, especially social anthropology, on the one hand, and the humdrum world of population and statistical genetics on the other. When someone like Lewontin bridges the gap, bearing from genetics a message which the other side wants to hear, it spreads fast – on that side. But there was no feedback. Others might have noticed Lewontin’s 1972 paper but I had stopped working in human and population genetics in 1968 on moving to Cambridge because I could not get any support (so I settled down to writing books instead). In the 1990s I began to pick up the message about only 15% of human genetic variation being between, as opposed to within, populations with its non-sequitur that classification was nigh impossible, and started asking my population-genetics colleagues where it came from. Most had not heard of it, and those that had did not know its source. I regret now that in my paper I did not acknowledge the influence of my brother John, Professor of Genetics in Oxford, because he was independently worrying over the question, inventing the phrase ‘the death of phylogeny’ which spurred me on.

Eventually the argument turned up unchallenged in Nature and the New Scientist and I was able to locate its origin. I only started writing about it after lunch one day in Caius during which I had tried to explain the fallacy across the table to a chemist, a physicist, a physiologist and an experimental psychologist – all Fellows of the Royal Society – and found myself faltering. I like to write to clear my mind. Then I met Adam Wilkins, the editor of BioEssays, and he urged me to work my notes up into a paper.

I have had no adverse reaction to it at all, but plenty of plaudits from geneticists, many of whom told me that they too had been perplexed. Perhaps the communication gap is still too large, or just possibly the point has been taken. After all, Fisher made it in 1925 in Statistical Methods which was written for biologists so it is hardly new. [my emphasis -R]

Richard Dawkins repeated Edward’s argument in The Ancestor’s Tale. You can read Edward’s full essay online. Also see p-ter’s lucid exposition at GNXP.

discblogsSo far I’ve been talking mostly about genes. But in terms of classification there isn’t anything magical about genes. Biological anthropologists using more robust morphometric traits have discerned an “Out of Africa” movement, just as geneticists have. You have above five individuals. All of them have dark hair and dark eyes. There’s total overlap on those traits. And yet I’m pretty sure you can assign their rough population identity to each. Why? Because humans take a look at correlated clusters of traits in assigning population identity intuitively. Some traits are more salient, such as skin color, but early geographers understood that East Asians and Europeans were different populations despite similarity of light complexion. The ancient Greeks understood that Indians and Ethiopians were different groups despite their similar complexions, because they differed on other informative traits.

Let’s bring it back down to earth. Population structure exists. Phylogenetic analyses of humans are trivial in their difficulty. They track geography rather closely, at least before the age of mass migration. Additionally, they tend to follow endogamous social groups, such as Ashkenazi Jews. A South Asian is going to be more genetically related to a South Asian than they are to an African. There are many cosmetic differences between populations. But there are also less cosmetic differences which are very important. You can even assign different regions of a chromosome to different ancestral components.

Where does this leave us? Ultimately, it’s about the “R-word.” “Race is a myth.” Or, as PBS stated, an illusion. Here’s some of the precis of the PBS documentary:

Everyone can tell a Nubian from a Norwegian, so why not divide people into different races? That’s the question explored in “The Difference Between Us,” the first hour of the series. This episode shows that despite what we’ve always believed, the world’s peoples simply don’t come bundled into distinct biological groups. We begin by following a dozen students, including Black athletes and Asian string players, who sequence and compare their own DNA to see who is more genetically similar. The results surprise the students and the viewer, when they discover their closest genetic matches are as likely to be with people from other “races” as their own.

Much of the program is devoted to understanding why. We look at several scientific discoveries that illustrate why humans cannot be subdivided into races and how there isn’t a single characteristic, trait – or even one gene - that can be used to distinguish all members of one race from all members of another.

Modern humans – all of us – emerged in Africa about 150,000 to 200,000 years ago. Bands of humans began migrating out of Africa only about 70,000 years ago. As we spread across the globe, populations continually bumped into one another and mixed their mates and genes. As a species, we’re simply too young and too intermixed to have evolved into separate races or subspecies.

So what about the obvious physical differences we see between people? A closer look helps us understand patterns of human variation:

  • In a virtual “walk” from the equator to northern Europe, we see that visual characteristics vary gradually and continuously from one population to the next. There are no boundaries, so how can we draw a line between where one race ends and another begins?
  • We also learn that most traits – whether skin color, hair texture or blood group – are influenced by separate genes and thus inherited independently one from the other. Having one trait does not necessarily imply the existence of others. Racial profiling is as inaccurate on the genetic level as it is on the New Jersey Turnpike.
  • We also learn that many of our visual characteristics, like different skin colors, appear to have evolved recently, after we left Africa, but the traits we care about – intelligence, musical ability, physical aptitude – are much older, and thus common to all populations. Geneticists have discovered that 85% of all genetic variants can be found within any local population, regardless of whether they’re Poles, Hmong or Fulani. Skin color really is only skin deep. Beneath the skin, we are one of the most similar of all species.

Certainly a few gene forms are more common in some populations than others, such as those controlling skin color and inherited diseases like Tay Sachs and sickle cell. But are these markers of “race?” They reflect ancestry, but as our DNA experiment shows us, that’s not the same thing as race. The mutation that causes sickle cell, we learn, was passed on because it conferred resistance to malaria. It is found among people whose ancestors came from parts of the world where malaria was common: central and western Africa, Turkey, India, Greece, Sicily and even Portugal – but not southern Africa.

This documentary came out in 2003. In late 2005 scientists discovered the role that SLC24A5 plays in skin color. It is the second most ancestrally informative locus typed so far to differentiate Europeans and Africans. It actually does come close to being a single gene which differentiates two populations! It is true that human populations have mixed. I probably have ancestors who were resident in China and Northern Europe within the last 1,000 years. That’s the way genealogy works. All Eurasians may be able to find a genealogical line of ancestry back to Genghis Khan (though not necessarily distinctive genes attributable to him). But that does not negate the fact that some of your ancestors show up in your pedigree orders of magnitude more than others of your ancestors. The vast majority of my ancestors within the last 1,000 years were South Asian, though a substantial minority were Southeast Asian. The question of our youth as a species and its relation to our differentiation into races and subspecies is an empirical matter, not an a priori one determined by a fixed number of years. Since races and subspecies are fuzzy characteristics they’re easy to refute, just pick the definition which is refutable. I have no idea how they adduce that traits like intelligence, musical ability, and physical aptitude, are that much older than the “Out of Africa” migration. Humans have been getting much more gracile over the last 10,000 years as a whole, while I don’t know how one can know about the musical abilities of anatomically modern humans in Africa 200,000 years. These traits are quantitative, and based on standing genetic variation, so the architecture is qualitatively different from that of skin color (though in 2003 we didn’t know the architecture of skin color, the confusion is explainable).

The old concept of “race” as outlined by anthropologists in the early 20th century, and accepted broadly, was often unclear, ad hoc, and not empirical. Over the past generation by way of refuting the concept of race people are wont to make unclear, ad hoc, and non-empirical, assertions. The reason that scholars discuss race and refute it is to eliminate confusions and misconceptions from the public, but their presentation has produced more confusions and misconceptions. The idea that human phylogeny is impossible is in the air, I have heard it from many intelligent people. I have no idea why people would be skeptical of it, the way it is presented by many scholars makes the implication clear that phylogeny is impossible, that differences are trivial. Both these are false impressions. I do not believe that the fact that mixed-race people’s real problems obtaining organs with the appropriate tissue match is a trivial affair. Human genetic differences have plenty of concrete impacts which are not socially constructed.

Personally I have no problem with abandoning the word race and all the baggage which that entails. But there’s no reason to throw the baby out with the bathwater here. In the “post-genomic” era human population substructure is taken for granted. The outlines of the history of our species, and its various branches, are getting clearer and clearer. There’s no point in replacing old rubbish with new rubbish. We have the possibility for clear and useful thought, if we choose to grasp it.

December 11, 2010

Excavating the Neolithic genetic strata

After linking to Marnie Dunsmore’s blog on the Neolithic expansion, and reading Peter Bellwood’s First Farmers, I’ve been thinking a bit on how we might integrate some models of the rise and spread of agriculture with the new genomic findings. Bellwood’s thesis basically seems to be that the contemporary world pattern of expansive macro-language families (e.g., Indo-European, Sino-Tibetan, Afro-Asiatic, etc.) are shadows of the rapid demographic expansions in prehistory of farmers. In particular, hoe-farmers rapidly pushing into virgin lands. First Farmers was published in 2005, and so it had access mostly to mtDNA and Y chromosomal studies. Today we have a richer data set, from hundreds of thousands of markers per person, to mtDNA and Y chromosomal results from ancient DNA. I would argue that the new findings tend to reinforce the plausibility of Bellwood’s thesis somewhat.

The primary datum I want to enter into the record in this post, which was news to me, is this: the island of Cyprus seems to have been first settled (at least in anything but trivial numbers) by Neolithic populations from mainland Southwest Asia.* In fact, the first farmers in Cyprus perfectly replicated the physical culture of the nearby mainland in toto. This implies that the genetic heritage of modern Cypriots is probably attributable in the whole to expansions of farmers from Southwest Asia. With this in mind let’s look at Dienekes’ Dodecad results at K = 10 for Eurasian populations (I’ve reedited a bit):


neolith

Modern Cypriots exhibit genetic signatures which shake out into three putative ancestral groups. West Asian, which is modal in the Caucasus region. South European, modal in Sardinia. And Southwest Asian, which is modal in the Arabian peninsula. Cypriots basically look like Syrians, but with less Southwest Asian, more balance between West Asian and South European, and far less of the minor components of ancestry.

Just because an island was settled by one group of farmers, it does not mean that subsequent invasions or migrations could not have an impact. The indigenous tribes of Taiwan seem to be the original agriculturalists of that island, and after their settlement there were thousands of years of gradual and continuous cultural change in situ. But within the last 300 years settlers from Fujian on the Chinese mainland have demographically overwhelmed the native Taiwanese peoples.

During the Bronze Age it seems Cyprus was part of the Near East political and cultural system. The notional kings of Cyprus had close diplomatic relations with the pharaohs of Egypt. But between the end of the Bronze Age and the Classical Age Cyprus became part of the Greek cultural zone. Despite centuries of Latin and Ottoman rule, it has remained so, albeit with a prominent Turkish minority.

One thing notable about Cyprus, and which distinguishes it from mainland Greece, is the near total absence of a Northern European ancestral component. Therefore we can make the banal inference that Northern Europeans were not initially associated with the demographic expansions of farmers from the Middle East. Rather, I want to focus on the West Asian and Southern European ancestral components. One model for the re-population of Europe after the last Ice Age is that hunter-gatherers expanded from the peninsular “refugia” of Iberia and Italy, later being overlain by expansions of farmers from the Middle East, and perhaps Indo-Europeans from the Pontic steppe. I have a sneaking suspicion though that what we’re seeing among Mediterranean populations are several waves of expansion out of the Near East. I now would offer the tentative hypothesis that the South European ancestral element at K = 10 is a signature of the first wave of farmers which issued out of the Near East. The West Asians were a subsequent wave. I assume that the two groups must correlate to some sort of cultural or technological shift, though I have no hypothesis as to that.

From the above assertions, it is clear that I believe modern Sardinians are descendants of that first wave of farmers, unaffected by later demographic perturbations. I believe that Basques then are a people who emerge from an amalgamation of the same wave of seafaring agriculturalists with the indigenous populations preceding them (the indigenes were likely the descendants of a broad group of northern Eurasians who expanded after the end of the last Ice Age from the aforementioned refugia). They leap-frogged across fertile regions of the Mediterranean and pushed up valleys of southern France, and out of the Straits of Gibraltar. Interestingly, the Basque lack the West Asian minority element evident in Dienekes’ Spaniards, Portuguese, as well as the HGDP French (even up to K = 15 they don’t shake out as anything but a two way admixture, while the Sardinians show a minor West Asian component). Also, the West Asian and Southern European elements are several times more well represented proportionally among Scandinavians than Finns. The Southern European element is not found among the Uyghur, though the Northern European and West Asian one is. I infer from all these patterns that the Southern European element derived from pre-Indo-European farmers who pushed west from the Near East. It is the second largest component across much of the Northwestern Europe, the largest across much of Southern European, including Greece.

A second issue which First Farmers clarified are differences between the spread of agriculture from the Near East to Europe and South Asia. It seems that the spread of agriculture across South Asia was more gradual, or least had a longer pause, than in Europe. A clear West Asian transplanted culture arrived in what is today Pakistan ~9,000 years ago. But it does not seem that the Neolithic arrived to the far south of India until ~4,000 years ago. I think that a period of “incubation” in the northwest part of the subcontinent explains the putative hybridization between “Ancient North Indians” and “Ancient South Indians” described in Reconstructing Indian population history. The high proportion of “Ancestral North Indian,” on the order of ~40%, as well as Y chromosomal markers such as R1a1a, among South Indian tribal populations, is a function of the fact that these groups are themselves secondary amalgamations between shifting cultivators expanding from the Northwest along with local resident hunter-gatherer groups which were related to the ASI which the original West Asian agriculturalists encountered and assimilated in ancient Pakistan (Pathans are ~25% ASI). I believe that the Dravidian languages arrived from the Northwest to the south of India only within the last 4-5,000 with the farmers (some of whom may have reverted to facultative hunter-gathering, as is common among tribals). This relatively late arrival of Dravidian speaking groups explains why Sri Lanka has an Indo-European presence to my mind; the island was probably only lightly settled by farming Dravidian speakers, if at all, allowing Indo-European speakers from Gujarat and Sindh to leap-frog and quickly replace the native Veddas, who were hunter-gatherers.

Note: Here is K = 15.

* Wikipedia says there were hunter-gatherers, but even here the numbers were likely very small.

March 28, 2010

More on recombination & natural selection

A follow up to the post below, see John Hawks, Selection’s genome-wide effect on population differentiation and p-ter’s Natural selection and recombination. As I said, it’s a dense paper, and I didn’t touch on many issues.

Powered by WordPress