Razib Khan One-stop-shopping for all of my content

November 3, 2018

It’s raining selective sweeps

Filed under: Population genetics,Population genomics,Selection — Razib Khan @ 11:44 pm

A week ago a very cool new preprint came out, Identifying loci under positive selection in complex population histories. It’s something that you can’t even imagine just ten years ago. The authors basically figure out ways to identify deviations of markers from expected allele frequency given a null neutral evolutionary model. The method is put first, which I really like, before getting to results or discussion. Additionally, they did a lot of simulation ahead of time. The sort of simulation that is really not possible before the sort of computational resources we have now.

Here’s the abstract:

Detailed modeling of a species’ history is of prime importance for understanding how natural selection operates over time. Most methods designed to detect positive selection along sequenced genomes, however, use simplified representations of past histories as null models of genetic drift. Here, we present the first method that can detect signatures of strong local adaptation across the genome using arbitrarily complex admixture graphs, which are typically used to describe the history of past divergence and admixture events among any number of populations. The method – called Graph-aware Retrieval of Selective Sweeps (GRoSS) – has good power to detect loci in the genome with strong evidence for past selective sweeps and can also identify which branch of the graph was most affected by the sweep. As evidence of its utility, we apply the method to bovine, codfish and human population genomic data containing multiple population panels related in complex ways. We find new candidate genes for important adaptive functions, including immunity and metabolism in under-studied human populations, as well as muscle mass, milk production and tameness in particular bovine breeds. We are also able to pinpoint the emergence of large regions of differentiation due to inversions in the history of Atlantic codfish.

On a related note in regards to selection, On the well-founded enthusiasm for soft sweeps in humans: a reply to Harris, Sackman, and Jensen. The authors are responding to a recent preprint criticizing their earlier work. The reason that it’s fascinating to me is that these sorts of arguments today are really concrete and not so theoretical. There’s a lot of data for analytic techinques to chew through, and computation has really transformed the possibilities.

A generation ago these sorts of debates would be a sequence of “you’re wrong!” vs. “no, you’re wrong!” Today the disputes involve a lot of data, and so have a reasonable chance of resolution.

The first preprint identifies the usual candidates in humans that you normally see, and expected targets in cattle and cod. Sure, that will given biologists more interested in mechanisms and pathways things to chew upon, but imagine once researchers have large numbers of genomes for thousands and thousands of species. Then they’ll be testing deviations from neutral allele frequencies across many trees, and getting a more general and abstract sense of the parameter that selection explores, conditional on particularities o evolutionary history.

This is why I’m excited about plans to sequence lots and lots of species.

October 22, 2018

The phylogenetic trees falling on the tundra

Filed under: Historical Population Genetics,Population genetics,Siberia — Razib Khan @ 9:59 pm

A massive new ancient DNA preprint just dropped, The population history of northeastern Siberia since the Pleistocene:

…Here, we report 34 ancient genome sequences, including two from fragmented milk teeth found at the ~31.6 thousand-year-old (kya) Yana RHS site, the earliest and northernmost Pleistocene human remains found. These genomes reveal complex patterns of past population admixture and replacement events throughout northeastern Siberia, with evidence for at least three large-scale human migrations into the region. The first inhabitants, a previously unknown population of “Ancient North Siberians” (ANS), represented by Yana RHS, diverged ~38 kya from Western Eurasians, soon after the latter split from East Asians. Between 20 and 11 kya, the ANS population was largely replaced by peoples with ancestry from East Asia, giving rise to ancestral Native Americans and “Ancient Paleosiberians” (AP), represented by a 9.8 kya skeleton from Kolyma River. AP are closely related to the Siberian ancestors of Native Americans, and ancestral to contemporary communities such as Koryaks and Itelmen. Paleoclimatic modelling shows evidence for a refuge during the last glacial maximum (LGM) in southeastern Beringia, suggesting Beringia as a possible location for the admixture forming both ancestral Native Americans and AP. Between 11 and 4 kya, AP were in turn largely replaced by another group of peoples with ancestry from East Asia, the “Neosiberians” from which many contemporary Siberians derive. We detect additional gene flow events in both directions across the Bering Strait during this time, influencing the genetic composition of Inuit, as well as Na Dene-speaking Northern Native Americans, whose Siberian-related ancestry components is closely related to AP. Our analyses reveal that the population history of northeastern Siberia was highly dynamic, starting in the Late Pleistocene and continuing well into the Late Holocene. The pattern observed in northeastern Siberia, with earlier, once widespread populations being replaced by distinct peoples, seems to have taken place across northern Eurasia, as far west as Scandinavia.

The preprint is very interesting and thorough, and the supplements are well over 100 pages. I read the genetics and linguistics portions. They make for some deep reading, and I really regret making fun of Iosif Lazaridis’ fondness for acronyms now.

I will make some cursory and general observations. First, the authors got really high coverage (so high quality) genomes from the Yana RS site. Notice that they’re doing more data-intense analytic methods. Second, they did not find any population with the affinities to Australo-Melanesian that several research groups have found among some Amazonians. Likely they are hiding somewhere…but the ancient DNA sampling is getting pretty good. We’re missing something. Third, I am not sure what to think about the very rapid bifurcation of lineages we’re seeing around ~40,000 years ago.

The ANS population, ancestral by and large to ANE, seems to be about ~75% West Eurasian (without much Basal Eurasian) and ~25% East Eurasian. Or at least that’s one model. Did they then absorb other peoples? Or, was there an ancient population structure in the primal ur-human horde pushing out of the Near East? That is, are the “West Eurasians” and “East Eurasians” simply the descendants of original human tribes venturing out of Africa ~50,000 years ago? Also, rather than discrete West Eurasian and East Eurasian components, perhaps there was a genetic cline where the proto-ANS occupied a position closer to the former, as opposed to some later pulse admixture?

Without more ancient DNA we probably won’t be able to resolve the various alternative models.

October 9, 2018

The post-neutral human genome (the Kern-Hahn era)

Filed under: Neutral Theory,Population genetics — Razib Khan @ 6:50 pm

If you have any background in evolutionary biology you are probably aware of the controversy around the neutral theory of molecular evolution. Fundamentally a theoretical framework, and instrumentally a null hypothesis, it came to the foreground in the 1970s just as empirical molecular data in evolutionary was becoming a thing.

At the same time that Motoo Kimura and colleagues were developing the formal mathematical framework for the neutral theory, empirical evolutionary geneticists were leveraging molecular biology to more directly assay natural allelic variation. In 1966 Richard Lewontin and John Hubby presented results which suggested far more variation than they had been expecting. Lewontin argued in the early 1970s that their data and the neutral model actually was a natural extension of the “classical” model of expected polymorphism as outlined by R. A. Fisher, as opposed to the “balance school” of Sewall Wright. In short, Lewontin proposed that the extent of polymorphism was too great to explain in the context of the dynamics of the balance school (e.g., segregation load and its impact on fitness), where numerous selective forces maintained variation. The classical school emphasized both strong selective sweeps on favored alleles and strong constraint against most new mutations.

And yet one might expect low levels of polymorphism from the classical school. The way in which the neutral framework was a more natural extension of this model is that even if most inter-specific variation, most substitutions across species, are due to selectively neutral variants, most variants could nevertheless be deleterious and so constrained. Alleles which increase in frequency may have done so through positive selection, or, just random drift. Not balancing forces like diversifying selection and overdominance.

The general argument around neutral theory generated much acrimony and spilled out from the borders of population genetics and molecular evolution to evolutionary biology writ large. Stephen Jay Gould, Simon Conway Morris, and Richard Dawkins, were all under the shadow of neutral theory in their meta-scientific spats about adaptation and contingency.

That was then, this is now. I’ve already stated that sometimes people overplay how much genomics has transformed our understanding of evolutionary biology. But in the arguments around neutral theory, I do think it has had a salubrious impact on the tone and quality of the discourse. Neutral theory and the great controversies flowered and flourished in an age where there was some empirical data to support everyone’s position. But there was never enough data to resolve the debates.

From where I stand, I think we’re moving beyond that phase in our intellectual history. To be frank, some of the older researchers who came up in the trenches when Kimura and his bête noire John Gillespie were engaged a scientific dispute which went beyond conventional collegiality seem to retain the scars of that era. But younger scientists are more sanguine, whatever their current position might be because they anticipate that the data will ultimately adjudicate, because there is so much of it.

With that historical context, consider a new paper, Background selection and biased gene conversion affect more than 95% of the human genome and bias demographic inferences:

Disentangling the effect on genomic diversity of natural selection from that of demography is notoriously difficult, but necessary to properly reconstruct the history of species. Here, we use high-quality human genomic data to show that purifying selection at linked sites (i.e. background selection, BGS) and GC-biased gene conversion (gBGC) together affect as much as 95% of the variants of our genome. We find that the magnitude and relative importance of BGS and gBGC are largely determined by variation in recombination rate and base composition. Importantly, synonymous sites and non-transcribed regions are also affected, albeit to different degrees. Their use for demographic inference can lead to strong biases. However, by conditioning on genomic regions with recombination rates above 1.5 cM/Mb and mutation types (C↔G, A↔T), we identify a set of SNPs that is mostly unaffected by BGS or gBGC, and that avoids these biases in the reconstruction of human history.

This is not an entirely surprising result. Some researchers in human genetics have been arguing for the pervasiveness of background selection, selection against deleterious alleles which effects nearby regions, for nearly a decade. In contrast, there are others who argue selective sweeps driven by positive selection are important in determining variation. Unlike the 1970s and 1980s these researchers don’t evince much acrimony, in part because the data keeps coming, and ultimately they’ll probably converge on the same position. And, the results may differ by species or taxon.

If you want a less technical overview than the paper, Kelley Harris has an excellent comment accompanying it. If you want to know what I mean by the Kern-Han era, it’s a joke due to the publication of The Neutral Theory in Light of Natural Selection.

Finally, some of you might wonder about the implications for demographic inference which preoccupies me so much on this weblog. In the big picture, it probably won’t change a lot, but it will be important for the details. So this is a step forward. That being said, the possibility of variable mutation rates and recombination rates across time and between lineages are also probably quite important.

October 6, 2018

The derived SNP that causes dry earwax was not found in all non-Africans

Filed under: earwax,Population genetics,rs17822931 — Razib Khan @ 11:26 am

A new paper on Chinese genomics using hundreds of thousands of low-coverage data from NIPT screenings is making some waves. I’ll probably talk about the paper at some point. But I want to highlight the frequency of rs17822931 in Han Chinese. It’s pretty incredible how high it is.

Because the derived variant SNP, which is correlated with dry flaky earwax when present in homozygote genotypes, is also associated with less body odor, it has been studied extensively by East Asian geneticists. Basically, individuals who are homozygote for the ancestral SNP, which is the norm in Europe, the Middle East, and Africa, tend to have more body odor, and in societies and contexts where this is offensive these people are subject to more ostracism in East Asia as they are a minority (some of the studies in Japan were motivated by conscripts who elicited complaints from their colleagues).

The relatively low frequency in Guangxi is to be expected. This province was Sinicized only recently. As in, the last 500 years. And it still retains a huge ethnic minority population, and many of the Han in the province likely have that ancestry. But the question still arises: why do the Han have such a high frequency of rs17822931?

Here’s a plot of frequencies:


But the ALFRED database has more details. Sardinians, Somalis, Ethiopian Jews, and Dani from the New Guinea highlands all have very low proportions or none of the derived variant. The Ethiopian Jews are about ~40% West Eurasian, due to Middle Eastern agriculturalist ancestry. Groups like the Masai also have Middle Eastern agriculturalist ancestry. I think the low frequencies of the derived variant in the Middle East are due to migration from eastern Eurasia in the relatively recent past. The frequencies of the derived variant in Europe probably came with the Ancestral North Eurasian ancestry of the steppe people. In South and Southeast Asia the frequencies are indicative of balancing selection, even if there is no such selection, while in the New World world the derived variant is at low, but appreciable frequencies.

As I mentioned in an earlier post, a 40,000 year old Siberian had the derived variant (heterozgote). I suspect the Basal Eurasians did not.

The population genetic structure of China (through noninvasive prenatal testing)

Filed under: China genetics,Han genetics,Population genetics,Population genomics — Razib Khan @ 10:03 am

This week a big whole genome analysis of China was published in Cell, Genomic Analyses from Non-invasive Prenatal Testing Reveal Genetic Associations, Patterns of Viral Infections, and Chinese Population History. The abstract:

We analyze whole-genome sequencing data from 141,431 Chinese women generated for non-invasive prenatal testing (NIPT). We use these data to characterize the population genetic structure and to investigate genetic associations with maternal and infectious traits. We show that the present day distribution of alleles is a function of both ancient migration and very recent population movements. We reveal novel phenotype-genotype associations, including several replicated associations with height and BMI, an association between maternal age and EMB, and between twin pregnancy and NRG1. Finally, we identify a unique pattern of circulating viral DNA in plasma with high prevalence of hepatitis B and other clinically relevant maternal infections. A GWAS for viral infections identifies an exceptionally strong association between integrated herpesvirus 6 and MOV10L1, which affects piwi-interacting RNA (piRNA) processing and PIWI protein function. These findings demonstrate the great value and potential of accumulating NIPT data for worldwide medical and genetic analyses.

In The New York Times write-up there is an interesting detail, “This study served as proof-of-concept, he added. His team is moving forward on evaluating prenatal testing data from more than 3.5 million Chinese people.” So what he’s saying is that this study with >100,000 individuals is a “pilot study.” Let that sink in.

The PCA at the top of the post is a bit busy, so I want to highlight the salient aspect. These results confirm that 5-10% of the ancestry of the Hui, Chinese speaking Muslims, is West Eurasian. The Uygur and Kazakh are about ~40% on the left of the plot. The authors note that the Manchus overlapped almost perfectly with individuals sampled from Northern China. This is expected because by the end of the Ching dynasty most of the Manchus had been fully Sinicized, and in the 20th century fully assimilated. Recently due to an emphasis on “national minorities” and some privileges granted therein many people have identified as Manchu due to some ancestry who in all other ways simply northern Han (the Manchu language is moribund).

The sections on particular adaptations which vary by region are not surprising. In books like The Retreat of Elephants the slow, gradual, and inexorable expansion of the Chinese beyond the Yangzi basin is described in a way that makes it clear that southern diseases and climate were a major impediment. But through a process of acclimation, assimilation of local peoples, and adaptation, by 1000 AD the center of demographic gravity had shifted to the south.

There is a section of the text which I think will be falsified though:

After removing participants with 49bp read length and with sequencing error rate >0.00325, a principal component analysis of 45,387 self-reported Han Chinese from the 31 administrative divisions showed that the greatest differentiation of Han Chinese is along a latitudinal gradient (Figures S3E and S3F), consistent with previous studies (Chen et al., 2009, Xu et al., 2009). In contrast, there is, perhaps surprisingly, very little differentiation from East to West. This observation may be explained by the fact that a large proportion of the western Han populations in China are recent immigrants organized by the central government starting from 1949 when the People’s Republic of China was founded (Liang and White, 1996).

I don’t think there’s any need to make recourse to migration from 1949 and after. The argument in Guns, Germs, and Steel suffices: it’s just easier to move across latitudes than longitudes. The people of the north eat noodles made from wheat, and the people of the south eat rice. This is a big cultural transition for peasants to make, and so it didn’t happen as often as moving to the coast, or inland. We have documented instances of mass migrations from adjacent provinces due to famine and political instability. In the 17th century conflicts resulted in the depopulation of Sichuan and the arrival of large numbers of people from Hunan and Hubei to the east.

The plot below is one of the more interesting ones from the paper. From left to right, private alleles found in the HapMap Utah whites also found in all individuals in a given province, and then just Han, and then private alleles to ethnic Telugu Indians (from South India) found in all individuals in a given province, and then just Han.

Click to enlarge

The first thing to notice is that there is a correlation between the Han and non-Han. This shouldn’t be surprising. Plenty of ethnic groups have become Han through acculturation and become demographically absorbed. This is probably truer in parts of the south than in the north, but southern Chinese ethnic minorities are genetically and culturally much more like the Han in the first place.

Private alleles shared with Northern Europeans (CEU) almost certainly has to do with the interaction sphere of the steppe pastoralists, which extends from the Carpathians to Mongolia. The relatively high frequency of R1a, and to a lesser extent R1b, among many Turkic/Central Asian peoples is a pretty good sign of where this West Eurasian ancestry comes from.

The Indian affinity is perhaps more interesting. To be honest I was surprised at the high affinity in Yunnan and Hainan. Tibet has strong cultural connections to India through its form of Buddhism. But its interesting that Qinghai, where many Tibetans also live, does not have the affinity with India. What’s going on in the other provinces? I suspect that the aboriginal peoples assimilated by the Han and other groups in this region probably had some distant connections to the non-West Eurasian ancestry in South Asia.

October 4, 2018

Chinese and Indian American population genetic structure

Filed under: Population genetics — Razib Khan @ 3:14 pm

In Who We Are and How We Got Here: Ancient DNA and the New Science of the Human Past David Reich makes the observation that India is a nation of many different ethnicities, while China is dominated by a single ethnicity, the Han. This is obviously true, more or less. Even today the vast majority of Indians seem to be marrying with their own communities, jati.

Over the years I’ve collected many different genotypes of Americans of various origins who have purchased personal genomics kits, and given me their raw results. I decided to go through my collection and strip detailed ethnic labels and simply group together all those individuals from India, and China, who have had their genotypes done from one of the major services.

I suspect that these individuals are representative of “Indian Americans” and “Chinese Americans.” So what’s their genetic structure?

Here’s the variation of the consumers:

Merging with 1000 Genomes and some HGDP samples, here’s what I get:

Let’s zoom in on the Chinese:


In case you don’t know, the American Chinese community has been historically biased toward being mostly Cantonese. More recently, there have been migrants from Fujian. The Indian American distribution should be self-evident.

October 3, 2018

Nomads, cosmopolitan predators, and peasants, xenophobic producers

Ten years ago when I read Peter Heather’s Empires and Barbarians, its thesis that the migrations and conquests of the post-Roman period were at least in part folk wanderings, where men, women, and children swarmed into the collapsing Empire en masse, was somewhat edgy. Today Heather’s model has to a large extent been validated. The recent paper on the Lombard migration, the discovery that the Lombards were indeed by and large genetically coherent as a transplanted German tribe in Pannonia and later northern Italy, confirms the older views which Heather attempted to resurrect. Additionally, the Lombards also seem to have been defined by a dominant group of elite male lineages.

Why is this even surprising? Because to a great extent, the ethnic and tribal character of the post-Roman power transfer between Late Antique elites and the newcomers was diminished and dismissed for decades. I can still remember the moment in 2010 when I was browsing books on Late Antiquity at Foyles in London and opened a page on a monograph devoted to the society of the Vandal kingdom in North Africa. The author explained that though the Vandals were defined by a particular set of cultural codes and mores, they were to a great extent an ad hoc group of mercenaries and refugees, whose ethnic identity emerged de novo on the post-Roman landscape.

In the next few years, we will probably get Vandal DNA from North Africa. I predict that they will be notably German (though with admixture, especially as time progresses). Additionally, I predict most of the males will be haplogroup R1b or I1. But the Vandal kingdom was actually one where there was a secondary group of barbarians: the Alans. It was Regnum Vandalorum et Alanorum. I predict that Alan males will be R1a. In particular, R1a1a-z93.

But this post is not about the post-Roman world. Rather, it’s about the Inner Asian forest steppe. The sea of grass, stretching from the Altai to the Carpathians. A new paper in Science adds more samples to the story of the Sbruna, Cimmerians, Scythians, and Sarmatians. Ancient genomes suggest the eastern Pontic-Caspian steppe as the source of western Iron Age nomads. The abstract is weirdly nonspecific, though accurate:

For millennia, the Pontic-Caspian steppe was a connector between the Eurasian steppe and Europe. In this scene, multidirectional and sequential movements of different populations may have occurred, including those of the Eurasian steppe nomads. We sequenced 35 genomes (low to medium coverage) of Bronze Age individuals (Srubnaya-Alakulskaya) and Iron Age nomads (Cimmerians, Scythians, and Sarmatians) that represent four distinct cultural entities corresponding to the chronological sequence of cultural complexes in the region. Our results suggest that, despite genetic links among these peoples, no group can be considered a direct ancestor of the subsequent group. The nomadic populations were heterogeneous and carried genetic affinities with populations from several other regions including the Far East and the southern Urals. We found evidence of a stable shared genetic signature, making the eastern Pontic-Caspian steppe a likely source of western nomadic groups.

The German groups which invaded the Western Roman Empire were agropastoralists. That is, they were slash and burn farmers who raised livestock. Though they were mobile, they were not nomads of the open steppe. Man for man the Germans of Late Antiquity had more skills applicable to the military life than the Roman peasant. This explains in part their representation in the Roman armed forces in large numbers starting in the 3rd century. But the people of the steppe, pure nomads, were even more fearsome. Ask the Goths about the Huns.

Whole German tribes, like the Cimbri, might coordinate for a singular migration for new territory, but for the exclusive pastoralist, their whole existence was migration. Groups such as the Goths and Vandals might settle down, and become primary producers again, but pure pastoralists probably required some natural level of predation and extortion upon settled peoples to obtain a lifestyle beyond marginal subsistence. Which is to say that some of the characterizations of Late Antique barbarians as ad hoc configurations might apply more to steppe hordes.

There has been enough work on these populations over the past few years to admit that various groups have different genetic characteristics, indicative of a somewhat delimited breeding population. But, invariably there are outliers here and there, and indications of periodic reversals of migration and interactions with populations from other parts of Eurasia.

Earlier I noted that Heather seems to have been correct that the barbarian invasions of the Roman Empire were events that involved the migration of women and children, as well as men. The steppe was probably a bit different. Here are the Y and mtDNA results for males from these data that are new to this paper:

Culture MtDNA Haplogroup Y Haplogroup
Late Sarmatian U5b2b R1b1a1a2?
Scythian U5a2a1 R1b1a1a2?
Late Sarmatian D4q R1b1a1a2
Scythian J2b1a6 R1b1a1a2
Scythian U5a1a1 R1b1a1a2
Scythian U5b2a3 R1b1a1a2
Scythian U4* R1b1a1a2
Scythian U5a2b R1b1a1a2
Cimmerian H9a R1b1a
Srubno-alakulskaya T2a1 R1a1a1?
Srubno-alakulskaya J1c3a R1a1a1
Srubno-alakulskaya H R1a1a1
Srubno-alakulskaya HV0a R1a1a1
Srubno-alakulskaya U5a1 R1a1a1
Srubno-alakulskaya HV0a R1a1a1
Late Sarmatian T1a1 R1a1a
Cimmerian C5c (50%) Q1a1

I’m assuming you aren’t surprised. These steppe tribes seem to be defined by extended paternal lineage networks. The Sbruna people are R1a1a1, as is dominant in Eastern Europe today. But, an ancient Sbruna male dating to 1800 BC was found to have the Asian variant of R1a1a1, found in South and Central Asia, not the one predominant among Slavic peoples.

Click to enlarge

Speaking of South Asians, there is some interesting discussion on this issue in the paper. I’ll quote a few sections:

The Bronze Age Srubnaya-Alakulskaya individuals from Kazburun 1/Muradym 8 presented genetic similarities to the previously published Srubnaya individuals. However, in f4 statistics, they shared more drift with representatives of the Andronovo and Afanasievo populations compared to the published Srubnaya individuals. Those apparently West Eurasian people lacked significant Siberian components (NEA and SEA) in ADMIXTURE analyses but carried traces of the SA component that could represent an earlier connection to ancient Bactria. The presence of an SA component (as well as finding of metals imported from Tien Shan Mountains in Muradym 8) could therefore reflect a connection to the complex networks of the nomadic transmigration patterns characteristic of seasonal steppe population movements….

There are two ways, not exclusive, that I can explain the “South Asian” component you find in some of the steppe individuals. First, the “South Asian” component is found in the Neolithic Iranian sample. And, you can see in another plot that the Scythians are enriched for West Asian ancestry in comparison to the Sbruna. As noted above there was probably south to north migration of these Indo-European nomadic groups. So yes, just as with the East Asia ancestry which periodically appears, this is evidence of an “Inner Asian International.”

A second possibility though is that the South Asian ancestry is artifactual and that it’s just emerging in ADMIXTURE because of shared ancestry between the Sbruna and South Asians because of gene flow from the steppe into South Asia (and since South Asians have “Iranian farmer” ancestry it also pops up in the Iranian Neolithich sample).

The Sbruna flourished between the 18th and 12th centuries BC. According to Wikipedia:

Philological and linguistic evidence indicates that the bulk of the Rigveda Samhita was composed in the northwestern region of the Indian subcontinent, most likely between c. 1500 and 1200 BC.

Mitannia Indo-Aryan is attested in Syria in 1380 BC.

In the centuries around 1500 BC it seems quite possible that there was a “Indo-Aryan Inner Asian International”, just as in the first millennium AD there emerged a Turkic International, and for more than a century after 1200 AD there was a Mongol International. In the north, the Indo-Aryans were absorbed by Iranian and Uralic peoples. In West Asia they didn’t have a major cultural impact, aside from introducing chariots. It is in India by happenstance that Indo-Aryan linguistic culture and aspects of their folk memory is preserved to this day.

This isn’t that amazing. Half of the speakers of Turkic langauges are ethnic Turks, who live in Turkey. Anatolia genetically isn’t really very East Asian, though there is some of that. But the cultural heritage of the ancient Turks remains stronger there than in areas anciently inhabited by Turks, such as western Mongolia (where the people are genetically more like the original Turks were in the first millennium AD).

What’s the upshot here? I think that there is a spectrum of passivity and xenophobia in the modes of production outlined above. Sedentary peasant peoples are the most conservative and xenophobic.  They are also the least warlike because their skill set is the least transferable to warfare. They specialize in production, not extortion.

Pure nomads are the least xenophobic and most open to various forms of cultural innovation. The Mongol horde rapidly expanded in the decades of Genghis Khan’s rule through assimilation of various Turkic and Tungusic peoples. Though Genghis Khan put his sons by his first wife Borte in all the major positions, competent individuals outside of his own family line were elevated to power and authority. We have enough evidence now that these social dynamics are also strongly driven by the reality of migrating males, who marry a variety of conquered peoples.

Though Mongols were religiously tolerant and relatively accepting of ethnic diversity so long as subordinate peoples did not rebel, they were fundamentally an extortive order where organized mass violence was always the weapon of first resort. They were almost certainly not atypical, but continuing an Inner Asian tradition which probably dates to the Bronze Age, and matured 1,000 years later with groups like the Scythians.

Agropastoralists, such as the people of Nothern Europe during antiquity, were probably somewhere in between peasants and nomads. Not as xenophobic as peasants, but definitely more inward looking than the steppe nomads.

October 2, 2018

How related should you expect relatives to be?

Filed under: Population genetics — Razib Khan @ 12:44 am

Like many Americans in the year 2018 I’ve got a whole pedigree plugged into personal genomic services. I’m talking from grandchild to grandparent to great-aunt/uncles. A non-trivial pedigree. So we as a family look closely at these patterns, and we’re not surprised at this point to see really high correlations in some cases compared to what you’d expect (or low).

This means that you can see empirically the variation between relatives of the same nominal degree of separation from a person of interest. For example, each of my children’s’ grandparents contributes 25% of their autosomal genome without any prior information. But I actually know the variation of contribution empirically. For example, my father is enriched in my daughter. My mother is my sons.

The sample principle applies to siblings. Though they should be 50% related on their autosomal genome, it turns out there is variation. I’ve seen some papers large data sets (e.g., 20,000 sibling pairs) which gives a standard deviation of 3.7% in relatedness. But what about other degrees of relation?

I didn’t find empirical data on that (imagine assembling a dataset with large numbers of known third cousins…perhaps in Iceland), but I did find this paper, Variation in actual relationship as a consequence of Mendelian sampling and linkage that was useful. The authors modeled the expectation and variance (and so standard deviation) of identity by descent, genomic relatedness. One of their models gives 3.84% standard deviation for siblings, so that seems pretty close to the empirical mark. Here is a table I put together from a subset of their results:

Relationship Relatedness Standard Dev
Parent-child 0.5 0
Full sibling 0.5 0.0384
Grandparent-Grandchildren 0.25 0.0251
Uncle-Aunt/Nephew-Niece 0.25 0.0251
Cousin 0.125 0.0241
2nd Cousin 0.0312 0.0117
3rd Cousin 0·0078 0·0054

The distribution of relatedness among siblings seems about normal. So there are individuals who are less than 40% related to their “full-sibling” while others are more than 60% related. Notice that when it comes to third cousins the variation in expected relatedness is in the same range as expected relatedness. Some “3rd cousins” won’t share any genomic relatedness as defined by identity by descent from recent ancestors.

Related: How much of your genome do you inherit from a particular grandparent?

September 19, 2018

David Burbridge’s 10 questions for A. W. F. Edwards In 2006

Filed under: A. W. F. Edwards,Population genetics — Razib Khan @ 9:30 pm

A few years ago I watched a documentary about the rise of American-influenced rock music in Britain in the 1960s. At some point, one of the Beatles, probably Paul McCartney, or otherwise Eric Clapton, was quoted as saying that they wanted to introduce Americans to “their famous people.” Though patronizing and probably wrong, what they were talking about is that there were particular blues musicians who were very influential in some British circles were lingering in obscurity in the United States of America due to racial prejudice. The bigger picture is that there are brilliant people who for whatever reason are not particularly well known to the general public.

This is why I am now periodically “re-upping” interviews with scientists that we’ve done on this weblog over the past 15 years. These are people who should be more famous. But aren’t necessarily.

In 2006 David Burbridge, a contributor this weblog and a historian of things Galtonian, interviewed the statistical geneticist A. W. F. Edwards. Edwards was one of R. A. Fisher’s last students, so he has a connection to a period if history that is passing us by.

I do want to say that his book, Foundations of Mathematical Genetics, really gave me a lot of insights when I first read it in 2005 and began to be deeply interested in pop gen. It’s dense. But short. Additionally, I have also noticed that there is now a book out which is a collection of Edwards’ papers, with commentaries, Phylogenetic Inference, Selection Theory, and a History of Science. Presumably, it is like W. D. Hamilton’s Narrow Roads of Gene Land series. I wish more eminent researchers would publish these sorts of compilations near the end of their careers.

There have been no edits below (notice the British spelling). But I did add some links!

David’s interview begins after this point:

A. W. F. (Anthony) Edwards is one of Britain’s most distinguished geneticists. He studied genetics at Cambridge as one of the last students of R. A. Fisher, and like Fisher he has contributed actively to both genetics and statistics. In genetics his work includes several influential papers on the reconstruction of phylogenies, and a widely-read recent article on ‘Lewontin’s Fallacy’. In statistics he is known especially for his development and advocacy of the concept of Likelihood as a criterion for scientific inference. He has also made a notable contribution to combinatorial mathematics by finding a method of constructing Venn diagrams for any number of sets. In addition to many scientific papers, he has written four books: Likelihood (1972; expanded edition 1992); Foundations of Mathematical Genetics (1977; 2nd edition 2000); Pascal’s Arithmetical Triangle: the Story of a Mathematical Idea (1987; expanded edition 2002); and Cogwheels of the Mind: the Story of Venn Diagrams (2004). He has written extensively on the history of genetics, mathematics, and statistics, and has co-edited (with H. A. David) Annotated Readings in the History of Statistics (2001), and (with Milo Keynes and Robert Peel) A Century of Mendelism in Human Genetics (2004). He is also a champion glider pilot.

1. You were among the last students of R. A. Fisher. Can you share with us some reminiscences of him?

I first met Fisher in the summer of 1956 and had much contact with him until his death six years later. I saw him last during the Second Human Genetics Conference in Rome in 1961 and subsequently corresponded with him. I have published quite a few of my reminiscences of those times in a number of different places. I have been fortunate in having had a lot of contact with the older generation who knew him better than I did – people like Barnard, Bartlett, Finney, Yates, Race, Ruth Sanger and Bennett – and with members of his family, especially Rose, Harry and Joan (his biographer) amongst his children. Being a fellow of the same Cambridge college (Caius) as Fisher, though not at the same time, has meant daily contact with people who knew him well. But the most important thing is his science, and there everyone can get to know him through his writings, which reveal a mind of extraordinary power and vigour. That is the Fisher whom succeeding generations should learn about and admire.

2. Like Fisher you have worked in both statistics and genetics. How do you see the relationship between them, both in your own work and more generally?

In a sense I have benefitted from being an amateur in both fields so that I see no boundary between them. Though I qualified in genetics the subject almost immediately changed so radically through advances in molecular biology that most geneticists would not now regard me as one of themselves anyway. My generation thought genetics was the study of inheritance; theirs thinks it is the study of genes. As to statistics, I attended eight lectures by Henry Daniels in Cambridge but am otherwise self-taught, being hugely influenced by Fisher’s book Statistical Methods for Research Workers which he told me to buy (and then signed for me). Genetical statistics has changed fundamentally too: our problem was the paucity of data, especially for man, leading to an emphasis on elucidating correct principles of statistical inference. Modern practitioners have too much data and are engaged in a theory-free reduction of it under the neologism ‘bioinformatics’. We had to navigate by the stars; they have GPSs.

3. Much of your early work (some of it in collaboration with L. L. Cavalli-Sforza) was on methods of inferring phylogenies. How do you assess the progress in this field since the 1960s, and how have your own methods stood up to empirical tests?

All my work was in collaboration with Luca Cavalli-Sforza. It was his idea. He hired me to join his group in Pavia in Italy, not specifically to work on phylogenies but to apply the new-fangled computers to human genetics generally. The late delivery of the Olivetti computer was a blessing in disguise because it left us time to talk about what we would do with it when it came. I was initially sceptical because I knew that linkage was statistically difficult and here was Luca proposing what looked like linkage on a tree whose very shape also required estimating!

I think progress on the theoretical side has been incredibly slow, despite the best efforts of Joe Felsenstein, the leading practitioner. In a few months in 1962 and 1963 Luca and I thought up three ways of tackling the problem: least-squares on an additive tree (his), minimum evolution or parsimony (mine) and maximum-likelihood on a stochastic model (very much a joint effort). Forty-odd years on people are still arguing about the relative merits of the descendants of our methods when all along they should have been concentrating on refining the statistical approach through maximum-likelihood, which was our real contribution. Of course, from a practical point of view the computer packages have taken over in a development parallel to that in human genetics, from shakey inferences based on too little data and doubtful logic to computer algorithms trying to digest too much.

4. Your recent article on ‘Lewontin’s Fallacy’ criticises the claim that human geographical races have no biological meaning. As the article itself points out, it could have been written at any time in the last 30 years. So why did it take so long – and have you had any reactions from Lewontin or his supporters?

I can only speak for myself as to why it took me so long. Others closer to the field will have to explain why the penny did not drop earlier, but the principal cause must be the huge gap in communication that exists between anthropology, especially social anthropology, on the one hand, and the humdrum world of population and statistical genetics on the other. When someone like Lewontin bridges the gap, bearing from genetics a message which the other side wants to hear, it spreads fast – on that side. But there was no feedback. Others might have noticed Lewontin’s 1972 paper but I had stopped working in human and population genetics in 1968 on moving to Cambridge because I could not get any support (so I settled down to writing books instead). In the 1990s I began to pick up the message about only 15% of human genetic variation being between, as opposed to within, populations with its non-sequitur that classification was nigh impossible, and started asking my population-genetics colleagues where it came from. Most had not heard of it, and those that had did not know its source. I regret now that in my paper I did not acknowledge the influence of my brother John, Professor of Genetics in Oxford, because he was independently worrying over the question, inventing the phrase ‘the death of phylogeny’ which spurred me on.

Eventually the argument turned up unchallenged in Nature and the New Scientist and I was able to locate its origin. I only started writing about it after lunch one day in Caius during which I had tried to explain the fallacy across the table to a chemist, a physicist, a physiologist and an experimental psychologist – all Fellows of the Royal Society – and found myself faltering. I like to write to clear my mind. Then I met Adam Wilkins, the editor of BioEssays, and he urged me to work my notes up into a paper.

I have had no adverse reaction to it at all, but plenty of plaudits from geneticists, many of whom told me that they too had been perplexed. Perhaps the communication gap is
still too large, or just possibly the point has been taken. After all, Fisher made it in 1925 in Statistical Methods which was written for biologists so it is hardly new.

5. You have written several articles about Fisher’s Fundamental Theorem of Natural Selection. Following a groundbreaking reinterpretation by George Price in the early 1970s, it is now generally accepted that the theorem as intended by Fisher is valid, but some biologists would still question its practical use or importance. Can you explain in non-technical terms the meaning of the theorem, how the correct interpretation differs from earlier misunderstandings of it, and your own view on its biological importance?

Oh, it’s very simple. You must first recall the precise name of Fisher’s book in which it is the centrepiece: The Genetical Theory of Natural Selection. He is studying the mechanisms of natural selection from the point of view of populations regarded as aggregates of genes. Of course he knows, and stresses, that this is not the whole story. But to him selection’s defining effect is to change gene frequencies. He sees that this will only happen if there is variability in the survival rates of different genes.

Animal breeders promote artificial selection by imposing different ‘fitnesses’ on their stock according to desirability, breeding from some and not from others. They thus raise the mean value in the population of the character desired. Fisher saw that this process implicitly relies on a correlation between the character and fitness, so that progress will depend both on the magnitude of this correlation and the extent to which the character is genetically determined. What happens, he then asked, if we designate fitness itself as the character, making the correlation perfect? The answer is that the mean fitness of the genes will increase by an amount that depends on the extent to which fitness is determined by them. This is the fundamental theorem (in a modern paraphrase): ‘The rate of increase in the mean fitness ascribable to natural selection acting through changes in gene frequencies is equal to the additive genetic variance in fitness’.

The theorem does not involve the mean genotypic fitness – that is, the weighted mean of the fitnesses of the genotypes – which is where most interpreters of it went wrong. Fisher’s repeated denials that his theorem referred to the mean genotypic fitness, itself immortalised in Sewall Wright’s ‘adaptive landscapes’, went unheeded. In 1941 Fisher even published an example in which gene frequencies were changed under natural selection but the mean genotypic fitness stayed constant. Nobody noticed.

The brilliance of the fundamental theorem is not merely that it expresses the central dogma of natural selection – the connection between genetic variability and selective change – but that it does so exactly. Fisher discovered what the rate of change was proportional to: not to the total variance in fitness of the genotypes but only to that part of it found by fitting a weighted linear regression to the genotypic fitnesses. This is the part accounted for by the regression itself, the so-called additive genetic variance. Animal breeders know it as the variance of the breeding values of the genotypes. The fundamental theorem disregards the way the genes are distributed through the population, which will depend on the amount of heterosis in fitness, the extent of assortative mating, and similar possibly transient effects. What matters to it are the changes to the mean fitness brought about by changing gene frequencies.

This, then, is the theorem whose ‘practical use or importance’ ‘some biologists would still question’. Let them ask the animal breeders if it is any ‘use’, and let them ask themselves whether they think Darwin’s theory of evolution by natural selection is of any ‘importance’. If they do, then the fundamental theorem should help them to a deeper, Mendelian, understanding of it. If, however, they hanker after a theory that can make evolutionary predictions, like Wright’s adaptive landscapes were thought to do at one time, they are crying for the moon. Possession of the fundamental theorem will no more enable you to predict the flow of evolution than possession of Newton’s law of gravitation will enable you to predict the time of high tide at London Bridge.

It should not be forgotten, however, that shorn of its genetical complexities the theorem does have predictive power, just as the law of gravitation does when applied to the celestial movements that underlie the tides. ‘In a subdivided population the rate of change of the overall growth-rate is proportional to the variance in growth rates’. The ‘populations’ could be economic sectors, for example, or even one’s own savings accounts.

6. Your career since the 1950s spans the period in which computers, and off-the-shelf programs, have become widely available. Has this been an unmixed blessing, and do you think the development of statistics or genetics would have been very different if computers had been available in, say, 1900?

A mixed blessing of course, because the existence of programs hinders the development of the underlying theory. This is particularly true in statistics where, despite assertions to the contrary by Bayesians, the underlying theory is still a matter for discussion. The phenomenon can be seen in the field of phylogenetic trees, where programs based on different methods proliferate.

1900 is a peculiarly well-chosen date on which to hang the question. Not only was it the year in which Mendel’s results became widely known but it was also the year of the publication of the second edition of Karl Pearson’s The Grammar of Science, which included chapters on biological science for the first time.The Grammar of Science was hugely influential in its day, proclaiming that the function of science was ‘not to explain, but to describe by conceptual shorthand our perceptual experience’. ‘The man who classifies facts of any kind whatever, who sees their mutual relation and describes their sequences, is applying the scientific method’. The computer implementation of this sterile philosophy would have had a devastating effect, particularly on the development of statistical theory and the acceptance of Mendelism. All Pearson’s formidable energy would have been devoted to amassing vast quantities of information to be sifted for correlations. William Bateson’s 1894 six-hundred-page Materials for the Study of Variation treated with especial regard to Discontinuity in the Origin of Species would have been digitally scanned and computer programmers urged to uncover its secrets. It doesn’t bear thinking about!

7. In statistics you are especially known for developing and advocating the concept of Likelihood and its use in scientific inference. Can you explain how Likelihood differs from probability, and why Likelihood methods are useful in evaluating hypotheses?

Likelihood compares statistical hypotheses; it has nothing to say about a hypothesis on its own, like a test of significance does. Imagine two statistical hypotheses, each of which predicts the probabilities of all the possible outcomes of an experiment – which need be no more complex than tossing a biassed coin a number of times and counting the heads. The experiment is performed, the heads counted. Given this count, was the probability of heads p1 (the first hypothesis) or p2 (the second hypothesis)?

Now imagine doing the experiment lots of times assuming the first, and then the second, hypothesis. Would you not prefer the hypothesis that had the shorter expected waiting time until the exact number of heads observed turned up? If so, you have just chosen the one with the greater likelihood. The likelihood of a hypothesis is proportional to the probability of the data given the hypothesis. Meaningless for a hypothesis by itself because of the undefined constant of proportionality, with two hypotheses to be compared on the same data this constant is irrelevant, and the ratio of their likelihoods (or the difference in their log-likelihoods) becomes a measure of the support for one hypothesis versus the other.

Likelihoods therefore derive from probabilities, but unlike the latter are not additive. Whereas you can sum the probabilities of two possible outcomes of an experiment to form the probability of ‘either one or the other’, you cannot do the same for the likelihood of two hypotheses; ‘either one hypothesis or the other’ is not in itself a hypothesis enabling the probabilities of outcomes to be computed, so no likelihood for it is defined. But you can graph the likelihood as a function of p and pay special attention to its maximum, the maximum-likelihood estimate of the probability of heads.

The concept of the likelihood function is fundamental to all approaches to statistical inference, whether Bayesian, Neyman-Pearson, or Fisherian. Not everyone agrees that it is meaningful standing alone by itself, but I (and others before me) believe it is. Doubters can always fall back on the above ‘how long to wait’ argument, which I think was due to David Sprott.

8. You have written extensively on the history of genetics, statistics, and mathematics. Apart from the intrinsic interest of historical studies, how important do you think a knowledge of the history of science is for practising scientists?

I find it essential, and cannot imagine doing science without it. Much of what counts as science nowadays is rather theory-free. We don’t really have a word for it. Sequencing the human genome, for example, is a marvellous achievement relying on technical advances of great ingenuity but it did not require historical understanding. It differs intellectually from, say, the associated activity of trying to estimate linkage values between gene loci. The history of the latter, on which I have written recently, is an essential part of the study of the problem, and much modern work suffers from its neglect.

Celebrating the centenary of the publication of The Origin of Species in 1959, Fisher said: More attention to the History of Science is needed, as much by scientists as by historians, and especially by biologists, and this should mean a deliberate attempt to understand the thoughts of the great masters of the past, to see in what circumstances or intellectual milieu their ideas were formed, where they took the wrong turning or stopped short on the right track.
I agree.

9. R. A. Fisher was a keen eugenist. What are your own views on the role (if any) of eugenics in the modern world?

Fisher’s world was so different from ours, in three ways in particular. Then (say the period between the wars) nation-states were much more independent of each other so that it was possible to discuss population matters for Britain in relative isolation; secondly, it was a time of concern about the possibility of a declining home population; and thirdly many scientists were in the first flush of enthusiasm for the application of Mendelian principles – so recently elucidated – to man. None of this is true today.

For myself, though I was once a grateful holder of a Darwin Research Fellowship of the Eugenics Society (now the Galton Institute), since boyhood I have been more concerned about the quantity of people on earth rather than their quality. In the early 1960s I was a founder-member of a body called, I think, the Conservation Society, which does not seem to exist today. Its main platform was that too large a population would be unsustainable. At the time there was much discussion about over-population which was seen as one of the greatest dangers facing mankind. Interestingly, the worse the problem gets, the less it is discussed. Yet the mounting dangers we face, such as the possibility of global warming, are all exacerbated by too high a world population, given its enthusiasm for motor-cars, aeroplanes, and environmentally-damaging activity generally. It seems that people fear the charge of racism if they comment on population growth – they intuitively understand Fisher’s fundamental theorem.

10. Like yourself, your brother, J. H. Edwards, is also a distinguished geneticist. Nature, nurture, or sibling rivalry?

Well, certainly not sibling rivalry. It is true that we have been sufficiently alike at some stages of our lives to have been mistaken for each other. At the Rome Conference of Human Genetics in 1961 we were in a lift with the Swedish geneticist Jan Lindsten when he engagingly introduced us to another participant as ‘the two most confused brothers in genetics’.

In fact I am 7 1/2 years younger than John, and due to mother’s illness, father’s war service, and wartime privations generally, I hardly encountered him until the end of the war when I was ten. We developed boyhood enthusiasms for science quite independently, he for biology, me for astronomy. But there was a common factor in our education from thirteen to eighteen. We both attended Uppingham School, though of course not at the same time, and were exceptionally well-taught in science and mathematics, in some cases by the same teachers. I cannot stress this influence too strongly. Since I only went to Uppingham because my elder brother did, is that nature or nurture?

Subsequently John’s main influence was when I was learning about likelihood (see the preface to my book Likelihood). He, being medically qualified, keeps me straight on medical matters and I try to keep him straight on things statistical. I deliberately stayed off linkage theory so as not to get too close to his interests. John was more influenced by Lancelot Hogben and J. B. S. Haldane than I was. There is a wonderful letter from Fisher to R. R. Race in 1960 in which he refers to me as ‘my Edwards from Cambridge’ and to John as ‘only one of Hogben’s [pupils]’, so at least Fisher got us straight.

And John introduced me to gliding. Though not exactly a ‘champion’, to use your word, I have enjoyed fifty years gliding and hope for a few more yet.

September 5, 2018

My interview of James F. Crow in 2006

Filed under: Jim Crow,Population genetics — Razib Khan @ 3:13 pm

Since the death of L. L. Cavalli-Sforza I’ve been thinking about the great scientists who have passed on. Last fall, I mentioned that Mel Green had died. There was a marginal personal connection there. I had the privilege to talk to Green at length about sundry issues, often nonscientific. He was someone who been doing science so long he had talked to Charles Davenport in the flesh (he was not complimentary of Davenport’s understanding of Mendelian principles). It was like engaging with a history book!

A few months before I emailed Cavalli-Sforza, I had sent a message on a lark to James F. Crow. It was really a rather random thing, I never thought that Crow would respond. But in fact he emailed me right back! And he answered 10 questions from me, as you can see below the fold. The truth is I probably wouldn’t have thought to try and get in touch with Cavalli-Sforza if it hadn’t been so easy with Crow.

If you are involved in population genetics you know who Crow is. No introduction needed. Some of the people he supervised, such as Joe Felsenstein, have gone on to transform evolutionary biology in their own turn.

Born in 1916, Crow’s scientific career spanned the emergence of population genetics as a mature field, to the discovery of the importance of DNA, to molecular evolution & genomics. He had a long collaboration with Motoo Kimura, the Japanese geneticist instrumental in pushing forward the development of “neutral theory.”

He died in 2012.

Below are the questions I asked 12 years ago. My interests have changed somewhat, so it’s interesting to see what I was curious about back then. And of course fascinating to read Crow’s responses.

Originally posted June 26, 2006

James F. Crow is Professor Emeritus of Genetics at the University of Wisconsin. A collaborator with Motoo Kimura on Neutral Theory, he remains an active member of the evolutionary genetics community.

1) In 2002 in “Perspective: Here’s to Fisher, additive genetic variance, and the fundamental theorem of natural selection,” you conclude, “is there any other quantity that captures so much evolutionary meaning in such a simple way?” in reference to additive genetic variance. And yet, what about other factors like statistical epistasis? Do gene-gene interactions pack enough of an evolutionary punch to be anything more than a footnote in God’s Book? Have you seen Loren Rieseberg’s work at Indiana which points to the importance of loci of large effect?

The remarkable thing about additive genetic variance is that it predicts the effect of selection, even in the presence of dominance and epistasis. Nature seems to follow least-squares principles. The result is that the additive component of variance pulls out of dominance and epistatic variance those components associated with allele frequency change under selection. Of course the theory is not exact, but it is a very good first approximation. Fisher did not ignore epistasis, as some have said; rather he showed how selection can utilize epistatic (and dominance) components of variance.

On a more technical level, Kimura showed that under selection with loose linkage the population rather soon attains a state in which the linkage-disequilibrium variance approximately cancels the epistatic variance. Thus, under this circumstance the effects of selection are better predicted by ignoring additive by additive epistatic variance than by including it. See my book with Kimura (1970, p. 217 ff).

I am aware of Rieseberg’s work on sunflowers. QTL mapping and various other molecular methods are indeed finding alleles with large effect in many species. It is inevitable that the first genes discovered will be those with largest effect, so I expect alleles with smaller effects to follow. How large a part genes with large effect have played in evolution is still up in the air, as far as I know. But they are getting more emphasis now than in the recent past.

2) R.A. Fisher is reputed to have aimed for an “ideal gas law” of evolutionary genetics (The Fundamental Theorem of Natural Selection?). In the paper above you state that you expect “mathematical theory” to become more “general and rigorous.” How near are we to an “ideal gas law” for evolutionary genetics which takes the step beyond a qualitative heuristic, if such a thing is possible?

It is not surprising that Fisher, who was trained in classical physics, would use physical analogies. Various mathematical geneticists, such as Tom Nagylaki of the University of Chicago, have found more general and accurate expressions, and I expect this to continue. I don’t expect evolution to imitate classical physics in such things as an ideal gas law. For example, Fisher’s analogizing fitness with entropy is better regarded as a metaphor than as rigorous science.

3) Computational methods have come to the fore within the past generation as an alternative to analytic modes for attacking theoretical problems. Do you believe this has been wholly a good thing, and if not, can you elaborate?

Yes, I think it is a good thing. Many problems in population genetics cannot be solved by a mathematician, no matter how gifted. Although I expect improvements in the mathematical theory, it is already clear that computer methods are very powerful. This is good. It also permits people with limited mathematical knowledge to work on important problems; but I don’t expect it to entirely replace mathematical theory.

4) The 1966 the Lewontin and Hubby allozyme papers reported a great deal more polymorphism than either the followers of Wright or Fisher expected (i.e., Balance School and Classical School). The work with Neutral Theory and its successors stepped into the theoretical breach. In hindsight, does it seem that Neutral Theory was plausible a priori, or did the evolutionary geneticists of the pre-DNA era simply miss the possibility (and ubiquity) of neutral substitutions because they did not have a good mental model of variation on the molecular level?

The amount of variability disclosed by Lewontin and Hubby was more than some expected, although it did not seem particularly surprising to me. It is important to say, as Lewontin was the first to articulate, that the difference between the classical and balance schools does not lie in the amount of variability (variability is an observable and not a theoretical quantity). Rather the difference in the two schools was the way in which variability was thought to be maintained: mainly by mutation-selection balance or mainly be heterosis.

I think neutral variability came as a surprise to almost everybody. Of course, it was an outgrowth of molecular methodology, which made possible the study of DNA itself rather than phenotypic traits. I don’t think it was the absence of a mental model as much as not knowing in advance the enormous number of nucleotides in the genome, and how little of the DNA, especially in mammals, is protein-coding.

5) Do you believe that group selection (i.e., inter-demic selection) might have played a significant role in the evolution of H. sapiens sapiens?

I’m sure it did, for our ancestors for many years had a tribal existence with competition, even wars, between groups. I suspect that group structure may be responsible for much altruistic behavior. In a small group everyone is related, so behaving cooperatively or altruistically toward members of a group is the genetic equivalent of kin-selection. Muller and others emphasized this idea. There is a level of relatedness in a group at which the welfare of the group prevails over the welfare of individuals. Egbert Leigh quantified this as did Aoki and I. 

6) When your commentary on Arthur Jensen’s infamous Harvard Educational Review article on the inheritance of IQ and racial differences was published in 1969, did you have any inkling that the issues raised by Jensen would remain largely unresolved over thirty-five years later? What kind of evidence do you think would decide these issues one way or the other?

I did not expect the issues to be resolved soon, for there were no new methods that promised be more informative. Of course, the structure of DNA had been discovered, but the powerful methods now available had not yet been developed. I think further identification of individual genes, usually by molecular methods, and a combination of statistical and molecular methods are pointing the way toward a solution. I don’t expect racial differences to be either entirely genetic or entirely environmental, but of course I don’t know the relative amount; it is likely to be different for different traits and different human groups. 

7) In you recent review of “Genes in Conflict” you state in reference to Robert Trivers’ papers published in the 1970s that, “They were ignored by most social scientists, who were reluctant to consider natural selection as a cause of human behavioral traits, and they were bitterly attacked by Marxists for reasons of doctrine.” Recently the University of Chicago evolutionary genomicist Bruce Lahn has come under fire (as profiled in The Wall Street Journal, June 16th edition) for his study of ASPM, a locus implicated in brain development, from both geneticists and non-geneticists because of the sensitivity of the possibility of intergroup variation due to differential evolutionary forces within the past 40,000 years. Last year the paper put forward by Gregory Cochran, Henry Harpending and Jason Hardy that argued high Ashkenazi IQ was due to recent natural selection also ignited a firestorm. It seems that we are entering a new era of human genetics as a great deal of data will soon be available for theorists to analyze (e.g. the HapMap and its successors). Are “controversial” questions still going to be off limits, or will the science compel the political and cultural taboos to step aside?

I hope that such questions can be approached with the same objectivity as that when we study inheritance of bristle number in Drosophila, but I don’t expect it soon. There are too many strongly held opinions. I thought Lahn had a clever idea in thinking that the normal alleles of head-reducing mutants might be responsible for evolution of larger heads in human ancestry. Likewise, I think that Cochran et al. are fully entitled to consider the reasons for Jewish intelligence and I found their arguments interesting. In my view it is wrong to say that research in this area — assuming it is well done — is out of order. I feel strongly that we should not discourage a line of research because someone might not like a possible outcome. 

8) If a budding evolutionary thinker had to read one book or paper that excluded Charles Darwin’s body of work, what would you recommend?

I would recommend Fisher’s “Genetical Theory of Natural Selection“. But the reader should be prepared to find it tough going. Fisher’s elegant obscurity has left many of us baffled, but entranced. Your “budding thinker” might want to stop before the last four chapters, which are more dated than the rest of the book. And by all means, read the 1999 variorum edition. It’s appendices explain many of the book’s obscurities.

9) You’ve defended “bean bag genetics” (Nature, 2001). Lynn Margulis has complimented you personally, but seems to dismiss the whole endeavor of theoretical evolutionary biology as trivial and irrelevant when set next to the concrete realities of molecular and cell biology. Over the past generation molecular biology has dethroned physics as the “Queen of Sciences” in regards to prestige, and many young biologists seemed to take the work of Fisher, Wright, Haldane, Kimura and yourself for granted and do not concern themselves with the abstract “big picture” when mechanistic details on the DNA scale needed to be elucidated. Do you believe that over the next generation more young people will begin to look once more at evolutionary biology in its grandest abstract reaches as the “low hanging fruit” in molecular biology is exhausted?

Lynn Margulis is a long-time personal friend and has done important work on the origin of cellular organelles, but I disagree with her on this issue. It is true that the elegant theory of Fisher, Wright, Haldane, Kimura, and Malécot was less useful than might have been expected, because of lack of good data to whieh the theory was applicable. But that is no longer true. Molecular evolution has provided an abundance of data and the theory now has plenty of important applications. In particular, the neutral theory of molecular evolution has had great heuristic and predictive value, and it owes a great deal to Kimura’s earlier theoretical work, which built on the foundations of the pioneers. Lynn might change her mind if she looked at some of the striking results gotten by combining molecular measurements with population genetics theory. Maybe I should ask her! 

10) If you had to have one last glass of beer, and your drinking partner was going to be either Fisher, Wright or Haldane, who would you choose, and why?

I would choose Haldane, for his uninhibited willingness to speculate, his enormous erudition, his interest in almost everything, his irreverence, his wit, and his enjoyment of conversation. I am told that much of the good biology in Huxley’s “Brave New World” is the result of his drinking partnership with Haldane.


July 26, 2018

Local ancestry deconvolution made simpler (?)

Filed under: Local ancestry,Population genetics — Razib Khan @ 11:37 pm

I’ve been waiting for a local ancestry deconvolution method to come out of Simon Myers’ group for a few years. Well, I think we’re there, Fine-scale Inference of Ancestry Segments without Prior Knowledge of Admixing Groups. Here’s the abstract:

We present an algorithm for inferring ancestry segments and characterizing admixture events, which involve an arbitrary number of genetically differentiated groups coming together. This allows inference of the demographic history of the species, properties of admixing groups, identification of signatures of natural selection, and may aid disease gene mapping. The algorithm employs nested hidden Markov models to obtain local ancestry estimation along the genome for each admixed individual. In a range of simulations, the accuracy of these estimates equals or exceeds leading existing methods that return local ancestry. Moreover, and unlike these approaches, we do not require any prior knowledge of the relationship between sub-groups of donor reference haplotypes and the unseen mixing ancestral populations. Instead, our approach infers these in terms of conditional “copying probabilities”. In application to the Human Genome Diversity Panel we corroborate many previously inferred admixture events (e.g. an ancient admixture event in the Kalash). We further identify novel events such as complex 4-way admixture in San-Khomani individuals, and show that Eastern European populations possess 1-5% ancestry from a group resembling modern-day central Asians. We also identify evidence of recent natural selection favouring sub-Saharan ancestry at the HLA region, across North African individuals. We make available an R and C ++ software library, which we term MOSAIC (which stands for MOSAIC Organises Segments of Ancestry In Chromosomes).

The truth is I’ve only done a quick skim of the preprint and not run the method myself to see how it works. But to be honest I can’t see where the part about Eastern Europeans is in the manuscript (I checked the supporting text)? That being said, if you run a PCA many Northern and most Eastern Europeans are clearly shifted toward East Asians compared to Southern Europeans. So I accept it.

In any case, always remember, all models are wrong. But some of them have insight.

July 13, 2018

Tutorial to run supervised admixture analyses

Filed under: Admixture,Data Analysis,Population genetics — Razib Khan @ 11:03 pm
ID Dai Gujrati Lithuanians Sardinian Tamil
razib_23andMe 0.14 0.26 0.02 0.00 0.58
razib_ancestry 0.14 0.26 0.02 0.00 0.58
razib_ftdna 0.14 0.26 0.02 0.00 0.57
razib_daughter 0.05 0.14 0.29 0.18 0.34
razib_son 0.07 0.17 0.28 0.19 0.30
razib_son_2 0.06 0.19 0.29 0.19 0.27
razib_wife 0.00 0.07 0.55 0.38 0.00

This is a follow-up to my earlier post, Tutorial To Run PCA, Admixture, Treemix And Pairwise Fst In One Command. Hopefully you’ll be able to run supervised admixture analysis with less hassle after reading this.

The above results are from a supervised admixture analysis of my family and myself. The fact that there are three replicates of me is due to the fact that I converted my 23andMe, Ancestry, and Family Tree DNA raw data into plink files. Notice that the results are broadly consistent. This emphasizes that discrepancies between DTC companies in their results are due to their analytic pipeline, not because of data quality.

The results for my family are not surprising. I’m about ~14% “Dai”, reflecting East Asian admixture into Bengalis. My wife is ~0% “Dai”. My children are somewhere in between. At the low fraction you expect some variance in the F1.

Now below are results for three Swedes with the sample reference panel:

Group ID Dai Gujrati Lithuanians Sardinian Tamil
Sweden Sweden17 0.00 0.09 0.63 0.28 0.00
Sweden Sweden18 0.00 0.08 0.62 0.31 0.00
Sweden Sweden20 0.00 0.05 0.72 0.23 0.00

All these were run on supervised admixture frameworks where I used Dai, Gujrati, Lithuanians, Sardinians, and Tamils, as the reference “ancestral” populations. Another way to think about it is: taking the genetic variation of these input groups, what fractions does a given test focal individual shake out at?

The commands are rather simple. For my family:
bash rawFile_To_Supervised_Results.sh TestScript

For the Swedes:
bash supervisedTest.sh Sweden TestScript

The commands need to be run in a folder: ancestry_supervised/.

You can download the zip file.

Here is what the scripts do in two different situations. Imagine you have raw genotype files downloaded from 23andMe, Ancestry, and Family Tree DNA.

Download the files as usual. Rename them in an intelligible way, because the file names are going to be used for generating IDs. So above, I renamed them “razib_23andMe.txt” and such. Leave the extensions as they are. You need to make sure they are not compressed obviously. Then place them all in ancestry_supervised/RAWINPUT.

The script looks for the files in there. You don’t need to specify names, it will find them. In plink the family ID and individual ID will be taken from the text before the extension in the file name. Output files will also have the file name.

Aside from the raw genotype files, you need to determine a reference file. In REFERENCESFILES/ you see the binary pedigree/plink file Est1000HGDP. The same file from the earlier post. It would be crazy to run supervised admixture on the dozens of populations in this file. You need to create a subset.

For the above I did this:
grep "Dai|Guj|Lithua|Sardi|Tamil" Est1000HGDP.fam > ../keep.keep

./plink --bfile REFERENCEFILES/Est1000HGDP --keep keep.keep --make-bed --out REFERENCEFILES/TestScript

When the script runs, it converts the raw genotype files into plink files, puts them in INDIVPLINKFILES/. Then it takes each plink file and uses it as a test against the reference population file. That file has a preprend on group/family IDs of the form AA_Ref_. This is essential for the script to understand that this is a reference population. The .pop files are automatically generated, and the script inputs in the correct K by looking for unique population numbers.

The admixture is going to be slow. I recommend you modify runadmixture.pl by adding the number of cores parameters so it can go multi-threaded.

When the script is done it will put the results in RESULTFILES/. They will be .csv files with strange names (they will have the original filename you provided, but there are timestamps in there so that if you run the files with a different reference and such it won’t overwrite everything). Each individual is run separately and has a separate output file (a .csv).

But this is not always convenient. Sometimes you want to test a larger batch of individuals. Perhaps you want to use the reference file I provided? For the Swedes I did this:
grep "Swede" REFERENCEFILES/Est1000HGDP.fam > ../keep.keep

./plink --bfile REFERENCEFILES/Est1000HGDP --keep keep.keep --make-bed --out INDIVPLINKFILES/Sweden

Please note the folder. There are modifications you can make, but the script assumes that the test files inINDIVPLINKFILES/. The next part is important. The Swedish individuals will have AA_Ref_ preprended on each row since you got them out of Est1000HGDP. You need to remove this. If you don’t remove it, it won’t work. In my case, I modified using the vim editor:
vim Sweden.fam

You can do it with a text editor too. It doesn’t matter. Though it has to be the .fam file.

After the script is done, it will put the .csv file in RESULTFILES/. It will be a single .csv with multiple rows. Each individual is tested separately though, so what the script does is append each result to the file. If you have 100 individuals, it will take a long time. You may want to look in the .csv file as the individuals are being added to make sure it looks right.

The convenience of these scripts is that it does some merging/flipping/cleaning for you. And, it formats the output so you don’t have to.

I originally developed these scripts on a Mac, but to get it to work on Ubuntu I made a few small modifications. I don’t know if it still works on Mac, but you should be able to make the modifications if not. Remember for a Mac you will need the make versions of plink and admixture.

For supervised analysis, the reference populations need to make sense and be coherent. Please check the earlier tutorial and use the PCA functions to remove outliers.

July 12, 2018

Running your own analyses

Filed under: Genetics,Population genetics,Scripts — Razib Khan @ 8:25 am

For the technically inclined people here: Tutorial To Run PCA, Admixture, Treemix And Pairwise Fst In One Command.

July 11, 2018

Tutorial to run PCA, Admixture, Treemix and pairwise Fst in one command

Filed under: Admixture,data,Fst,PCA,PLINK,Population genetics,TreeMix — Razib Khan @ 11:50 pm

Today on Twitter I stated that “if the average person knew how to run PCA with plink and visualize with R they wouldn’t need to ask me anything.” What I meant by this is that the average person often asks me “Razib, is population X closer to population Y than Z?” To answer this sort of question I dig through my datasets and run a few exploratory analyses, and get back to them.

I’ve been meaning to write up and distribute a “quickstart” for a while to help people do their own analyses. So here I go.

The audience of this post is probably two-fold:

  1. “Trainees” who are starting graduate school and want to dig in quickly into empirical data sets while they’re really getting a handle on things. This tutorial will probably suffice for a week. You should quickly move on to three population and four population tests, and Eigensoft and AdmixTools. As well fineStructure
  2. The larger audience is technically oriented readers who are not, and never will be, geneticists professionally. 

What do you need? First, you need to be able to work in a Linux or environment. I work both in Ubuntu and on a Mac, but this tutorial and these scripts were tested on Ubuntu. They should work OK on a Mac, but there may need to be some modifications on the bash scripts and such.

Assuming you have a Linux environment, you need to download this zip or tar.xz file. Once you open this file it should decompress a folderancestry/.

There are a bunch of files in there. Some of them are scripts I wrote. Some of them are output files that aren’t cleaned up. Some of them are packages that you’ve heard of. Of the latter:

  • admixture
  • plink
  • treemix

You can find these online too, though these versions should work out of the box on Ubuntu. If you have a Mac, you need the Mac versions. Just replace the Mac versions into the folderancestry/. You may need some libraries installed into Ubuntu too if you recompile yourselfs. Check the errors and make search engines your friends.

You will need to install R (or R Studio). If you are running Mac or Ubuntu on the command line you know how to get R. If not, Google it.

I also put some data in the file. In particular, a plink set of files Est1000HGDP. These are merged from the Estonian Biocentre, HGDP, and 1000 Genomes. There are 4,899 individuals in the data, with 135,000 high quality SNPs (very low missingness).

If you look in the “family” file you will see an important part of the structure. So do:

less Est1000HGDP.fam

You’ll see something like this:
Abhkasians abh154 0 0 1 -9
Abhkasians abh165 0 0 1 -9
Abkhazian abkhazian1_1m 0 0 2 -9
Abkhazian abkhazian5_1m 0 0 1 -9
Abkhazian abkhazian6_1m 0 0 1 -9
AfricanBarbados HG01879 0 0 0 -9
AfricanBarbados HG01880 0 0 0 -9

There are 4,899 rows corresponding to each individual. I have used the first column to label the ethnic/group identity. The second column is the individual ID. You can ignore the last 4 columns.

There is no way you want to analyze all the different ethnic groups. Usually, you want to look at a few. For that, you can use lots of commands, but what you need is a subset of the rows above. The grep command matches and returns rows with particular patterns. It’s handy. Let’s say I want just Yoruba, British (who are in the group GreatBritain), Gujurati, Han Chinese, and Druze. The command below will work (note that Han matches HanBeijing, Han_S, Han_N, etc.).

grep "Yoruba|Great|Guj|Han|Druze" Est1000HGDP.fam > keep.txt

The file keep.txt has the individuals you want. Now you put it through plink to generate a new file:

./plink --bfile Est1000HGDP --keep keep.txt --make-bed --out EstSubset

This new file has only 634 individuals. That’s more manageable. But more important is that there are far fewer groups for visualization and analysis.

As for that analysis, I have a Perl script with a bash script within it (and some system commands). Here is what they do:

1) they perform PCA to 10 dimensions
2) then they run admixture on the number of K clusters you want (unsupervised), and generate a .csv file you can look at
3) then I wrote a script to do pairwise Fst between populations, and output the data into a text file
4) finally, I create the input file necessary for the treemix package and then run treemix with the number of migrations you want

There are lots of parameters and specifications for these packages. You don’t get those unless you to edit the scripts or make them more extensible (I have versions that are more flexible but I think newbies will just get confused so I’m keeping it simple).

Assuming I create the plink file above, running the following commands mean that admixture does K = 2 and treemix does 1 migration edge (that is, -m 1). The PCA and pairwise Fst automatically runs.

perl pairwise.perl EstSubset 2 1

Just walk away from your box for a while. The admixture will take the longest. If you want to speed it up, figure out how many cores you have, and edit the file makecluster.sh, go to line 16 where you see admixture. If you have 4 cores, then type -j4 as a parameter. It will speed admixture up and hog all your cores.

There is as .csv that has the admixture output. EstSubset.admix.csv. If you open it you see something like this:
Druze HGDP00603 0.550210 0.449790
Druze HGDP00604 0.569070 0.430930
Druze HGDP00605 0.562854 0.437146
Druze HGDP00606 0.555205 0.444795
GreatBritain HG00096 0.598871 0.401129
GreatBritain HG00097 0.590040 0.409960
GreatBritain HG00099 0.592654 0.407346
GreatBritain HG00100 0.590847 0.409153

Column 1 will always be the group, column 2 the individual, and all subsequent columns will be the K’s. Since K = 2, there are two columns. Space separated. You should be able to open the .csv or process it however you want to process it.

You’ll also see two other files: plink.eigenval plink.eigenvec. These are generic output files for the PCA. The .eigenvec file has the individuals along with the values for each PC. The .eigenval file shows the magnitude of the dimension. It looks like this:

Basically, this means that PC 1 explains twice as much of the variance as PC 2. Beyond PC 4 it looks like they’re really bunched together. You can open up this file as a .csv and visualize it however you like. But I gave you an R script. It’s RPCA.R.

You need to install some packages. First, open R or R studio. If you want to go command line at the terminal, type R. Then type:

Once those packages are loaded you can use the script:

Then, to generate the plot at the top of this post:

There are some useful parameters in this function. The plot to the left adds some shape labels to highlight two populations. A third population I label by individual ID. This second is important if you want to do outlier pruning, since there are mislabels, or just plain outlier individuals, in a lot of data (including in this). I also zoomed in.

Here’s how I did that:
plinkPCA(subVec = c("Druze","GreatBritain"),labelPlot = c("Lithuanians"),xLim=c(-0.01,0.0125),yLim=c(0.05,0.062))

To look at stuff besides PC 1 and PC 2 you can do plinkPCA(PC=c("PC3","PC6")).

I put the PCA function in the script, but to remove individuals you will want to run the PCA manually:

./plink --bfile EstSubset --pca 10

You can remove individuals manually by creating a remove file. What I like to do though is something like this:
grep "randomID27 " EstSubset.fam >> remove.txt

The double-carat appends to the remove.txt file, so you can add individuals in the terminal in one window while running PCA and visualizing with R in the other (Eigensoft has an automatic outlier removal feature). Once you have the individuals you want to remove, then:

./plink --bfile EstSubset --remove remove.txt --make-bed --out EstSubset
./plink --bfile EstSubset --pca 10

Then visualize!

To make use of the pairwise Fst you need the fst.R script. If everything is set up right, all you need to do is type:

It will load the file and generate the tree. You can modify the script so you have an unrooted tree too.

The R script is what generates the FstMatrix.csv file, which has the matrix you know and love.

So now you have the PCA, Fst and admixture. What else? Well, there’s treemix.

I set the number of SNPs for the blocks to be 1000. So -k 1000. As well as global rearrangement. You can change the details in the perl script itself. Look to the bottom. I think the main utility of my script is that it generates the input files. The treemix package isn’t hard to run once you have those input files.

Also, as you know treemix comes with R plotting functions. So run treemix with however many migration edges (you can have 0), and then when the script is done, load R.


But actually, you don’t need to do the above. I added a script to generate a .png file with the treemix plot in pairwise.perl. It’s called TreeMix.TreeMix.Tree.png.

OK, so that’s it.

To review:

Download zip or tar.xz file. Decompress. All the packages and scripts should be in there, along with a pretty big dataset of modern populations. If you are on a non-Mac Linux you are good to go. If you are on a Mac, you need the Mac versions of admixture, plink, and treemix. I’m going to warn you compiling treemix can be kind of a pain. I’ve done it on Linux and Mac machines, and gotten it to work, but sometimes it took time.

You need R and/or R Studio (or something like R Studio). Make sure to install the packages or the scripts for visualizing results from PCA and pairwiseFst won’t work.*

There is already a .csv output from admixture. The PCA also generates expected output files. You may want to sort, so open it in a spreadsheet.

This is potentially just the start. But if you are a laypersonwith a nagging question and can’t wait for me, this should be you where you need to go!

* I wrote a lot of these things piecemeal and often a long time ago. It may be that not all the packages are even used. Don’t bother to tell me.

June 21, 2018

The fault in our parameters

Filed under: Population genetics — Razib Khan @ 10:49 pm

Of the books, I own Elements of Evolutionary Genetics is one I consult frequently because of its range and comprehensiveness. The authors, Brian Charlesworth and Deborah Charleswencyclopedican encyclopediac knowledge of the literature. To truly understand the evolutionary process in all its texture and nuance it is important to absorb a fair amount of theory, and Elements of Evolutionary Genetics does do that (though it’s not as abstruse as something like An Introduction to Population Genetics Theory).

When I see a paper by one of the Charlesworth’s, I try and read them. Not because I have a love of Drosophila or Daphnia, but because to develop strong population genetics intuitions it always helps to stand on the shoulders of giants. So with that, I pass on this preprint, Mutational load, inbreeding depression and heterosis in subdivided populations:

This paper examines the extent to which empirical estimates of inbreeding depression and inter-population heterosis in subdivided populations, as well as the effects of local population size on mean fitness, can be explained in terms of estimates of mutation rates, and the distribution of selection coefficients against deleterious mutations provided by population genomics data. Using results from population genetics models, numerical predictions of the genetic load, inbreeding depression and heterosis were obtained for a broad range of selection coefficients and mutation rates. The models allowed for the possibility of very high mutation rates per nucleotide site, as is sometimes observed for epiallelic mutations. There was fairly good quantitative agreement between the theoretical predictions and empirical estimates of heterosis and the effects of population size on genetic load, on the assumption that the deleterious mutation rate per individual per generation is approximately one, but there was less good agreement for inbreeding depression. Weak selection, of the order of magnitude suggested by population genomic analyses, is required to explain the observed patterns. Possible caveats concerning the applicability of the models are discussed.

June 19, 2018

Burmese are a bit Bengali

Filed under: Burma,Population genetics — Razib Khan @ 12:48 am

About ten years ago I read the book The River of Lost Footsteps: Histories of Burma. Though I have read books where Burma figures prominently (e.g., Strange Parallels), this is the only history of Burma I have read. The author is Burmese, and provide something much more than a travelogue, as might have been the case if he was of Western background. By chance over the past month or so I’ve been in contact with the author, who made a few inquiries as to the genetics of his own family (he came with genotypes in hand). But this brought us to the issue of the genetics of the Burmese people, and their position in the historical-genetic landscape.

The author of The River of Lost Footsteps reminded me of something that’s curious about Southeast Asia: its Indic influences tend to be from the south of the subcontinent. In particular, the native scripts derive from a South Indian parent. Could genetics confirm this connection as well? Also, could genetics give some insights as to the timing of admixture/gene-flow?

In theory, yes.

I had a lot of Southeast Asian datasets to play with, and did a lot of pruning to remove outliers (e.g., people with obvious recent Chinese ancestry). First, comparing them to Bangladeshis it seems that even without local ancestry tract analysis that Burmese and Malays have more varied, and so likely recent, exogenous ancestry than Bangladeshis. At least this is evidence on the PCA plot, where these two groups exhibit strong admixture clines toward South Asians.

But what about the question of Southeast Asian affinities? This needs deeper analysis. Three-population tests, which measure admixture with outgroups when compared to a dyad of populations which are modeled as a clade, can be informative.

Outgroup Pop1 Pop2 f3 z
Bangladeshi Telugu Cambodians -0.00183999 -46.3322
Bangladeshi Telugu Han -0.00220121 -46.046
Burma Telugu Han -0.00406071 -51.0018
Burma Han Bangladeshi -0.00348186 -49.1398
Burma Han Punjabi_ANI_2 -0.00418193 -47.2351
Cambodians Telugu Viet -0.00126923 -16.91
Cambodians Punjabi_ANI_2 Viet -0.00129881 -15.6039
Cambodians Bangladeshi Viet -0.000970022 -14.5642
Malay Igorot Telugu -0.00249795 -18.758
Malay Igorot Bangladeshi -0.00223454 -18.5212
Malay Igorot Punjabi_ANI_2 -0.00250732 -18.3027
Malay Igorot Cambodians -0.00107817 -16.6214
Viet Han Cambodians -0.000569337 -13.1139

Bangladeshis show strong signatures with both Cambodians and Han. This is in accordance with earlier analysis which suggests Austro-Asiatic and Tibeto-Burman contributions to the “East Asian” element of Bengali ancestry. The Burmese always have Han ancestry, with a South Asian donor as well. This aligns with other PCA analysis which shows the Burmese samples skewed toward Han Chinese. Burma is a compound of different ethnic groups. Some are Austro-Asiatic. The Bamar, the core “Burman” group, have some affinities to Tibetans. And the Shan are a Thai people who are relatively late arrivals.

Cambodians have a weaker admixture signature and are paired with a South Asian group and their geographic neighbors the Vietnamese. The Malays are similar to Cambodians but have the Igorot  people from the Philippines as one of their donors. And finally, not surprisingly the Vietnamese show some mixture between Han-like and Cambodian-like ancestors.

Further PCA analysis shows that while Cambodians and Malays tend to skew somewhat neutrally to South Asians (the recent Indian migration to Malaysia is mostly Tamil), the Burmese are shifted  toward Bangladeshis:

Click to enlarge

Finally, I ran some admixture analyses.

First, I partitioned the samples with an unsupervised set of runs (K = 4 and K = 5). In this way I obtained reified reference groups as follows:

“Austronesians” (Igorot tribesmen from the Philippines)
“Austro-Asiatic” (a subset of Cambodians with the least exogeneous admixture)
“North Indians” (Punjabis)
“South Indians” (A subset of middle-caste Telugus highest on the modal element in South Indians)
“Han” (a proxy for “northern” East Asian)

The results are mostly as you’d expect. In line with three-population tests, the Vietnamese are Han and Austro-Asiatic. More of the former than latter. There is a minor Austronesian component. Notice there is no South Asian ancestry in this group.

In contrast, Cambodians have low levels of both North and South Indian. These out sample Cambodians are still highly modal for Austro-Asiatic though.

Malays are more Austro-Asiatic than Austronesian, which might surprise. But the Igorot samples are highly drifted and distinct. I think these runs are underestimating Austronesian in the Malays. Notice that some of the Malays have South Asian ancestry, but a substantial number do not. This large range in admixture is what you see in PCA as well. I think this strongly points to the fact that Malays have been receiving gene-flow from India recently, as it is not a well mixed into the population.

The Bangladeshi outgroup is mostly a mix of North and South Indian, with a slight bias toward the latter. No surprise. As I suggested earlier you can see that the Bangladeshi samples are hard to model as just a mix of Burmese with South Asians. The Austro-Asiatic component is higher in them than the Burmese. This could be because Burma had recent waves of northern migration (true), and, eastern India prior to the Indo-Aryan expansion was mostly inhabited by Austro-Asiatic Munda (probably true). That being said, the earlier analysis suggested that the Munda cannot be the sole source of East Asian ancestry in Bengalis.

Finally, every single Burmese sample has South Asian ancestry. Much higher than Cambodians. And, there is variance.  I think that leads us to the likely conclusion that Burma has been subject to continuous gene-flow as well as recent pulses of admixture from South Asia. The variation in South Asian ancestry in the Burmese is greater than East Asian ancestry in Bengalis. I believe this is due to more recent admixture in Burmese due to British colonial Indian settlement in that country.

The cultural and historical context of this discussion is the nature of South Asian, Indic, influence, on Southeast Asia. One can not deny that there has been some gene-flow between Southeast Asia and South Asia. In prehistoric times it seems that Austro-Asiatic languages moved from mainland Southeast Asia to India. More recently there is historically attested, and genetically confirmed, instances of colonial Indian migration. But, the evidence from Cambodia suggests that this is likely also ancient, as unlike Malaysia or Burma, Cambodia did not have any major flow of Indian migrants during the colonial period. One could posit that perhaps the Cambodian Indian affinity is a function of “Ancestral South Indian.” But the Cambodians are not skewed toward ASI-enriched groups in particular. And, I know for a fact that appreciable frequencies of R1a1a exist within the male Khmer population (this lineage is common in South Asia, especially the north and upper castes).

As far as Burma goes, I think an older period of South Indian cultural influence, and some gene-flow seems likely. But, with the expansion of Bengali settlement to the east over the past 2,000 years, more recent South Asian ancestry is probably enriched for that ethnolinguistic group.

I’m going to try and follow-up with some ancestry tract analysis….

May 30, 2018

Soft & hard selection vs. soft & hard sweeps

Filed under: Population genetics,Quantitative Genetics — Razib Khan @ 12:09 am

When I was talking to Matt Hahn I made a pretty stupid semantic flub, confusing “soft selection” with “soft sweeps.” Matt pointed out that soft/hard selection were terms more appropriate to quantitative genetics rather than population genomics. His viewpoint is defensible, though going back into the literature on soft/selection, e.g., Soft and hard selection revisited, the main thinkers pushing the idea were population geneticists who were also considering ecological questions.*

The strange thing is that I had already known the definitions of hard and soft selection on some level because I had read about them as I was getting confused with hard and soft sweeps! But this was more than ten years ago now, and since then I haven’t given the matter enough thought obviously, as I defaulted back to confusing the two classes of terms, just as I used to.

Matt pointed out that truncation selection is a form of hard selection. All individuals below (or above) a certain phenotype value have a fitness of zero, as they don’t reproduce. In a single locus context, hard selection would involve deleterious lethal alleles, whose impact on the genotype was the same irrespective of ecological context. So in a hard selection, it operates by reducing the fitness of individuals/genotypes to zero.

For soft selection, context matters much more, and you would focus more on relative fitness differences across individuals/genotypes. Some definitions of soft vs. hard selection emphasize that in the former case fitness is defined relative to the local ecological patch, while the latter is a universal estimate. Soft selection does not necessarily operate through the zero fitness value for a genotype, but rather differential fitness. Hard selection can crash your population size. Soft selection does not necessarily do that.

Though I won’t outline the details, one of the originators of the soft/hard selection concept analogized them to density-dependent/independent dynamics in ecology. If you know the ecological models, the correspondence probably is obvious to you.

As for hard and soft sweeps, these are particular terms of relevance to genomics, because genome-wide data has allowed for their detection through the impact they have on the variation in the genome. A “sweep” is a strong selective event that tends to sweep away variation around the focus of selection. A hard sweep begins with a single mutant, and positive selection tends to drive it toward fixation.

A classical example is lactase persistence in Northern Europeans and Northwest South Asians (e.g., Punjabis). The mutation in the LCT gene is the same across a huge swath of Eurasia. And, the region around the genome is also the same, because regions of the genome adjacent to that single mutation increased in frequency as well (they “hitchhiked”). This produces a genetic block of highly reduced diversity since the hard selective sweep increases the frequency of so many variants which are associated with the advantageous one, and may drive to extinction most other competitive variants.

Someone is free to correct me in the comments, but it strikes me that many hard selective sweeps are driven by soft selection. Fitness differentials between those with the advantageous alleles and those without it are not so extreme, and obviously context dependent, even in cases of hard sweeps on a single locus.

The key to understanding soft sweeps is that there isn’t a focus on a singular mutation. Rather, selection can target multiple mutations, which may have the same genetic position, but be embedded within different original gene copies. In fact, soft selection often operates on standing variation, preexistent alleles which were segregating in the population at low frequencies or were totally neutral. Genetic signatures of these events are less striking than those for hard sweeps because there is far less diminishment of diversity, since it’s not the increase in the frequency of a singular mutation and the hitchhiking of its associated flanking genomic region.

Soft sweeps can clearly occur with soft selection. But truncation selection can occur on polygenic traits, so depending on the architecture of the trait (i.e., effect size distribution across the loci) one can imagine them associated with hard selection as well.

Going back to the conversation I had with Matt the reason semantics is important is that terms in population genetics are informationally rich, and lead you down a rabbit-hole of inferences. If population genetics is a toolkit for decomposing reality, then you need to have your tools well categorized and organized. On occasion it is important to rectify the names.

* There are two somewhat related definitions of soft/hard selection. I’ll follow Wallace’s original line here, though I’m not sure they differ that much.

May 21, 2018

The mutation accumulation controversy continues….

Filed under: Mutation,Population genetics — Razib Khan @ 10:56 pm

Every few years I check to see if the great mutation accumulation controversy has resolved itself. I don’t know if anyone calls it that, but that’s what I think of it as. There are two major issues that matter here: mutation rates are a critical parameter in evolutionary models, and, mutation accumulation over time matters for parental age effects when it comes to disease (speaking as an older father!).

In the latter case, I’m talking about the reasons that people freeze their eggs or sperm. In the former case, I’m talking about whether we can easily extrapolate mutation rates over evolutionary time as semi-fixed, so we can infer dates of last common ancestry and such. To give a concrete example of what I’m talking about, if mutation rates varied a lot over the evolutionary history of our hominin lineage, then we might need to rethink some of the inferred timings.

Today two preprints came out on mutation accumulation. First, Overlooked roles of DNA damage and maternal age in generating human germline mutations. Second, Reproductive longevity predicts mutation rates in primates. What a coincidence in synchronicity!

Additionally, the last author on the second preprint, Matt Hahn, is someone I’ll be doing a podcast with this week. So aside from talking about neutral theory, and his book Molecular Population Genetics, I’m going to have to bring up this mutation business.

The figure above from the first preprint shows that the proportion of mutations derived from the father don’t increase over time, as textbooks generally state. Why would we expect this? Sperm keeps replicating after puberty so you should be gaining more mutations. In contrast, the eggs are arrested in meiosis. There are various mechanistic reasons that the authors of the first preprint give for why the ratio does not change between paternal and maternal mutations (e.g., non-replicative mutations seem to be the primary one). The authors are using a very “pedigree” strategy, rather than an “evolutionary” one. They’re looking at sequenced trios, and noticing patterns. I think in the near future they’ll be far more sure of what’s going on because they’ll have bigger sample sizes. They admit the effects are subtle (also, some of the p-values are getting close to 0.05).

Instead of focusing on a human pedigree, the second preprint does some sequencing on owl monkeys (I had no idea there were “owl monkeys” before this paper). They find that the mutation rate is ~32% lower in owl monkeys than in humans. Why is this?

The plot to the left shows that mutations increase across age with species (though the number of data points is pretty small). The authors contend that:

The association between mutation rates and reproductive longevity implies that changes in life history traits rather than changes to the mutational machinery are responsible for the evolution of these rates. Species that have evolved greater reproductive longevity will have a higher mutation rate per generation without any underlying change to the replication, repair, or proofreading proteins.

If I read this right: owl monkeys reproduce fast and don’t have as much reproductive longevity. Ergo, lower mutation rates (less mutational build-up from paternal side).

After all these years I’m still not convinced about anything. I assume that eventually bigger data sets will come online and we’ll resolve this. Someone has to be right!

(not too many people on Twitter get what’s going on either)

May 8, 2018

The peoples of the Maghreb have some Pleistocene roots

Filed under: Human Population Genetics,North Africa,Population genetics — Razib Khan @ 11:58 pm
Moroccan Berber man

The Maghreb is an important and interesting place. In the history of Western civilization, the tension between Carthage, the ancient port city based out of modern-day Tunisia, and Rome, is one of the more dramatic and tragic rivalries that has resonances down through the ages. Read Adrian Goldsworthy’s chapter on the Battle of Cannae in The Punic Wars for what I’m alluding to (and of course there was Cato the Younger’s dramatic remonstrations).

Later Roman Africa, which really encompassed northern Morocco, coastal Algeria, and Tunisia and Tripolitania, became a major social and economic pillar of the Imperium. Not only did men such as the emperor Septimius Severus and St. Augustine have roots in the region, but these provinces were a major economic bulwark for the Western Empire in its last century. The wealthy Senators of the 4th and 5th century were often absentee landlords of vast estates in North Africa. The fall of these provinces to the Vandals and Alans in the 430s began the transformation of the Western Empire based in Rome into a more regional player, rather than a true hegemon (perhaps an analogy here can be made to the loss of Anatolia by the Byzantines in the 11th century).

Another important aspect of North Africa is that it is the westernmost extension of the region possibly settled by Near Eastern farmers in Africa. The native Afro-Asiatic Berber languages seem to have been dominant in the region despite the influence and prestige of Punic and Latin in the cities when Muslim Arabs conquered the region in the late 7th century. The genetic-demographic characteristics of the region are relevant to attempts to understand the origins of the Afro-Asiatic languages more generally since Berber is part of the clade with the Semitic languages.

A preprint and a paper utilizing ancient DNA have shed a great deal of light on these questions recently. The paper is in Science, Pleistocene North African genomes link Near Eastern and sub-Saharan African human populations. The preprint is Ancient genomes from North Africa evidence prehistoric migrations to the Maghreb from both the Levant and Europe. They are in broad agreement, though they cover somewhat different periods.

The figure below is the big finding of the Science paper:

They retrieved some genotypes from a site in northern Morocco, Taforalt, which dates to ~15,000 years before the present. This is a Pleistocene site, before the rise of agriculture. The Taforalt individuals are about 65% Eurasian in affinity, and 35% Sub-Saharan African. This confirms that the Eurasian back-migration to northern Africa predates the Holocene, just as many archaeologists and geneticists have reported earlier.

The samples from the preprint date to a later time. IAM in the samples dates to 7,200 years before the present, and KEB to ~5,000 years before the present. It seems pretty clear that the IAM samples in the preprint exhibit continuity with the Taforalt samples. Though it is not too emphasized in the preprint the lower K’s seem to strongly suggest that the IAM samples have Sub-Saharan African ancestry, just like the Taforalt samples which are nearly 8,000 years older. In the KEB samples, the fraction drops, probably diluted in part by ancestry related to what we elsewhere term “Early European Farmer” (EEF), related to the Anatolian farming expansion.

Both the Taforalt and IAM samples, in particular, seem to exhibit strong affinities to Natufian/Levantine peoples. Additionally, many of these samples carry Y chromosome haplogroup E1b, just like some of the Natufians. These results indicate that the Natufian-North African populations were exchanging genes or one cline rather deep in the Pleistocene.

Though various methods have suggested that there is a lot of recent Sub-Saharan African admixture, dating to the Arab period, in North Africa, these results suggest that much of it is far older. The Mozabites, as an isolated Berber group, reflect this tendency. Though some individuals have inflated African ancestry due to recent admixture, much of it is older and evener. And yet the Mozabites seem to have less Sub-Saharan African ancestry on average than the IAM sample.

There aren’t enough data points to make a strong inference about the temporal transect, but these few results imply a decline in Sub-Saharan ancestral component after the Pleistocene with further farming migration, and then a rise again with the trans-Saharan slave trade during the Muslim period. Another issue, highlighted in the preprint, is likely heterogeneity within the Maghreb in ancestry (lowland populations in modern North Africa tend to have more Sub-Saharan ancestry due to where slaves were settled).

In the Science paper the authors make an attempt to adduce the origin of the Sub-Saharan contribution to the Taforalt individuals. The result is that there is no modern or ancient proxy that totally fits the bill. These individuals have affinities to many Sub-Saharan African populations.  The Sub-Saharan component is likely heterogeneous, but attempts to model European genetic variation during the Ice Age ran into trouble that divergence from modern populations was quite great. Until we get more ancient DNA there probably won’t be too much more clarity.

On the issue of the Eurasian ancestry, it’s clearly quite like the Natufians. But curiously the authors find that the Neanderthal ancestry in these samples is greater than that found in early Holocene Iran samples. From this, the authors conclude that they may have had a lower fraction of “Basal Eurasian” (BEu) than those populations further to the east. But already 15,000 years ago BEu populations were mixed with more generic West Eurasians to generate the back-migration to Africa. If BEu diverged from other Eurasians >50,000 years ago, then it may have merged back into the “Out-of-Africa” populations around or before the Last Glacial Maximum, ~20,000 years ago.

Finally, the authors looked at some pigmentation genes. Curiously the Taforalt and IAM individuals did not carry the derived variants for pigmentation found in many West and South Eurasians, but the KEB did. This confirms results from Europe, and population genomic inference in modern samples, that selection for derived pigmentation variants is relatively recent in the Holocene.

I do want to add that one possibility about the Sub-Saharan ancestry in the Taforalt, and probably all modern North Africans to a lesser extent, is that it is ancient and local. We now know proto-modern humans were present in the region >300,000 years ago. Northwest Africa may have been part of the multi-regional metapopulation of H. sapiens, as opposed to the Eurasian biogeographic zone that it is often placed, before a post-LGM back migration of Eurasians.

February 24, 2018

Are Turks Armenians under the hood?

Filed under: Historical Population Genetics,Population genetics — Razib Khan @ 8:31 pm

Benedict Anderson’s Imagined Communities: Reflections on the Origin and Spread of Nationalism is one of those books I haven’t read, but should. In contrast, I have read Azar Gat’s Nations, which is a book-length counterpoint to Imagined Communities. To take a stylized and extreme caricature, Imagined Communities posits nations to be recent social and historical constructions, while Nations sees them as primordial, and at least originally founded on on ties of kinships and blood.

The above doesn’t capture the subtlety of  Gat’s book, and I’m pretty sure it doesn’t capture that of Anderson’s either. But, those are the caricatures that people take away and project in public, especially Anderson’s (since Gat’s is not as famous).

When it comes to “imagined communities” I recently have been thinking how much that of modern Turks fits into the framework well. Though forms of pan-Turkic nationalism can be found as earlier as 9th-century Baghdad, the ideology truly emerges in force in the late 19th century, concomitantly with the development of a Turkish identity in Anatolia which is distinct from the Ottoman one.

The curious thing is that though Turkic and Turkish identity is fundamentally one of language and secondarily of religion (the vast majority of Turkic peoples are Muslim, and there are periods, such as the 17th century when the vast majority of Muslims lived in polities ruled by people of Turkic origin*), there are some attempts to engage in biologism. This despite the fact that the physical dissimilarity of Turks from Turkey and groups like the Kirghiz and Yakut is manifestly clear.

Several years ago this was made manifestly clear in the paper The Genetic Legacy of the Expansion of Turkic-Speaking Nomads across Eurasia. This paper clearly shows that Turkic peoples across Eurasia have been impacted by the local genetic substrate. In plainer language, the people of modern-day Turkey mostly resemble the people who lived in Turkey before the battle of Manzikert and the migration of Turkic nomads into the interior of the peninsula in the 11th century A.D. Of course, there is some genetic element which shows that there was a migration of an East Asian people into modern day Anatolia, but this component in the minority one.**

Sometimes the Turkish fascination with the biological comes out in strange ways, Turkish genealogy database fascinates, frightens Turks. Much of the discussion has to do with prejudice against Armenians and Jews. But the reality is that most Turks at some level do understand that they are descended from Greeks, Armenians, Georgians, etc.

To interrogate this further I decided to look at a data set of Greeks, Turks, Armenians, Georgians, and a few other groups, including Yakuts, who are the most northeastern of Turkic peoples. The SNP panel was >200,000, and I did some outlier pruning. Additionally, I didn’t have provenance on a lot of the Greeks, except some labeled as from Thessaly. I therefore just split those up with “1” being closest to the Thessaly sample and “3” the farthest.

First, let’s look at the PCA.

The Turks are shifted toward the Yakuts, but not too much. In contrast, there is much more of Yakut shift in Tajiks, and especially Turkmens. These are two groups from further east, closer to the heart of the zone Turkic expansion. Curiously, the Tajiks, who are the dominant non-Turkic Iranian speaking people of Central Asia, actually have more East Asian ancestry than the Turks of Turkey. This goes to show that ethnicity is somewhat fluid, and Turkic people have assimilated into the Tajik identity. That being said, please note that the Turkmen are notably more east-shifted than the Tajik.

Let’s see how this looks on pairwise Fst.

Fst is kind of difficult for fine distinctions when you have outgroups like Yakuts and Dai. So let’s look at Treemix with five migrations:

On this, you can see that the relationship of the Greece clusters on Treemix to Lithuanians matches PCA. Greece1 is the closest, Greece 3 the farthest.

The Turks are close to the Georgians and Armenians, but not the Kurds, or Tajiks. And, they receive gene flow from the Turkmen-Yakut region of the graph. So do the Tajiks…but the Tajiks also remove gene flow from the Lithuanians. The admixture plot makes it more clear what’s happening I think.

Yellow ~ modal in Southern Europe, green ~ modal Northern Europe, red ~ Central Asian, while blue and purple are northern and southern East Asian. In comparison to Turks of Anatolia Tajiks have a lot more Northern European affinity, probably because of the common steppe heritage. Not surprisingly, Turks have more Southern European like ancestry.

Curiously the East Asian ancestry in the Turkic people seems to be both Yakut and Dai like, so perhaps it was more cosmopolitan than we might think? The Yakuts after all are from the northern edge of the range, and may have absorbed a lot of indigenous Siberian ancestry.

Georgians have none of the Northern European sort of ancestry, but Armenians do, and Turks even more. One could posit that this is due to Slavic ancestry arriving with the Rumelian Turks who arrived in the 20th century, but just as likely is the possibility that Turks have a lot of ancestry from western Anatolia which was Greek, and Greeks have more of this than Armenians.

It’s hard to tell from these results whether Turks have more of an affinity with Greek or Armenians as their non-Turkic ancestors. So I ran a three population test.

Outgroup X1 X2 f3 error z
Turkey Armenians Yakut -0.00253688 6.70852e-05 -37.8158
Turkey Greece3 Yakut -0.00246931 6.72384e-05 -36.7247
Turkey Georgian Yakut -0.00256555 7.60158e-05 -33.7502
Turkey Armenians Dai -0.00246779 7.40038e-05 -33.3468
Turkey Greece3 Dai -0.0024101 7.34629e-05 -32.8071
Turkey Georgian Dai -0.00249174 8.11957e-05 -30.688
Turkey Greece2 Yakut -0.00222382 7.62368e-05 -29.1699
Turkey Greece2 Dai -0.00231001 8.39207e-05 -27.5261
Turkmen Turkey Dai -0.00288213 0.000108049 -26.6742
Turkmen Turkey Yakut -0.00254805 0.000102816 -24.7826
Turkey Greece1 Yakut -0.00225638 9.94722e-05 -22.6836
Turkey GreekCentral Dai -0.00235681 0.000104014 -22.6587
Turkey Greece3 Tajik -0.000622671 2.76666e-05 -22.5063
Turkey GreekCentral Yakut -0.00221985 0.000101654 -21.8373
Turkey Greece1 Dai -0.00243254 0.000112011 -21.717
Turkey Greece3 Turkmen -0.000640439 3.33529e-05 -19.2019
Turkey GreekThessaly Yakut -0.00208436 0.00011042 -18.8767
Turkey Dai GreekThessaly -0.00225435 0.00012241 -18.4163
Turkey Greece2 Turkmen -0.000584983 3.29819e-05 -17.7365
Turkey Armenians Turkmen -0.000520887 3.07253e-05 -16.953
Turkey Armenians Tajik -0.000421139 2.55274e-05 -16.4975
Tajik Turkey Dai -0.00140423 8.51697e-05 -16.4875
Tajik Turkey Yakut -0.00124601 7.60725e-05 -16.3793
Turkey Georgian Turkmen -0.000532496 3.80694e-05 -13.9875
Turkey Greece2 Tajik -0.000412419 3.04172e-05 -13.5587
Turkey Armenians Lithuanians -0.000459831 3.75838e-05 -12.2348
Turkey Greece1 Turkmen -0.000570715 4.7753e-05 -11.9514
Turkey Kurds Yakut -0.00146087 0.000124799 -11.7058
Turkey GreekThessaly Turkmen -0.000516877 4.46683e-05 -11.5714
Turkey Georgian Tajik -0.000328859 3.02443e-05 -10.8734
Turkey GreekCentral Turkmen -0.000504962 4.92555e-05 -10.2519

Armenians beat out Greece3 a bit better, but really it’s hard to say from this that this is definitive. It’s likely that my Turkish sample has both, and/or the original Turkic nomads had Iranian-like ancestry which was more like Armenian than Greek? Hard to say. Additionally, the face that Greece3 is better than the other options suggests to me that the source are Anatolian Greeks who were less impacted by migrations from the north than Greeks in Greece proper.


* The Mughals were Central Asian Turks, while the Safavids were mostly Azeri Turks.

** Since the Turks who arrived in Anatolia had long sojourned in Turn and Iran it is important not to assume that their contribution is limited only to the East Asian component of ancestry.

Older Posts »

Powered by WordPress