Razib Khan One-stop-shopping for all of my content

April 23, 2017

The logic of human destiny was inevitable 1 million years ago

Filed under: Evolution,Genetics,Genomics,Human Evolution,Human Genetics — Razib Khan @ 1:11 pm

Robert Wright’s best book, Nonzero: The Logic of Human Destiny, was published near 20 years ago. At the time I was moderately skeptical of his thesis. It was too teleological for my tastes. And, it does pander to a bias in human psychology whereby we look to find meaning in the universe.

But this is 2017, and I have somewhat different views.

In the year 2000 I broadly accepted the thesis outlined a few years later in The Dawn of Human Culture. That our species, our humanity, evolved and emerged in rapid sequence, likely due to biological changes of a radical kind, ~50,000 years ago. This is the thesis of the “great leap forward” of behavioral modernity.

Today I have come closer to models proposed by Michael Tomasello in The Cultural Origins of Human Cognition and Terrence Deacon in The Symbolic Species: The Co-evolution of Language and the Brain. Rather than a punctuated event, an instance in geological time, humanity as we understand it was a gradual process, driven by general dynamics and evolutionary feedback loops.

The conceit at the heart of Robert J. Sawyer’s often overly preachy Neanderthal Parallax series, that if our own lineage went extinct but theirs did not they would have created a technological civilization, is I think in the main correct. It may not be entirely coincidental that the hyper-drive cultural flexibility of African modern humans evolved in African modern humans first. There may have been sufficient biological differences to enable this to be likely. But I believe that if African modern humans were removed from the picture Neanderthals would have “caught up” and been positioned to begin the trajectory we find ourselves in during the current Holocene inter-glacial.

Luke Jostins’ figure showing across board encephalization

The data indicate that all human lineages were subject to increased encephalization. That process trailed off ~200,000 years ago, but it illustrates the general evolutionary pressures, ratchets, or evolutionary “logic”, that applied to all of them. Overall there were some general trends in the hominin lineage that began to characterized us about a million years ago. We pushed into new territory. Our rate of cultural change seems to gradually increased across our whole range.

One of the major holy grails I see now and then in human evolutionary genetics is to find “the gene that made us human.” The scramble is definitely on now that more and more whole genome sequences from ancient hominins are coming online. But I don’t think there will be such gene ever found. There isn’t “a gene,” but a broad set of genes which were gradually selected upon in the process of making us human.

In the lingo, it wasn’t just a hard sweep from a de novo mutation. It was as much, or even more, soft sweeps from standing variation.

April 20, 2017

Aryan marauders from the steppe came to India, yes they did!

Filed under: Genetics,Genomics,History,India — Razib Khan @ 10:21 pm

Its seems every post on Indian genetics elicits dissents from loquacious commenters who are woolly on the details of the science, but convinced in their opinions (yes, they operate through uncertainty and obfuscation in their rhetoric, but you know where the axe is lodged). This post is an attempt to answer some questions so I don’t have to address this in the near future, as ancient DNA papers will finally start to come out soon, I hope (at least earlier than Winds of Winter).

In 2001’s The Eurasian Heartland: A continental perspective on Y-chromosome diversity Wells et al. wrote:

The current distribution of the M17 haplotype is likely to represent traces of an ancient population migration originating in southern Russia/Ukraine, where M17 is found at high frequency (>50%). It is possible that the domestication of the horse in this region around 3,000 B.C. may have driven the migration (27). The distribution and age of M17 in Europe (17) and Central/Southern Asia is consistent with the inferred movements of these people, who left a clear pattern of archaeological remains known as the Kurgan culture, and are thought to have spoken an early Indo-European language (27, 28, 29). The decrease in frequency eastward across Siberia to the Altai-Sayan mountains (represented by the Tuvinian population) and Mongolia, and southward into India, overlaps exactly with the inferred migrations of the Indo-Iranians during the period 3,000 to 1,000 B.C. (27). It is worth noting that the Indo-European-speaking Sourashtrans, a population from Tamil Nadu in southern India, have a much higher frequency of M17 than their Dravidian-speaking neighbors, the Yadhavas and Kallars (39% vs. 13% and 4%, respectively), adding to the evidence that M17 is a diagnostic Indo-Iranian marker. The exceptionally high frequencies of this marker in the Kyrgyz, Tajik/Khojant, and Ishkashim populations are likely to be due to drift, as these populations are less diverse, and are characterized by relatively small numbers of individuals living in isolated mountain valleys.

In a 2002 interview with the India site Rediff, the first author was more explicit:

Some people say Aryans are the original inhabitants of India. What is your view on this theory?

The Aryans came from outside India. We actually have genetic evidence for that. Very clear genetic evidence from a marker that arose on the southern steppes of Russia and the Ukraine around 5,000 to 10,000 years ago. And it subsequently spread to the east and south through Central Asia reaching India. It is on the higher frequency in the Indo-European speakers, the people who claim they are descendants of the Aryans, the Hindi speakers, the Bengalis, the other groups. Then it is at a lower frequency in the Dravidians. But there is clear evidence that there was a heavy migration from the steppes down towards India.

But some people claim that the Aryans were the original inhabitants of India. What do you have to say about this?

I don’t agree with them. The Aryans came later, after the Dravidians.

Over the past few years I’ve gotten to know the above first author Spencer Wells as a personal friend, and I think he would be OK with me relaying that to some extent he was under strong pressure to downplay these conclusions. Not only were, and are, these views not popular in India, but the idea of mass migration was in bad odor in much of the academy during this period. Additionally, there was later work which was less clear, and perhaps supported an Indian origin for R1a1a. Spencer himself told me that it was not impossible for R1a to have originated in India, but a branch eventually back-migrated to southern Asia.

But even researchers from the group at Stanford where he had done his postdoc did not support this model by the middle 2000s, Polarity and Temporality of High-Resolution Y-Chromosome Distributions in India Identify Both Indigenous and Exogenous Expansions and Reveal Minor Genetic Influence of Central Asian Pastoralists. In 2009 a paper out of an Indian group was even stronger in its conclusion for a South Asian origin of R1a1a, The Indian origin of paternal haplogroup R1a1* substantiates the autochthonous origin of Brahmins and the caste system.

By 2009 one might have admitted that perhaps Spencer was wrong. I was certainly open to that possibility. There was very persuasive evidence that the mtDNA lineages of South Asia had little to do with Europe or the Middle East.

Yet a closer look at the above papers reveals two major systematic problems.

First, ancient DNA has made it clear that there has been major population turnover during the Holocene, but this was not the null hypothesis in the 2000s. Looking at extant distributions of lineages can give one a distorted view of the past. Frankly, the 2009 Indian paper was egregious in this way because they included Turkic groups in their Central Asian data set. Even in 2009 there was a whole lot of evidence that Central Asian Turkic groups were likely very different from Indo-European Turanian populations which would have been the putative ancestors of Indo-Aryans. Honestly the authors either consciously loaded the die to reduce the evidence for gene flow from Central Asia, or they were ignorant (the nature of the samples is much clearer in the supplements than the  primary text for what it’s worth).

Second, Y chromosomal marker sets in the 2000s were constrained to fast mutating microsatellite regions or less than 100 variant SNPs on the Y. Because it is so repetitive the Y chromosome is hard to sequence, and it really took the technologies of the last ten years to get it done. Both the above papers estimate the coalescence of extant R1a1a lineages to be 10-15,000 years before the present. In particular, they suggest that European and South Asian lineages date back to this period, pushing back any possible connection between the groups, and making it possible that European R1a1a descended from a South Asian founder group which was expanding after the retreat of the ice sheets. The conclusions were not unreasonable based on the methods they had.  But now we have better methods.*

Whole genome sequencing of the Y, as well as ancient DNA, seems to falsify the above dates. Though microsatellites are good for very coarse grain phyolgenetic inferences, one has to be very careful about them when looking at more fine grain population relationships (they are still useful in forensics to cheaply differentiate between individuals, since they accumulate variation very quickly). They mutate fast, and their clock may be erratic.

Additionally, diversity estimates were based on a subset of SNP that were clearly not robust. R1a1a is not diverse anywhere, though basal lineages seem to be present in ancient DNA on the Pontic steppe in some cases.

To show how lacking in diversity R1a1a is, here are the results of a 2016 paper which performed whole genome sequencing on the Y. Instead of relying on the order of 10 to 100 SNPs, this paper discover over 65,000 Y variants worldwide. Notice how little difference there is between different South Asian groups below, indicative of a massive population expansion relatively recently in time which didn’t even have time to exhibit regional population variation. They note that “The most striking are expansions within R1a-Z93 [the South Asian clade], ~4.0–4.5 kya. This time predates by a few centuries the collapse of the Indus Valley Civilization, associated by some with the historical migration of Indo-European speakers from the western steppes into the Indian sub-continent.

(BEB = Bengali, GIH = Gujarati, PJL = Punjabi, STU = Sri Lanka Tamil, ITU = Indian Telugu)

The spatial distribution of Z93 lineages of R1a is as you can see to the left. There are branches in South Asia, Central Asia, and in the Altai region. Ancient DNA from the Bronze Age Mongolia has found Z93. Modern Mongolians clearly have a small, but appreciable, fraction of West Eurasian ancestry. Some also carry R1a1a. Z93 has also been found in North-Central Asian steppe samples that date to ~4,500 years before the present.

Today with ancient DNA we’re discovering individuals who lived around the time of the massive  expansion alluded to above. What are these individuals like? They are a mix of European, Central Eurasian, Near Eastern, and Siberian. Many of them share quite a bit of ancestry with South Asian populations, in particular those from the northwest of subcontinent, as well as upper castes more generally.

A new paper using ancient DNA from Scythians (Iranian speakers) also shows that they carried Z93. Some of them had East Asian admixture. These were the ones from the eastern steppe. So not entirely surprising. In the supplements of the paper they have an admixture plot with many populations. At K = 15 in supplementary figure 14 you see many ancient Central Eurasian populations run against modern groups. At this K there is a South Asian modal cluster which is found in South Asians as well as nearby Iranian groups from Afghanistan.

It is not light green or dark blue. You see see that this salmon color is modal in tribal South Indian populations, or non-Brahmin South Indians. It drops in frequency as you move north and west, and as you move up the caste ladder. Observe that is present even among the relatively isolated Kalash people of Chitral.

Outside of South Asia-Afghanistan, this salmon component is found among Thai and Cambodians. From talking to various researchers, and recent published findings, it seems clear that this signature is not spurious, but is indicative of some migration from South Asia to Southeast Asia in the historical period, as one might infer based on cultural affinities. It is also found at lower frequencies among the Uyghur of Xinjiang. This is not entirely surprising either. This region of the Tarim basin was connected to Kashmir across the Pamirs. The 4th century Buddhist monk from the Tarim basin city of Kucha, who was instrumental in the translation of texts into Chinese, Kumārajīva, may have had a Kashmiri father.

Even before Islam much of Northwest India and Central Asia were under the rule of the same polity, and after Islam there is extensive record of the enslavement of many Indians in the cities of the eastern Islamic world, as well as the travel of some Indian merchants and intellectuals into these regions.

And yet this South Asia cluster is not present in the ancient steppe samples carrying R1a1a-Z93. None of them to my knowledge. Many ancient samples share ancestry with South Asians. For example it seems that many ancient West Asian samples from Iran share common history as evident in genetic drift patterns with many South Asians. And, there is good evidence that a subset of South Asians, skewed toward northwest and upper caste groups, share drift with steppe Yamna samples. But South Asians are often clearly composites of these exogenous populations and an indigenous component with affinities with Andaman Islanders, and more distantly Southeast Asians and other eastern non-Africans.

How can you reconcile this with migration out of South Asia? The path is found in publications such as Genetic Evidence for Recent Population Mixture in India. Here you have a paper which models mixing between Ancestral North Indians (ANI) and Ancestral South Indians (ASI). The ANI would be the source population for the ancestry shared with West Eurasians. And, they would lack ASI ancestry because the mixing had not occurred. The admixture dates the paper are between two and four thousand years before the present.

There is a problem though. These methods detect the last admixture events. Therefore, they are a lower bound on major mixing events, not a record of when there was no mixing. Secondarily, but not less importantly, recent work indicates that because of the pulse admixture simplification these methods likely underestimate the time period of admixture.

Another issue for me is the idea that ANI and ASI could be so separate within India. If ANI is the source of gene flow into other parts of Eurasia from South Asia, then I believe that ASI is intrusive to the subcontinent. I don’t think that ASI being intrusive is so implausible. Southeast Asia has undergone massive genetic changes over the Holocene, and it may be that there was much more ASI ancestry in placers like Burma before the arrival of Austro-Asiatic rice farmers. The presence of Austro-Asiatic languages in northeast India and central India shows a precedent of migration from Southeast Asia into the subcontinent.

In sum, the balance of evidence suggests male mediated migration into South Asia from Central Asia on the order of ~4-5,000 years ago. There are lots of details to be worked out, and this is not an assured model in terms of data, but it is the most likely. In the near future ancient DNA will clear up confusions. Writing very long but confused comments just won’t change this state of affairs. New data will.

Addendum: Indian populations have finally been relatively well sampled, thanks to Mait Mepsalu’s group in Estonia, David Reich’s lab and, the Indian collaborators of both, and the 1000 Genomes (HGDP gave us Pakistanis). Additionally, Zack Ajmal’s Harappa website did some work filling in some holes in the early 2010s.

* A Facebook argument broke out about one of my posts where one interlocutor asserted that he leaned on papers from the late 2000s, not all the new stuff. That’s obviously because the new stuff did not support his preferred position, while the old stuff did. I would prefer that faster-than-light travel were possible, so I’ll just stick to physics before 1910?

Oxford Nanopore finally giving hope to biologist’s dreams

Filed under: Genomics,Oxford Nanopore,Sequencing — Razib Khan @ 5:48 pm

I don’t talk too much about genomic technology because it changes so fast. Being up-to-date on the latest machines and tools often requires regular deep-dives right now, though I believe at some point technological improvements will plateau as the data returned will be cheap and high quality enough that there won’t be much to gain on the margin.

Of course we’ve already come a long way. Fifteen years ago a “whole human genome” cost on the order of billions of dollars. Today a high quality whole human genome will run you on the order of $1,000. This is fundamentally a technology driven change, with big metal machines automatically generating reads and powerful computers to process them. One couldn’t imagine such a scenario 30 years ago because the technology wasn’t there.

I’ve stated before that I don’t think genomics fundamentally alters what we know and understand about evolution. At least so far. But it is a huge change in the domain of medicine. Cleary the human genomicists, especially Francis Collins, overhyped the yield of the technology in relation to healthcare in the 2000s. But with cheap and ubiquitous sequencing we may see the end of Mendelian diseases in our lifetime (through screening and possibly at some point CRISPR therapy).

This has been driven by technological innovation in the private sector around a few firms. The famous chart showing the massive decline in the cost of genomic sequencing over the past 15 years is due in large part to the successes of Illumina. But, Illumina has also had a quasi-monopoly on the field over the past five years (or more), and that shows with the leveling off of the decline in cost. Until the past year….

What gives? Many people believe that Illumina is moving again in part because a genuine challenger is emerging, or at least the flicker of a challenge, in the form of Oxford Nanopore. Oxford Nanopore has been around since 2005, but it really came into the public eye around 2010 or so. But like many tech companies it overpromised in the early years. I remember skeptically listening to a friend in the fall of 2011 talk about how quickly Nanopore was going to change the game…. I didn’t put too much stock into these sorts of presentations to hopeful researchers because I remember Pacific Biosciences making the same sort of pitch to amazed biologists in 2008. Pac Bio is still around, but has turned out to be a bit player, rather than a challenger to Illumina.

But I have to admit that Nanopore has really started to step up its game of late. Probably one of the major things it has accomplished is that it’s made us reimagine what sequencing technology should look like. Rather than refrigerators of various sizes, Oxford Nanopore allows us to imagine sequencing technology which exhibits a form factor more analogous to a USB thumb drive. The first time I saw a Nanopore machine in the flesh I knew intellectually what I was going to see…but because of my deep intuitions I still overlooked the two Nanopore machines laying on the workbench in front of me.

Despite their amazing form factor, these early Nanopore machines had limited application. They didn’t generate much data, and so were utilized by researchers who worked with smaller genomes. Scientists who worked with bacteria seem to have been using them a lot, for example. Additionally the machines were error prone and people were working out their kinks in real time in laboratories (one tech told me early on they were so small that he swore they were affected by ambient vibrations so he found ways to dampen that source of error).

A new preprint suggests we may be turning the corner though, Nanopore sequencing and assembly of a human genome with ultra-long reads:

Nanopore sequencing is a promising technique for genome sequencing due to its portability, ability to sequence long reads from single molecules, and to simultaneously assay DNA methylation. However until recently nanopore sequencing has been mainly applied to small genomes, due to the limited output attainable. We present nanopore sequencing and assembly of the GM12878 Utah/Ceph human reference genome generated using the Oxford Nanopore MinION and R9.4 version chemistry. We generated 91.2 Gb of sequence data (~30x theoretical coverage) from 39 flowcells. De novo assembly yielded a highly complete and contiguous assembly (NG50 ~3Mb). We observed considerable variability in homopolymeric tract resolution between different basecallers. The data permitted sensitive detection of both large structural variants and epigenetic modifications. Further we developed a new approach exploiting the long-read capability of this system and found that adding an additional 5x-coverage of “ultra-long” reads (read N50 of 99.7kb) more than doubled the assembly contiguity. Modelling the repeat structure of the human genome predicts extraordinarily contiguous assemblies may be possible using nanopore reads alone. Portable de novo sequencing of human genomes may be important for rapid point-of-care diagnosis of rare genetic diseases and cancer, and monitoring of cancer progression. The complete dataset including raw signal is available as an Amazon Web Services Open Dataset at: https://github.com/nanopore-wgs-consortium/NA12878.

30x just means that you’re getting bases sampled typically 30 times, so that you have a very accurate and precise read on its state. 30x has become the default standard in medical genomics. If Nanopore can do 30x on human genomes at reasonable cost it won’t be a niche player much longer.

The read length is important because last I checked the human genome still had large holes in it. The typical Illumina machine produces average read lengths in the low hundreds of base pairs. If you have large repetitive regions of the human genome (and you do have these), you’re never going to span them with such short yardsticks. Additionally, these short reads have to be tiled together when you assemble a genome from raw results, and this is a computationally really intensive task. It’s good when you have a reference genome you can align to as a scaffold. But researchers who don’t work on humans or model organisms may not have a good reference genome, or in many cases a reference genome at all.

Pac Bio occupies a space where it provide really long reads for a high price point. Most of the time this isn’t necessary, but imagine you work on a disease which is caused by large repetitive regions. You are likely willing to pay the price that is asked. And because Pac Bio generates very long reads it makes de novo assembly much easier, as your algorithm has to tile together far fewer contiguous sequences, and long sequences are less likely to have lots of repetitive matches in the genome.

But Pac Bio machines are expensive and huge. In the abstract above it alludes to “Portable de novo sequencing of human genomes.” This is a huge deal. The dream, as whispered by some genomicists I have known, is that at a point in the future biologists would carry portable sequencers which would produce very long reads that so that they could de novo assemble sequences on the spot. A concrete example might be a health inspector checking on the sorts of microbes found on the counter of a restaurant, or a field ecologist who might be sample various fungi to discover cryptic species.

Obviously this is still a dream. The preprint above makes it clear that to do what they did required a lot of novel techniques and development of new tools. This isn’t beta technology, it’s early alpha. But because it’s 2017 the outlines of the dream are coming into public view.

Citation: Nanopore sequencing and assembly of a human genome with ultra-long reads
Miten Jain, Sergey Koren, Josh Quick, Arthur C Rand, Thomas A Sasani, John R Tyson, Andrew D Beggs, Alexander T Dilthey, Ian T Fiddes, Sunir Malla, Hannah Marriott, Karen H Miga, Tom Nieto, Justin O’Grady, Hugh E Olsen, Brent S Pedersen, Arang Rhie, Hollian Richardson, Aaron Quinlan, Terrance P Snutch, Louise Tee, Benedict Paten, Adam M. Phillippy, Jared T Simpson, Nicholas James Loman, Matthew Loose
bioRxiv 128835; doi: https://doi.org/10.1101/128835

April 19, 2017

Mouse fidelity comes down to the genes

Filed under: Genetics,Genomics,Human Genetics — Razib Khan @ 10:02 pm

While birds tend to be at least nominally monogamous, this is not the case with mammals. This strikes some people as strange because humans seem to be monogamous, at least socially, and often we take ourselves to be typically mammalian. But of course we’re not. Like many primates we’re visual creatures, rather than relying in smell and hearing. Obviously we’re also bipedal, which is not typical for mammals. And, our sociality scales up to massive agglomerations of individuals.

How monogamous we are is up for debate. Desmond Morris, who is well known to many from his roles in television documentaries, has been a major promoter of the idea that humans are monogamous, with a focus on pair-bonds. In contrast, other researchers have highlighted our polygamous tendencies. In The Mating Mind Geoffrey Miller argues for polygamy, and suggests that pair-bonds in a pre-modern environment were often temporary, rather than lifetime (Miller is now writing a book on polyamory).

The fact that in many societies high status males seem to engage in polygamy, despite monogamy being more common, is one phenomenon which confounds attempts to quickly generalize about the disposition of our species. What is preferred may not always be what is practiced, and the external social adherence to norms may be quite violated in private.

Adducing behavior is simpler in many other organisms, because their range of behavior is more delimited. When it comes to studying mating patterns in mammals voles have long been of interest as a model. There are vole species which are monogamous, and others which are not. Comparing the diverged lineages could presumably give insight as to the evolutionary genetic pathways relevant to the differences.

But North American deer mice, Peromyscus, may turn to be an even better bet: there are two lineages which exhibit different mating patterns which are phylogenetically close enough to the point where they can interbreed. That is crucial, because it allows one to generate crosses and see how the characteristics distribute themselves across subsequent generations. Basically, it allows for genetic analysis.

And that’s what a new paper in Nature does, The genetic basis of parental care evolution in monogamous mice. In figure 3 you can see the distribution of behaviors in parental generations, F1 hybrids, and the F2, which is a cross of F1 individuals. The widespread distribution of F2 individuals is likely indicative of a polygenic architecture of the traits. Additionally, they found that some traits are correlated with each other in the F2 generation (probably due to pleiotropy, the same gene having multiple effects), while others were independent.

With the F2 generation they ran a genetic analysis which looked for associations between traits and regions of the genome. They found 12 quantitative trait loci (QTLs), basically zones of the genome associated with variation on one or more of the six traits. From this analysis they immediately realized there was sexual dimorphism in terms of the genetic architecture; the same locus might have a different effect in the opposite sex. This is evolutionarily interesting.

Because the QTLs are rather large in terms of physical genomic units the authors looked to see which were plausible candidates in terms of function. One of their hits was vasopressin, which should be familiar to many from vole work, as well as some human studies. Though the QTL work as well as their pup-switching experiment (which I did not describe) is persuasive, the fact that a gene you’d expect shows up as a candidate really makes it an open and shut case.

The extent of the variation explained by any given QTL seems modest. In the extended figures you can see it’s mostly in the 1 to 5 percent range. In Carl Zimmer’s excellent write up he ends:

But Dr. Bendesky cautioned that the vasopressin gene would probably turn out to be just one of many that influence oldfield mice. Though it is strongly linked to parental behavior, the vasopressin gene accounts for 6.7 percent of the variation in nest building among males, and only 2.9 percent among females.

The genetic landscape of human parenting will turn out to be even more rugged, Dr. Bendesky predicted.

“You cannot do a 23andMe test and find out if your partner is going to be a good father,” he said.

Sort of. The genetic architecture above is polygenic…but not incredibly diffuse. The proportion of variation explained by the largest effect allele is more than for height, and far more than for education. If human research follows up on this, I wouldn’t be surprised if you could develop a polygenic risk score.

But I don’t have a good intuition on how much variation in humans there really is for these sorts of traits that are heritable. I assume some. But I don’t know how much. And how much of the variance in behavior might be explained by human QTLs? Humans don’t lick or build nests, or retrieve pups. Also, as one knows from Genetics and Analysis of Quantitative Traits sexually dimorphic traits take a long time to evolve. These are two deer mice species. Within humans there may not have been enough time for this sort of heritable complexity of behavior to evolve.

There are a lot of philosophical issues here about translating to a human context.

Nevertheless, this research shows that ingenious animal models can powerfully elucidate the biological basis of behavior.

Citation: The genetic basis of parental care evolution in monogamous mice. Nature (2017) doi:10.1038/nature22074

April 15, 2017

Genetic variation in human populations and individuals

Filed under: Genetics,Genomics,Human Genetics,Polymorphisms,SNPs — Razib Khan @ 9:25 pm


I’m old enough to remember when we didn’t have a good sense of how many genes humans had. I vaguely recall numbers around 100,000 at first, which in hindsight seems rather like a round and large number. A guess. Then it went to 40,000 in the early 2000s and then further until it converged to some number just below 20,000.

But perhaps more fascinating is that we have a much better catalog of the variation across the whole human genome now. Often friends ask me questions of the form: “so DTC genomic company X has about 800,000 SNPs, is that enough to do much?” To answer such a question you need some basic numbers in your head, as well as what you want to “do.”

First, the human genome has about 3 billion base pairs (3 Gb). That’s a lot. But most of the genome famously doesn’t code for proteins. The exome, the proportion of the genome where bases directly translate into a protein accounts for 1% of the whole genome. That’s 30 million bases (30 Mb). But this small region of the genome is very important, as the vast majority of major disease mutations are found in the exome.

When it comes to a standard 800K SNP chip, which samples 800,000 positions across the 3 Gb genome, it is likely that the designers enriched the marker set for functional positions relevant to diseases. Not all marker positions are created equal. Though even outside of those functional positions there are often nearby SNPs that can “tag” them, so you can infer one from the state of the other.

But are 800,000 positions enough to make good ancestry inference? (to give one example) Yes. 800,000 is actually a substantial proportion of the polymorphism in any given genome. There have been some papers which improved on the numbers in 2015’s A global reference for human genetic variation, but it’s still a good comprehensive review to get an order-of-magnitude sense. The table below gives you a sense of individual variation:

Median autosomal variant sites per genome

When it comes to single nucleotide polymorphisms (SNPs), what SNP chips are getting at, an 800K array should get a substantial proportion of your genome-wide variation. More than enough for ancestry inference or forensics. The singleton column shows mutations specific to the individual.  When focusing on new mutations specific to an individual that might cause disease, singleton large deletions and nonsynonymous SNPs is really where I’d look.

But what about whole populations? The plot to the left shows the count of variants as a function of alternative allele frequency. When we say “SNP”, you really mean variants which exhibit polymorphism at a particular cut-off frequency for the minor allele (often 1%). It is clear that as the minor allele frequency increases in relation to the human reference genome the number of variants decreases.

From the paper:

The majority of variants in the data set are rare: ~64 million autosomal variants have a frequency <0.5%, ~12 million have a frequency between 0.5% and 5%, and only ~8 million have a frequency >5% (Extended Data Fig. 3a). Nevertheless, the majority of variants observed in a single genome are common: just 40,000 to 200,000 of the variants in a typical genome (1–4%) have a frequency <0.5% (Fig. 1c and Extended Data Fig. 3b). As such, we estimate that improved rare variant discovery by deep sequencing our entire sample would at least double the total number of variants in our sample but increase the number of variants in a typical genome by only ~20,000 to 60,000.

An 800K SNP chip will be biased toward the 8 million or so variants with a frequency of 5%. This number gives you a sense of the limited scope of variation in the human genome. 0.27% of the genome captures a lot of the polymorphism.

Citation: 1000 Genomes Project Consortium. “A global reference for human genetic variation.” Nature 526.7571 (2015): 68-74.

April 7, 2017

Why humans have so many pulse admixtures

Filed under: Admixture,Evolution,Genetics,Genomics — Razib Khan @ 5:38 pm

The Blank Slate is one of my favorite books (though I’d say The Language Instinct is unjustly overshadowed by it). There is obviously a substantial biological basis in human behavior which is mediated by genetics. When The Blank Slate came out in the early 2000s one could envisage a situation in 2017 when empirically informed realism dominated the intellectual landscape. But that was not to be. In many ways, for example in sex differences, we’ve gone backward, while there is still undue overemphasis in our society on the environmental impact parents have on children (as opposed to society more broadly).

But genes do not determine everything, obviously. Several years after reading The Blank Slate I read Not by Genes Alone: How Culture Transformed Human Evolution. In this work Peter Richerson and Robert Boyd outline their decades long project of modeling cultural variation and evolution formally in a manner reminiscent of biological evolution. Richerson and Boyd’s program does not start from a “blank slate” assumption. Rather, it is focused on broad macro-social dynamics where cultural variation “swamps” out biological variation.

Recall that in classic population genetic theory a major problem with group level selection is that gene flow between adjacent groups quickly removes between group variation. One migrant between two groups per generation is enough for them not to diverge genetically. For group selection to occur the selective effect has to be very strong or the between group difference has to be very high. Rather than talking about genetics though, where the debate is still live, and the majority consensus is still that biological group selection is not that common (depending on how you define it), let’s talk about human culture.

Here the group level differences are extreme and the boundaries can be sharp. Historically it seems likely that most groups which were adjacent to each other looked rather similar because of gene flow and similar selective pressures. Even though in medieval Spain there was a generality, probably true, that Muslims were swarthier than Christians*, there was a palpable danger in battle of identifying friend from foe because the two groups overlapped too much in appearance.

This brings up how one might delineate differences culturally. In battle opposing armies wear distinct uniforms and colors so that the distinction can be made. But obviously one change uniform surreptitiously (perhaps taking the garb from the enemy dead). This is why physical adornment such as tattoos are useful, as they are “hard to fake.” Perhaps the most clear illustration of this dynamic is the Biblical story for the origin of the term shibboleth. Even slight differences in accent are clear to all, and, often difficult to mimic once in adulthood.

Biological evolution mediated through genes is relatively slow and constrained compared to cultural evolution. Whole regions of central and northern Europe shifted from adherence to Roman Catholicism to forms of Protestantism on the order of 10 years. Of course religion is an aspect of culture where change can happen very rapidly, but even language shifts can occur in only a few generations (e.g., the decline of regional German and Italian dialects in the face of standard forms of the language).

Cultural evolution as a formally modeled neofunctionalism is credibly outlined in works such as Peter Turchin’s Ultrasociety: How 10,000 Years of War Made Humans the Greatest Cooperators on Earth. That’s not what I want to focus on here. Rather, I contend that the reality of massive pulse admixtures evident in the human genome over the past 10,000 years, at minimum, is a function of the fact that human cultural evolutionary processes result in winner-take-all genetic consequences.

A concrete example of what I’m talking about would compare the peoples of the Italian peninsula and the Iberian peninsula around 1500. The two populations are not that different genetically, and up to that point shared many cultural traits (and continue to do so). But, a combination of geography and history resulted in Iberian demographic expansion in the several hundred years after 1500, whereby today there are probably many more descendants of Iberians than Italians. This is not a function of any deep genetic difference between the two groups. There aren’t deep genetic differences in fact. Rather, the social and demographic forces which propelled Iberia to imperial status redounded upon the demographic production of Iberians in the future. In addition, the New World underwent a massive pulse admixture between Iberians, and native Amerindians, as well as Africans, usually brought over as slaves, due the cultural and political history of the period.

The pulse admixture question is rather interesting academically. To some extent current methods are biased toward detection of pulse admixtures, and even fit continuous gene flow as pulse admixtures. A quick rapid exchange of gene flow and then recombination breaking apart associations of markers which are ancestrally informative haplotypes is something you can test for. But I think we can agree that the gene flow triggered by the Columbian Exchange was a pulse admixture, and there’s too much concurrent evidence from uniparental lineage turnover in the ancient DNA to dismiss the non-historically corroborated signatures of pulses as simply artifacts.

Nevertheless continuous gene flow does occur. That is, normal exchange of individuals between neighboring demes as a slow simmer over time. But the idea that we are a clinal ring species or something like that isn’t right in my opinion. Part of the story are strong geographical barriers. But another major part is that cultural revolutions and advantages introduce huge short-term demographic advantages to particular groups, and the shake out of inter-group competition can be dramatic.

Therefore, I make a prediction: the more cultural evolutionary dynamics a species is subject to, the more pulse admixture you’ll be able to detect. For example, pulse admixture should be more important in social insects than their solitary relatives.

* Not only was some of the ancestry of Muslims North African, Muslim rule was longest in the southern and southeastern regions, where people were not as fair as in the north.

April 4, 2017

Sex bias in migration from the steppe (revisited)

Filed under: Anthroplogy,Genetics,Genomics,History — Razib Khan @ 11:21 pm

Last fall I blogged a preprint which eventually came out as a paper in PNAS, Ancient X chromosomes reveal contrasting sex bias in Neolithic and Bronze Age Eurasian migrations. The upshot is that the authors found that there was far less steppe ancestry on the X chromosomes of Bronze Age Central Europeans than across the whole genome. The natural inference here is that you had migrations of males into territory where they had to find local wives.

But the story does not end there. Iosif Lazaridis and David Reich have put out a short not on biorxiv, Failure to Replicate a Genetic Signal for Sex Bias in the Steppe Migration into Central Europe. It’s short, so I suggest you read the note yourself, but the major issue seems to be that on X chromosomes ADMIXTURE in supervised mode seems to behave really strangely. Lazaridis and Reich find that there seems to be a downward bias of steppe ancestry. Ergo, the finding was an artifact.

Goldberg et al. almost immediately responded, Reply To Lazaridis And Reich: Robust Model-Based Inference Of Male-Biased Admixture During Bronze Age Migration From The Pontic-Caspian Steppe. Their response seems to be that yes, ADMIXTURE does behave strangely, but the overall finding is still robust.

With these uncertainties I do wonder if it’s hard at this point to evaluate the alternative models. But, we do have archaeology and mtDNA. What do those say? On that basis, from what little I know, I am inclined to suspect a strong male bias of migration.

Citation: Reply To Lazaridis And Reich: Robust Model-Based Inference Of Male-Biased Admixture During Bronze Age Migration From The Pontic-Caspian Steppe, Amy Goldberg, Torsten Gunther, Noah A Rosenberg, Mattias Jakobsson
bioRxiv 122218; doi: https://doi.org/10.1101/122218

Citation: Failure to Replicate a Genetic Signal for Sex Bias in the Steppe Migration into Central Europe, Iosif Lazaridis, David Reich, bioRxiv 114124; doi: https://doi.org/10.1101/114124

How Tibetans can function at high altitudes

Filed under: Altitude Adaptation,Evolution,Genetics,Genomics,Human Evolution,Tibetans — Razib Khan @ 11:10 am


About seven years ago I wrote two posts about how Tibetans manage to function at very high altitudes. And it’s not just physiological functioning, that is, fitness straightforwardly understood. High altitudes can cause a sharp reduction in reproductive fitness because women can not carry pregnancies to term. In other words, high altitude is a very strong selection pressure. You adapt, or you die off.

For me there have been two things of note since those original papers came out. First, one of those loci seem to have been introgressed from a Denisovan genetic background. I want to be careful here, because the initial admixture event may not have been into the Tibetans proper, but earlier hunter-gatherers who descend from Out of Africa groups, who were assimilated into the Tibetans as they expanded 5-10,000 years ago. Second, it turns out that dogs have been targeted for selection on EPAS1 as well (the “Denisovan” introgression) for altitude adaptation as well.

This shows that in mammals at least there’s a few genes which show up again and again. The fact that EPAS1 and EGLN1 were hits on relatively small sample sizes also reinforces their powerful effect. When the EPAS1 results initially came out they were highlighted as the strongest and fastest instance of natural selection in human evolutionary history. One can quibble about the details about whether this was literally true, but that it was a powerful selective event no one could deny.

A new paper in PNAS, Genetic signatures of high-altitude adaptation in Tibetans, revisits the earlier results with a much larger sample size (the research group is in China) comparing Han Chinese and Tibetans. They confirm the earlier results, but, they also find other loci which seem likely targets of selection in Tibetans. Below is the list:

SNP A1 A2 Frequency of A1 P value FST Nearest gene
Tibetan EAS (Han)
rs1801133 A G 0.238 0.333 6.30E-09 0.021 MTHFR
rs71673426 C T 0.102 0.013 1.50E-08 0.1 RAP1A
rs78720557 A T 0.498 0.201 4.70E-08 0.191 NEK7
rs78561501 A G 0.599 0.135 6.10E-15 0.414 EGLN1
rs116611511 G A 0.447 0.003 3.60E-19 0.57 EPAS1
rs2584462 G A 0.211 0.549 3.90E-09 0.203 ADH7
rs4498258 T A 0.586 0.287 1.70E-08 0.171 FGF10
rs9275281 G A 0.095 0.365 1.10E-10 0.162 HLA-DQB1
rs139129572 GA G 0.316 0.449 5.80E-09 0.036 HCAR2
P value indicates the P value from the MLMA-LOCO analysis. FST is the FST value between Tibetans and EASs. Nearest gene indicates the nearest annotated gene to the top differentiated SNP at each locus except EGLN1, which is known to be associated with high-altitude adaptation; rs139129572 is an insertion SNP with two alleles: GA and G. A1, allele 1; A2, allele 2.

Many of these genes are familiar. Observe the allele frequency differences between the Tibetans and other East Asians (mostly Han). The sample sizes are on the order of thousands, and the SNP-chip had nearly 300,000 markers. What they found was that the between population Fst of Han to Tibetan was ~0.01. So only 1% of the SNP variance in their data was partitioned between the two groups. These alleles are huge outliers.

The authors used some sophisticated statistical methods to correct for exigencies of population structure, drift, admixture, etc., to converge upon these hits, but even through inspection the deviation on these alleles is clear. And as they note in the paper it isn’t clear all of these genes are selected simply for hypoxia adaptation. MTFHR, which is quite often a signal of selection, may have something to due to folate production (higher altitudes have more UV). ADH7 is part of a set of genes which always seem to be under selection, and HLA is never a surprise.

Rather than get caught up in the details it is important to note here that expansion into novel habitats results in lots of changes in populations, so that two groups can diverge quite fast on functional characteristics.  The PCA makes it clear that Tibetans and Hans have very little West Eurasian admixture, and the Fst based analysis puts their divergence on the order of 5,000 years before the present. The authors admit honestly that this is probably a lower bound value, but I also think it is quite likely that Tibetans, and probably Han too, are compound populations, and a simple bifurcation model from a common ancestral population is probably shaving away too many realistic edges. In plainer language, there has been gene flow between Han and Tibetans probably <5,000 years ago, and Tibetans themselves probably assimilated more deeply diverged populations in the highlands as they expanded as agriculturalists. An estimate of a single divergence fits a complex history to too simple of a model quite often.

The take home: understanding population history is probably important to get a better sense of the dynamics of adaptation.

Citation: Jian Yang, Zi-Bing Jin, Jie Chen, Xiu-Feng Huang, Xiao-Man Li, Yuan-Bo Liang, Jian-Yang Mao, Xin Chen, Zhili Zheng, Andrew Bakshi, Dong-Dong Zheng, Mei-Qin Zheng, Naomi R. Wray, Peter M. Visscher, Fan Lu, and Jia Qu, Genetic signatures of high-altitude adaptation in Tibetans, PNAS 2017 ; published ahead of print April 3, 2017, doi:10.1073/pnas.1617042114

April 2, 2017

The future shall, and should, be sequenced

Filed under: Genomics,GWAS,Human Genetics — Razib Khan @ 10:32 pm

Last fall I talked about a preprint, Human demographic history impacts genetic risk prediction across diverse populations. It’s now published in AJHG, with the same informative title, Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations. Even though talked about this before, I thought it would be useful to highlight again.

To recap, GWAS is a pretty big deal, but only in the last 15 years or so. With genome-wide data researchers began to explore associations between diseases and population genetic variation. In some cases they discovered strong associations between characteristics and genetic variants, but in many casese it turned out that though a trait is highly heritable (e.g., schizophrenia) the causal variants are either not common or do not explain much of the variation in the poplation (or both).

But as the second decade of GWAS proceeds the sample sizes are getting larger, and researchers are moving from SNP-chips, with their various biases, to high quality whole-genome sequences. One of the major sorts of low hanging fruit in the minds of many people are rare variants. Basically SNP-chips are geared toward finding common variations within large populations, since they have a finite number of markers they are going to interrogate. Sequencing though is a comprehensive catalog of the genome in a relative sense. If you have high coverage (so you sample the site many times) you can easily discover rare mutations within an individual genome that makes them distinctive from almost the rest of the human race (these may be de novo mutations, or, they could be mutations private to their extended pedigree).

But context matters. Martin et al. find that confirmed GWAS hits in Europeans tend to exhibit decreased portability as a function of genetic distance. This isn’t entirely surprising, especially if rarer variants are part of the explanation. Rare variants usually emerged later in history, after the differentiation between geographic races.

A solution would be to have a diverse panel of populations in your studies. For many reasons this was not to be. Northwest Europeans are enormously enriched in current data sets. Martin et al. observe that recent this has diminished somewhat, from 95% European to less than 80%. But they observe that this is mostly due to the inclusion of “Asian” samples, as opposed to African and Native Americans, who remain as undererpresented as they did several years ago.

The African and Native American samples present somewhat different problems. The Native American groups are quite drifted due to bottlenecks. Likely they have their own variants due to the combined affects of mutation and selection through 15 to 20,000 years of isolation from other human populations. In contrast, the African groups have lots of diversity with a high time depth due to their ancestral histories, which are less subject to bottleneck effects. The prediction ability into Africans of current GWAS looks to be rather pathetic. This is reasonable because their diversity is poorly captured in Eurocentric study designs, and, they are more genetically diverged from Europeans than Asians are.

Ultimatley I think, and hope, this portability question will be of short term utility. As sequencing gets cheap, and studies become more numerous, we’ll fill in the gaps of understudied populations. Finally, ethics is above my paygrade, but I do hope those who demand a strenuous bar on consent keep in mind that that will result in slower growth of these study populations. Academics want to do a good job, but they also want to stay on the good side of IRB.

Citation: Martin, Alicia R., et al. “Human demographic history impacts genetic risk prediction across diverse populations.” bioRxiv (2016): 070797.

March 27, 2017

Adaptation is ancient: the story of Duffy

Filed under: Duffy allele,Duffy antigen,Genomics,History,Malaria — Razib Khan @ 10:06 pm

Anyone with a passing familiar with human population genetics will know of the Duffy system, and the fact that there is a huge difference between Sub-Saharan Africans and other populations on this locus. Specifically, the classical Duffy allele exhibits a nearly disjoint distribution from Africa to non-Africa. It was naturally one of the illustrations in The Genetics of Human Populations, a classic textbook from the 1960s.

Today we know a lot more about human variation. On most alleles we don’t see such sharp distinctions. Almost certainly the detection of these very differentiated alleles early on in human genetics was partly a function of selection bias. The methods, techniques, and samples, were underpowered and limited, so only the largest differences would be visible. Today we often use single base pair variations, single nucleotide polymorphisms, and the frequency differences are much more modest on average. Ergo, the reality that only a minority of genetic variation is partitioned across geographic races.

Why is Duffy different? Obviously it could be random. Assuming you have a polymorphism, you’ll get a range of frequencies across populations, and in some cases those frequencies which map onto different geographic zones just by chance. Imagine constant mutation, and high structured bottlenecks. You could get a sequence of derived mutations fixing in populations one after the other, just by chance.

This is probably not the case with Duffy. I’ll quote from Wikipedia:

The Duffy antigen is located on the surface of red blood cells, and is named after the patient in which it was discovered. The protein encoded by this gene is a glycosylated membrane protein and a non-specific receptor for several chemokines. The protein is also the receptor for the human malarial parasites Plasmodium vivax and Plasmodium knowlesi. Polymorphisms in this gene are the basis of the Duffy blood group system.

Malaria is one of the strongest selection pressures known to humanity. The balancing selection which results in sickle-cell disease is well known even among the general public. But the likely selection pressures due to the vivax variety are well commonly talked about, partly because they don’t as a side-effect induce a serious disease. Duffy may be canonical if you are a human population geneticist, but it is of less interest more generally.

But a recent paper in PLOS GENETICS shows just how dynamic the evolutionary genetic past of our species was, through the lens of the Duffy system, Population genetic analysis of the DARC locus (Duffy) reveals adaptation from standing variation associated with malaria resistance in humans. Here’s the author summary:

Infectious diseases have undoubtedly played an important role in ancient and modern human history. Yet, there are relatively few regions of the genome involved in resistance to pathogens that show a strong selection signal in current genome-wide searches for this kind of signal. We revisit the evolutionary history of a gene associated with resistance to the most common malaria-causing parasite, Plasmodium vivax, and show that it is one of regions of the human genome that has been under strongest selective pressure in our evolutionary history (selection coefficient: 4.3%). Our results are consistent with a complex evolutionary history of the locus involving selection on a mutation that was at a very low frequency in the ancestral African population (standing variation) and subsequent differentiation between European, Asian and African populations.

Why is it that regions of the genome subject to selection due to co-evolution with pathogens are hard to detect in relation to selection? My response would be that it’s because selection and adaptation are always happening in these regions, constantly erasing its footprints in these regions of the genome.

You may be familiar with the fact that the major histocompatibility complex (MHC) are some of the most diverse regions of the genome. That’s because negative frequency dependent selection makes it so that rare variants never go extinct, as the rarer they get the more favored they are.

Many classical and modern techniques of selection require less protean dynamics when it comes to the model which they attempt to detect. Basically, many of the standard selection detection methods are looking for a simple perturbation in the pattern of variation that’s expected. A strong powerful recent sweep on a single mutation is like the spherical cow of evolutionary genetics. It happens. And it’s easy to model and detect. But it may not be nearly as important as our ability to detect these “hard sweeps” may suggest to us.

In contrast, if selection targets a larger number of independent mutations, then you get a “soft sweep,” which is harder to detect, because it is no singular event. Complexity is the enemy of detection. As a thought experiment, if you selected for height within a population you may catch some large effect alleles that would leave strong signals, but most of the dynamic would leave a polygenic footprint, distributed across innumerable genes.

The Duffy locus is somewhat in the middle. The authors distinguish between selection on standing variation (the allele frequency is higher than a single new mutation within the population) and a soft sweep, where multiple variants against different haplotypes are subject to selection. Their models and results strongly support selection on standing variation for the FY*O variant, and perhaps selection for the FY*A variant.

These selection events were very old, and very strong. Selection coefficients on the order of 4% are hard to believe in a natural environment. Curiously the coalescence times for the haplotypes some of these alleles indicate that selection was contemporaneous with the emergence of modern humans out of Africa, about ~50,000 years ago. From their sequence data analysis the different alleles have been segregating for a long time in the collective human population, and powerful sweeps fixed FY*O in both the ancestors of the Bantu and Pygmies before they diverged from each other. In contrast the Khoisan samples suggest that FY*O introgressed into their population from newcomers, while variants of FY*A are ancestral.

The big picture here is that selection is ancient, that it is powerful, and it was a dynamic even before our species diversified into various lineages.

If you read the paper, and you should, it’s pretty clear that a lot of the adaptive story was suspected. It’s just with modern genomics and fancy ABC methods you can put point estimates and intervals on these hunches. But another issue, as they note in the piece, is that we have a better grasp of African population structure today than in the past, and this allows for better framing.

But it is here I have some caution to throw. At one point citing a 2012 paper the authors suggest “The KhoeSan peoples are a highly diverse set of southern African populations that diverged from all other populations approximately 100 kya.” I can tell you that some credible researchers who have access to whole genome sequences and have been looking at this question peg the divergence date closer to 200,000 years. Some of the issue here is that you need to decompose later gene flow, which will reduce the distance between populations. Easier said than done.

The genetic prehistory of the African continent is almost certainly much more complex than what is presented in the paper, largely due to lack of ancient DNA within Africa. Northern Eurasia turned out to be far more complex than had earlier been guessed…and it is likely that Northern Eurasia has had a simpler history because of its much shorter time of habitation.

If I had to guess I suspect that the ancestors of the Khoisan as we understand them were a separate and distinct group who diverged between ~100,000 and ~200,000 years ago from other extant African populations. But I suspect our clarity is very low in relation the sort of structure which eventually resulted in the shake-out of only a few large groups of Sub-Saharan Africans aside from the Khoisan.

Citation: Population genetic analysis of the DARC locus (Duffy) reveals adaptation from standing variation associated with malaria resistance in humans.

March 23, 2017

Ancestry inference won’t tell you things you don’t care about (but could)

Filed under: Anthroplogy,Genetics,Genomics,Personal genomics — Razib Khan @ 5:59 pm

The figure above is from Noah Rosenberg’s relatively famous paper, Clines, Clusters, and the Effect of Study Design on the Inference of Human Population Structure. The context of the publication is that it was one of the first prominent attempts to use genome-wide data on a various of human populations (specifically, from the HGDP data set) and attempt model-based clustering. There are many details of the model, but the one that will jump out at you here is that the parameter defines the number of putative ancestral populations you are hypothesizing. Individuals then shake out as proportions of each element, K. Remember, this is a model in a computer, and you select the parameters and the data. The output is not “wrong,” it’s just the output based how you set up the program and the data you input yourself.

These sorts of computational frameworks are innocent, and may give strange results if you want to engage in mischief. For example, let’s say that you put in 200 individuals, of whom 95 are Chinese, 95 are Swedish, and 10 are Nigerian. From a variety of disciplines we know to a good approximation that non-Africans form a monophyletic clade in relation to Africans (to a first approximation). In plain English, all non-Africans descend from a group of people who diverged from Africans more than 50,000 years ago. That means if you imagine two populations, the first division should be between Africans and non-Africans, to reflect this historical demography. But if you skew the sample size, as the program looks for the maximal amount of variation in the data set it may decide that dividing between Chinese and Swedes as the two ancestral populations is the most likely model given the data.

This is not wrong as such. As the number of Africans in the data converges on zero, obviously the dividing line is between Swedes and Chinese. If you overload particular populations within the data, you may marginalize the variation you’re trying to explore, and the history you’re trying to uncover.

I’ve written all of this before. But I’m writing this in context of the earlier post, Ancestry Inference Is Precise And Accurate(Ish). In that post I showed that consumers drive genomics firms to provide results where the grain of resolution and inference varies a lot as a function of space. That is, there is a demand that Northern Europe be divided very finely, while vast swaths of non-European continents are combined into one broad cluster.

Less than 5% Ancient North Eurasian

Another aspect though is time. These model-based admixture frameworks can implicitly traverse time as one ascends up and down the number of K‘s. It is always important to explain to people that the number of K‘s may not correspond to real populations which all existed at the same time. Rather, they’re just explanatory instruments which illustrate phylogenetic distance between individuals. In a well-balanced data set for humans K = 2 usually separates Africans from non-Africans, and K = 3 then separates West Eurasians from other populations. Going across K‘s it is easy to imagine that is traversing successive bifurcations.

A racially mixed man, 15% ANE, 30% CHG, 30% WHG, 30% EEF

But today we know that’s more complicated than that. Three years ago Pickrell et al. published Toward a new history and geography of human genes informed by ancient DNA, where they report the result that more powerful methods and data imply most human populations are relatively recent admixtures between extremely diverged lineages. What this means is that the origin of groups like Europeans and South Asians is very much like the origin of the mixed populations of the New World. Since then this insight has become only more powerful, as ancient DNA has shed light as massive population turnovers over the last 5,000 to 10,000 years.

These are to some extent revolutionary ideas, not well known even among the science press (which is too busy doing real journalism, i.e. the art of insinuation rather than illumination). As I indicated earlier direct-to-consumer genomics use national identities in their cluster labels because these are comprehensible to people. Similarly, they can’t very well tell Northern Europeans that they are an outcome of a successive series of admixtures between diverged lineages from the late Pleistocene down to the Bronze Age. Though Northern Europeans, like South Asians, Middle Easterners, Amerindians, and likely Sub-Saharan Africans and East Asians, are complex mixes between disparate branches of humanity, today we view them as indivisible units of understanding, to make sense of the patters we see around us.

Personal genomics firms therefore give results which allow for historically comprehensible results. As a trivial example, the genomic data makes it rather clear that Ashkenazi Jews emerged in the last few thousand years via a process of admixture between antique Near Eastern Jews, and the peoples of Western Europe. After the initial admixture this group became an endogamous population, so that most Ashkenazi Jews share many common ancestors in the recent past with other Ashkenazi Jews. This is ideal for the clustering programs above, as Ashkenazi Jews almost always fit onto a particular K with ease. Assuming there are enough Ashkenazi Jews in your data set you will always be able to find the “Jewish cluster” as you increase the value.

But the selection of a K which satisfies this comprehensibility criterion is a matter of convenience, not necessity. Most people are vaguely aware that Jews emerged as a people at a particular point in history. In the case of Ashkenazi Jews they emerged rather late in history. At certain K‘s Ashkenazi Jews exhibit mixed ancestral profiles, placing them between Europeans and Middle Eastern peoples. What this reflects is the earlier history of the ancestors of Ashkenazi Jews. But for most personal genomics companies this earlier history is not something that they want to address, because it doesn’t fit into the narrative that their particular consumers want to hear. People want to know if they are part-Jewish, not that they are part antique Middle Eastern and Southwest European.

Perplexment of course is not just for non-scientists. When Joe Pickrell’s TreeMix paper came out five years ago there was a strange signal of gene flow between Northern Europeans and Native Americans. There was no obvious explanation at the time…but now we know what was going on.

It turns out that Northern Europeans and Native Americans share common ancestry from Pleistocene Siberians. The relationship between Europeans and Native Americans has long been hinted at in results from other methods, but it took ancient DNA for us to conceptualize a model which would explain the patterns we were seeing.

An American with recent Amerindian (and probably African) ancestry

But in the context of the United States shared ancestry between Europeans and Native Americans is not particularly illuminating. Rather, what people want to know is if they exhibit signs of recent gene flow between these groups, in particular, many white Americans are curious if they have Native American heritage. They do not want to hear an explanation which involves the fusion of an East Asian population with Siberians that occurred 15,000 to 20,000 years ago, and then the emergence of Northern Europeans thorough successive amalgamations between Pleistocene, Neolithic, and Bronze Age, Eurasians.

In some of the inference methods Northern Europeans, often those with Finnic ancestry or relationship to Finnic groups, may exhibit signs of ancestry from the “Native American” cluster. But this is almost always a function of circumpolar gene flow, as well as the aforementioned Pleistocene admixtures. One way to avoid this would be to simply not report proportions which are below 0.5%. That way, people with higher “Native American” fractions would receive the results, and the proportions would be high enough that it was almost certainly indicative of recent admixture, which is what people care about.

Why am I telling you this? Because many journalists who report on direct-to-consumer genomics don’t understand the science well enough to grasp what’s being sold to the consumer (frankly, most biologists don’t know this field well either, even if they might use a barplot here and there).

And, the reality is that consumers have very specific parameters of what they want in terms of geographic and temporal information. They don’t want to be told true but trivial facts (e.g., they are Northern European). But neither they do want to know things which are so novel and at far remove from their interpretative frameworks that they simply can’t digest them (e.g., that Northern Europeans are a recent population construction which threads together very distinct strands with divergent deep time histories). In the parlance of cognitive anthropology consumers want their infotainment the way they want their religion, minimally counterintuitive. Consume some surprise. But not too much.

Your ancestry inference is precise and accurate(ish)

Filed under: 23andMe,Ancestry,Culture,Family Tree DNA,Genetics,Genomics — Razib Khan @ 6:29 am

For about three years I consulted for Family Tree DNA. It was a great experience, and I met a lot of cool people through that connection. But perhaps the most interesting aspect was the fact that I can understand the various pressures that direct-to-consumer genomics firms face from the demand side. The science is one thing, but when you are working on a consumer facing product, other variables come into play which are you not cognizant of when you are thinking of it from a point of pure analysis. I’m pretty sure that my insights working with Family Tree DNA can generalize to the other firms as well (23andMe, Ancestry, and Genographic*).

The science behind the ancestry inference elements of the product on offer is not particularly controversial or complex, but the customer aspect of how these results are received can become an intractable nightmare. The basic theory was outlined in the year 2000 in Pritchard et al.’s Inference of Population Structure Using Multilocus Genotype Data. You have lots of data thanks to better genomic technology (e.g., 300,000 SNPs). You have computers to analyze that data. And, you have scientific models of population history and dynamics which you can test that data against. The shape of the data will determine the parameters of the model, and it this those parameters that yield “your ancestry.”

In broad sketches the results make sense for most people. It’s in the finer details that the confusions emerge. To the left you see my son’s 23andMe ancestry deconvolution. The color coding is such you can tell that his maternal and paternal chromosomes have very different ancestry profiles (mostly Northern European and South Asian, respectively).

But his “Northern European” chromosomes also are more richly colored, with alternative segments denoting ancestry from different parts of Northern Europe. So in terms of proportions I am told my son is about 15 percent French and German, and 10 percent Scandinavian and 10 percent British and Irish. This is reasonable. On the other side he’s nearly 50 percent “broadly South Asian.” The balance is accounted for by my East Asian ancestry, which is correct, as my South Asian ethnicity is from Bengal, where there is a fair amount of East Asian ancestry (my family’s origin is on the eastern edge of Bengal itself).

And it is here that the non-scientific concerns of consumer genomics comes into focus. The genetic differences and distance between various South Asian groups are far higher than those between various Northern European groups. Depending on the statistic measure you use intra-South Asian variation is about one order of magnitude greater than intra-Northern European differences. This is due to geographic partitioning, the caste system, and differential admixture in South Asians between extreme diverged ancestral elements (about half of South Asian ancestry is very similar to Europeans and Middle Easterners, and half of it is extremely different, so how far you are from the 50 percent mark determines a lot).

Broadly South Asian

In Northern Europe there is very little genetic variation from the British Isles all the way the Baltic. The reason for this is historical: massive population turnover in the region 4,500 years ago means that much of the genetic divergence between the groups dates to the Bronze Age. It is this the genetic divergence, the variation, that is the raw material for the inferences and proportions you see in ancestry calculators. There’s just not that much raw material for Northern Europeans.

Broadly South Asian

Remember, the methods require lots of variation in the data as a raw input. You’re making the inference machine work real hard to produce a reasonable robust result if you don’t have that much variation. In contrast to the situation with Northern Europeans, with South Asians the companies are leaving raw material on the table, and just combining diverse groups together.

What’s going on here? As you might have guessed this is an economically motivated decision. Most South Asians know their general heritage due to caste and regional origins (though many Bengalis exhibit some lacunae about their East Asian ancestry). In contrast, many Americans of Northern European ancestry with an interest in genealogy are extremely curious about explicit proportional breakdowns between Northern European nationalities. The direct-to-consumer genomic firms attempt to cater to this demand as best as they can.

As I have stated many times, racial background is to various extents both biological and social. When it comes to the difference between Lithuanians and Nigerians the biological differences due to evolutionary history are straightforward, and clear and distinct. You can generate a phylogenetic history and perform a functional analysis of the differences. Additionally, you also have to note that the social differences exist, but are not straightforward. Like Lithuanians Nigerians of Igbo background are generally Roman Catholic, while most other Nigerians are not. The linguistic differences between Nigerian languages are great enough that it is defensible to suggest that Hausa speakers of Afro-Asiatic dialects are closer to Lithuanians in their phylogenetic history than to the dialects of the Yoruba.

A Lithuanian American

Contrast this to the situation where you differentiate Lithuanians from French. To any European the differences here are incredibly huge. The history of France, what was Roman Gaul, goes back 2,000 years. After the collapse of the West Roman Empire by any measure the people who became French were at the center of European history. In contrast, Lithuanians were a marginal tribe, who did not enter Christian civilization until the late 14th century. In social-cultural terms, due to history, the differences between French and Lithuanians are extremely salient to people of French and Lithuanian ancestry. But genetically the differences are modest at best.

If a direct-to-consumer genetic testing company tells you that you are 90 percent Northern European and 10 percent West African, that is a robust result that has a clear historical genetic interpretation. The two element’s of one’s ancestry have been relatively distinct for on the order of 100,000 years, with the Northern European element really just a proxy for non-Africans (though it is easy to drill-down within Eurasia). In contrast, notice how 23andMe, with some of the best scientists in the business, tells people they are “French-German,” and not French or German. What the hell is a “French-German”? Someone from Alsace-Lorraine? A German descendent of Huguenots? Obviously not.

“French-German” is a cluster almost certainly because there are no clear and distinct genetic differences between French and Germans. Yes, there is a continuum of allele frequencies between these two groups, but having looked at a fair number of people of French and German background in Family Tree DNA’s database I can tell you that France and Germany have a lot of local structure even among people of indigenous ancestry. Germans from the Rhineland are quite often genetically closer to French from Normandy than they are to Germans from eastern Saxony. Some of this is due to gene flow between neighboring regions, but some of this is due to cultural fluidity as to who exactly is German. It is clear that some Germans from the eastern regions are Germanized Slavs. Some Germans from the north exhibit strong affinities to Scandinavians, while Germans from Bavaria and Austria are classically Central European (whatever that means). The average German is distinct from the average French person, but the genetic clustering of the two groups is not clear and distinct.

Remember earlier I explained that the science is predicated on aligning data and models. The cultural model of Northern Europeans is conditioned on diversity and difference which has been very salient for the past few thousand years since the rise and fall of Rome. But the evolutionary genetic history is one where there are far fewer differences. The data do not fit a model that makes much sense to the average consumer (e.g., “you descend from a mix of Bronze Age migrants from the west-central steppe of Eurasia and Mesolithic indigenous hunter-gatherers and Neolithic farmers”). What makes sense to the average American consumer are histories of nationalities, so direct-to-consumer genetic companies try to satisfy this need. Because the needs of the consumer and their cultural expectations are poorly served by the data (genetic variation) and models of population history, you have a lot of awkward kludges and strange results.

Imagine, for example, you want to estimate how “German” someone is.  What do you use for your reference population of Germans?  Looking at the data there are clearly three major clusters within Germany when you weight the numbers appropriate, with affinities to the northern French, Slavs, and Scandinavians, and various proportions in between. Your selection of your sample is going to mean that some Germans are going to be more Germans than other Germans. If you select an eastern German sample then western Germans whose ancestors have been speaking a Germanic language far longer than eastern Germans are going to come out as less German. Or, you could just pick all of these disparate groups…in which case, lots of Northern Europeans become “German.”

Consumers want genetic tests to reflect strong cultural memories which were forged in the fires of rapidly protean and distinction-making process of cultural evolution. But biological and cultural evolution exhibit different modes (the latter generates huge between group differences) and tempos (those differences emerge fast). The ancestry results many people get are the outcomes of compromises to thread the needle and square the circle.

All the above is half the story. Next I’ll explain why “deep history” has to be massaged to make recent history informative and comprehensible….

* Also, I have a little historical perspective because of my friendship with the person who arguably created this sector, Spencer Wells.

November 30, 2016

Razib Khan at the Center of Eurasia

Filed under: Genomics,science — Razib Khan @ 6:41 am
The Eurogenes blog is running a fundraiser. I chipped in mostly to support his continued blogging. I don't agree with everything he posts, but the site is a good and valuable resource. "Genome blogging" hasn't gotten as far as I'd have thought it would have in 2010, mostly because the initial burst of enthusiasm wasn't...

November 8, 2013

Selection happens; but where, when, and why?

Filed under: Anthroplogy,Genetics,Genomics,Pigmentation — Razib Khan @ 1:49 am
Distribution of SLC452 variation at SNP rs1426654. Credit, HGDP Browser

Distribution of SLC452 variation at SNP rs1426654. Credit, HGDP Browser

Nina Davuluri, Miss America 2014, Credit: Andy Jones

Nina Davuluri, Miss America 2014, Credit: Andy Jones

One of the secondary issues which cropped up with Nina Davuluri winning Miss America is that it seems implausible that someone with her complexion would be able to win any Indian beauty contest. A quick skim of Google images “Miss India” will make clear the reality that I’m alluding to. The Indian beauty ideal, especially for females, is skewed to the lighter end of the complexion distribution of native South Asians. Nina Davuluri herself is not particularly dark skinned if you compared her to the average South Asian; in fact she is likely at the median. But it would be surprising to see a woman who looks like her held up as conventionally beautiful in the mainstream Indian media. When I’ve pointed this peculiar aspect out to Indians* some of them of will submit that there are dark skinned female celebrities, but when I look up the actresses in question they are invariably not very dark skinned, though perhaps by comparison to what is the norm in that industry they may be. But whatever the cultural reality is, the fraught relationship of color variation to aesthetic variation prompts us to ask, why are South Asians so diverse in their complexions in the first place? A new paper in PLoS Genetics, The Light Skin Allele of SLC24A5 in South Asians and Europeans Shares Identity by Descent, explores this genetic question in depth.

Much of the low hanging fruit in this area was picked years ago. A few large effect genetic variants which are known to be polymorphic across many populations in Western Eurasia segregate within South Asian populations. What this means in plainer language is that a few genes which cause major changes in phenotype are floating around in alternative flavors even within families among people of Indian subcontinental origin. Ergo, you can see huge differences between full siblings in complexion (African Americans, as an admixed population, are analogous). While loss of pigmentation in eastern and western Eurasia seems to be a case of convergent evolution (different mutations in overlapping sets of genes), the H. sapiens sapiens ancestral condition of darker skin is well conserved from Melanesia to Africa.


So what’s the angle on this paper you may ask? Two things. The first is that it has excellent coverage of South Asian populations. This matters because to understand variation in complexion you should probably look at populations which vary a great deal. Much of the previous work has focused on populations at the extremes of the human distribution, Africans and Europeans. There are obvious limitations using this approach. If you are looking at variant traits, then focusing on populations where the full range of variation is expressed can be useful. Second, this paper digs deeply into the subtle evolutionary and phylogenomic questions which are posed by the diversification of human pigmentation. It is often said that race is often skin deep, as if to dismiss the importance of human biological variation. But skin is a rather big deal. It’s our biggest organ, and the pigmentation loci do seem to be rather peculiar.

You probably know that on the order of ~20% of genetic variation is partitioned between continent populations (races). But this is not the case at all genes. And pigmentation ones tend to be particular notable exceptions to the rule. In late 2005 a paper was published which arguably ushered in the era of modern pigmentation genomics, SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. The authors found that one nonsynonomous mutation was responsible for on the order of 25 to 33% of the variation in skin color difference between Africans and Europeans. And, the allele frequency was nearly disjoint across the two populations, and between Europeans and East Asians. When comparing Europeans to Africans and East Asians almost all the variation was partitioned across the populations, with very little within them. The derived SNP, which differs from the ancestral state, is found at ~100% frequency in Europeans, and ~0% in Africans and East Asians. It is often stated (you can Google it!) that this variant is the second most ancestrally informative allele in the human genome in relation to Europeans vs. Africans.

SLC24A5 was just the beginning. SLC45A2, TYR, OCA2, and KITLG are just some of the numerous alphabet soup of loci which has come to be understood to affect normal human variation in pigmentation. Despite the relatively large roll call of pigmentation genes one can safely say that between any two reasonably distinct geographic populations ~90 percent of the between population variation in the trait is going to be due to ~10 genes. Often there is a power law distribution as well. The first few genes of large effect are over 50% of the variance, while subsequent loci are progressively less important.

So how does this work to push the overall results forward?

- With their population coverage the authors confirm that SLC24A5 seems to be polymorphic in all Indo-European and Dravidian speaking populations in the subcontinent. The frequency of the derived variant ranges from ~90% in the Northwest, and ~80% in Brahmin populations all over the subcontinent, to ~10-20% in some tribal groups.

- Though there is a north-south gradient, it is modest, with a correlation of ~0.25. There is a much stronger correlation with longtitude, but I’m rather sure that this is an artifact of their low sampling of Indo-European populations in the eastern Gangetic plain. As hinted in the piece the correlation with longitude has to do with the fact that Tibetan and Burman populations in these fringe regions tend to lack the West Eurasian allele.

- Using haplotype based tests of natural selection the authors infer that the frequency of this allele has been driven up positively in north, but not south, India. It could be that the authors lack power to detect selection in the south because of lower frequency of the derived allele. And, I did wonder if selection in the north was simply an echo of what occurred in West Eurasia. But if you look at the frequency of the A allele in the north most of the populations seem to have a higher frequency of the derived variant than they do of inferred “Ancestral North Indian”.

What’s perhaps more interesting is the bigger picture of human evolutionary dynamics and phylogenetics that these results illuminate. Resequencing the region around SLC24A5 these researchers confirmed it does look like the derived variant is identical by descent in all populations across Western Eurasia and into South Asia. What this means is that this mutation arose in someone at some point around the Last Glacial Maximum, after West Eurasians separated from East Eurasians. The authors gives some numbers using some standard phylogenetic techniques, but admit that it is ancient DNA that will give true clarity on the deeper questions. When I see something written like that my hunch, and hope, is that more papers are coming soon.

When I first read The Light Skin Allele of SLC24A5 in South Asians and Europeans Shares Identity by Descent, I thought that it was essential to read Ancient DNA Links Native Americans With Europe and Efficient moment-based inference of admixture parameters and sources of gene flow. The reason goes back to the plot which I generated at the top of this post: notice that Native Americans do not carry the West Eurasian variant of SLC24A5. What the find of the ~24,000 Siberian boy, and his ancient DNA, suggest is that there was a population with affinities closer to West Eurasians than East Eurasians that contributed to the ancestry of Native Americans. The lack of the European variant of SLC24A5 in Native Americans suggests to me that the sweep had not begun, or, that the European variant was disfavored. What the other paper reports is that on the order of 20-40% of the ancestry of Europeans may be derived from an ancient North Eurasian population, unrelated to West Eurasians (or at least not closely related). It is likely that this population has something to do with the Siberian boy. Since Europeans are fixed for the derived variant of SLC24A5, that implies to me that sweep must have occurred after 24,000 years ago.

journal.pgen.1003912.g002At this point I have to admit that I believe need to be careful calling this a “European variant.” Just because it is nearly fixed in Europe, does not imply that the variant arose in Europe. If you look at the frequency of the derived variant you see it is rather high in the northern Middle East. Looking at some of the populations in the Middle Eastern panel the ancestral variant might be all explained by admixture in historical time from Africa. If the sweep began during the last Ice Age, then most of Europe would have been uninhabited. The modern distribution is informative, but it surely does not tell the whole story.

Where we are is that SLC24A5 , and pigmentation as a whole, is coming to be genomically characterized fully. We don’t know the whole story of why light skin was selected so strongly. And we don’t quite know where the selection began, and when it began. But through gradually filling in pieces of the puzzle we may come to grips with this adaptively significant trait in the nearly future.

Citation: Basu Mallick C, Iliescu FM, Möls M, Hill S, Tamang R, et al. (2013) The Light Skin Allele of SLC24A5 in South Asians and Europeans Shares Identity by Descent. PLoS Genet 9(11): e1003912. doi:10.1371/journal.pgen.1003912

* From my personal experience American born Indians often do not share the same prejudices and biases, partly because subtle shades of brown which are relevant in the Indian context seem ludicrous in the United States.

The post Selection happens; but where, when, and why? appeared first on Gene Expression.

Selection happens; but where, when, and why?

Filed under: Anthroplogy,Genetics,Genomics,Pigmentation — Razib Khan @ 1:49 am
Distribution of SLC452 variation at SNP rs1426654. Credit, HGDP Browser

Distribution of SLC452 variation at SNP rs1426654. Credit, HGDP Browser

Nina Davuluri, Miss America 2014, Credit: Andy Jones

Nina Davuluri, Miss America 2014, Credit: Andy Jones

One of the secondary issues which cropped up with Nina Davuluri winning Miss America is that it seems implausible that someone with her complexion would be able to win any Indian beauty contest. A quick skim of Google images “Miss India” will make clear the reality that I’m alluding to. The Indian beauty ideal, especially for females, is skewed to the lighter end of the complexion distribution of native South Asians. Nina Davuluri herself is not particularly dark skinned if you compared her to the average South Asian; in fact she is likely at the median. But it would be surprising to see a woman who looks like her held up as conventionally beautiful in the mainstream Indian media. When I’ve pointed this peculiar aspect out to Indians* some of them of will submit that there are dark skinned female celebrities, but when I look up the actresses in question they are invariably not very dark skinned, though perhaps by comparison to what is the norm in that industry they may be. But whatever the cultural reality is, the fraught relationship of color variation to aesthetic variation prompts us to ask, why are South Asians so diverse in their complexions in the first place? A new paper in PLoS Genetics, The Light Skin Allele of SLC24A5 in South Asians and Europeans Shares Identity by Descent, explores this genetic question in depth.

Much of the low hanging fruit in this area was picked years ago. A few large effect genetic variants which are known to be polymorphic across many populations in Western Eurasia segregate within South Asian populations. What this means in plainer language is that a few genes which cause major changes in phenotype are floating around in alternative flavors even within families among people of Indian subcontinental origin. Ergo, you can see huge differences between full siblings in complexion (African Americans, as an admixed population, are analogous). While loss of pigmentation in eastern and western Eurasia seems to be a case of convergent evolution (different mutations in overlapping sets of genes), the H. sapiens sapiens ancestral condition of darker skin is well conserved from Melanesia to Africa.


So what’s the angle on this paper you may ask? Two things. The first is that it has excellent coverage of South Asian populations. This matters because to understand variation in complexion you should probably look at populations which vary a great deal. Much of the previous work has focused on populations at the extremes of the human distribution, Africans and Europeans. There are obvious limitations using this approach. If you are looking at variant traits, then focusing on populations where the full range of variation is expressed can be useful. Second, this paper digs deeply into the subtle evolutionary and phylogenomic questions which are posed by the diversification of human pigmentation. It is often said that race is often skin deep, as if to dismiss the importance of human biological variation. But skin is a rather big deal. It’s our biggest organ, and the pigmentation loci do seem to be rather peculiar.

You probably know that on the order of ~20% of genetic variation is partitioned between continent populations (races). But this is not the case at all genes. And pigmentation ones tend to be particular notable exceptions to the rule. In late 2005 a paper was published which arguably ushered in the era of modern pigmentation genomics, SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. The authors found that one nonsynonomous mutation was responsible for on the order of 25 to 33% of the variation in skin color difference between Africans and Europeans. And, the allele frequency was nearly disjoint across the two populations, and between Europeans and East Asians. When comparing Europeans to Africans and East Asians almost all the variation was partitioned across the populations, with very little within them. The derived SNP, which differs from the ancestral state, is found at ~100% frequency in Europeans, and ~0% in Africans and East Asians. It is often stated (you can Google it!) that this variant is the second most ancestrally informative allele in the human genome in relation to Europeans vs. Africans.

SLC24A5 was just the beginning. SLC45A2, TYR, OCA2, and KITLG are just some of the numerous alphabet soup of loci which has come to be understood to affect normal human variation in pigmentation. Despite the relatively large roll call of pigmentation genes one can safely say that between any two reasonably distinct geographic populations ~90 percent of the between population variation in the trait is going to be due to ~10 genes. Often there is a power law distribution as well. The first few genes of large effect are over 50% of the variance, while subsequent loci are progressively less important.

So how does this work to push the overall results forward?

– With their population coverage the authors confirm that SLC24A5 seems to be polymorphic in all Indo-European and Dravidian speaking populations in the subcontinent. The frequency of the derived variant ranges from ~90% in the Northwest, and ~80% in Brahmin populations all over the subcontinent, to ~10-20% in some tribal groups.

– Though there is a north-south gradient, it is modest, with a correlation of ~0.25. There is a much stronger correlation with longtitude, but I’m rather sure that this is an artifact of their low sampling of Indo-European populations in the eastern Gangetic plain. As hinted in the piece the correlation with longitude has to do with the fact that Tibetan and Burman populations in these fringe regions tend to lack the West Eurasian allele.

– Using haplotype based tests of natural selection the authors infer that the frequency of this allele has been driven up positively in north, but not south, India. It could be that the authors lack power to detect selection in the south because of lower frequency of the derived allele. And, I did wonder if selection in the north was simply an echo of what occurred in West Eurasia. But if you look at the frequency of the A allele in the north most of the populations seem to have a higher frequency of the derived variant than they do of inferred “Ancestral North Indian”.

What’s perhaps more interesting is the bigger picture of human evolutionary dynamics and phylogenetics that these results illuminate. Resequencing the region around SLC24A5 these researchers confirmed it does look like the derived variant is identical by descent in all populations across Western Eurasia and into South Asia. What this means is that this mutation arose in someone at some point around the Last Glacial Maximum, after West Eurasians separated from East Eurasians. The authors gives some numbers using some standard phylogenetic techniques, but admit that it is ancient DNA that will give true clarity on the deeper questions. When I see something written like that my hunch, and hope, is that more papers are coming soon.

When I first read The Light Skin Allele of SLC24A5 in South Asians and Europeans Shares Identity by Descent, I thought that it was essential to read Ancient DNA Links Native Americans With Europe and Efficient moment-based inference of admixture parameters and sources of gene flow. The reason goes back to the plot which I generated at the top of this post: notice that Native Americans do not carry the West Eurasian variant of SLC24A5. What the find of the ~24,000 Siberian boy, and his ancient DNA, suggest is that there was a population with affinities closer to West Eurasians than East Eurasians that contributed to the ancestry of Native Americans. The lack of the European variant of SLC24A5 in Native Americans suggests to me that the sweep had not begun, or, that the European variant was disfavored. What the other paper reports is that on the order of 20-40% of the ancestry of Europeans may be derived from an ancient North Eurasian population, unrelated to West Eurasians (or at least not closely related). It is likely that this population has something to do with the Siberian boy. Since Europeans are fixed for the derived variant of SLC24A5, that implies to me that sweep must have occurred after 24,000 years ago.

journal.pgen.1003912.g002At this point I have to admit that I believe need to be careful calling this a “European variant.” Just because it is nearly fixed in Europe, does not imply that the variant arose in Europe. If you look at the frequency of the derived variant you see it is rather high in the northern Middle East. Looking at some of the populations in the Middle Eastern panel the ancestral variant might be all explained by admixture in historical time from Africa. If the sweep began during the last Ice Age, then most of Europe would have been uninhabited. The modern distribution is informative, but it surely does not tell the whole story.

Where we are is that SLC24A5 , and pigmentation as a whole, is coming to be genomically characterized fully. We don’t know the whole story of why light skin was selected so strongly. And we don’t quite know where the selection began, and when it began. But through gradually filling in pieces of the puzzle we may come to grips with this adaptively significant trait in the nearly future.

Citation: Basu Mallick C, Iliescu FM, Möls M, Hill S, Tamang R, et al. (2013) The Light Skin Allele of SLC24A5 in South Asians and Europeans Shares Identity by Descent. PLoS Genet 9(11): e1003912. doi:10.1371/journal.pgen.1003912

* From my personal experience American born Indians often do not share the same prejudices and biases, partly because subtle shades of brown which are relevant in the Indian context seem ludicrous in the United States.

The post Selection happens; but where, when, and why? appeared first on Gene Expression.

January 8, 2013

Using your 23andMe data: exploring with MDS

Filed under: 23andMe,Genomics — Razib Khan @ 2:09 am

Note: please read the the earlier post on this topic if you haven’t.

The above image is from 23andMe. It’s from a feature which seems to have been marginalized a bit with their ancestry composition. Basically it is projecting 23andMe customers on a visualization of genetic variation from the HGDP data set. This is actually a rather informative sort of representation of variation. But there has always been an issue with the 23andMe representation: you are projected onto their invariant data set. In other words, you can’t mix & match the populations so as to explore different relationships. The nature of the algorithm and representation produces strange results, so varying the population sets is often useful in smoking out the true shape of things.

With the MDS feature I wrote about yesterday you can now compute positions with different weights of populations and mixes. This post will focus on how to manipulate the overall data set. You should have PHYLO from the the earlier post. Open up the .fam file. It should look like this:

Malayan A382 0 0 1 -9
Paniya D36 0 0 1 -9
BiakaPygmies HGDP00479 0 0 1 -9
BiakaPygmies HGDP00985 0 0 1 -9
BiakaPygmies HGDP01094 0 0 1 -9
MbutiPygmies HGDP00982 0 0 1 -9
Mandenkas HGDP00911 0 0 1 -9
Mandenkas HGDP01202 0 0 1 -9
Yorubas HGDP00927 0 0 1 -9
BiakaPygmies HGDP00461 0 0 1 -9
BiakaPygmies HGDP00986 0 0 1 -9
MbutiPygmies HGDP00449 0 0 1 -9
Mandenkas HGDP00912 0 0 1 -9
Mandenkas HGDP01283 0 0 1 -9
Yorubas HGDP00928 0 0 2 -9

And so forth. PHYLO has 1,500+ individuals. This is a bit much, which is why the – -genome command took so long. To ask particular questions it is often useful to prune the population down. I have a friend who is 1/4 Filipino who is curious as to whether his ancestry was more Chinese or native Filipino. How to answer this?

- You want a range of East Asian populations, north to south.

- You want a good out group. I’ll use the Utah whites.

All you need to do is go through the .fam file and keep only those lines you want, and put them into a new file, keep.txt. Then you run this command:

plink – -noweb – -bfile PHYLO – -keep keep.txt – -make-bed – -out PHYLONARROW

So I’ve now made a new pedigree data set which is a subset of the original. Now I merged my friend and my daughter’s genotype into this data set. What about if I wanted to remove some individuals, for examples, the ones in keep.txt? You do it like so:

plink – -noweb – -bfile PHYLO – -remove keep.txt – -make-bed – -out PHYLOAFEWGONE

With – -keep and – -remove, and making files drawn from the .fam file(s), you can customize your own data set for your own purposes. Again you want to produce an MDS, so run:

- -plink – -noweb – -bfile PHYLONARROW – -genome

-plink – -noweb – -bfile PHYLONARROW – -read-genome plink.genome – -mds-plot 6

This time – -genome will run very fast, because there are far fewer individuals. Here is my plot of the result of the outcome (my friend is “RF,” my daughter is “RD”):

Note that RF is aligned straight toward the “Dai” population, an ethnic group from South China, but not Han (they are related to the Thai). It seems plausible that my friend is of mixed Chinese and Filipino background. My daughter’s minimal East Asian ancestry is indeed Southeast Asian, and this is clear from this plot, as she is shifted further toward the Cambodians (this may be due to South Asian affinities as well).

The point is not to rely on one plot, but to generate many so as to explore the possibilities, and develop and intuition.

January 7, 2013

Using your 23andMe data in Plink

Filed under: Genomics,PLINK — Razib Khan @ 2:58 am

With the recent $99 price point for 23andMe many of my friends have purchased kits (finally!). 23andMe’s interpretive results are pretty rich now, but there are still things missing. There are plenty of third party tools you can use, but I know some people might want to do their own data analysis. There are many ways you could go about this, but I want to put up some posts on DIY genomic data analysis to making the learning curve a little less steep, and get people started. Motivation to actually begin going down this road is a big issue, but I think once you get over the hump it gets a lot easier.

First, you need Plink. It is really preferable that you work on a Mac or in Linux to engage in heavy duty analysis, but in this post I’ll assume you are working on the Windows platform. Again, the point here is to make this accessible. Download Plink if you don’t have it, and extract it where ever you like.


Plink is a command line tool, which means that you need to into the folder with the old MS-DOS interface. So use the cd command to get into that folder. Here is a screenshot of my shell:

The selection “plink –noweb –bfile PhyloF –genome” is a command that I entered. It is not part of the directory structure. If you don’t know about the cd command, please see the Wikipedia entry. It’s really just a simple way to step through the directory structure of your files and folders.

Now you have Plink. We need to put your 23andMe data into pedigree format. Additionally it would be convenient to have other reference data sets . Go to here. You now need to click the ZIP option. That will download a 74 MB zip containing all the files you see listed to the left. Most of that is in the two zip files, which are pedigree file data sets that I have provided for your future use. More on that later. First you need to use “CONVERT_23AME_PED.pl”  This a Perl script which takes the 23andMe text file, and converts it to pedigree format which Plink can use. You need to have Perl to use this script.

If you are on Windows you need to get ActivePerl. Download it. Again you have to open the command prompt and go into the appropriate folder. On my computer (this is the first time I’m using Perl on Windows in 10 years, the sacrifices I make for the readership of this weblog!) it is in the C:\ directory, so you probably have to move “up” the directory tree twice by typing “cd ..” (if you do this you’ll see what I mean). Once you are in the Perl directory you need to go into the bin directory. Remember to move the Perl script into the Perl directory. Here is a screen of what I get when I try and run the Perl script without any parameters:

Basically there needs to be a file for the script to process. You should have a 23andMe text file, your raw data. It will start like so: “genome_”. If you don’t have it, go into your account, and click “Browse Raw Data.”  On this page there will be options to download various peoples’ data if you have multiple accounts. It will download whoever is selected in your profile (for most it will be just one person of course).

Now you need to just select the button and enter your password. An 8 MB zip file will come down from the server. Put it into your Perl/bin folder by extracting it. Do not try and process the zip file! Once in there you now add it as your first parameter. I’d rename it something short and sweet since you’ll be typing it in. You don’t need to put a unique id parameter in, but I would if I were you. Try “me.” And “me” for the family id. At some point you’ll do more sophisticated things and need less silly ids, but not right now.

Here’s a screenshot of me running the Perl script with my own data (I renamed the text file). If the file name isn’t recognized make sure that you didn’t add the file extension within Windows, that might confuse it (e.g., for razibdata.txt if that’s what you see in the directory, you’d have to enter in razibdata.txt.txt in the parameter value since the extension is hidden):

There are two output files. In my case they are razibdata.ped and razibdata.map. As you can see they are named from your original file. You need to move both into the Plink directory. The .ped file has your individual data, the first half a dozen columns being the same as the parameters you may, or may not, have entered above. But it is very large because the whole line is filled out with your 23andMe genotype. The .map file basically has the information about the SNPs. These are both text files, and unwieldy.  You need to make it into a binary file. At the end of this there are three new files of the same name with extensions .bed, .bim, .fam:

You can see a lot of information. Most of it is not relevant to you, but note the number of SNPs. So now you have a pedigree file! Great. What do you do with it? Lots of stuff. You can look at the Plink documentation. Because the .bed file is a binary, never open it. The .bim has SNP info. You shouldn’t open this. On the other hand when you merge data sets .fam is useful. It’s a text file with all your individual and family id information. In this case with one file it isn’t informative, though you could change the id by editing the .fam file.

One thing you can do with just one individual is look for runs of homozygosity. The command is:

plink –bfile mydata –homozyg

You enter your binary pedigree file name, without the extension. Observe that now we are use –bfile instead of –file. Many commands will be bCommand instead of just Command if you are using binary files instead of the conventional ones. Binary files are smaller and the commands finish much faster, so use them! The output files, unless you use the –out command at the end to define them, usually begin with plink. So above you have plink.hom. It has some interesting information about the runs of homozygosity, but it is probably not too illuminating unless you suspect you are inbred!

Ultimately what I want you do by the end of this is compute an MDS with your own data against a reference set. That’s PHYLO in the data I’ve provided. It has 99,000 SNPs that overlap with 23andMe, and 1,500 individuals. I’ve altered the .fam file so that all the family ids are recognizable as populations. This will make analysis of the output easier for you. First you need to merge the files. It will be useful for you to prune your data set down, since you have a lot of extra SNPs.

Assuming you’ve extracted PHYLO out of the zip downloaded here is my command writing out the list of SNPs within PHYLO:

You can see from reading this that this data set has ~99,000 SNPs. I pruned it so that it ran quicker for phylogenetic analysis. This is more than a sufficient number for most analysis. What you want to next is create a copy of your own data which doesn’t have so many SNPs, so you can merge them well. Because I created this data set I can tell you that all of the above SNPs are probably in your 23andMe file. With the commands above there is a file, plink.snplist, which you will use to filter your data set.

Here’s how to do it:

Now we’ve got it ready to merge. I will warn you that this takes forever on Windows! No idea why. Also, Windows tends to do strange things with the file extensions. If Plink tells you that a .fam does not exist, look to the file extension. If you label something as something.fam, it might actually be something.fam.fam. In any case, here’s how you merge:

This is going to give you lots of warnings. Often this won’t matter, but sometimes it will tell you that you might need to “flip” one of the files. Try flipping it. If it still doesn’t work I would remove the SNPs causing problems. Something like this:

Honestly you might have to do a lot of things to get data sets to merge. But this particular combination of 23andMe genotype and PHYLO shouldn’t be too bad. Let’s assume that your merge worked. What do you want to do? One thing that might be interesting is an MDS plot (it’s like a PCA plot).

First you run the genome command, which takes forever to finish. It might be best if you did this before you go to sleep, and just check in in the morning. The genome command will produce an output that you’ll use next.

Notice the input file. That was generated in the previous step. The value 6 is a parameter that defines how many dimensions you want to output. My experience with this is that it doesn’t take too long, so I go for 6 at least. The final result of this is that you have an plink.mds file with an ordered list of family and individual ids, along with positions for 6 dimensions. It should be straightforward to import this into Excel, and then plot your MDS, emphasizing your own position. Since I can no longer use Excel I couldn’t be bothered to figure out how to plot my own position, but the distribution should be familiar.

That’s about it for now. I’ll put up another post focusing less on phylogenetics, using the HapMap data set that I provided. I don’t know if I can continue to do this in Windows, but hopefully this illustrated how easy (if tedious) most of this is.

January 5, 2013

Why the future won’t be genetically homogeneous

While reading The Founders of Evolutionary Genetics I encountered a chapter where the late James F. Crow admitted that he had a new insight every time he reread R. A. Fisher’s The Genetical Theory of Natural Selection. This prompted me to put down The Founders of Evolutionary Genetics after finishing Crow’s chapter and pick up my copy of The Genetical Theory of Natural Selection. I’ve read it before, but this is as good a time as any to give it another crack.

Almost immediately Fisher aims at one of the major conundrums of 19th century theory of Darwinian evolution: how was variation maintained? The logic and conclusions strike you like a hammer. Charles Darwin and most of his contemporaries held to a blending model of inheritance, where offspring reflect a synthesis of their parental values. As it happens this aligns well with human intuition. Across their traits offspring are a synthesis of their parents. But blending presents a major problem for Darwin’s theory of adaptation via natural selection, because it erodes the variation which is the raw material upon which selection must act. It is a famously peculiar fact that the abstraction of the gene was formulated over 50 years before the concrete physical embodiment of the gene, DNA, was ascertained with any confidence. In the first chapter of The Genetical Theory R. A. Fisher suggests that the logical reality of persistent copious heritable variation all around us should have forced scholars to the inference that inheritance proceeded via particulate and discrete means, as these processes do not diminish variation indefinitely in the manner which is entailed by blending.

More formally the genetic variance decreases by a factor of 1/2 every generation in a blending model. This is easy enough to understand. But I wanted to illustrate it myself, so I slapped together a short simulation script. The specifications are as follows:

1) Fixed population size, in this case 100 individuals

2) 100 generations

3) All individuals have 2 offspring, and mating is random (no consideration of sex)

4) The offspring trait value is the mid-parent value of the parents, though I also including a “noise” parameter in some of the runs, so that the outcome is deviated somewhat in a random fashion from expected parental values

In terms of the data structure the ultimate outcome is a 100 ✕ 100 matrix, with rows corresponding to generations, and each cell an individual in that generation. The values in each cell span the range from 0 to 1. In the first generation I imagine the combining of two populations with totally different phenotypic values; 50 individuals coded 1 and 50 individuals coded 0. If a 1 and 1 mate, the produce only 1′s. Likewise with 0′s. On the other hand a 0 and a 1 produce a 0.5. And so forth. The mating is random in each generation.

The figure to the left illustrates the decay in the variance of the trait value over generation time in different models. The red line is the idealized decay: 1/2 decrease in variance per generation. The blue line is one simulation. It roughly follows the decay pattern, though it is deviated somewhat because it seems that there was some assortative mating randomly (presumably if I used many more individuals it would converge upon the analytic curve). Finally you see one line which follows the trajectory of a simulation with noise. Though this population follows the theoretical decay more closely initially, it converges upon a different equilibrium value, one where some variance remains. That’s because the noise parameter continues to inject this every generation. The relevant point is that most of the variation disappears < 5 generations, and it is basically gone by the 10th generation. To maintain variation in a blending inheritance model requires a great deal of mutation, the extent of which is just not plausible.

To get a different sense of what occurred in these two particular simulations, here are heat maps. The interval 0 and  1 now have shading in each sell. I am displaying only 50 generations here. The top panel is one without noise, while the bottom panel has the noise parameter.

The contrast with a Mendelian model is striking. Imagine that 0 and 1 are now coded by two homozygote genotypes, with heterozygotes exhibiting a value of 0.5. If all the variation is controlled by the genotypes, then you have three genotypes, and three trait values. If I change the scenario above to a Mendelian one than variance will initially decrease, but the equilibrium will be maintained at a much higher level, as 50% of the population will be heterozygotes (0.5), and 50% homozygotes of each variety (0 and 1). With the persistence of heritable variation natural selection can operate to change the allele frequencies over time without the worry that the trait values within a breeding population will converge upon each other too rapidly. This is true even in cases of polygenic traits. Height and I.Q. remain variant, because they are fundamentally heritable through discrete and digital processes.

All this is of course why the “blond gene” won’t disappear, redheads won’t go extinct, nor will humans converge upon a uniform olive shade in a panmictic future. A child is a genetic cross between parents, but only between 50% of each parent’s genetic makeup. And that is one reason they are not simply an “averaging” of parental trait values.

January 4, 2013

Mitochondrial Eve: a de facto deception?

The above image, and the one to the left, are screenshots from my father’s 23andMe profile. Interestingly, his mtDNA haplogroup is not particularly common among ethnic Bengalis, who are more than ~80% on a branch of M. This reality is clear in the map above which illustrates the Central Asian distribution my father’s mtDNA lineage. In contrast, his whole genome is predominantly South Asianform, as is evident in the estimate that 23andMe provided via their ancestry composition feature, which utilizes the broader genome. The key takeaway here is that the mtDNA is informative, but it should not be considered to be representative, or anything like the last word on one’s ancestry in this day and age.


As a matter of historical record mtDNA looms large in human population genetics and phylogeography for understandable reasons. Mitchondria produce more genetic material than is found in the nucleus, and so were the lowest hanging fruit in the pre-PCR era. Additionally, because mtDNA lineages do not recombine they are well suited to a coalescent framework, where an idealized inverted treelike phylogeny converges upon a common ancestor. Finally, mtDNA was presumed to be neutral, so reflective of demographic events unperturbed by adaptation, and characterized by a high mutation rate, yielding a great amount of variation with which to differentiate the branches of the human family tree.

Many of these assumptions are are now disputable. But that’s not the point of this post. In the age of dense 1 million marker SNP-chips why are we still focusing on the history of one particular genetic region? In a word: myth. Eve, the primal woman. The “mother of us all,” who even makes cameos in science fiction finales!

In 1987 a paper was published which found that Africans harbored the greatest proportion of mtDNA variation among human populations. Additionally, these lineages coalesced back to a common ancestor on the order of 150,000 years ago. Since mtDNA is present in humans, there was a human alive 150,000 years ago who carried this ancestral lineage, from which all modern lineages derive. Mitochondrial DNA is passed from mothers to their offspring, so this individual must have been a woman. In the press she was labeled Eve, for obvious reasons. The scientific publicity resulted in a rather strange popular reaction, culminating in a Newsweek cover where Adam and Eve are depicted as naked extras from Eddie Murphy’s Coming to America film.

The problem is that people routinely believe that mtDNA Eve was the only ancestress of all modern humans from the period in which she lived. Why they believe this is common sense, and requires no great consideration. The reality is that the story being told by science is the story of mtDNA, with inferences about the populations which serve as hosts for mtDNA being incidental. These inferences need to be made cautiously and with care. It is basic logic that a phylogeny will coalesce back to a common ancestor at some point. Genetic lineages over time go extinct, and so most mtDNA lineages from the time of Eve went extinct. There were many woman who were alive during the same time as Eve, who contributed at least as much, perhaps more, to the genetic character of modern humans today. All we can say definitively is that their mtDNA lineage is no longer present. As mtDNA is passed from mother to daughter (males obviously have mtDNA, but we are dead ends, and pass it to no one), all one needs for a woman’s mtDNA lineage to go extinct is for her to have only sons. Though she leaves no imprint on the mtDNA phylogeny, obviously her sons may contribute genes to future generations.

Prior to ancient DNA and the proliferation of dense SNP data sets scholars were a bit too ambitious about what they believed they could infer from mtDNA and Y lineages (e.g., The Real Eve: Modern Man’s Journey Out of Africa). We are in a different time now, inferences made about the past rest on more than one leg. But the legend of Eve of the mtDNA persists, not because of its compelling scientific nature, but because this is a case where science piggy-backs upon prior conceptual furniture. This yields storytelling power, but a story which is based on a thin basis of fact becomes just another tall tale.

All this is on my mind because one of the scientists involved with Britain’s DNA, Jim Wilson, has penned a response to Vincent Plagnol’s Exaggerations and errors in the promotion of genetic ancestry testing (see here for more on this controversy). Overall I don’t find Wilson’s rebuttal too persuasive. It is well written, but it has the air of sophistry and lawyerly precision. I have appreciated Wilson’s science before, so I am not casting aspersions at his professional competence. Rather, some of the more enthusiastic and uninformed spokespersons for his firm have placed him in a delicate and indefensible situation, and he is gamely attempting to salvage the best of a bad hand. Importantly, he does not reassure me in the least that his firm did not use Britain’s atrocious libel laws as a threat to mute forceful criticism of their business model on scientific grounds. A more general issue here is that Wilson is in a situation where he must not damage the prospects of his firm, all the while maintaining his integrity as a scientist. From what I have seen once science becomes a business one must abandon the pretense of being a scientist first and foremost, no matter how profitable that aura of objectivity may be. The nature of marketing is such that the necessary caution and qualification essential for science becomes a major liability in the processing of communicating. It’s about selling, not convincing.

Going back to Eve, Wilson marshals a very strange argument:

“The claim that Adam and Eve really existed, as you suggest, refers to the most recent common ancestors of the mtDNA and non-recombining part of the Y chromosome. I don’t agree that there is nothing special about these individuals: there must have been a reason why mitochondrial Eve was on the front cover of Time magazine in the late 80s!….

A minor quibble, but I suspect he means the Newsweek cover. More seriously, this line of argumentation is bizarre on scientific grounds. Rather, it is a tack which is more rational when aiming toward a general audience which might purchase a kit which they believe might tell them of their relationship to “Eve.”

In the wake of the discussion at Genomes Unzipped I participated in further exchanges with Graham Coop and Aylwyn Scally on Twitter, and decided to spend 20 minutes this afternoon asking people what they thought about mitochondrial Eve. By “people,” I mean individuals who are pursuing graduate educations in fields such as genetics and forensics. My cursory “field research” left me very alarmed. Naturally these were individuals who did not make elementary mistakes in regards to the concept, but there was great confusion. I can only wonder what’s going through the minds of the public.

Analogies, allusions, and equivalences are useful when they leverage categories and concepts which we are solidly rooted in, and transpose them upon a foreign cognitive landscape. By pointing to similarities of structure and relation one can understand more fully the novel ground which one is exploring. Saying that the president of India is analogous to the queen of England is an informative analogy. These are both positions where the individual is a largely ceremonial head of state. In contrast, the president of the United States and the queen of England are very different figures, because the American executive is not ceremonial at all. This is not a useful analogy, even though superficially it sees no lexical shift.

Who was Eve? A plain reading is that she is the ancestor of all humans, and more importantly, the singular ancestress of all humans back to the dawn of time. This is a concept which the public grasps intuitively. Who is mtDNA Eve? A woman who flourished 150,000 years ago, who happened to carry the mtDNA lineage which would drift to fixation in the ancestors of modern humans. I think this is a very different thing indeed. For purposes of poetry and marketing the utilization of the name Eve is justifiable. But on scientific grounds all it does is confuse, obfuscate, and mislead.

The fiasco that Vincent Plagnol stumbled upon is just a symptom of a broader problem. Scientists need to engage in massive conceptual clean up, as catchy phrases such as “mitochondrial Eve” and “Y Adam” permeated the culture over the past generation, and mislead many sincere and engaged seekers of truth. This is of the essence because personal genomics, and the scientific understanding of genealogy, are now moving out of the ghetto of hobbyists, enthusiasts, and researchers. Though I doubt this industry will be massive, it will be ubiquitous, and a seamless part of our information portfolio. If people still have ideas like mitochondrial Eve in their head it is likely to cloud their perception of the utility of the tools at hand, and their broader significance.

December 21, 2012

The causes of evolutionary genetics

A few days ago I was browsing Haldane’s Sieve,when I stumbled upon an amusing discussion which arose on it’s “About” page. This “inside baseball” banter got me to thinking about my own intellectual evolution. Over the past few years I’ve been delving more deeply into phylogenetics and phylogeography, enabled by the rise of genomics, the proliferation of ‘big data,’ and accessible software packages. This entailed an opportunity cost. I did not spend much time focusing so much on classical population and evolutionary genetic questions. Strewn about my room are various textbooks and monographs I’ve collected over the years, and which have fed my intellectual growth. But I must admit that it is a rare day now that I browse Hartl and Clark or The Genetical Theory of Natural Selection without specific aim or mercenary intent.

R. A. Fisher

Like a river inexorably coursing over a floodplain, with the turning of the new year it is now time to take a great bend, and double-back to my roots, such as they are. This is one reason that I am now reading The Founders of Evolutionary Genetics. Fisher, Wright, and Haldane, are like old friends, faded, but not forgotten, while Muller was always but a passing acquaintance. But ideas 100 years old still have power to drive us to explore deep questions which remain unresolved, but where new methods and techniques may shed greater light. A study of the past does not allow us to make wise choices which can determine the future with any certitude, but it may at least increase the luminosity of the tools which we have iluminate the depths of the darkness. The shape of nature may become just a bit less opaque through our various endeavors.

Figure from “Directional Positive Selection on an Allele of Arbitrary Dominance”, Teshima KM, Przeworski M

So what of this sieve of Haldane? As noted at  Haldane’s Sieve the concept is simple. Imagine two mutations, one which expresses a trait in a recessive fashion, and another in a dominant one. The sieve operates by favoring the emergence out of the low frequency zone where stochastic forces predominate of dominantly expressing variants (i.e., even if an allele confers a large fitness benefit, at low frequencies the power of random chance may still imply that it is highly likely to go extinct). An example of this would be lactase persistence, which in the modal  Eurasian variant seems to exhibit dominance. The converse case, where beneficial mutations are recessive in expression suffer from a structural problem where their benefit is more theoretical than realized.

The mathematics of this is exceedingly simple, a consequence of the Hardy-Weinberg dynamics of diploid random mating organisms. Let’s use the gene which is implicated in variation in lactase persistence as an example, LCT. Consider two alleles, LP and LNP, where the former confers persistence (one can digest lactose sugar as an adult), and the latter manifests the conventional mammalian ‘wild type’ (the production of lactase ceases as one leaves the life stage when nursing is feasible). LP is clearly the novel mutant. In a small population it is not unimaginable that by random chance the frequency of LP rises to ~10%. What now? At HWE you have:

p2 + 2pq + q2 = 1, where q = LP allele. At ~10% the numbers substituted would be:

(0.90)2 + 2(0.90)(0.10) + (0.10)2

This is where dominance or recessive expression is highly relevant. The reality is that LP is a dominant trait. So in this population the frequency of LP as a trait would be:

(0.10)2 + 2(0.90)(0.10) = 19%

Now imagine a model where LP is favored, but it expresses in a recessive fashion. Then the frequency of the trait would equal q2, the homozygote LP-allele proportion. That is, 1%. Though population genetics is often constructed on an algebraic foundation, the results lend themselves to intuition. A structural parameter endogenous to the genetic system, dominant or recessive expression, can have longstanding consequences in terms of the likely trajectory of the alleles. Selection only “sees” the trait, so a recessive trait with sterling qualities may as well be a trait with no qualities. In contrast, a dominantly expressed allele can cut like a scythe through a population, because every copy “counts.”

In preparation for this post I revisited the selection on Haldane’s Sieve in the encyclopediac Elements of Evolutionary Genetics. The authors note that this phenomenon, though of vintage character as these things can be reckoned is a field as young as evolutionary genetics, is still a live one. The dominance of favored mutations in wild populations, or the recessive character of deleterious ones in laboratory stock, may reflect the different regimes which these two genes pools are subject to. The nature of things is such that is easier to generate recessive mutations than dominant ones (i.e., loss is easier than gain), so the preponderance of dominant variants in wild stocks subject to positive selective pressure lends credence to the idea that evolutionary rather than development forces and constraints shape the genetic character of many species.

And yet things are not quite so tidy. Haldane’s Sieve, and the framework of dominant versus recessive alleles, operates differently in the area of sex chromosomes. In many lineages there is a ‘heterogametic sex’ which carries only one functional chromosome for most of the genome. In mammals this is the male (XY), while in birds this is the female (ZW). As males have only one functional copy of most genes on the sex chromosome, the masking effect of recessive expression does not apply to them in mammals. This may imply that because of the exposure of many deleterious recessive variants to natural selection within the heterogametic sex one would see different allelic distributions and genetic landscapes on these chromosomes (e.g., more rapid adaptation because of the exposure of nominally recessive alleles in the heterogametic sex, as well as more purifying selection on deleterious variants). But the reality is more complex, and the literature in this area is somewhat muddled. More precisely, it seems phylogenetically sensitive. Validation of the theory in mammals founders once one moves to Drosphila.

And that is why research in evolutionary genetics continues. The theory stimulates empirical exploration, and is tested against it. Much of the formal theory of classical evolutionary genetics, which crystallized in the years before World War II, is now gaining renewed relevance because of empirical testability in the era of big data and big computation. This is an domain where the past is not simply of interest to historians. Scientists themselves, chasing the next grant, and producing the expected stream of publications, may benefit from a little historical perspective by standing upon the shoulders of giants.

Older Posts »

Powered by WordPress