Razib Khan One-stop-shopping for all of my content

January 5, 2013

Why the future won’t be genetically homogeneous

While reading The Founders of Evolutionary Genetics I encountered a chapter where the late James F. Crow admitted that he had a new insight every time he reread R. A. Fisher’s The Genetical Theory of Natural Selection. This prompted me to put down The Founders of Evolutionary Genetics after finishing Crow’s chapter and pick up my copy of The Genetical Theory of Natural Selection. I’ve read it before, but this is as good a time as any to give it another crack.

Almost immediately Fisher aims at one of the major conundrums of 19th century theory of Darwinian evolution: how was variation maintained? The logic and conclusions strike you like a hammer. Charles Darwin and most of his contemporaries held to a blending model of inheritance, where offspring reflect a synthesis of their parental values. As it happens this aligns well with human intuition. Across their traits offspring are a synthesis of their parents. But blending presents a major problem for Darwin’s theory of adaptation via natural selection, because it erodes the variation which is the raw material upon which selection must act. It is a famously peculiar fact that the abstraction of the gene was formulated over 50 years before the concrete physical embodiment of the gene, DNA, was ascertained with any confidence. In the first chapter of The Genetical Theory R. A. Fisher suggests that the logical reality of persistent copious heritable variation all around us should have forced scholars to the inference that inheritance proceeded via particulate and discrete means, as these processes do not diminish variation indefinitely in the manner which is entailed by blending.

More formally the genetic variance decreases by a factor of 1/2 every generation in a blending model. This is easy enough to understand. But I wanted to illustrate it myself, so I slapped together a short simulation script. The specifications are as follows:

1) Fixed population size, in this case 100 individuals

2) 100 generations

3) All individuals have 2 offspring, and mating is random (no consideration of sex)

4) The offspring trait value is the mid-parent value of the parents, though I also including a “noise” parameter in some of the runs, so that the outcome is deviated somewhat in a random fashion from expected parental values

In terms of the data structure the ultimate outcome is a 100 ✕ 100 matrix, with rows corresponding to generations, and each cell an individual in that generation. The values in each cell span the range from 0 to 1. In the first generation I imagine the combining of two populations with totally different phenotypic values; 50 individuals coded 1 and 50 individuals coded 0. If a 1 and 1 mate, the produce only 1′s. Likewise with 0′s. On the other hand a 0 and a 1 produce a 0.5. And so forth. The mating is random in each generation.

The figure to the left illustrates the decay in the variance of the trait value over generation time in different models. The red line is the idealized decay: 1/2 decrease in variance per generation. The blue line is one simulation. It roughly follows the decay pattern, though it is deviated somewhat because it seems that there was some assortative mating randomly (presumably if I used many more individuals it would converge upon the analytic curve). Finally you see one line which follows the trajectory of a simulation with noise. Though this population follows the theoretical decay more closely initially, it converges upon a different equilibrium value, one where some variance remains. That’s because the noise parameter continues to inject this every generation. The relevant point is that most of the variation disappears < 5 generations, and it is basically gone by the 10th generation. To maintain variation in a blending inheritance model requires a great deal of mutation, the extent of which is just not plausible.

To get a different sense of what occurred in these two particular simulations, here are heat maps. The interval 0 and  1 now have shading in each sell. I am displaying only 50 generations here. The top panel is one without noise, while the bottom panel has the noise parameter.

The contrast with a Mendelian model is striking. Imagine that 0 and 1 are now coded by two homozygote genotypes, with heterozygotes exhibiting a value of 0.5. If all the variation is controlled by the genotypes, then you have three genotypes, and three trait values. If I change the scenario above to a Mendelian one than variance will initially decrease, but the equilibrium will be maintained at a much higher level, as 50% of the population will be heterozygotes (0.5), and 50% homozygotes of each variety (0 and 1). With the persistence of heritable variation natural selection can operate to change the allele frequencies over time without the worry that the trait values within a breeding population will converge upon each other too rapidly. This is true even in cases of polygenic traits. Height and I.Q. remain variant, because they are fundamentally heritable through discrete and digital processes.

All this is of course why the “blond gene” won’t disappear, redheads won’t go extinct, nor will humans converge upon a uniform olive shade in a panmictic future. A child is a genetic cross between parents, but only between 50% of each parent’s genetic makeup. And that is one reason they are not simply an “averaging” of parental trait values.

January 4, 2013

Mitochondrial Eve: a de facto deception?

The above image, and the one to the left, are screenshots from my father’s 23andMe profile. Interestingly, his mtDNA haplogroup is not particularly common among ethnic Bengalis, who are more than ~80% on a branch of M. This reality is clear in the map above which illustrates the Central Asian distribution my father’s mtDNA lineage. In contrast, his whole genome is predominantly South Asianform, as is evident in the estimate that 23andMe provided via their ancestry composition feature, which utilizes the broader genome. The key takeaway here is that the mtDNA is informative, but it should not be considered to be representative, or anything like the last word on one’s ancestry in this day and age.


As a matter of historical record mtDNA looms large in human population genetics and phylogeography for understandable reasons. Mitchondria produce more genetic material than is found in the nucleus, and so were the lowest hanging fruit in the pre-PCR era. Additionally, because mtDNA lineages do not recombine they are well suited to a coalescent framework, where an idealized inverted treelike phylogeny converges upon a common ancestor. Finally, mtDNA was presumed to be neutral, so reflective of demographic events unperturbed by adaptation, and characterized by a high mutation rate, yielding a great amount of variation with which to differentiate the branches of the human family tree.

Many of these assumptions are are now disputable. But that’s not the point of this post. In the age of dense 1 million marker SNP-chips why are we still focusing on the history of one particular genetic region? In a word: myth. Eve, the primal woman. The “mother of us all,” who even makes cameos in science fiction finales!

In 1987 a paper was published which found that Africans harbored the greatest proportion of mtDNA variation among human populations. Additionally, these lineages coalesced back to a common ancestor on the order of 150,000 years ago. Since mtDNA is present in humans, there was a human alive 150,000 years ago who carried this ancestral lineage, from which all modern lineages derive. Mitochondrial DNA is passed from mothers to their offspring, so this individual must have been a woman. In the press she was labeled Eve, for obvious reasons. The scientific publicity resulted in a rather strange popular reaction, culminating in a Newsweek cover where Adam and Eve are depicted as naked extras from Eddie Murphy’s Coming to America film.

The problem is that people routinely believe that mtDNA Eve was the only ancestress of all modern humans from the period in which she lived. Why they believe this is common sense, and requires no great consideration. The reality is that the story being told by science is the story of mtDNA, with inferences about the populations which serve as hosts for mtDNA being incidental. These inferences need to be made cautiously and with care. It is basic logic that a phylogeny will coalesce back to a common ancestor at some point. Genetic lineages over time go extinct, and so most mtDNA lineages from the time of Eve went extinct. There were many woman who were alive during the same time as Eve, who contributed at least as much, perhaps more, to the genetic character of modern humans today. All we can say definitively is that their mtDNA lineage is no longer present. As mtDNA is passed from mother to daughter (males obviously have mtDNA, but we are dead ends, and pass it to no one), all one needs for a woman’s mtDNA lineage to go extinct is for her to have only sons. Though she leaves no imprint on the mtDNA phylogeny, obviously her sons may contribute genes to future generations.

Prior to ancient DNA and the proliferation of dense SNP data sets scholars were a bit too ambitious about what they believed they could infer from mtDNA and Y lineages (e.g., The Real Eve: Modern Man’s Journey Out of Africa). We are in a different time now, inferences made about the past rest on more than one leg. But the legend of Eve of the mtDNA persists, not because of its compelling scientific nature, but because this is a case where science piggy-backs upon prior conceptual furniture. This yields storytelling power, but a story which is based on a thin basis of fact becomes just another tall tale.

All this is on my mind because one of the scientists involved with Britain’s DNA, Jim Wilson, has penned a response to Vincent Plagnol’s Exaggerations and errors in the promotion of genetic ancestry testing (see here for more on this controversy). Overall I don’t find Wilson’s rebuttal too persuasive. It is well written, but it has the air of sophistry and lawyerly precision. I have appreciated Wilson’s science before, so I am not casting aspersions at his professional competence. Rather, some of the more enthusiastic and uninformed spokespersons for his firm have placed him in a delicate and indefensible situation, and he is gamely attempting to salvage the best of a bad hand. Importantly, he does not reassure me in the least that his firm did not use Britain’s atrocious libel laws as a threat to mute forceful criticism of their business model on scientific grounds. A more general issue here is that Wilson is in a situation where he must not damage the prospects of his firm, all the while maintaining his integrity as a scientist. From what I have seen once science becomes a business one must abandon the pretense of being a scientist first and foremost, no matter how profitable that aura of objectivity may be. The nature of marketing is such that the necessary caution and qualification essential for science becomes a major liability in the processing of communicating. It’s about selling, not convincing.

Going back to Eve, Wilson marshals a very strange argument:

“The claim that Adam and Eve really existed, as you suggest, refers to the most recent common ancestors of the mtDNA and non-recombining part of the Y chromosome. I don’t agree that there is nothing special about these individuals: there must have been a reason why mitochondrial Eve was on the front cover of Time magazine in the late 80s!….

A minor quibble, but I suspect he means the Newsweek cover. More seriously, this line of argumentation is bizarre on scientific grounds. Rather, it is a tack which is more rational when aiming toward a general audience which might purchase a kit which they believe might tell them of their relationship to “Eve.”

In the wake of the discussion at Genomes Unzipped I participated in further exchanges with Graham Coop and Aylwyn Scally on Twitter, and decided to spend 20 minutes this afternoon asking people what they thought about mitochondrial Eve. By “people,” I mean individuals who are pursuing graduate educations in fields such as genetics and forensics. My cursory “field research” left me very alarmed. Naturally these were individuals who did not make elementary mistakes in regards to the concept, but there was great confusion. I can only wonder what’s going through the minds of the public.

Analogies, allusions, and equivalences are useful when they leverage categories and concepts which we are solidly rooted in, and transpose them upon a foreign cognitive landscape. By pointing to similarities of structure and relation one can understand more fully the novel ground which one is exploring. Saying that the president of India is analogous to the queen of England is an informative analogy. These are both positions where the individual is a largely ceremonial head of state. In contrast, the president of the United States and the queen of England are very different figures, because the American executive is not ceremonial at all. This is not a useful analogy, even though superficially it sees no lexical shift.

Who was Eve? A plain reading is that she is the ancestor of all humans, and more importantly, the singular ancestress of all humans back to the dawn of time. This is a concept which the public grasps intuitively. Who is mtDNA Eve? A woman who flourished 150,000 years ago, who happened to carry the mtDNA lineage which would drift to fixation in the ancestors of modern humans. I think this is a very different thing indeed. For purposes of poetry and marketing the utilization of the name Eve is justifiable. But on scientific grounds all it does is confuse, obfuscate, and mislead.

The fiasco that Vincent Plagnol stumbled upon is just a symptom of a broader problem. Scientists need to engage in massive conceptual clean up, as catchy phrases such as “mitochondrial Eve” and “Y Adam” permeated the culture over the past generation, and mislead many sincere and engaged seekers of truth. This is of the essence because personal genomics, and the scientific understanding of genealogy, are now moving out of the ghetto of hobbyists, enthusiasts, and researchers. Though I doubt this industry will be massive, it will be ubiquitous, and a seamless part of our information portfolio. If people still have ideas like mitochondrial Eve in their head it is likely to cloud their perception of the utility of the tools at hand, and their broader significance.

December 18, 2012

Buddy, can you spare some ascertainment?

The above map shows the population coverage for the Geno 2.0 SNP-chip, put out by the Genographic Project. Their paper outlining the utility and rationale by the chip is now out on arXiv. I saw this map last summer, when Spencer Wells hosted a webinar on the launch of Geno 2.0, and it was the aspect which really jumped out at me. The number of markers that they have on this chip is modest, only >100,000 on the autosome, with a few tens of thousands more on the X, Y, and mtDNA. In contrast, the Axiom® Genome-Wide Human Origins 1 Array Plate being used by Patterson et al. has ~600,000 SNPs. But as is clear by the map above Geno 2.0 is ascertained in many more populations that the other comparable chips (Human Origins 1 Array uses 12 populations). It’s obvious that if you are only catching variation on a few populations, all the extra million markers may not give you much bang for the buck (not to mention the biases that that may introduce in your population genetic and phylogenetic inferences).


To the left are the list of populations against which the Human Origins 1 Array was ascertained, and they look rather comprehensive to me. In contrast, for Geno 2.0 ‘ancestrally informative markers’ were ascertained on 450 populations. The ultimate question for me is this: is all the extra ascertainment on diverse and obscure groups worth it? On first inspection Geno 2.0′s number of SNPs looks modest as I stated, but in my experience when you quality control and merge different panels together you are often left with only a few hundred thousand SNPs in any case. 100-200,000 SNPs is also sufficient to elucidate relationships even in genetically homogeneous regions such as Europe in my experience (it’s more than enough for model-based clustering, and seems to be overkill for MDS or PCA). One issue that jumps out at me about the Affymetrix chip is that it is ascertained toward the antipodes. In contrast, Geno 2.0 takes into account the Eurasian heartland. I suspect, for example, that Geno 2.0 would be better for population or ancestry assignment for South Asians because it would have more informative markers for those populations.

Ultimately I can’t really say much more until I use both marker sets in different and similar contexts. Since Geno 2.0 consciously excludes many functional and medically relevant SNPs its utility is primarily in the domain of demographics and history. If the populations in question are well covered by the Human Origins 1 Array, I see no reason why one shouldn’t go with it. Not only does it have more information about biological function, but the number of markers are many fold greater. On the other hand, Geno 2.0 may be more useful on the “blank zones” of the Affy chip. Hopefully the Genographic Project results paper for Geno 2.0 will come out soon and I can pull down their data set and play with it.

Cite: arXiv:1212.4116

Unveiling the genealogical lattice

To understand nature in all its complexity we have to cut down the riotous variety down to size. For ease of comprehension we formalize with math, verbalize with analogies, and visualize with representations. These approximations of reality are not reality, but when we look through the glass darkly they give us filaments of essential insight. Dalton’s model of the atom is false in important details (e.g., fundamental particles turn out to be divisible into quarks), but it still has conceptual utility.

Likewise, the phylogenetic trees popularized by L. L. Cavalli-Sforza in The History and Geography of Human Genes are still useful in understanding the shape of the human demographic past. But it seems that the bifurcating model of the tree must now be strongly tinted by the shades of reticulation. In a stylized sense inter-specific phylogenies, which assume the approximate truth of the biological species concept (i.e., little gene flow across lineages), mislead us when we think of the phylogeny of species on the microevolutionary scale of population genetics. On an intra-specific scale gene flow is not just a nuisance parameter in the model, it is an essential phenomenon which must be accommodated into the framework.


This is on my mind because of the emergence of packages such as TreeMix and AdmixTools. Using software such as these on the numerous public data sets allows one to perceive the reality of admixture, and overlay lateral gene flow upon the tree as a natural expectation. But perhaps a deeper result is the character of the tree itself is torn asunder. The figure above is from a new paper, Efficient moment-based inference of admixture parameters and sources of gene flow, which debuts MixMapper. The authors bring a lot of mathematical heft to their exposition, and I can’t say I follow all of it (though some of the details are very similar to Pickrell et al.’s). But in short it seems that in comparison to TreeMix MixMapper allows for more powerful inference of a narrower set of populations, selected for exploring very specific questions. In contrast, TreeMix explores the whole landscape with minimal supervision. Having used the latter I can testify that that is true.

The big result from MixMapper is that it extends the result of Patterson et al., and confirms that modern Europeans seem to be an admixture between a “north Eurasian” population, and a vague “west Eurasian” population. Importantly, they find evidence of admixture in Sardinians, which implies that Patterson et al.’s original were not sensitive to admixture in putative reference populations (note that Patterson is a coauthor on this paper as well). The rub, as noted in the paper, is that it is difficult to estimate admixture when you don’t have “pure” ancestral reference populations. And yet here the takeaway for me is that we may need to rethink our whole conception of pure ancestral populations, and imagine a human phylogenetic tree as a series of lattices in eternal flux, with admixed nodes periodically expanding so as to generate the artifice of a diversifying tree. The closer we look, the more likely that it seems that most of the populations which have undergone demographic expansion in the past 10,000 years are also the products of admixture. Any story of the past 10,000 years, and likely the past 100,000 years, must give space at the center of the narrative arc lateral gene flow across populations.

Cite: arXiv:1212.2555 [q-bio.PE]

December 13, 2012

We are Nature

Filed under: Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 8:03 am

There’s an interesting piece in Slate, The Great Schism in the Environmental Movement, which seems to be a distillation of trends which have been bubbling within the modern environmentalist movement for a generation now (I’ve read earlier manifestos in a similar vein). I can’t assess the magnitude of the shift, but here’s the top-line:

But that is a false construct that scientists and scholars have been demolishing the past few decades. Besides, there’s a growing scientific consensus that the contemporary human footprint—our cities, suburban sprawl, dams, agriculture, greenhouse gases, etc.—has so massively transformed the planet as to usher in a new geological epoch. It’s called the Anthropocene.

Modernist greens don’t dispute the ecological tumult associated with the Anthropocene. But this is the world as it is, they say, so we might as well reconcile the needs of people with the needs of nature. To this end, Kareiva advises conservationists to craft “a new vision of a planet in which nature—forests, wetlands, diverse species, and other ancient ecosystems—exists amid a wide variety of modern, human landscapes.”


Let’s take this debate as a given. It is fundamentally normative. That is, it is about values. We we need to tread carefully before projecting values across disputants. Far too often in this domain people seem to presume normative alignments, and therefore confuse ideological disagreement for rejection of factual truths. But, one thing to consider is that it is probable that human beings have already radically reshaped the ecological character of the world over the past 100,000 years. The implicit model that many older environmental activists seem to present is a framework pitting man & the machine vs. nature (the Shire vs. Mordor). But it is just not a useful dichotomy for many.

It is possible that there was, and is, no “pristine” nature. These disparate perspectives come to the fore in particular in post-colonial landscapes settled by Europeans. There is a long tradition in these areas of transforming ‘natives’ into ‘Noble Savages,’ who have attained some idealized harmony with Nature. The reality is that it is not harmony that was attained, but equilibrium. The arrival of anatomically modern humans to Australia and the New World resulted in a ‘shock’ to the ecological system, as megafauna went extinct due to the new variable of human predation. Even if H. sapiens were not the sufficient condition for these extinctions (populations naturally go through cycles), it is likely they were necessary (i.e., humans might extirpate species during times of low census size). But it is not just the initial impact in terms of species turnover. Australian and Amerindian populations seem to have reshaped the long term character of the landscape through fireCharles C. Mann argues in 1491 that  the vast forests which colonial and early American settlers cleared were in fact second growth, which emerged in the wake of massive die-offs of indigenous peoples due to Old World disease.

All of this is fundamentally complicated. Instead of a decision tree with two options, ‘Civilization’ vs. ‘Nature,’ there is actually a space populated with a multitude of positions. As someone touched by a moderate amount of biophilia my vision for the future is one of arcology based urbanism, massively scaled up algaculture, and megafaunal rewilding through genetic engineering and ancient DNA. Rather than idealize a mythic past we should endeavor to forge a new future. So it was, and so shall it ever be.

December 12, 2012

A lighter shade of brown: Dan MacArthur, look east or south!

Filed under: Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 2:58 pm

South Indian Udupi cuisine

In the post below I offered up my supposition that Dan MacArthur’s ancestry is unlikely to be Northwest Indian, which precludes a Romani origin for his South Asian ancestry. Indeed this is almost certainly so, Dienekes Pontikos followed up my crude analyses with IBD-sharing calculations (IBD = ‘identity by descent,’ which is basically what you would think it is). The South Asian population which MacArthur has the closest affinity to is from Karnataka, which is one of the Dravidian speaking states of the South. This does not necessarily refute my earlier contention, as aside from Brahmins most Bengalis seem to have broad South Indian affinities, except for the fact that they often have more East Asian ancestry.


Now, I may seem a touch obsessive on this issue at this point. There are several things motivating me. First, this was laying around in plain sight, but we missed it for years! Second, I’ve known Dan for a while, so this is very amusing on a personal level. Third, Dienekes’ has been pushing me to continue my exploration in a friendly competition. None of this is very difficult, and I’ve been going at it in the early hours of the day before work, or right before I go to sleep. In short, I’m doing this in part to show that you don’t need to just talk genomics, you too can do genomics. Ironically the age of “Big Data” is also the age of distributed data.

A lighter shade of brown: the Dan MacArthur chronicles, not a Romani

Filed under: Anthroplogy,Daniel MacArthur,Human Genetics,Human Genomics — Razib Khan @ 9:25 am

Pakistani honor guard

A few days ago I suggested that Dr. Daniel MacArthur might have South Asian ancestry. Now, when confronted with surprise the best option is to stick with your prior assumption, unless that surprise is powerful enough for you to “update” your model. After a few days of further analysis I will update: I do think Dan MacArthur has South Asian ancestry. Dienekes dug further, and noticed that there are hallmarks of “Ancestral South Indian” ancestry along the first 2/3 or so of chromosome 10. Now, you do have to remember that this genomic region is only half South Asian. The other half is European.

But in any case, one question that some people brought up: perhaps MacArthur has Romani heritage? I’m skeptical of this partly because:

1) there weren’t that many Romani in Britain in the 19th century

2) The British Romani are already very highly admixed

Another friend, who is a population genomicist himself, expressed some skepticism that such a long segment wasn’t broken up by recombination over the generations. My only moderately informed answer is this: we’d only notice the long segments, because if a very small region of ‘exotic’ ancestry was embedded within the dominant ancestral component it probably would not show up on some of these tests (or, we’d assume it was noise). Dan has another segment of South Asian ancestry, but much smaller in size. It may be there are other regions which we could find if we used better reference populations.

Here’s what I tentatively want to do with Dan’s data now. First, take the 80 MB or so which has South Asian ancestry, and phase it. That way I’d have a South Asian chromosome and a European one, and we could look for matches for only the South Asian one. But being busy I didn’t have time to do this. What I did have time to do was reduce the chromosomal region under consideration, and then run an IBS distance analysis in a private data set I have. This is a crude, but not always uninformative analysis. But by looking at the relationships I can now conclude that Dan MacArthur probably does not have Romani ancestry. Why? Because the Romani are of Northwest Indian heritage, and MacArthur’s match pattern using the diploid genotype (so South Asian + European) does not match what I expect would emerge from such a combination.

The full table is below, but to me the fact that he has so many matches with Northwest Indian populations is evidence that his ancestry was not Northwest Indian. Otherwise, he would be matching more Utah white (CEU samples) more often. Rather, someone with a mix of more conventional South Asian ancestry and European ancestry often resembles some of the less South Asian populations of South Asia (e.g., Brahui) in these crude measures. In fact, one of the closest matches to Dan’s IBS profile’s is that of my own mother. She is a rather vanilla ethnic Bengali, so I think there is a strong chance that his Indian ancestry is similar. This weak genetic data isn’t really the primary reason. The British East India company operated out of Bengal for much of its history, and there are simply a lot of Bengalis.

There’s a lot more that can be done here. Since I don’t have time, here’s the pedigree file if anyone wants to play with them (Dan is DGM001).

Population Genetic distance from Dan Standardized distance
Brahui 0.253 81.268
Burusho 0.257 82.736
Razib’s Mother 0.258 82.783
CEU 0.258 82.993
Burusho 0.258 83.024
CEU 0.26 83.547
Sakilli 0.26 83.555
Brahui 0.261 83.831
Brahui 0.261 83.857
GIH 0.261 83.955
CEU 0.261 83.972
CEU 0.261 83.985
CEU 0.262 84.043
North Kannadi 0.262 84.169
CEU 0.262 84.207
CEU 0.262 84.318
CEU 0.262 84.33
CEU 0.263 84.391
Paniya 0.263 84.408
CEU 0.263 84.437
CEU 0.263 84.445
CEU 0.263 84.488
CEU 0.263 84.606
CEU 0.263 84.609
CEU 0.264 84.691
Brahui 0.264 84.709
CEU 0.264 84.752
CEU 0.264 84.764
Brahui 0.264 84.822
GIH 0.264 84.826
Burusho 0.264 84.841
CEU 0.264 84.898
CEU 0.264 84.975
North Kannadi 0.264 84.992
CEU 0.265 85.087
Paniya 0.265 85.212
CEU 0.265 85.226
CEU 0.265 85.25
CEU 0.265 85.25
CEU 0.265 85.278
CEU 0.265 85.299
North Kannadi 0.265 85.3
Burusho 0.265 85.309
Burusho 0.266 85.328
CEU 0.266 85.363
CEU 0.266 85.409
North Kannadi 0.266 85.412
CEU 0.266 85.436
Burusho 0.266 85.446
Bene Israel 0.266 85.508
CEU 0.266 85.521
GIH 0.266 85.618
GIH 0.267 85.661
CEU 0.267 85.696
CEU 0.267 85.722
CEU 0.267 85.732
Brahui 0.267 85.777
GIH 0.267 85.793
CEU 0.267 85.799
CEU 0.267 85.816
Cochin Jews 0.267 85.85
CEU 0.267 85.943
Brahui 0.268 85.996
CEU 0.268 86.005
Cochin Jews 0.268 86.011
CEU 0.268 86.08
CEU 0.268 86.115
CEU 0.268 86.18
GIH 0.268 86.229
Cochin Jews 0.268 86.234
CEU 0.268 86.244
Burusho 0.268 86.265
CEU 0.268 86.277
CEU 0.268 86.278
CEU 0.269 86.288
CEU 0.269 86.291
CEU 0.269 86.318
CEU 0.269 86.325
CEU 0.269 86.326
GIH 0.269 86.327
CEU 0.269 86.329
CEU 0.269 86.354
CEU 0.269 86.387
CEU 0.269 86.463
CEU 0.269 86.515
CEU 0.269 86.517
CEU 0.269 86.55
CEU 0.27 86.609
Paniya 0.27 86.682
CEU 0.27 86.687
CEU 0.27 86.696
CEU 0.27 86.717
CEU 0.27 86.733
Sakilli 0.27 86.74
CEU 0.27 86.866
Malayan 0.27 86.879
North Kannadi 0.27 86.883
CEU 0.271 86.937
Brahui 0.271 86.952
Burusho 0.271 86.956
CEU 0.271 86.957
CEU 0.271 86.977
North Kannadi 0.271 86.995
GIH 0.271 87.018
CEU 0.271 87.042
CEU 0.271 87.066
CEU 0.271 87.07
Brahui 0.271 87.09
Bene Israel 0.271 87.094
Sakilli 0.271 87.141
CEU 0.271 87.2
CEU 0.271 87.24
North Kannadi 0.272 87.253
CEU 0.272 87.297
Burusho 0.272 87.307
CEU 0.272 87.327
GIH 0.272 87.353
CEU 0.272 87.355
Cochin Jews 0.272 87.381
CEU 0.272 87.384
CEU 0.272 87.5
CEU 0.272 87.535
CEU 0.273 87.594
Malayan 0.273 87.676
CEU 0.273 87.702
CEU 0.273 87.741
Burusho 0.273 87.806
CEU 0.273 87.846
Cambodians 0.274 87.932
North Kannadi 0.274 87.951
CEU 0.274 87.951
Burusho 0.274 88.03
CEU 0.274 88.047
CEU 0.274 88.081
CEU 0.274 88.089
CEU 0.274 88.101
CEU 0.274 88.179
CEU 0.274 88.19
North Kannadi 0.275 88.243
CEU 0.275 88.32
GIH 0.275 88.325
CEU 0.275 88.349
Brahui 0.275 88.393
CEU 0.275 88.402
CEU 0.275 88.457
Bene Israel 0.276 88.552
CEU 0.276 88.577
CEU 0.276 88.603
CEU 0.276 88.647
CEU 0.276 88.7
CEU 0.276 88.729
CEU 0.276 88.814
CEU 0.276 88.85
Brahui 0.276 88.855
CEU 0.277 88.923
GIH 0.277 88.99
Paniya 0.277 89.082
CEU 0.277 89.118
CEU 0.277 89.15
CEU 0.277 89.151
CEU 0.277 89.17
CEU 0.278 89.184
Cambodians 0.278 89.208
Cambodians 0.278 89.233
Cambodians 0.278 89.383
CEU 0.278 89.45
CEU 0.278 89.493
Cambodians 0.279 89.522
CEU 0.279 89.595
CEU 0.279 89.679
CEU 0.279 89.753
CEU 0.279 89.762
CEU 0.279 89.807
Cambodians 0.28 89.942
GIH 0.28 90.085
CEU 0.281 90.178
Brahui 0.281 90.364
Cambodians 0.282 90.543
Cambodians 0.282 90.559
Cambodians 0.282 90.77
Cambodians 0.283 90.898
CEU 0.283 90.956
CEU 0.284 91.316
CHD 0.289 92.952
Sakilli 0.29 93.103
Bene Israel 0.29 93.122
CHD 0.291 93.619
CHD 0.291 93.663
CHD 0.293 94.125
CHD 0.293 94.248
CHD 0.294 94.451
CHD 0.294 94.629
CHD 0.296 94.965
CHD 0.296 95.279
Yorubas 0.297 95.298
CHD 0.297 95.368
CHD 0.297 95.438
CHD 0.297 95.441
Yorubas 0.297 95.567
CHD 0.298 95.678
CHD 0.298 95.828
CHD 0.299 96.032
CHD 0.299 96.127
CHD 0.3 96.349
CHD 0.3 96.403
CHD 0.3 96.443
CHD 0.3 96.508
CHD 0.3 96.523
CHD 0.3 96.533
CHD 0.301 96.575
CHD 0.301 96.598
CHD 0.301 96.624
CHD 0.301 96.625
CHD 0.301 96.738
CHD 0.301 96.758
CHD 0.301 96.869
Yorubas 0.302 97.106
CHD 0.303 97.37
CHD 0.303 97.41
Yorubas 0.304 97.681
CHD 0.304 97.713
CHD 0.304 97.747
Yorubas 0.304 97.829
CHD 0.304 97.838
CHD 0.305 98.106
CHD 0.306 98.309
Yorubas 0.307 98.499
CHD 0.307 98.546
CHD 0.307 98.547
CHD 0.307 98.606
CHD 0.307 98.764
CHD 0.307 98.78
CHD 0.307 98.803
Yorubas 0.308 98.947
Yorubas 0.308 99.03
Yorubas 0.309 99.411
Yorubas 0.309 99.417
CHD 0.309 99.452
CHD 0.31 99.624
Yorubas 0.311 100

A lighter shade of brown: the Dan MacArthur chronicles, not a Romani

Filed under: Anthroplogy,Daniel MacArthur,Human Genetics,Human Genomics — Razib Khan @ 9:25 am

Pakistani honor guard

A few days ago I suggested that Dr. Daniel MacArthur might have South Asian ancestry. Now, when confronted with surprise the best option is to stick with your prior assumption, unless that surprise is powerful enough for you to “update” your model. After a few days of further analysis I will update: I do think Dan MacArthur has South Asian ancestry. Dienekes dug further, and noticed that there are hallmarks of “Ancestral South Indian” ancestry along the first 2/3 or so of chromosome 10. Now, you do have to remember that this genomic region is only half South Asian. The other half is European.

But in any case, one question that some people brought up: perhaps MacArthur has Romani heritage? I’m skeptical of this partly because:

1) there weren’t that many Romani in Britain in the 19th century

2) The British Romani are already very highly admixed

Another friend, who is a population genomicist himself, expressed some skepticism that such a long segment wasn’t broken up by recombination over the generations. My only moderately informed answer is this: we’d only notice the long segments, because if a very small region of ‘exotic’ ancestry was embedded within the dominant ancestral component it probably would not show up on some of these tests (or, we’d assume it was noise). Dan has another segment of South Asian ancestry, but much smaller in size. It may be there are other regions which we could find if we used better reference populations.

Here’s what I tentatively want to do with Dan’s data now. First, take the 80 MB or so which has South Asian ancestry, and phase it. That way I’d have a South Asian chromosome and a European one, and we could look for matches for only the South Asian one. But being busy I didn’t have time to do this. What I did have time to do was reduce the chromosomal region under consideration, and then run an IBS distance analysis in a private data set I have. This is a crude, but not always uninformative analysis. But by looking at the relationships I can now conclude that Dan MacArthur probably does not have Romani ancestry. Why? Because the Romani are of Northwest Indian heritage, and MacArthur’s match pattern using the diploid genotype (so South Asian + European) does not match what I expect would emerge from such a combination.

The full table is below, but to me the fact that he has so many matches with Northwest Indian populations is evidence that his ancestry was not Northwest Indian. Otherwise, he would be matching more Utah white (CEU samples) more often. Rather, someone with a mix of more conventional South Asian ancestry and European ancestry often resembles some of the less South Asian populations of South Asia (e.g., Brahui) in these crude measures. In fact, one of the closest matches to Dan’s IBS profile’s is that of my own mother. She is a rather vanilla ethnic Bengali, so I think there is a strong chance that his Indian ancestry is similar. This weak genetic data isn’t really the primary reason. The British East India company operated out of Bengal for much of its history, and there are simply a lot of Bengalis.

There’s a lot more that can be done here. Since I don’t have time, here’s the pedigree file if anyone wants to play with them (Dan is DGM001).

Population Genetic distance from Dan Standardized distance
Brahui 0.253 81.268
Burusho 0.257 82.736
Razib’s Mother 0.258 82.783
CEU 0.258 82.993
Burusho 0.258 83.024
CEU 0.26 83.547
Sakilli 0.26 83.555
Brahui 0.261 83.831
Brahui 0.261 83.857
GIH 0.261 83.955
CEU 0.261 83.972
CEU 0.261 83.985
CEU 0.262 84.043
North Kannadi 0.262 84.169
CEU 0.262 84.207
CEU 0.262 84.318
CEU 0.262 84.33
CEU 0.263 84.391
Paniya 0.263 84.408
CEU 0.263 84.437
CEU 0.263 84.445
CEU 0.263 84.488
CEU 0.263 84.606
CEU 0.263 84.609
CEU 0.264 84.691
Brahui 0.264 84.709
CEU 0.264 84.752
CEU 0.264 84.764
Brahui 0.264 84.822
GIH 0.264 84.826
Burusho 0.264 84.841
CEU 0.264 84.898
CEU 0.264 84.975
North Kannadi 0.264 84.992
CEU 0.265 85.087
Paniya 0.265 85.212
CEU 0.265 85.226
CEU 0.265 85.25
CEU 0.265 85.25
CEU 0.265 85.278
CEU 0.265 85.299
North Kannadi 0.265 85.3
Burusho 0.265 85.309
Burusho 0.266 85.328
CEU 0.266 85.363
CEU 0.266 85.409
North Kannadi 0.266 85.412
CEU 0.266 85.436
Burusho 0.266 85.446
Bene Israel 0.266 85.508
CEU 0.266 85.521
GIH 0.266 85.618
GIH 0.267 85.661
CEU 0.267 85.696
CEU 0.267 85.722
CEU 0.267 85.732
Brahui 0.267 85.777
GIH 0.267 85.793
CEU 0.267 85.799
CEU 0.267 85.816
Cochin Jews 0.267 85.85
CEU 0.267 85.943
Brahui 0.268 85.996
CEU 0.268 86.005
Cochin Jews 0.268 86.011
CEU 0.268 86.08
CEU 0.268 86.115
CEU 0.268 86.18
GIH 0.268 86.229
Cochin Jews 0.268 86.234
CEU 0.268 86.244
Burusho 0.268 86.265
CEU 0.268 86.277
CEU 0.268 86.278
CEU 0.269 86.288
CEU 0.269 86.291
CEU 0.269 86.318
CEU 0.269 86.325
CEU 0.269 86.326
GIH 0.269 86.327
CEU 0.269 86.329
CEU 0.269 86.354
CEU 0.269 86.387
CEU 0.269 86.463
CEU 0.269 86.515
CEU 0.269 86.517
CEU 0.269 86.55
CEU 0.27 86.609
Paniya 0.27 86.682
CEU 0.27 86.687
CEU 0.27 86.696
CEU 0.27 86.717
CEU 0.27 86.733
Sakilli 0.27 86.74
CEU 0.27 86.866
Malayan 0.27 86.879
North Kannadi 0.27 86.883
CEU 0.271 86.937
Brahui 0.271 86.952
Burusho 0.271 86.956
CEU 0.271 86.957
CEU 0.271 86.977
North Kannadi 0.271 86.995
GIH 0.271 87.018
CEU 0.271 87.042
CEU 0.271 87.066
CEU 0.271 87.07
Brahui 0.271 87.09
Bene Israel 0.271 87.094
Sakilli 0.271 87.141
CEU 0.271 87.2
CEU 0.271 87.24
North Kannadi 0.272 87.253
CEU 0.272 87.297
Burusho 0.272 87.307
CEU 0.272 87.327
GIH 0.272 87.353
CEU 0.272 87.355
Cochin Jews 0.272 87.381
CEU 0.272 87.384
CEU 0.272 87.5
CEU 0.272 87.535
CEU 0.273 87.594
Malayan 0.273 87.676
CEU 0.273 87.702
CEU 0.273 87.741
Burusho 0.273 87.806
CEU 0.273 87.846
Cambodians 0.274 87.932
North Kannadi 0.274 87.951
CEU 0.274 87.951
Burusho 0.274 88.03
CEU 0.274 88.047
CEU 0.274 88.081
CEU 0.274 88.089
CEU 0.274 88.101
CEU 0.274 88.179
CEU 0.274 88.19
North Kannadi 0.275 88.243
CEU 0.275 88.32
GIH 0.275 88.325
CEU 0.275 88.349
Brahui 0.275 88.393
CEU 0.275 88.402
CEU 0.275 88.457
Bene Israel 0.276 88.552
CEU 0.276 88.577
CEU 0.276 88.603
CEU 0.276 88.647
CEU 0.276 88.7
CEU 0.276 88.729
CEU 0.276 88.814
CEU 0.276 88.85
Brahui 0.276 88.855
CEU 0.277 88.923
GIH 0.277 88.99
Paniya 0.277 89.082
CEU 0.277 89.118
CEU 0.277 89.15
CEU 0.277 89.151
CEU 0.277 89.17
CEU 0.278 89.184
Cambodians 0.278 89.208
Cambodians 0.278 89.233
Cambodians 0.278 89.383
CEU 0.278 89.45
CEU 0.278 89.493
Cambodians 0.279 89.522
CEU 0.279 89.595
CEU 0.279 89.679
CEU 0.279 89.753
CEU 0.279 89.762
CEU 0.279 89.807
Cambodians 0.28 89.942
GIH 0.28 90.085
CEU 0.281 90.178
Brahui 0.281 90.364
Cambodians 0.282 90.543
Cambodians 0.282 90.559
Cambodians 0.282 90.77
Cambodians 0.283 90.898
CEU 0.283 90.956
CEU 0.284 91.316
CHD 0.289 92.952
Sakilli 0.29 93.103
Bene Israel 0.29 93.122
CHD 0.291 93.619
CHD 0.291 93.663
CHD 0.293 94.125
CHD 0.293 94.248
CHD 0.294 94.451
CHD 0.294 94.629
CHD 0.296 94.965
CHD 0.296 95.279
Yorubas 0.297 95.298
CHD 0.297 95.368
CHD 0.297 95.438
CHD 0.297 95.441
Yorubas 0.297 95.567
CHD 0.298 95.678
CHD 0.298 95.828
CHD 0.299 96.032
CHD 0.299 96.127
CHD 0.3 96.349
CHD 0.3 96.403
CHD 0.3 96.443
CHD 0.3 96.508
CHD 0.3 96.523
CHD 0.3 96.533
CHD 0.301 96.575
CHD 0.301 96.598
CHD 0.301 96.624
CHD 0.301 96.625
CHD 0.301 96.738
CHD 0.301 96.758
CHD 0.301 96.869
Yorubas 0.302 97.106
CHD 0.303 97.37
CHD 0.303 97.41
Yorubas 0.304 97.681
CHD 0.304 97.713
CHD 0.304 97.747
Yorubas 0.304 97.829
CHD 0.304 97.838
CHD 0.305 98.106
CHD 0.306 98.309
Yorubas 0.307 98.499
CHD 0.307 98.546
CHD 0.307 98.547
CHD 0.307 98.606
CHD 0.307 98.764
CHD 0.307 98.78
CHD 0.307 98.803
Yorubas 0.308 98.947
Yorubas 0.308 99.03
Yorubas 0.309 99.411
Yorubas 0.309 99.417
CHD 0.309 99.452
CHD 0.31 99.624
Yorubas 0.311 100

Don’t wait to have children!

Filed under: Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 7:34 am

The New Republic has a piece up, How Older Parenthood Will Upend American Society, which won’t have surprising data for readers of this weblog. But it’s nice to see this sort of thing go “mainstream.” My daughter was born when her parents were in their mid-30s, so I know all the statistics. They aren’t good bed-time reading (she’s healthy and robust so far!). If I had to do it over again I definitely wouldn’t have waited this long. After becoming a father it brought home to me that waiting was one of the worst decisions of my life. Why postpone something this incredible for the more far more prosaic pleasures of an extended adolescence? Granted, I’m not sure that I would have been the best father at 25, but I don’t think there’s much I can say in reply to the argument that I should have become a father by 30.

More concretely, we would have had sperm and egg “banked” if we had been smart delaying parenthood. The article notes that storage of sperm costs $850 up front, and $300 to $500 per year after that, and that many balk at the cost. And how much do you spend on your cell phone every year? The issue here seems to be time preference.


Of course one of the aspects of the piece which is implicit is its class-orientation. It finishes up:

It won’t be easy to make the world more baby-friendly, but if we were to try, we’d have to restructure the professions so that the most intensely competitive stage of a career doesn’t occur right at the moment when couples should be lavishing attention on infants….

Though the fertility transition has occurred across all groups, it has been most striking among the professional classes. But the the professional classes are a minority of the population. The reality is that we’d expect fertility to rebound over time as professional class people, whether for genetic or cultural (likely both) reasons, fail to reproduce, and are replaced by those who produce large families early, and continue to do so in a heritable fashion.

 
The plot to the left illustrates the dynamic which I’m alluding to. If at t = 0 you have two populations where one outnumbers the second by a factor of 100, if the population which is at ~1% proportion has a doubling-time of 30 years vs. replacement for the majority, then it takes ~200 years for the minority to outnumber the majority. Of course, we need to be careful about projections. Something I’ve criticized more gullible or historically naive readers about. Mormon fundamentalists or Old Order Amish have very high fertility rates in the context of being embedded in advanced societies which have very low fertility (at least in relation). But context-free models are useless. For example, some hyper-fertile groups, such as Mormon fundamentalists and Hasidic Jews, have been known to exploit the welfare state to subsidize their natalism. Such a subsidy lasts only so long as the ratio between producers and consumers favors the former. If a trend is unsustainable, it won’t be sustained. In the short term the “limits to growth” are probably not capped by resources, but human capital, which allows us to have increasing marginal returns. Systems do return to equilibria eventually.

As I don’t see a shift back to younger families among the professional classes, what’s the solution? Massive sperm and egg banking of 20 year old individuals seems like an economically feasible and effective strategy. I wouldn’t be surprised if the math works out that this reduces our long term health care costs, by cutting down on children born with congential defects and decreasing the lifetime morbidity rate of the population as a whole. Want to bend the cost curve? This might be a way! Combined with pre-implantation genetic diagnosis the ‘solution’ then is probably more technology, not less.

The long term fix of course is to reform the professional system and the culture of careerism so that what is rewarded is genuine productivity, rather than signalling. By this, I mean that I suspect many people who work “80 hours a week” aren’t all that productive beyond about 50, but they do signal to their superiors that they are “committed.” And of course now the whole peer group has to shift toward this irrational norm. Better workplace monitoring and productivity assessments might be a way to address this issue, so that superiors can actually evaluate gains to hours of putative labor.

Finally, the General Social Survey asks people about the ideal number of children they’d like to have, versus the actually number of children they do have (CHLDIDEL and CHILDS). To remove demographic confounds I limited the sample below to non-Hispanic white women age 45 and up between the year 2006 and 2010, and compared across educational attainment.

December 11, 2012

We don’t know why Ethiopians breathe easy

Filed under: Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 9:40 pm

Most people are aware that altitude imposes constraints on individual performance and function. Much of this is flexible; athletes who train at high altitudes may gain a performance edge. But over the long term there are costs, just as there are with computers which are ‘overclocked.’ This is the point where you make the transition from physiology to evolution. Residence at high altitude entails strong selective pressures on populations. Over the past few years there has been a great deal of exploration of the genetics of long resident high altitude groups, the Tibetans, Peruvians, and Ethiopians.

Why is high altitude adaptation of interest? First, it’s clear and distinct. Coding the two phenotypes is not that difficult. Second, the adaptive value is clear. Though people can survive at high altitudes, most do not flourish. And third, you have several distinct populations which are phylogenetically diverse. In other words, you can test evolution across multiple lineages, and see if the phenotypic and genetic features resemble each other, or differ. To some extent we know the answer: though there are overlaps, altitude adaptations differ both in their physiological presentation, and in their genetic architecture. Because it is such a powerful force, the shape of adaptation to high altitudes gives us a better sense of the arc of evolution across diverse populations and times.

With all that under our belt, I have to admit that a new paper in PLoS Genetics, The Genetic Architecture of Adaptations to High Altitude in Ethiopia, is quite peculiar in my opinion. To not put too fine a point on it, it seems hurried and confused. Granted, from what I can gather the authors confirmed the likelihood that the genetic architecture of high altitude adaptations in Ethiopian populations is such that variance exhibits a more polygenic character, resulting in weaker ability to detect genes of large effect. But if you read the paper you notice that they go off on a particular tangent on the adaptation to diseases on immune loci. Much of the work over the past 10 years on recent evolution in humans has focused on immune response, so I don’t see why that was necessary for the broader paper.

Additionally, there’s one element which I think needs to be explored more: the phylogenetic history of the Ethiopian people. The paper makes the blithe assertion that Ethiopians have been resident in the highlands of that nation for tens of thousands of years, but the latest research is that that just isn’t so. In short, it may be that Ethiopians as we understand them are recent compounds of a long resident Sub-Saharan African population and West Eurasian migrants. I’m not sure that this necessarily impacts the high altitude adaptations, as the admixed populations would presumably inherit the alleles from the local residents if they did not exterminate them. But, it does seem to complicate the story that you’d want to tell.

Overall, I think this sort of paper, which follows on the heels of other work, should have been presented in a more explicitly evolutionary context. If these adaptations are old, then there may be many variants in the highlands of Ethiopia. In contrast, the Peruvian adaptations are presumably new, and from what I have been told they are more like improvisations on the fly. In a way the spectrum of subtly of adaptation, with Ethiopians being the most nuanced, and Peruvians the least (which Tibetans in the middle), gives us a sense of the different power of evolution on the scale of 50,000 years vs. 10,000 years. Additionally, the region around Ethiopia has long been inhabited by hominins. No prior population was resident on the Altiplano before the arrival of the first Amerindians ~10,000 years before the present.

I’m hoping that the Nielsen Group comes out with a better paper. I know they’re going to revisit the topic of altitude adaptation (so I was told at ASHG 2012).

Citation: Alkorta-Aranburu G, Beall CM, Witonsky DB, Gebremedhin A, Pritchard JK, et al. (2012) The Genetic Architecture of Adaptations to High Altitude in Ethiopia. PLoS Genet 8(12): e1003110. doi:10.1371/journal.pgen.1003110

December 10, 2012

Is Daniel MacArthur ‘desi’?

My initial inclination in this post was to discuss a recent ordering snafu which resulted in many of my friends being quite peeved at 23andMe. But browsing through their new ‘ancestry composition’ feature I thought I had to discuss it first, because of some nerd-level intrigue. Though I agree with many of Dienekes concerns about this new feature, I have to admit that at least this method doesn’t give out positively misleading results. For example, I had complained earlier that ‘ancestry painting’ gave literally crazy results when they weren’t trivial. It said I was ~60 percent European, which makes some coherent sense in their non-optimal reference population set, but then stated that my daughter was >90 percent European. Since 23andMe did confirm she was 50% identical by descent with me these results didn’t make sense; some readers suggested that there was a strong bias in their algorithms to assign ambiguous genomic segments to ‘European’ heritage (this was a problem for East Africans too).

Here’s my daughter’s new chromosome painting:

One aspect of 23andMe’s new ancestry composition feature is that it is very Eurocentric. But, most of the customers are white, and presumably the reference populations they used (which are from customers) are also white. Though there are plenty of public domain non-white data sets they could have used, I assume they’d prefer to eat their own data dog-food in this case. But that’s really a minor gripe in the grand scheme of things. This is a huge upgrade from what came before. Now, it’s not telling me, as a South Asian, very much. But, it’s not telling me ludicrous things anymore either!

But in regards to omission I am curious to know why this new feature rates my family as only ~3% East Asian, when other analyses put us in the 10-15% range. The problem with very high values is that South Asians often have some residual ‘eastern’ signal, which I suspect is not real admixture, but is an artifact. Nevertheless, northeast Indians, including Bengalis, often have genuine East Asia admixture. On PCA plots my family is shifted considerably toward East Asians. The signal they are picking up probably isn’t noise. Almost every apportionment of East Asian ancestry I’ve seen for my family yields a greater value for my mother, and that holds here. It’s just that the values are implausibly low.

In any case, that’s not the strangest thing I saw. I was clicking around people who I had “shared” genomes with, and I stumbled upon this:

As you can guess from the screenshot this is Daniel MacArthur’s profile. And according to this ~25% of chromosome 10 is South Asian! On first blush this seemed totally nonsensical to me, so I clicked around other profiles of people of similar Northern European background…and I didn’t see anything equivalent.

What to do? It’s going to take more evidence than this to shake my prior assumptions, so I downloaded Dr. MacArthur’s genotype. Then I merged it with three HapMap populations, the Utah whites (CEU), the Gujaratis (GIH), and the Chinese from Denver (CHD). The last was basically a control. I pulled out chromosome 10. I also added Dan’s wife Ilana to the data set, since I believe she got typed with the same Illumina chip, and is of similar ethnic background (i.e., very white). It is important to note that only 28,000 SNPs remained in the data set. But usually 10,000 is more than sufficient on SNP data for model-based clustering with inter-continental scale variation.

I did two things:

1) I ran ADMIXTURE at K = 3, unsupervised

2) I ran an MDS, which visualized the genetic variation in multiple dimensions

Before I go on, I will state what I found: these methods supported the inference from 23andMe, on chromosome 10 Dr. MacArthur seems to have an affinity with South Asians (i.e., this is his ‘curry chromosome’). Here are the average (median) values in tabular format, with MacArthur and his wife presented for comparison.

ADMIXTURE results for chromosome 10
K 1 K 2 K 3
CEU 0.04 0.02 0.93
GIH 0.87 0.05 0.08
CHD 0.01 0.97 0.01
Daniel MacArthur 0.29 0.07 0.64
Ilana Fisher 0.01 0.06 0.94

You probably want a distribution. Out of the non-founder CEU sample none went above 20% South Asian. Though it did surprise me that a few were that high, making it more plausible to me that MacArthur’s results on chromosome 10 were a fluke:

And here’s the MDS with the two largest dimensions:

Again, it’s evident that this chromosome 10 is shifted toward South Asians. If I had more time right now what I’d do is probably get that specific chromosomal segment, phase it, and then compare it to various South Asian populations. But I don’t have time now, so I went and checked out the results from the Interpretome. I cranked up the settings to reduce the noise, and so that it would only spit out the most robust and significant results. As you can see, again chromosome 10 comes up as the one which isn’t quite like the others.

Is there is a plausible explanation for this? Perhaps Dr. MacArthur can call up a helpful relative? From what  recall his parents are immigrants from the United Kingdom, and it isn’t unheard of that white Britons do have South Asian ancestry which dates back to the 19th century. Though to be totally honest I’m rather agnostic about all this right now. This genotype has been “out” for years now, so how is it that no one has noticed this peculiarity??? Perhaps the issue is that everyone was looking at the genome wide average, and it just doesn’t rise to the level of notice? What I really want to do is look at the distribution of all chromosomes and see how Daniel MacArthur’s chromosome 10 then stacks up. It might be a random act of nature yet.

Also, I guess I should add that at ~1.5% South Asian that would be consistent with one of MacArthur’s great-great-great-great grandparents being Indian. Assuming 25 year generation times that puts them in the mid-19th century. Of course, at such a low proportion the variance is going to be high, so it is quite possible that you need to push the real date of admixture one generation back, or one generation forward.

December 9, 2012

The origins of the Romani determined definitively

Filed under: Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 11:52 am

In many cases there are questions of a historical and ethnographic nature which are subject to controversy and debate. Scholarly arguments are laid out, and further dispute ensues. For decades progress seems fleeting, as one hypothesis is accepted, only to be subject to later revision. This sort of pattern gives succor to the most cynical and jaded of ‘Post Modern’ set, especially when the ‘discourse’ in question is in the domain of science.

But thankfully these debates can come to an end in some cases. So it is with the origins of the European Romani, better known as ‘Gypsies’ (though the Roma are the most well known of the Romani, other groups within Europe have different ethnonyms). Obviously many of the basic elements have long been there, but I think the most recent genetic work now establishes a level of closure. Taking a step back, what do we know?

1) The Romani language seems to be Indo-Aryan, with a likely affinity with the northwest group of Indo-Aryan languages

2) The Romani presence in Europe only dates to the past ~1,000 years, with an entry point in the Byzantine Empire

3) They are an admixture between an ancestral Indian element, and local populations

4) Their history of endogamy has resulted in a strong genetic drift effect

The two papers which seem to nail the coffin shut on these questions use somewhat different methodologies. One relies on Y chromosomal STRs (hypervariable repeat regions) to generate a paternal phylogeny. Focusing just on the paternal phylogeny allows for one to make very robust genealogical inferences. Additionally, the authors had a very large data set across India. Their goal was to ascertain the exact region of origin of the Romani before they left India. As noted in bullet #1 there is already some evidence from their language that this must be in northwest India. The second paper uses a SNP-chip; hundreds of thousands of autosomal markers. This has been done to death for other populations, so the method isn’t new. Rather, it is that it is now being applied to the Romani.

First, the Y chromosomal paper. The Phylogeography of Y-Chromosome Haplogroup H1a1a-M82 Reveals the Likely Indian Origin of the European Romani Populations:

Linguistic and genetic studies on Roma populations inhabited in Europe have unequivocally traced these populations to the Indian subcontinent. However, the exact parental population group and time of the out-of-India dispersal have remained disputed. In the absence of archaeological records and with only scanty historical documentation of the Roma, comparative linguistic studies were the first to identify their Indian origin. Recently, molecular studies on the basis of disease-causing mutations and haploid DNA markers (i.e. mtDNA and Y-chromosome) supported the linguistic view. The presence of Indian-specific Y-chromosome haplogroup H1a1a-M82 and mtDNA haplogroups M5a1, M18 and M35b among Roma has corroborated that their South Asian origins and later admixture with Near Eastern and European populations. However, previous studies have left unanswered questions about the exact parental population groups in South Asia. Here we present a detailed phylogeographical study of Y-chromosomal haplogroup H1a1a-M82 in a data set of more than 10,000 global samples to discern a more precise ancestral source of European Romani populations. The phylogeographical patterns and diversity estimates indicate an early origin of this haplogroup in the Indian subcontinent and its further expansion to other regions. Tellingly, the short tandem repeat (STR) based network of H1a1a-M82 lineages displayed the closest connection of Romani haplotypes with the traditional scheduled caste and scheduled tribe population groups of northwestern India.

 

Two trees illustrate the results succinctly:

The bottom line:

- This particular Y chromosomal lineage which is highly diagnostic of South Asian origin in the Romani shows that the Romani seem to derive from the populations of northwest India

- Additionally, within these populations the Romani Y chromosomal lineages derive from the lower caste elements, the scheduled castes and scheduled tribes

But the above results don’t get directly at genome-wide admixture. The second paper does, using hundreds of thousands of markers to explore the Romani affinity to other populations. Reconstructing the Population History of European Romani from Genome-wide Data:

The Romani, the largest European minority group with approximately 11 million people…constitute a mosaic of languages, religions, and lifestyles while sharing a distinct social heritage. Linguistic…and genetic…studies have located the Romani origins in the Indian subcontinent. However, a genome-wide perspective on Romani origins and population substructure, as well as a detailed reconstruction of their demographic history, has yet to be provided. Our analyses based on genome-wide data from 13 Romani groups collected across Europe suggest that the Romani diaspora constitutes a single initial founder population that originated in north/northwestern India ∼1.5 thousand years ago (kya). Our results further indicate that after a rapid migration with moderate gene flow from the Near or Middle East, the European spread of the Romani people was via the Balkans starting ∼0.9 kya. The strong population substructure and high levels of homozygosity we found in the European Romani are in line with genetic isolation as well as differential gene flow in time and space with non-Romani Europeans. Overall, our genome-wide study sheds new light on the origins and demographic history of European Romani.

The plot to the left illustrates the relationship of the Romani to world-wide populations using multi-dimensional scaling, where genetic variation is decomposed into dimensions, and individuals are plotted on those dimensions. In short, the Romani exhibit a classic admixture cline pattern.That is, they are the products of a two-way admixture between populations which occupy distinct positions along a cline, and Romani individuals and populations are distributed along the cline in proportion to their admixture. One notable aspect is that the Romani are actually two clusters; one which manifests a strong ‘east’-'west’ distribution, and another which seems located purely within the European cluster. The latter seems to be the Welsh Romani, who in the neighbor-joining tree (see the supplements) fall on the same branch as European populations, as opposed to the other Romani, who form their own clade.

To drill down further you need to ascertain admixture with a model-based clustering algorithm. Ergo, ADMIXTURE. I’ve reedited the figure to illustrate the salient points. In particular, it is clear that the Roma populations except the Welsh have significant South Asian ancestry. The question is how much? To answer this question you need to know the source population in South Asia. A peculiar aspect of this plot is that the Romani have very little of the green ancestral component, which happens to be modal in the Middle East (not shown). This element happens to be highly enriched in many Pakistani populations, but not necessarily northwest Indian ones. Nevertheless, the issue that leaves me suspicious of this particular finding is that many of the European populations, in particular those groups (e.g., Balkans) which may have admixed with the Romani, have this element to extent not evident in one of their presumed ‘daughter’ populations. I wonder if perhaps the peculiarities of Romani inbreeding has skewed the allele frequency distribution so much that you get strangeness like this. I am not showing higher K’s because those break out with a Romani-cluster. Just like the Kalash-cluster this is to a great extent a feature of the long term endogamy of these communities. With high levels of drift the allele frequency of these groups moves into a very peculiar space in relation to their parental populations, but one must not become confused and assume that the Romani or Kalash are themselves appropriate independent clusters in the same way that Europeans or East Asians are.

Using various forms of admixture analysis the authors seem to conclude that the Balkan Romani are 30-50% South Asian. This seems in line with intuition. But that still leaves open the question of who those South Asians were. As I noted above the most thorough Y chromosomal data point to the lower caste elements of northwest India. What do the autosomes say?

I don’t want get into the technical details of how they tested the models, but it seems that one of the likely parental populations to the Romani had a close relationship to the Meghwal, a scheduled caste from northwest India. In other words, the autosome results align very well with the Y chromosomal inferences. Additionally, the models tested imply that the Romani likely left South Asian ~1,000 years before the present, which aligns well with what is known from the historical record (though this is a case where I put much more stock in the historical record than inferences from population genetic models; look at the intervals).

Finally, there is the question of inbreeding. One aspect of the Romani genome is jumps out you is that they have many long “runs-of-homozygosity” (ROH). This is totally expected, as decades of uniparental analyses suggested a great deal of population bottleneck events as the Romani spread throughout Europe. But the ROH patterns also unearth an interesting fact: some of the Balkan Romani clearly have recent European admixture, while the non-Balkan Romani had an initial period of admixture followed by endogamy. The latter scenario seems to resemble Askhenazi Jews, while the former would suggest that the boundary between Romani and non-Romani in the Balkans is more fluid than is sometimes portrayed.

So there we have it. The Romani derive from lower castes populations from the northwest Indian subcontinent who seem to have left ~1,000 years ago. Over time they admixed with local populations, and are now 50-70% non-South Asian, with some groups being ~90% European (e.g., Welsh Romani). And, they have a long history as an endogamous group, judging by their inbreeding.

December 2, 2012

More diversity in the Personal Genome Project

Filed under: Genomics,Human Genomics,PGP — Razib Khan @ 1:42 pm

Seeking Diversity (Especially Families):

Because the PGP is self-recruiting, we don’t have a very balanced set of participants. “Self-recruitment” means that all participants have enrolled in our project through word of mouth, finding our website and enrolling online. To put it bluntly, that means we mostly end up with young white men….

…Research within one or two racial/ethnic categories isn’t necessarily a virtue, biracial and multiracial heritage may be even more interesting to some researchers and can open more areas for future….

In particular, NIST is looking for “trios”: two parents and a child. Researchers like to use samples from trios because they know every piece of DNA in the child comes from one of the parents. This makes it easier to assess error rates — and that sort of quality control is what NIST expects the genome material to be used for. We think all such family groups are valuable, but current trios in the PGP haven’t been the most diverse….

Reader Paul brought this to my attention. I haven’t been too interested in the PGP for myself because it’s just so slow to “play” with whole genomes (~3 GB) as opposed to 1 million SNPs. But over Christmas I’ll look into signing up, and see if they are interested in my own “trio.” I also thought I’d pass this along to readers, though my readership actually looks almost exactly like current PGP participants, so I don’t know if I’ll be contributing to the problem.

Layering genetic histories

Filed under: Anthroplogy,Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 12:14 pm

As a follow up to my post from yesterday, I decided to run TreeMix on a data set I happened to have had on hand (see Inference of Population Splits and Mixtures from Genome-Wide Allele Frequency Data for more on TreeMix). Basically I wanted to display a tree with, and without, gene flow.

The technical details are straightforward. I LD pruned ~550,000 SNPs down to ~150,000. I ran TreeMix without and with migration parameters with the Bantu Kenya population being the root. Finally, when I did turn on the migration parameter I set it for 5. You can see the results below.

Most of the flows are pretty expected. The West Eurasian flow from the Turks to the Uygurs makes sense, because there is a large West Asian component to what the Uygurs have (from East Iranians?). The Chuvash are a Turkic group with minor, but significant, Turkic component. The HGDP Russian sample does have some East Eurasian ancestry. And the Moroccans also have African ancestry. But your guess is as good as mine with the Bantu flow in. These are I think Kenya, so it might be trying to interpret Nilotic admixture as generalized Eurasian.

A minor note: installing TreeMix and generating the appropriate files from pedigree format is not to difficult. But you might have confusion in how to generate the pedigree input file. You do it like so in PLINK:

./plink --noweb --bfile YourFile --freq --within YourGroupNamesFile --out YourOutPutFile

It’s the last you want to put into TreeMix’s python conversion script. The YourGroupNamesFile is basically the .fam file with an extra column, the population names for each individual.

December 1, 2012

Africa’s hidden people hold the keys to the past

I mentioned this in passing on my post on ASHG 2012, but it seems useful to make explicit. For the past few years there has been word of research pointing to connections between the Khoisan and the Cushitic people of Ethiopia. To a great extent in the paper which is forthcoming there is the likely answer to the question of who lived in East Africa before the Bantu, and before the most recent back-migration of West Eurasians. On one level I’m confused as to why this has to be something of a mystery, because the most recent genetic evidence suggests a admixture on the order of 2-3,000 years before the past.* If the admixture was so recent we should find many of the “first people,” no? As it is, we don’t. I think these groups, and perhaps the Sandawe, are the closest we’ll get.

Publication is imminent at this point (of this, I was assured), so I’m going to just state the likely candidate population (or at least one of them): the Sanye, who speak a Cushitic language with possible Khoisan influences. There really isn’t that much information on these people, which is why when I first heard about the preliminary results a few years back and looked around for Khoisan-like populations in Kenya I wasn’t sure I’d hit upon the right group. But at ASHG I saw some STRUCTURE plots with the correct populations, and the Sanye were one of them. I would have liked to see something like TreeMix, but the STRUCTURE results were of a quality that I could accept that these populations were not being well modeled by the variation which dominated their data set. Though Cushitic in language the Sanye had far less of the West Eurasian element present among other Cushitic speaking populations of the Horn of Africa. Neither were their African ancestral components quite like that of the Nilotic or Bantu populations. The clustering algorithm was having a “hard time” making sense of them (it seemed to wanted to model them as linear combinations of more familiar groups, but was doing a bad job of it).

Here is an interesting article on these groups: Little known tribe that census forgot. Like the Sandawe this is a population which seems to have been hunter-gatherers very recently, and to some extent still engage in this lifestyle. In this way I think they are fundamentally different from Indian tribal populations, who are often held up to be the “first people” of the subcontinent.  More and more it seems that the tribes of India are less the descendants of the original inhabitants of the subcontinent, at least when compared to the typical Indian peasant, and more simply those segments of the Indian population which were marginalized and pushed into less productive territory. Over time they naturally diverged culturally because of their isolation, but the difference was not primal. In contrast, groups like the Sanye and Sandawe may have mixed to a great extent with their neighbors (and lost their language like the Pygmies), but evidence of full featured hunting & gathering lifestyles implies a sort of direct cultural continuity with the landscape of eastern Africa before the arrival of farmers and pastoralists from the west and north.

* I understand some readers refuse to accept the likelihood of these results because of other lines of information. I am just relaying the results of the geneticists. I am not interested in re-litigating prior discussions on this. We’ll probably have a resolution soon enough.

Northern Europeans and Native Americans are not more closely related than previously thought

A new press release is circulating on the paper which I blogged a few months ago, Ancient Admixture in Human History. Unlike the paper, the title of the press release is misleading, and unfortunately I notice that people are circulating it, and probably misunderstanding what is going on. Here’s the title and first paragraph:

Native Americans and Northern Europeans More Closely Related Than Previously Thought

Released: 11/30/2012 2:00 PM EST
Source: Genetics Society of America

Newswise — BETHESDA, MD – November 30, 2012 — Using genetic analyses, scientists have discovered that Northern European populations—including British, Scandinavians, French, and some Eastern Europeans—descend from a mixture of two very different ancestral populations, and one of these populations is related to Native Americans. This discovery helps fill gaps in scientific understanding of both Native American and Northern European ancestry, while providing an explanation for some genetic similarities among what would otherwise seem to be very divergent groups. This research was published in the November 2012 issue of the Genetics Society of America’s journal GENETICS

 

The reality is ta Native Americans and Northern Europeans are not more “closely related” genetically than they were before this paper. There has been no great change to standard genetic distance measures or phylogeographic understanding of human genetic variation. A measure of relatedness is to a great extent a summary of historical and genealogical processes, and as such it collapses a great deal of disparate elements together into one description. What the paper in Genetics outlined was the excavation of specific historically contingent processes which result in the summaries of relatedness which we are presented with, whether they be principal component analysis, Fst, or model-based clustering.

What I’m getting at can be easily illustrated by a concrete example. To the left is a 23andMe chromosome 1 “ancestry painting” of two individuals. On the left is me, and the right is a friend. The orange represents “Asian ancestry,” and the blue represents “European” ancestry. We are both ~50% of both ancestral components. This is a correct summary of our ancestry, as far as it goes. But you need some more information. My friend has a Chinese father and a European mother. In contrast, I am South Asian, and the end product of an ancient admixture event. You can’t tell that from a simple recitation of ancestral quanta. But it is clear when you look at the distribution of ancestry on the chromosomes. My components have been mixed and matched by recombination, because there have been many generations between the original admixture and myself. In contrast, my friend has not had any recombination events between his ancestral components, because he is the first generation of that combination.

So what the paper publicized in the press release does is present methods to reconstruct exactly how patterns of relatedness came to be, rather than reiterating well understood patterns of relatedness. With the rise of whole-genome sequencing and more powerful computational resources to reconstruct genealogies we’ll be seeing much more of this to come in the future, so it is important that people are not misled as to the details of the implications.

Northern Europeans and Native Americans are not more closely related than previously thought

A new press release is circulating on the paper which I blogged a few months ago, Ancient Admixture in Human History. Unlike the paper, the title of the press release is misleading, and unfortunately I notice that people are circulating it, and probably misunderstanding what is going on. Here’s the title and first paragraph:

Native Americans and Northern Europeans More Closely Related Than Previously Thought

Released: 11/30/2012 2:00 PM EST
Source: Genetics Society of America

Newswise — BETHESDA, MD – November 30, 2012 — Using genetic analyses, scientists have discovered that Northern European populations—including British, Scandinavians, French, and some Eastern Europeans—descend from a mixture of two very different ancestral populations, and one of these populations is related to Native Americans. This discovery helps fill gaps in scientific understanding of both Native American and Northern European ancestry, while providing an explanation for some genetic similarities among what would otherwise seem to be very divergent groups. This research was published in the November 2012 issue of the Genetics Society of America’s journal GENETICS

 

The reality is ta Native Americans and Northern Europeans are not more “closely related” genetically than they were before this paper. There has been no great change to standard genetic distance measures or phylogeographic understanding of human genetic variation. A measure of relatedness is to a great extent a summary of historical and genealogical processes, and as such it collapses a great deal of disparate elements together into one description. What the paper in Genetics outlined was the excavation of specific historically contingent processes which result in the summaries of relatedness which we are presented with, whether they be principal component analysis, Fst, or model-based clustering.

What I’m getting at can be easily illustrated by a concrete example. To the left is a 23andMe chromosome 1 “ancestry painting” of two individuals. On the left is me, and the right is a friend. The orange represents “Asian ancestry,” and the blue represents “European” ancestry. We are both ~50% of both ancestral components. This is a correct summary of our ancestry, as far as it goes. But you need some more information. My friend has a Chinese father and a European mother. In contrast, I am South Asian, and the end product of an ancient admixture event. You can’t tell that from a simple recitation of ancestral quanta. But it is clear when you look at the distribution of ancestry on the chromosomes. My components have been mixed and matched by recombination, because there have been many generations between the original admixture and myself. In contrast, my friend has not had any recombination events between his ancestral components, because he is the first generation of that combination.

So what the paper publicized in the press release does is present methods to reconstruct exactly how patterns of relatedness came to be, rather than reiterating well understood patterns of relatedness. With the rise of whole-genome sequencing and more powerful computational resources to reconstruct genealogies we’ll be seeing much more of this to come in the future, so it is important that people are not misled as to the details of the implications.

November 11, 2012

The Genographic Project’s Scientific Grants Program

While I was at Spencer Wells’ poster at ASHG I was primarily curious about bar plots. He’s got really good spatial coverage, so I’m moderately excited about the paper (though I didn’t see much explicit testing of phylogenetic hypotheses, which I think this sort of paper has to do now; we’re beyond PCA and bar plots only papers). That being said, Spencer was more interested in me promoting the Scientific Grants Program. Here’s some more information:

The Genographic Project’s Scientific Grants Program awards grants on a rolling basis for projects that focus on studying the history of the human species utilizing innovative anthropological genetic tools. The variety of projects supported by the scientific grants will aim to construct our ancient migratory and demographic history while developing a better understanding of the phylogeographic structure of world populations. Sample research topics could include subjects like the origin and spread of the Indo-European languages, genetic insights into Papua New Guinea’s high linguistic diversity, the number and routes of migrations out of Africa, the origin of the Inca, or the genetic impact of the spread of maize agriculture in the Americas.

Recipients will typically be population geneticists, students, linguists, and other researchers or scientists interested ...

Reflections on the evolution at ASHG 2012

As most readers know I was at ASHG 2012. I’m going to divide this post in half. First, the generalities of the meeting. And second, specific posters, etc.

Generalities:

- Life Technologies/Ion Torrent apparently hires d-bag bros to represent them at conferences. The poster people were fine, but the guys manning the Ion Torrent Bus were total jackasses if they thought it would be funny/amusing/etc. Human resources acumen is not always a reflection of technological chops, but I sure don’t expect organizational competence if they (HR) thought it was smart to hire guys who thought (the d-bags) it would be amusing to alienate a selection of conference goers at ASHG. Go Affy & Illumina!

- Speaking of sequencing, there were some young companies trying to pitch technologies which will solve the problem of lack of long reads. I’m hopeful, but after the Pacific Biosciences fiasco of the late 2000s, I don’t think there’s a point in putting hopes on any given firm.

- I walked the poster hall, read the titles, and at least skimmed all 3,000+ posters’ abstracts. No surprise that genomics was all over the place. But perhaps a moderate ...

November 4, 2012

Inflammatory bowel syndrome is nature’s side effect

Last week Luke Jostins (soon to be Dr. Luke Jostins) published an interesting paper in Nature. To be fair, this paper has an extensive author list, but from what I am to understand this is the fruit of the first author’s Ph.D. project. In any case, you may know Luke because I have used his loess curve on hominin encephalization for years. His bread & butter is statistical genetics, and it shows in this Nature paper. God knows how he managed to cram so much density into ~5.5 pages of plain text. Luke is also a contributor to Genomes Unzipped, and has put up a post over there on one implication of the paper, Dozens of new IBD genes, but can they predict disease? The short answer is that for individual prediction complex traits are going to be a hard haul over the long term.*

They are subject to what Jim Manzi would term “high causal density.” A simple way to state this is that outcome X is dependent on a host of variables, and if you capture only a small number of variables, you aren’t going to be explaining much in a general ...

Older Posts »

Powered by WordPress