Razib Khan One-stop-shopping for all of my content

March 23, 2017

Your ancestry inference is precise and accurate(ish)

Filed under: 23andMe,Ancestry,Culture,Family Tree DNA,Genetics,Genomics — Razib Khan @ 6:29 am

For about three years I consulted for Family Tree DNA. It was a great experience, and I met a lot of cool people through that connection. But perhaps the most interesting aspect was the fact that I can understand the various pressures that direct-to-consumer genomics firms face from the demand side. The science is one thing, but when you are working on a consumer facing product, other variables come into play which are you not cognizant of when you are thinking of it from a point of pure analysis. I’m pretty sure that my insights working with Family Tree DNA can generalize to the other firms as well (23andMe, Ancestry, and Genographic*).

The science behind the ancestry inference elements of the product on offer is not particularly controversial or complex, but the customer aspect of how these results are received can become an intractable nightmare. The basic theory was outlined in the year 2000 in Pritchard et al.’s Inference of Population Structure Using Multilocus Genotype Data. You have lots of data thanks to better genomic technology (e.g., 300,000 SNPs). You have computers to analyze that data. And, you have scientific models of population history and dynamics which you can test that data against. The shape of the data will determine the parameters of the model, and it this those parameters that yield “your ancestry.”

In broad sketches the results make sense for most people. It’s in the finer details that the confusions emerge. To the left you see my son’s 23andMe ancestry deconvolution. The color coding is such you can tell that his maternal and paternal chromosomes have very different ancestry profiles (mostly Northern European and South Asian, respectively).

But his “Northern European” chromosomes also are more richly colored, with alternative segments denoting ancestry from different parts of Northern Europe. So in terms of proportions I am told my son is about 15 percent French and German, and 10 percent Scandinavian and 10 percent British and Irish. This is reasonable. On the other side he’s nearly 50 percent “broadly South Asian.” The balance is accounted for by my East Asian ancestry, which is correct, as my South Asian ethnicity is from Bengal, where there is a fair amount of East Asian ancestry (my family’s origin is on the eastern edge of Bengal itself).

And it is here that the non-scientific concerns of consumer genomics comes into focus. The genetic differences and distance between various South Asian groups are far higher than those between various Northern European groups. Depending on the statistic measure you use intra-South Asian variation is about one order of magnitude greater than intra-Northern European differences. This is due to geographic partitioning, the caste system, and differential admixture in South Asians between extreme diverged ancestral elements (about half of South Asian ancestry is very similar to Europeans and Middle Easterners, and half of it is extremely different, so how far you are from the 50 percent mark determines a lot).

Broadly South Asian

In Northern Europe there is very little genetic variation from the British Isles all the way the Baltic. The reason for this is historical: massive population turnover in the region 4,500 years ago means that much of the genetic divergence between the groups dates to the Bronze Age. It is this the genetic divergence, the variation, that is the raw material for the inferences and proportions you see in ancestry calculators. There’s just not that much raw material for Northern Europeans.

Broadly South Asian

Remember, the methods require lots of variation in the data as a raw input. You’re making the inference machine work real hard to produce a reasonable robust result if you don’t have that much variation. In contrast to the situation with Northern Europeans, with South Asians the companies are leaving raw material on the table, and just combining diverse groups together.

What’s going on here? As you might have guessed this is an economically motivated decision. Most South Asians know their general heritage due to caste and regional origins (though many Bengalis exhibit some lacunae about their East Asian ancestry). In contrast, many Americans of Northern European ancestry with an interest in genealogy are extremely curious about explicit proportional breakdowns between Northern European nationalities. The direct-to-consumer genomic firms attempt to cater to this demand as best as they can.

As I have stated many times, racial background is to various extents both biological and social. When it comes to the difference between Lithuanians and Nigerians the biological differences due to evolutionary history are straightforward, and clear and distinct. You can generate a phylogenetic history and perform a functional analysis of the differences. Additionally, you also have to note that the social differences exist, but are not straightforward. Like Lithuanians Nigerians of Igbo background are generally Roman Catholic, while most other Nigerians are not. The linguistic differences between Nigerian languages are great enough that it is defensible to suggest that Hausa speakers of Afro-Asiatic dialects are closer to Lithuanians in their phylogenetic history than to the dialects of the Yoruba.

A Lithuanian American

Contrast this to the situation where you differentiate Lithuanians from French. To any European the differences here are incredibly huge. The history of France, what was Roman Gaul, goes back 2,000 years. After the collapse of the West Roman Empire by any measure the people who became French were at the center of European history. In contrast, Lithuanians were a marginal tribe, who did not enter Christian civilization until the late 14th century. In social-cultural terms, due to history, the differences between French and Lithuanians are extremely salient to people of French and Lithuanian ancestry. But genetically the differences are modest at best.

If a direct-to-consumer genetic testing company tells you that you are 90 percent Northern European and 10 percent West African, that is a robust result that has a clear historical genetic interpretation. The two element’s of one’s ancestry have been relatively distinct for on the order of 100,000 years, with the Northern European element really just a proxy for non-Africans (though it is easy to drill-down within Eurasia). In contrast, notice how 23andMe, with some of the best scientists in the business, tells people they are “French-German,” and not French or German. What the hell is a “French-German”? Someone from Alsace-Lorraine? A German descendent of Huguenots? Obviously not.

“French-German” is a cluster almost certainly because there are no clear and distinct genetic differences between French and Germans. Yes, there is a continuum of allele frequencies between these two groups, but having looked at a fair number of people of French and German background in Family Tree DNA’s database I can tell you that France and Germany have a lot of local structure even among people of indigenous ancestry. Germans from the Rhineland are quite often genetically closer to French from Normandy than they are to Germans from eastern Saxony. Some of this is due to gene flow between neighboring regions, but some of this is due to cultural fluidity as to who exactly is German. It is clear that some Germans from the eastern regions are Germanized Slavs. Some Germans from the north exhibit strong affinities to Scandinavians, while Germans from Bavaria and Austria are classically Central European (whatever that means). The average German is distinct from the average French person, but the genetic clustering of the two groups is not clear and distinct.

Remember earlier I explained that the science is predicated on aligning data and models. The cultural model of Northern Europeans is conditioned on diversity and difference which has been very salient for the past few thousand years since the rise and fall of Rome. But the evolutionary genetic history is one where there are far fewer differences. The data do not fit a model that makes much sense to the average consumer (e.g., “you descend from a mix of Bronze Age migrants from the west-central steppe of Eurasia and Mesolithic indigenous hunter-gatherers and Neolithic farmers”). What makes sense to the average American consumer are histories of nationalities, so direct-to-consumer genetic companies try to satisfy this need. Because the needs of the consumer and their cultural expectations are poorly served by the data (genetic variation) and models of population history, you have a lot of awkward kludges and strange results.

Imagine, for example, you want to estimate how “German” someone is.  What do you use for your reference population of Germans?  Looking at the data there are clearly three major clusters within Germany when you weight the numbers appropriate, with affinities to the northern French, Slavs, and Scandinavians, and various proportions in between. Your selection of your sample is going to mean that some Germans are going to be more Germans than other Germans. If you select an eastern German sample then western Germans whose ancestors have been speaking a Germanic language far longer than eastern Germans are going to come out as less German. Or, you could just pick all of these disparate groups…in which case, lots of Northern Europeans become “German.”

Consumers want genetic tests to reflect strong cultural memories which were forged in the fires of rapidly protean and distinction-making process of cultural evolution. But biological and cultural evolution exhibit different modes (the latter generates huge between group differences) and tempos (those differences emerge fast). The ancestry results many people get are the outcomes of compromises to thread the needle and square the circle.

All the above is half the story. Next I’ll explain why “deep history” has to be massaged to make recent history informative and comprehensible….

* Also, I have a little historical perspective because of my friendship with the person who arguably created this sector, Spencer Wells.

February 8, 2011

Health care costs and ancestry

Filed under: Ancestry,Genetics,Genomics,Health,Medical Genetics,race — Razib Khan @ 1:07 am

The Pith: In this post I examine the relationship between racial ancestry and cancer mortality risks conditioned on particular courses of treatment. I review research which indicates that the amount of Native American ancestry can be a very important signal as to your response to treatment if you have leukemia, as measured by probability of relapse.

If you are an engaged patient who has been prescribed medication I assume you’ve done your due diligence and double-checked your doctor’s recommendations (no, unfortunately an M.D. does not mean that an individual is omniscient). Several times when I’ve been prescribed a medication I have seen a note about different recommended dosages by race when I did further research. Because of my own personal background I am curious when it says “Asian.” The problem with this term in medical literature is that “Asian” in the American context is derived from a Census category constructed in 1980 for bureaucratic and political purposes. It amalgamates populations which are genetically relatively close, East and Southeast Asians, with more distant ones, South Asians (when my siblings were born I remember that my parents listed their race as “Asian” ...

December 15, 2010

“Genome blogging”

Nature profiles Dodecad, the Pickrell Affair, and the emergence of amateur genomicists in a new piece. Interestingly David of BGA is going to try and get something through peer review. In particular, the relationship of Assyrians and Jews.

So we have Genomes Unzipped, Dodecad, and BGA. What next? Who next? I hope Dienekes doesn’t mind if I divulge the fact that the computational resources needed to utilize ADMIXTURE as he has is within the theoretical capability of everyone reading this post. Rather, the key is getting familiar with PLINK and writing some code to merge data sets. After you do that, to really add value you’d probably want to get raw data from more than what you can find in the HGDP, HapMap and other public resources.

But here I make an open offer: if you start a blog or a project which replicates the methods of Dodecad and BGA I’ll link to you and promote you. When Dienekes began Dodecad I actually started to play around with the data sets in ADMIXTURE, but I’ve personally held off until seeing what he and David find. What their pitfalls and successes might be. Here’s to 2011 being more interesting than we can imagine!

Update: Already had a friend with a computational background contact me about doing something on South Asian genomics. So again: if you get a site/blog set up, and start pumping out plots, I will promote you. In particular, if you need 23andMe raw data files of geographical region X it might be useful to try and get the word out via blogs and what not.

November 23, 2010

Eurogenes 500K SNP BioGeographicAncestry Project

Filed under: Admixture,Ancestry,BGA,Genetics,Genomes Unzipped,Genomics — Razib Khan @ 12:11 am

Since I have been promoting the Dodecad Ancestry Project, it seems only fair to bring to your attention Eurogenes 500K SNP BioGeographicAncestry Project. The sample populations are a bit different from Dodecad, but again ADMIXTURE is the primary tool. But the author also makes recourse to other methodologies to explore more than simply population level variation. For example, his most recent post is Locating and visualizing minority non-European admixtures across our genomes:

Imagine, for example, a white American carrying a couple of tiny segments of West African origin, from an ancestor who lived 250 years ago, and an eastern Finn with no Asian ancestors in the last 4000 years or more. If we run an inter-continental ADMIXTURE analysis with these two, it’s very likely the American will score 100% European, while the eastern Finn will probably come out around 9% North and East Asian due to really old Uralic influence.

That sort of thing isn’t a huge problem when comparing the genetic structure of populations. Obviously, overall, eastern Finns rather than white Americans are genetically closer to North Asians, and that’s basically what ADMIXTURE picks up. However, if the focus is also on individuals, this certainly can become an issue. Our hypothetical American might be aware of that African ancestor, with solid paperwork backing up their genealogical connection, but he’s pulling his hair out because nothing’s showing up via genetic tests.

So let’s take a look at a real life example of how RHHcounter can pick up segments of potentially recent Sub-Saharan African origin…

Olivia Munn & Uyghur woman

The basic issue here is that in terms of genomic variation old admixture looks different from new admixture. Someone who is a first generation Eurasian, with a Chinese and European parent, may be about the same ancestral mix proportionally as a Uyghur. They would resemble a Uyghur on STRUCTURE and be placed within that cluster on a PCA chart (this is what happens in 23andMe). But, the Uyghur “Eastern” and “Western” genetic heritage has been reshuffled to a great extent by recombination over the past 1,000-2,000 years. In contrast, a first generation Eurasian will have huge swaths of their genome which are Eastern or Western on alternating strands (from their respective parents). In population genetic language a group of first generations hybrids would be exhibit a lot of linkage disequilibrium (LD). In a panmictic hybrid population LD will decay due to recombination, which breaks apart the distinctive allelic associations inherited from the parental populations.

This is the key to differentiating between the old “Asian” ancestry which sometimes falls out of the genetic variation of Finns at low frequencies, and more recent Asian ancestry. For example, the paleoanthropologist Vance Haynes is apparently a great-grandson of one of the original “Siamese Twins,” Chang Bunker. Chang Bunker was a Chinese Thai, so presumably Vance Haynes would come out to be ~10% Asian, and would be shifted toward the Asian cluster in relation to other Europeans. On the other hand, a closer look at his genome would indicate differences from a Turk who was ~10% Asian, because Vance Haynes’ Asian ancestry has only had three generations for recombination to break apart the original allelic associations which were passed down from Chang Bunker. After only these few generations the genome would still show many segments of clustered ancestry with distinctive sets of markers characteristic of Han Chinese.

Let’s make this more concrete. Below are two “ancestry paintings” from 23andMe. One is of a reference example, a Uyghur woman, and another is of a Eurasian individual. The difference is pretty obvious:


23andmeclusFor the record, 23andMe says that the Eurasian man is 50% Asian, 50% European. For the Uyghur woman, 52% European, 48% Asian. As I indicated above, Eurasian individuals who are projected onto the variation of the HGDP sample tend to cluster with the Uyghurs. In the image to the left the black mark indicates the Eurasian man. The Uyghurs are green. The purple rectangles are Hazaras.

But obviously this is a trivial example. What’s the point of sniffing around for non-European ancestry in individuals whose non-European ancestry is 1) visible, and 2) recent and immediate. No, a bigger question here are claims and suggestions by some white Americans that they have significant non-European ancestry. Usually this is Native American. But in the case of one of the European-origin samples which “Polako” (the principal behind the BGA Project) analyzed it seems there is a suggestion of West African ancestry.

dandonThis individual is Dr. Don Conrad of Genomes Unzipped. In particular, Polako found that there were two nearby segments on two chromosomes which exhibited a pattern of population atypical heterozygosity in Dr. Don Conrad’s genome. Look at chromosomes 7 and 13. Contrast the pattern with my distant paternal cousin, Dr. Daniel MacArthur. He also exhibits points of heterozygosity, but they’re randomly distributed across the genome. It’s old admixture or just noise.

Polako doesn’t make much of Dr. Don Conrad’s results, and neither do I (presumably as Dr. Don Conrad is a member of Genomes Unzipped it’s easy to talk about his results without any of the ethical or moral hassles about confidentiality). On the other hand, unlike Dr. Dan MacArthur, a little utilization of the powers of the interwebs indicates that Dr. Don Conrad is an American. In particular, of recent Midwestern background. Though I’m not a total creep, so I didn’t start poking around Ancestry.com. But after the Pickrell affair I am probably just a touch more hesitant to laugh off peculiar results from these sorts of analyses as simply algorithms-gone-meshugana.

Image Credit: Colegota

November 12, 2010

Was the Pocahontas exception necessary?

Filed under: Ancestry,Genetics,Genomics,race — Razib Khan @ 12:11 am

Harry_F._ByrdIn Jonathan Spiro’s Defending the Master Race it is recounted that as American states were passing more robust anti-miscegenation laws and legally enshrining the concept of the one-drop-rule an exception was made in Virginia for those with 1/16th or less Native American ancestry. The reason for this was practical: many of the aristocratic “First Families of Virginia” claimed descent from Pocahontas. Included within this set was Senator Harry F. Byrd Sr. of Virginia, who was 1/16th Native American, being a great-great-grandson of Pocahontas. This sort of background was probably not exceptional among the “Founding Stock” of Anglo-Americans whose ancestors were resident within the boundaries of the American republic at independence. Only around 1700 did the white population of the American British colonies exceed the indigenous, so no doubt some amalgamation did occur.

But from what I’ve seen the extent of admixture with the indigenous substrate was very marginal, especially in comparison to white populations in Argentina or Brazil. Or so I thought. In conversation a friend recently claimed that over 50% of American whites were 5% or more non-European in ancestry. I expressed skepticism, and he dug up the citation. Genetic ancestry: A new look at racial disparities in head and neck cancer:

The study included 358 patients; 37 percent were African American.

The researchers examined diagnosis (late versus early stage) and overall survival for African Americans with HNSCC based on self-reported race and genetic West African ancestry.

During the past decade, many groups have developed and characterized sets of single nucleotide polymorphism markers that can distinguish genetic ancestry among major ethnic groups such as Asian and West African, called ancestry information makers (AIMs).

For the study, genetic ancestry was based on a panel of 100 AIMs to estimate genetic background.

“Using these genetic markers gives you additional statistical power. It’s no longer two just categories – Black or White; it becomes a continuous variable. Race is not equal to genetics. Genetic markers don’t define specific races,” says Dr. Worsham.

Ultimately, the study found no correlation between West African genetic ancestry and HNSCC outcomes. Only self-reported race was associated with head and neck cancer stage.

Only 5 percent of self-reported African Americans had more than 95 percent West African ancestry, with 27 percent having less than 60 percent West African ancestry. By comparison, 48 percent who self-reported as Caucasian had more than 95 percent European American ancestry.

I’m not too worried about the number of markers. 100 should be sufficient on the scale of continents if well selected. But I’m curious about the representativeness of the sample. The African American one seems more European than others I’ve seen previously. And I really haven’t seen that much admixture with non-Europeans in the CEPH Utah white sample in the HapMap. But perhaps the Utah whites aren’t representative? Dienekes ran ADMIXTURE on the HapMap3 populations a few weeks ago, and I don’t see any elevated component of non-European ancestry in the Utah whites when compared to the Tuscans from Italy.


A factoid such as that less than 50% of white Americans are 95% or more European in ancestry can get traction quickly. But I think we should wait a bit and just get more samples. The results are from a presentation at a conference, not even a paper. Of course there’s a possibility that many people have more interesting backgrounds than multi-generational families which settled in Utah rather early. Time will tell.

Addendum: I believe that Native American admixture is going to be more common among the white Americans of the South than Yankees from New England. The reason I would give is that powerful and populous tribes and confederacies such as the Creek and Cherokee persisted in the Southern highlands far longer than in New England. The CEPH sample is going to be biased toward Yankees, as well as European converts from the British Isles and Scandinavia, so perhaps giving a somewhat lower result for non-European ancestry in American whites.

Addendum II: I thought about it more. Something went wrong in their analysis, or they had a very unrepresentative sample. Perhaps they had many Latinos and only coded their self-identified race and not ethnicity (50% of American Latinos identify as white). Maybe the AIMs aren’t good. I don’t know. But I do know that American genealogy buffs who assume Native American ancestry are often very disappointed. They seem to far outnumber those who find surprising non-white ancestry.

Powered by WordPress