Razib Khan One-stop-shopping for all of my content

September 18, 2018

On the whole genomics will not be individually transformative…for now

Filed under: Crispr,Genomics,Personal Genetics,Personal Genome,Personal genomics — Razib Khan @ 4:51 pm

A new piece in The Guardian, ‘Your father’s not your father’: when DNA tests reveal more than you bargained for, is one of the two major genres in writings on personal genomics in the media right now (there are exceptions). First, there is the genre where genetics doesn’t do anything for you. It’s a waste of money! Second, there is the genre where genetics rocks our whole world, and it’s dangerous to one’s own self-identity. And so on. Basically, the two optimum peaks in this field of journalism are between banal and sinister.

In response to this I stated that for most people personal genomics will probably have an impact somewhere in the middle. To be fair, someone reading the headline of the comment I co-authored in Genome Biology, Consumer genomics will change your life, whether you get tested or not, may wonder as the seeming contradiction.

But it’s not really there. On the aggrgate social level genomics is going to have a non-trivial impact on health and lifestyle. This is a large proportion of our GDP. So it’s “kind of a big deal” in that sense. But, for many individuals the outcomes will be quite modest. For a small minority of individuals there will be real and important medical consequences. In these cases the outcomes are a big deal. But for most people genetic dispositions and risks are diffuse, of modest effect, and often backloaded in one’s life. Even though it will impact most of society in the near future, it’s touch will be gentle.

An analogy here can be made with BMI, or body-mass-index. As an individual predictor and statistic it leaves a lot to be desired. But, for public health scientists and officials aggregate BMI distributions are critical to get a sense of the landscape.

Finally, this is focusing on genomics where we read the sequence (or get back genotype results). The next stage that might really be game-changing is the write revolution. CRISPR genetic engineering. In the 2020s I assume that CRISPR applications will mostly be in critical health contexts (e.g., “fixing” Mendelian diseases), or in non-human contexts (e.g., agricultural genetics). Like genomics the ubiquity of genetic engineering will be kind of a big deal economically in the aggregate, but it won’t be a big deal for individuals.

If you are a transhumanist or whatever they call themselves now, one can imagine a scenario where a large portion of the population starts “re-writing” themselves. That would be both a huge aggregate and individual impact. But we’re a long way from that….

August 26, 2018

There could be 100 million genotyping kits sold by January 1st 2020

Filed under: D.T.C. Personal Genomics,Personal genomics — Razib Khan @ 4:55 pm


The figure to the right is from the comment David Mittelman and I wrote for Genome Biology, Consumer genomics will change your life, whether you get tested or not. The original numbers are from ISOGG, which does a great job collating information from a variety of sources. When final revisions for the comment were due, we only found data up to 5/1/2018.

That being said, I thought it would be useful to generate a chart where I combined and smoothed the results from the various companies. It is clear that the period after 2016 is when you see massive takeoff and adoption, driven first by Ancestry, but later by 23andMe joining the race. The other companies have been increasing their sales as well, with new players such as MyHeritage making a big play.

All this makes me wonder: what does the future have it store? Year-to-year the total number of kits in circulation were doubling in 2013 and 2014. That rate dropped to ~1.6-fold increases in 2015 and 2016. A lot of this is due to 23andMe turning away from customer acquisition (more marketing always leads to more sales). With 23andMe competing with Ancestry again in 2017 one saw a >2.5-fold increase in the number of kits sold.

My back-of-the-envelope calculations indicate that around 1.8 million kits were being sold per month between the big players in the first in the first 4 months of 2018. That’s about 18 million kits this year. That means 29 million kits total in circulation by January 1st of 2019. The wildcard here though is that this space is “consumer”, which means that a disproportionate number of kits are going to be sold between Halloween and Christmas. Extrapolating from the period between January 1st to May 1st, as I’m doing above, could be way too conservative.

The sales in markets outside of the USA, along with customer acquisition through marketing, need to keep increasing up until January 1st of 2020 for there to be 100 million kits sold. But I think it’s very possible. I’m on the bubble of saying even likely. The wholesale price of arrays (the chips) keeps decreasing, so the price point of the consumer product is also decreasing. This isn’t a situation where the market is growing linearly, it’s exponential. A few positive shocks here and there 100 million by January 1st of 2020 may seem conservative.

Addendum: There has been some confusion in the media between sequencing and genotyping platforms. These are different technologies. Genotyping platforms, SNP-arrays, are targeting a genome-wide subset of polymorphisms. 23andMe’s current chip seems to probe about 630,000 markers. The whole genome consists of 3 billion bases. In the 2020s sequencing will probably replace targeted genotyping arrays in consumer products, but it will probably really come to the fore first in the medical space.

August 20, 2018

Consumer Genomics in 2018, beyond the future’s threshold

Filed under: Personal genomics — Razib Khan @ 9:07 pm

In 2013 David Mittelman and I wrote Rumors of the death of consumer genomics are greatly exaggerated. This was in the wake of the FDA controversy with 23andMe, and continuing worries about DNA and privacy. Today David and I came out with a new comment in Genome BiologyConsumer genomics will change your life, whether you get tested or not.

Really transformative technology becomes beneath comment. As long as we’re having to comment about genomics, it isn’t really mainstream. But I think in 2018 it is much clearer that the 2020s will see legitimate mainstreaming. The numbers speak for themselves. I hadn’t realized in a visceral manner how much had changed since our original comment came out. It’s pretty much an order of magnitude shift.

My hypothesis for why 23andMe plateaued for a while at ~1 million is that that was the sample size which maximized the statistical power they wanted to catch loci of particular effect sizes. In the initial years, 23andMe was not just buying customers with marketing, it was subsidizing the array costs. Today Illumina SNP arrays are well under $50 (some people say less than $25) wholesale, so I think at some point in early 2017 they realized even though 10 million wasn’t worth much to them in comparison to 1 million for GWAS, they were going to lose the luster of being “market leader” to Ancestry, who were acquiring customers at a massive clip through their marketing (my understanding is that at some point Illumina was having issues processing the samples that Ancestry was returning to them it was at such high scale; higher than Ancestry had anticipated!).

July 2, 2018

At least today we can explore personal genomics

A very long piece on the “personal genomics industry.” Lots of quotes from my boss Spencer Wells, since he has been in the game so long.

The piece covers all the bases. I actually think some of the criticisms of direct-to-consumer genetics are on base. I just don’t think they’re insoluble problems, or problems so large that that should discourage the industry from growing. I think part of the problem is that many of the people journalists can talk to who can comment on the industry are based in academia, and academia has a different focus when it comes to comes to genetics than the nascent industry. For rational reasons academics need to be very careful when it comes to ethics. Consumer products I think are somewhat different.

But I do think we need to reflect how far we’ve come in 10 years. Back in the 2000s when I was reading stuff on Y, mtDNA and autosomal studies, I honestly didn’t imagine that I would know my own haplogroups and genome-wide ancestry decomposition. It seemed like science fiction. That all changed rather rapidly over a few years, and I purchased kits in the early years when the price was still high. Today it’s a mass industry, with a sub-$100 price point in many cases.

Yes, there are plenty of cautions and worries we need to consider. But the future is already the present, and the horse has left the barn.

April 10, 2018

Notes from the personal genomic inflection point

Filed under: D.T.C. Personal Genomics,DTC personal genomics,Personal genomics — Razib Khan @ 11:21 pm


There’s a debate that periodically crops up online about the utility, viability, and morality of returning results from genetic tests to consumers. Consumers here means people like you or me. Pretty much everyone.

If you want to caricature two stylized camps, there are information maximalists who proclaim a utopia now, where people can find out so much about themselves through their genome. And then there are information elitists, who emphasize that the public can’t handle the truth. Or, more accurately, that throwing information without context and interpretation from someone who knows better is not just useless, it’s dangerous.

Of course, most people will stake out more nuanced complex positions. That’s not the point. Here is my bottom-line, which I’ve probably held since about ~2010:

  1. The value for most people in actionable information in direct-to-consumer genetics is probably not there yet when set against the cost.
  2. With the reduction in the cost of genotyping and sequencing, there’s no way that we have enough trained professionals to handle the surfeit of information. And there will really be no way in 10 years when a large proportion of the American population will be sequenced.

At some point, the cost will come down enough, and the science probably is strong enough, that direct-to-consumer genetics moves away from novelty and early adopters to the mass market. At that point, we need to be able to make the best use of that data. Genetic counselors, geneticists, and doctors all cost a fair amount of money and have a finite amount of labor supply to provide to the public. They need to focus on serious, complex, and consequential cases.

To some extent, we need to reduce much of interpretation in the personal genomics space to an information technology problem. For example, if someone’s genotype pulls out a bunch of statistically significant hits of interest the tool should automatically condition significance on that individual’s genetic background.

Yes, there are primitive forms of these sorts of tools out there already. But they’re not good enough. And that’s because there isn’t the market need. But there will be.

April 5, 2018

Our family’s pedigree in 23andMe

Filed under: 23andMe,D.T.C. Personal Genomics,Personal genomics — Razib Khan @ 7:18 pm


With 23andMe’s new update to its ancestry, the results for my family have changed. Not for me, since I’m not of European descent, and this looks Euro-focused (no surprise). But my wife and kids are different.

My wife has two great-grandparents who were born in Norway. 23andMe is picking that up immediately. It also picks it up in my children, from left to right, my daughter, my younger son, and my older son. With more than 3 million in their database 23andMe has knowledge of which haplotypes are unique to Norway, and which are not. When you click “Norway,” it says “We predict you had ancestors that lived in Norway within the last 200 years.” That’s telling me that they detect IBD segments uniquely found in Norwegian populations of a particular length threshold.

My youngest is on a new chip, but I’m not sure I believe some of the European admixture estimates. The two boys exhibit very little drop off in Scandinavian. But my daughter is way lower. This is not unreasonable, but they exhibit differences in East Asian ancestry. I’ve looked but I can’t detect this on PCA plots. My daughter is in fact more distant from Han Chinese than my sons.

March 6, 2018

The 23andMe BRCA test

In case you were sleeping under a rock, 23andMe got FDA approval for DTC testing of markers related to BRCA risk. Obviously, this is a pretty big step, in principle.

But the short-term implications are not that earth-shaking.

From the FDA release:

The three BRCA1/BRCA2 hereditary mutations detected by the test are present in about 2 percent of Ashkenazi Jewish women, according to a National Cancer Institute study, but rarely occur (0 percent to 0.1 percent) in other ethnic populations. All individuals, whether they are of Ashkenazi Jewish descent or not, may have other mutations in BRCA1 or BRCA2 genes, or other cancer-related gene mutations that are not detected by this test. For this reason, a negative test result could still mean that a person has an increased risk of cancer due to gene mutations….

Apparently, women with one of these variants have a 45-85% chance of developing breast cancer by age 70. So the penetrance is high.

It seems that you’ll know if this sort of test is going to have utility for you based on family history.

The big thing is the transition to DTC. This will increase availability and drive the price down. That’s probably going to mean more work for those engaged in interpretation and education. False positives are going to start being a major thing….

December 12, 2017

Helix kit price waived until December 26 at 2:59am EST

Filed under: Personal genomics — Razib Khan @ 7:56 am

Happy Hanukkah! My main qualm with wishing you a happy holiday is that I’m a thorough assimilator and I don’t want to be disemboweled.

For the context, listen to the Stuff You Missed in History Class episode on the Maccabean Revolt. As a Jewish friend of mine once observed, the Maccabees were kind of the Al-Qaeda of their day (today she would have said ISIS).

With that out of the way, I want to give you a heads up that Helix has a sale going until December 26 at 2:59am EST where the $80 kit cost for purchase of any app is waived if you haven’t purchased at app before. Just enter the promotion code HOLIDAY at checkout.

That means presales of Insitome’s Regional Ancestry is no more than $19.99, while Neanderthal is $29.99 and Metabolism is $39.99 (this applies to all of Helix’s products except embodyDNA by Lose It! and Geno 2.0 by National Geographic).

Why does it matter? Again, Helix banks a high quality exome+ (the + is for non-exonic positions) when you purchase any of their apps. If you want subsequent apps you don’t have to sent another kit in, you just buy the app and get the results. Also, I do have to say that from what I’ve seen and heard Helix’s laboratory facilities are top-notch in terms of getting results turned around rapidly.

Helix kit price waived until December 26 at 2:59am EST

Filed under: Personal genomics — Razib Khan @ 7:56 am

Happy Hanukkah! My main qualm with wishing you a happy holiday is that I’m a thorough assimilator and I don’t want to be disemboweled.

For the context, listen to the Stuff You Missed in History Class episode on the Maccabean Revolt. As a Jewish friend of mine once observed, the Maccabees were kind of the Al-Qaeda of their day (today she would have said ISIS).

With that out of the way, I want to give you a heads up that Helix has a sale going until December 26 at 2:59am EST where the $80 kit cost for purchase of any app is waived if you haven’t purchased at app before. Just enter the promotion code HOLIDAY at checkout.

That means presales of Insitome’s Regional Ancestry is no more than $19.99, while Neanderthal is $29.99 and Metabolism is $39.99 (this applies to all of Helix’s products except embodyDNA by Lose It! and Geno 2.0 by National Geographic).

Why does it matter? Again, Helix banks a high quality exome+ (the + is for non-exonic positions) when you purchase any of their apps. If you want subsequent apps you don’t have to sent another kit in, you just buy the app and get the results. Also, I do have to say that from what I’ve seen and heard Helix’s laboratory facilities are top-notch in terms of getting results turned around rapidly.

December 8, 2017

Genomic ancestry tests are not cons, part 2: the problem of ethnicity

Filed under: Personal genomics — Razib Khan @ 12:51 am

The results to the left are from 23andMe for someone whose paternal grandparents were immigrants from southern Germany. Their mother had a father who was of English American background (his father was a Yankee American with an English surname and his mother was an immigrant from England), and grandparents who were German (Rhinelander) and French Canadian respectively on their maternal side.

Looking at the results from 23andMe one has to wonder, why is this individual only a bit under 25% French & German, when genealogical records show places of birth that indicates they should be 75% French & German (more precisely, 62.5% German and 12.5% French). Though their ancestry is 25% English, only 13% of their ancestry is listed as such.

First, notice that nearly half of their ancestry is “Broadly Northwestern European.” Last I  checked  23andMe uses phased haplotypes to detect segments of ancestry. This is a very powerful method and is often quite good at zeroing in on people of European ancestry. But with Americans of predominant, but mixed, Northern European background rather than giving back precise proportions often you obtain results of the form of “Broadly…” because presumably, recombination has generated novel haplotypes in white Americans.

But this isn’t the whole story. Why, for example, are many of the Finnish people I know on 23andMe assigned as >90% Finnish, while a Danish friend is 40% Scandinavian?

The issue here is that to be “Finnish” and “Scandinavian” are not equivalent units in terms of population genetics. Finns are a relatively homogeneous ethnic group who seem to have undergone a recent population bottleneck. In contrast, Scandinavia encompasses several different, albeit related, ethnicities which are geographically widely distributed.

Ethnic identities are socially and historically constructed. Additionally, they are often clear and distinct. This is not always the case for population genetic classifications. On a continental scale, racial classification is trivial, and feasible with only a modest number of genetic markers. Why? Because the demographic and evolutionary history of Melanesians and West Africans, to give two concrete examples, are distinct over tens of thousands of years. Population genetic analyses which attempt to identify or differentiate these groups have a lot of raw material to work with.

To the left, you see a PCA plot of Papuans, Yoruba, and Swedes. They are clear and distinct populations. I pruned the marker set down to 750 SNPs. Now, since these were SNPs selected to be variable in human populations, they aren’t just random markers. They are biased toward being informative of population history. That being said, notice how distinct the groups are.

The Yoruba and Swedes and Papuans are separated by 50,000 to 100,000 years of history. That history is reflected in the genetic variation. And the social construct of an ethnocultural identity is nested within that demographic history. The Yoruba people are a coherent cultural unit. Similarly, the Swedes emerged in the last 1,000 years through a fusion of tribes such as the Geats and Svear. The Papuans are a different case, as “Papuan” brackets a whole range of groups. To a great extent, one can argue that a self-conscious Papuan identity is a product of the 20th century, because of political forces (the independence of Papua New Guinea), and large-scale contact with Europeans and Austronesians. Nevertheless, when comparing extreme different groups, an artificial catchall ethnic identity such as “Papuan” is quite informative.

Using the same marker set I plotted individuals from the Yoruba and Esan ethnic groups from the southwest and south of Nigeria, respectively. It is immediately clear that you can barely differentiate the Esan from the Yoruba geneticallyAt least with 750 SNPs.

The Esan and Yoruba have distinct identities, but culturally they are not too distinct from each other. They even share some traditional deities. Being close neighbors there has likely been a great deal of gene flow, as the shared common common ancestors are much closer in time to the present than in the cases I illustrated above.

But when I increased the marker set to ~250,000 SNPs the Yoruba and Esan were clearly distinct populations. This is not surprising. Often today we are wont to assert that ethnic identities are recent historically contingent creations. The reality is many ethnic identities were assembled out of clear and distinct preexistent elements, which had their own history, and so could be reflected in genetics.

That being said, the closer two ethnic groups are geographically and socioculturally, the more likely the two groups are to overlap genetically (more precisely, they can be much harder to differentiate). Sometimes though genetics and culture are very different. The Basque people of northern Spain and southwest France are only mildly genetically distinct from their Romance-speaking neighbors, but they are an ethnolinguistic isolate. The cultural chasm in language is huge. But the genetic chasm is much smaller.

Scandinavia is a coherent ethnolinguistic category which encompasses various northern Germanic people who were relatively untouched by Roman cultural influences. This is in contrast to many Germanic tribes to the south, such as the Franks, who emerged in dynamic tension with the rise of the Roman Empire. The final Scandinavian conversion to Christianity, and so admission into the post-Roman European world, began about two centuries after the conversion of the pagan Saxons by Charlemagne.

Later, the two centuries of the Kalmar Union brought all the modern nations of Scandinavia under one ruler. Today, the concept of Norden, which includes non-Scandinavian Finland, expresses the cultural and social connections of the northern peoples.

And yet genetically the reality is more muddled. Looking at samples of Germans, Danes, Swedes and Norwegians, the geographic patterning is clear. Danes occupy a position between Germans on the one hand, and Norwegians and Swedes on the other. Because of Sami ancestry in many Norwegians and Sami and Finnish ancestry in many Swedes they are genetically distinct from continental Germanic peoples to the south, including Danes.

So what is a Scandinavian? A Scandinavian is a Swede, Dane, or Norwegian (or an Icelander). Scandinavians share 1,000 years of history since their integration into the European system. As a cultural category Scandinavians are clear and distinct.

But as a genetic cluster things are not so clear. First, there is the Danish connection to Germany. This is due to both history and geography. People from northern Germany are clearly genetically close to the Danes. While the Angles and Jutes were from modern Denmark, the Saxons were from northern Germany. Yet in Britain, they fused seamlessly into one people. Before the mass conversion of the continental Saxons under the Carolingians, the cultural barriers between the peoples of Jutland and Saxony must have been marginal at best.

Second, an enormous number of Swedes in particular seem to be highly admixed with Finnic peoples. Many Swedes are highly “Finn-shifted”, both due to Sami assimilation in the past few hundred years, and the long history of Finnish migration into Sweden (which dominated Finland either politically or culturally for nearly 1,000 years). But culturally, and in their ethnolinguistic identity, these people are nothing but Scandinavian at this point.

Going back to the results of the 23andMe user above, who genealogically is more than 60% German, but comes back as 25% German, how to make sense of it? Anyone who has looked at German data realizes that it is very difficult to identify a ‘prototypical’ German. Germans are people who speak Germanic languages, whose ancestors out of the European Bronze Age, when much of Northern European population structure was established. But being at the center of Europe means that Germans have been subject to gene flow by peoples to from all other directions. Also, some ethnic Germans in the eastern regions clearly descend from Slavic tribes, and more recently there were migrations of peoples such as French Huguenots.

A PCA of Danes, English, French, and Germans, show differences across the groups. But Germans overlap a great deal with the English, and a substantial minority overlap with Danes. Also, many more of the Germans are “French-shifted” than the English.

The point is that to be German is to be many things. At least in the context of Northern European peoples.

There are powerful methods of ancestry inference using more information than just genotypes, such as fineSTRUCTURE. And, there are methods relying in rare variants, which allow for much more fine-grained distinctions. But all these methods suffer from the fact that one has to define populations with labels in the first place.  Genetically Germany has several closely related clusters, and all of them are arguably authentically Germany.

Because ethnolinguistic categories are constructions of human history and social preferences they do not always map onto genetic differences at a fine-grain. But, because ethnolinguistic categories were created by humans to give intelligibility to national and cultural variation they are incredibly powerful ways in which to communicate classification to the general public.

Some people believe that personal genomics tests are wrong and false because of the discrepancies as the one I highlight in this post. Actually, the issue is that the language we use shapes our preconceptions, and these companies are attempting to leverage categories and classes which are highly informative to give us a general sense of the patterns they are detecting. Language does not shape reality, but it shapes our perception of reality. To say someone is 25% French-German is more informative to the end-user than to say someone is 25% Generic Continental North European, even though really they are basically the same thing. And yet, if you told someone they were 25% Generic Continental North European they might be less likely to cross-reference that result with their genealogy, because the term is expansive and vague that one does not assume ethnolinguistic precision.

Ultimately I don’t think there is a right answer on this sort of issue. My own preference is clearly to avoid national and ethnic terms to which people bring their own preconceptions. At least when possible.

Genomic ancestry tests are not cons, part 2: the problem of ethnicity

Filed under: Personal genomics — Razib Khan @ 12:51 am

The results to the left are from 23andMe for someone whose paternal grandparents were immigrants from southern Germany. Their mother had a father who was of English American background (his father was a Yankee American with an English surname and his mother was an immigrant from England), and grandparents who were German (Rhinelander) and French Canadian respectively on their maternal side.

Looking at the results from 23andMe one has to wonder, why is this individual only a bit under 25% French & German, when genealogical records show places of birth that indicates they should be 75% French & German (more precisely, 62.5% German and 12.5% French). Though their ancestry is 25% English, only 13% of their ancestry is listed as such.

First, notice that nearly half of their ancestry is “Broadly Northwestern European.” Last I  checked  23andMe uses phased haplotypes to detect segments of ancestry. This is a very powerful method and is often quite good at zeroing in on people of European ancestry. But with Americans of predominant, but mixed, Northern European background rather than giving back precise proportions often you obtain results of the form of “Broadly…” because presumably, recombination has generated novel haplotypes in white Americans.

But this isn’t the whole story. Why, for example, are many of the Finnish people I know on 23andMe assigned as >90% Finnish, while a Danish friend is 40% Scandinavian?

The issue here is that to be “Finnish” and “Scandinavian” are not equivalent units in terms of population genetics. Finns are a relatively homogeneous ethnic group who seem to have undergone a recent population bottleneck. In contrast, Scandinavia encompasses several different, albeit related, ethnicities which are geographically widely distributed.

Ethnic identities are socially and historically constructed. Additionally, they are often clear and distinct. This is not always the case for population genetic classifications. On a continental scale, racial classification is trivial, and feasible with only a modest number of genetic markers. Why? Because the demographic and evolutionary history of Melanesians and West Africans, to give two concrete examples, are distinct over tens of thousands of years. Population genetic analyses which attempt to identify or differentiate these groups have a lot of raw material to work with.

To the left, you see a PCA plot of Papuans, Yoruba, and Swedes. They are clear and distinct populations. I pruned the marker set down to 750 SNPs. Now, since these were SNPs selected to be variable in human populations, they aren’t just random markers. They are biased toward being informative of population history. That being said, notice how distinct the groups are.

The Yoruba and Swedes and Papuans are separated by 50,000 to 100,000 years of history. That history is reflected in the genetic variation. And the social construct of an ethnocultural identity is nested within that demographic history. The Yoruba people are a coherent cultural unit. Similarly, the Swedes emerged in the last 1,000 years through a fusion of tribes such as the Geats and Svear. The Papuans are a different case, as “Papuan” brackets a whole range of groups. To a great extent, one can argue that a self-conscious Papuan identity is a product of the 20th century, because of political forces (the independence of Papua New Guinea), and large-scale contact with Europeans and Austronesians. Nevertheless, when comparing extreme different groups, an artificial catchall ethnic identity such as “Papuan” is quite informative.

Using the same marker set I plotted individuals from the Yoruba and Esan ethnic groups from the southwest and south of Nigeria, respectively. It is immediately clear that you can barely differentiate the Esan from the Yoruba geneticallyAt least with 750 SNPs.

The Esan and Yoruba have distinct identities, but culturally they are not too distinct from each other. They even share some traditional deities. Being close neighbors there has likely been a great deal of gene flow, as the shared common common ancestors are much closer in time to the present than in the cases I illustrated above.

But when I increased the marker set to ~250,000 SNPs the Yoruba and Esan were clearly distinct populations. This is not surprising. Often today we are wont to assert that ethnic identities are recent historically contingent creations. The reality is many ethnic identities were assembled out of clear and distinct preexistent elements, which had their own history, and so could be reflected in genetics.

That being said, the closer two ethnic groups are geographically and socioculturally, the more likely the two groups are to overlap genetically (more precisely, they can be much harder to differentiate). Sometimes though genetics and culture are very different. The Basque people of northern Spain and southwest France are only mildly genetically distinct from their Romance-speaking neighbors, but they are an ethnolinguistic isolate. The cultural chasm in language is huge. But the genetic chasm is much smaller.

Scandinavia is a coherent ethnolinguistic category which encompasses various northern Germanic people who were relatively untouched by Roman cultural influences. This is in contrast to many Germanic tribes to the south, such as the Franks, who emerged in dynamic tension with the rise of the Roman Empire. The final Scandinavian conversion to Christianity, and so admission into the post-Roman European world, began about two centuries after the conversion of the pagan Saxons by Charlemagne.

Later, the two centuries of the Kalmar Union brought all the modern nations of Scandinavia under one ruler. Today, the concept of Norden, which includes non-Scandinavian Finland, expresses the cultural and social connections of the northern peoples.

And yet genetically the reality is more muddled. Looking at samples of Germans, Danes, Swedes and Norwegians, the geographic patterning is clear. Danes occupy a position between Germans on the one hand, and Norwegians and Swedes on the other. Because of Sami ancestry in many Norwegians and Sami and Finnish ancestry in many Swedes they are genetically distinct from continental Germanic peoples to the south, including Danes.

So what is a Scandinavian? A Scandinavian is a Swede, Dane, or Norwegian (or an Icelander). Scandinavians share 1,000 years of history since their integration into the European system. As a cultural category Scandinavians are clear and distinct.

But as a genetic cluster things are not so clear. First, there is the Danish connection to Germany. This is due to both history and geography. People from northern Germany are clearly genetically close to the Danes. While the Angles and Jutes were from modern Denmark, the Saxons were from northern Germany. Yet in Britain, they fused seamlessly into one people. Before the mass conversion of the continental Saxons under the Carolingians, the cultural barriers between the peoples of Jutland and Saxony must have been marginal at best.

Second, an enormous number of Swedes in particular seem to be highly admixed with Finnic peoples. Many Swedes are highly “Finn-shifted”, both due to Sami assimilation in the past few hundred years, and the long history of Finnish migration into Sweden (which dominated Finland either politically or culturally for nearly 1,000 years). But culturally, and in their ethnolinguistic identity, these people are nothing but Scandinavian at this point.

Going back to the results of the 23andMe user above, who genealogically is more than 60% German, but comes back as 25% German, how to make sense of it? Anyone who has looked at German data realizes that it is very difficult to identify a ‘prototypical’ German. Germans are people who speak Germanic languages, whose ancestors out of the European Bronze Age, when much of Northern European population structure was established. But being at the center of Europe means that Germans have been subject to gene flow by peoples to from all other directions. Also, some ethnic Germans in the eastern regions clearly descend from Slavic tribes, and more recently there were migrations of peoples such as French Huguenots.

A PCA of Danes, English, French, and Germans, show differences across the groups. But Germans overlap a great deal with the English, and a substantial minority overlap with Danes. Also, many more of the Germans are “French-shifted” than the English.

The point is that to be German is to be many things. At least in the context of Northern European peoples.

There are powerful methods of ancestry inference using more information than just genotypes, such as fineSTRUCTURE. And, there are methods relying in rare variants, which allow for much more fine-grained distinctions. But all these methods suffer from the fact that one has to define populations with labels in the first place.  Genetically Germany has several closely related clusters, and all of them are arguably authentically Germany.

Because ethnolinguistic categories are constructions of human history and social preferences they do not always map onto genetic differences at a fine-grain. But, because ethnolinguistic categories were created by humans to give intelligibility to national and cultural variation they are incredibly powerful ways in which to communicate classification to the general public.

Some people believe that personal genomics tests are wrong and false because of the discrepancies as the one I highlight in this post. Actually, the issue is that the language we use shapes our preconceptions, and these companies are attempting to leverage categories and classes which are highly informative to give us a general sense of the patterns they are detecting. Language does not shape reality, but it shapes our perception of reality. To say someone is 25% French-German is more informative to the end-user than to say someone is 25% Generic Continental North European, even though really they are basically the same thing. And yet, if you told someone they were 25% Generic Continental North European they might be less likely to cross-reference that result with their genealogy, because the term is expansive and vague that one does not assume ethnolinguistic precision.

Ultimately I don’t think there is a right answer on this sort of issue. My own preference is clearly to avoid national and ethnic terms to which people bring their own preconceptions. At least when possible.

December 4, 2017

Genomic ancestry tests are not cons, part 1

Filed under: DTC personal genomics,Personal genomics — Razib Khan @ 6:45 pm

As someone who is part of the personal genomics sector, I keep track of media representations of the industry very closely. There is the good and the bad, some justified and some not.

But there is one aspect which I need to weigh in on because it is close to my interests and professional focus, and it is one where I have a lot of experience: ancestry inference on human data.

Periodically I see in my Twitter timeline an article shared by a biologist which is filled with either misrepresentation, confusions, and even falsehoods. Of course, some of the criticisms are correct. The problem is that when you mix truth and falsehood or sober analysis and critique with sensationalism the whole product is debased.

I’m going to address some of the most basic errors and misimpressions. This post is “part 1” because I might have follow-ups, as I feel like this is a situation where I have to put out fires periodically, as people write about things they don’t know about, and then those articles get widely shared to a credulous public.

First, if an article mentions STRs or microsatellites or a test with fewer than 1,000 markers in a direct to consumer genomic context, ignore the article. This is like an piece where the author dismisses air travel because it’s noisy due to propeller-driven planes. Propeller-driven planes are a very small niche. Similarly, the major direct to consumer firms which have sold close to ~10 million kits do not use STRs or microsatellites, very much a technology for the 1990s and 2000s. Any mention of STRs or microsatellites or low-density analyses indicate the journalist didn’t do their homework, or simply don’t care to be accurate.

Second, there is constant harping on the fact that different companies give different results. This is because tests don’t really give results as much is interpretations. The raw results consist of your genotype. On the major SNP-chip platforms this will be a file on the order of 20 MBs. The companies could provide this as the product, but most humans have difficulty grokking over 100,000 variables.

So what’s the solution? The same that scientists have been using for decades: reduce the variation into a much smaller set of elements which are human digestible, often through tables or visualization.

For example, consider a raw data set consisting of my three genotypes from 23andMe, Ancestry, and Family Tree DNA. Merged with public data these are ~201,000 single nucleotide markers. You can download the plink formatted data yourself and look at it. The PCA below shows where my three genotypes are positioned, by the Tamil South Asians. Observe that my genotypes are basically at the same point:

The differences between the different companies have nothing to do with the raw data, because with hundreds of thousands of markers they capture enough of the relevant between population differences in my genome (do you need to flip a coin 1 million times after you’ve flipped it 100,000 times to get a sense of whether it is fair?). The law of large numbers is kicking in at this point, with genotyping errors on the order of 0.5% not being sufficient to differentiate the files.

Sure enough raw genotype files of the three services match pretty closely. 99.99% for Family Tree DNA and 23andMe, 99.7% for Family Tree DNA and Ancestry, and 99.6% for Ancestry and 23andMe. For whatever reason Ancestry is the outlier here. My personal experience looking at genotype data from Illumina chips is that most are pretty high quality, but it’s not shocking to see instances with 0.5% no call or bad call rates. For phylogenetic purposes if the errors are not systematic it’s not a big deal.

The identity to other populations is consistent. About 74% to Tamils. 72-73% for other Eurasians. 71% for the Surui, an isolated Amazonian group. And 69% to Yoruba. Observe that this recapitulates the phylogenetic history of what we know for the population which I am from, Bengalis. The greater the genetic distance between two populations due to distinct evolutionary histories the greater the genetic divergence. This is not rocket science. This gets to the point that the raw results make a lot more sense when you integrate and synthesize them with other information you have. Most customers are not going into the process of getting a personal genomic ancestry test blind…but that causes pitfalls as well as opportunities.

But most people do not receive statistics of the form:

SNP Identity
You Yoruba 0.69
You German 0.72
You Japanese 0.73
You Tamil 0.74

Mind you, this is informative. It’s basically saying I am most genetically distant from Yoruba and closer in sequence to Tamils. But this is somewhat thin gruel for most people. Consider the below which is a zoom in of PC 2 vs. PC 4. I am blue and the purple/pink are Tamils, and the population at the bottom left are East Asians.

If you looked at enough PCA plots it will become rather clear I am shifted toward East Asians in comparison to most other South Asians. The high identity that I have with Japanese and Dai is due in part to the fact that I have relatively recent admixture from an East Asian population, above and beyond what is typical in South Asians. Remember, all three of my genotypes are basically on the same spot on PCA plots. That’s because they’re basically the same. Genotyping error is rather low.

How do we summarize this sort of information for a regular person? The standard method today is giving people a set of proportions with specific population labels. Why? People seem to understand population labels and proportions, but can be confused by PCA plots. Additionally, the methods that give out populations and proportions are often better at capturing pulse admixture events relatively recent in time than PCA, and for most consumers of ancestry services, this is an area that they are particularly focused on (i.e., Americans).

An easy way to make one’s genetic variation comprehensible to the general public is to model them as a mixture of various populations that they already know of. So consider the ones above in the plink file. I ran ADMIXTURE in supervised model progressively removing populations for my three genotypes. The results are below.

  Dai Druze German Japanese Papuan Sardinian Surui Tamil Yoruba
Razib23andMe 11% 3% 8% 4% 1% 0% 1% 73% 1%
RazibAncestry 10% 2% 8% 4% 1% 0% 1% 73% 1%
RazibFTDNA 11% 2% 8% 3% 1% 0% 1% 72% 1%
                   
  Dai Druze German Japanese Papuan Sardinian Surui Tamil  
Razib23andMe 11% 3% 8% 4% 1% 0% 1% 73%  
RazibAncestry 10% 3% 8% 4% 1% 0% 1% 74%  
RazibFTDNA 11% 3% 8% 3% 1% 0% 1% 73%  
                   
  Dai Druze Japanese Papuan Surui Tamil      
Razib23andMe 10% 9% 4% 1% 1% 74%      
RazibAncestry 10% 9% 4% 1% 1% 75%      
RazibFTDNA 11% 9% 4% 1% 1% 74%      
                   
  Dai Japanese Surui Tamil          
Razib23andMe 11% 4% 1% 84%          
RazibAncestry 10% 4% 1% 85%          
RazibFTDNA 11% 3% 1% 84%        

Please observe again that they are broadly congruent. These methods exhibit a stochastic element, so there is some noise baked into the cake, but with 200,000+ markers and a robust number of reference populations the results come out the same across all methods (also, 23andMe and Family Tree DNA seem to correlate a bit more, which makes sense since these two genotypes are more similar to each other than they are to Ancestry).

Observe that until I remove all other West Eurasian populations the Tamil fraction in my putative ancestry is rather consistent. Why? Because my ancestry is mostly Tamil-like, but social and historical evidence would point to the likelihood of some exogenous Indo-Aryan component. Additionally, seeing as how very little of my ancestry could be modeled as West African removing that population had almost no impact.

When there were three West Eurasian populations, Germans, Druze, and Sardinians, the rank order was in that sequence. Removing Germans and Sardinians and the Druze picked up most of that ancestral component. This a supervised method, so I’m assigning the empirical populations as reified clusters which can be used to reconstitute the variation you see in my own genotype. No matter what I put into the reference data, the method tries its best to assign proportions to populations.

The question then comes into the stage of subtle choices one makes to obtain the most informative inferences for the customer. These are not always matters of different results in terms of accuracy or precision, but often of presentation. If West Eurasian populations are removed entirely, my Tamil fraction inflates. That’s the closest to the West Eurasian populations left in the data. In contrast, the East Asian fraction remains the same because I’ve left the two proxy populations in the data (I rigged the die here because I know I have Tibeto-Burman admixture which is a combination of Northeast and Southeast Asian).

Let’s do something different. I’m going to swap out the West Eurasian populations with equivalents.

  Armenians Dai French_Basque Japanese Mandenka Surui Sweden Tamil
Razib23andMe 6% 11% 0% 4% 1% 1% 5% 72%
RazibAncestry 5% 11% 0% 4% 1% 1% 5% 73%
RazibFTDNA 6% 11% 0% 4% 1% 1% 5% 72%
                 
German Papuan Yoruba          
Razib23andMe 68% 20% 13%          
RazibAncestry 68% 20% 13%          
RazibFTDNA 68% 20% 13%          
                 
French_Basque Tamil            
Razib23andMe 8% 92%            
RazibAncestry 7% 93%            
RazibFTDNA 8% 92%            
                 
Tamil Yoruba            
Razib23andMe 97% 3%            
RazibAncestry 97% 3%            
RazibFTDNA 97% 3%          

I have no ancestry from French Basque, but I do have ancestry from Armenians and Swedes in this model. Why? If you keep track of the most recent population genomic ancestry this all makes sense. But if you don’t, well, it’s harder to unpack. This is part of the problem with these sorts of tests: how to make it comprehensible to the public while maintaining fidelity to the latest research.

This is not always easy, and differences between companies in terms of interpretation are not invidious as some of the press reports would have you think, but a matter of difficult choices and trade-offs one needs to make to give value to customers. True, this could all be ironed out if there was a ministry of genetic interpretation and a rectification of names in relation to population clusters, but right now there isn’t. This allows for both brand differentiation and engenders confusion.

In most of the models with a good number of populations, my Tamil ancestry is in the low 70s. Notice then that some of these results are relatively robust to the populations one specifies. Some of the patterns are so striking and clear that one would have to work really hard to iron them out and mask them in interpretation. But what happens when I remove Tamils and include populations I’m only distantly related to? This is a ridiculous model, but the algorithm tries its best. My affinity is greatest to Germans, both because of shared ancestry, and in the case of Papuans, their relatively high drift from other East Eurasians and Denisovan ancestry. But both Papuan and Yoruba ancestry are assigned because I’m clearly not 100% German, and I share alleles with both these populations. In models where there are not enough populations to “soak up” an individual’s variation, but you include Africans, it is not uncommon for African ancestry to show up at low fractions. If you take Europeans, Africans, and East Asians, and force two populations out of this mix, then Europeans are invariably modeled as a mix of Africans and East Asians, with greater affinity to the latter.

Even when you model my ancestry as Tamil or Yoruba, you see that there is a Yoruba residual. I have too much genetic variation that comes from groups not closely related to the variation you find in Tamils to eliminate this residual.

Just adding a few populations fixes this problem:

  Dai Tamil Yoruba  
Razib23andMe 14% 83% 2%  
RazibAncestry 14% 84% 2%  
RazibFTDNA 14% 83% 2%  
         
  Dai German Tamil Yoruba
Razib23andMe 15% 10% 74% 1%
RazibAncestry 14% 9% 75% 1%
RazibFTDNA 15% 10% 74% 1%

Notice how my Tamil fraction is almost the same as when I had included in many more reference populations. Why? My ancestral history is complex, like most humans, but it’s not that complex. The goal for public comprehensibility is to reduce the complexity into digestible units which give insight.

Of course, I could just say read Inference of Population Structure Using Multilocus Genotype Data. The basic framework was laid out in that paper 17 years ago for model-based clustering of the sort that is very common in direct to consumer services (some use machine learning and do local ancestry decomposition across the chromosome, but really the frameworks are an extension of the original logic). But that’s not feasible for most people, including journalists.

Consider this piece at Gizmodo, Why a DNA Test Is Actually a Really Bad Gift. I pretty much disagree with a lot of the privacy concerns, seeing as how I’ve had my public genotype downloadable for seven years. But this portion jumped out at me: “Ancestry tests are based on sound science, but variables in data sets and algorithms mean results are probabilities, not facts, as many people expect.”

Yes, there are probabilities involved. But if a DNA test using the number of markers above tells you you are 20% Sub-Saharan African and 80% European in ancestry, that probability is of the same sort of confidence of you determining that a coin flip is fair after 100,000 flips. True, you can’t be totally sure after 100,000 flips that you have a fair coin, but you can be pretty confident. With hundreds of thousands of markers, a quantum of 20% Sub-Saharan African in a person of predominantly European heritage is an inference made with a degree of confidence that verges upon certitude within a percentage or so.

As for the idea that they are not “facts.” I don’t even know what that means in this context. And I doubt the journalist does either. Which is one of my main gripes with these sorts of stories: unless they talk to a small subset of scientists the journalists just don’t know what they are talking about when it comes to the statistical genetics.

Finally, there is the issue about what does it even mean to be % percent of population X, Y, or Z? Even many biologists routinely reify and confuse the population clusters with something real and concrete in a Platonic sense. But deep down when you think about it we all need to recall we’re collapsing genealogies of many different segments of DNA into broad coarse summaries when we say “population.” And populations themselves are by their nature often somewhat open and subject to blending and flow with others. A population genomic understanding of structure does not bring into clarity Platonic facts, but it gives one instruments and tools to smoke out historical insight.

The truth, in this case, is not a thing in and of itself, but a dynamic which refines our intuitions of a fundamentally alien process of Mendelian assortment and segregation.

Genomic ancestry tests are not cons, part 1

Filed under: DTC personal genomics,Personal genomics — Razib Khan @ 6:45 pm

As someone who is part of the personal genomics sector, I keep track of media representations of the industry very closely. There is the good and the bad, some justified and some not.

But there is one aspect which I need to weigh in on because it is close to my interests and professional focus, and it is one where I have a lot of experience: ancestry inference on human data.

Periodically I see in my Twitter timeline an article shared by a biologist which is filled with either misrepresentation, confusions, and even falsehoods. Of course, some of the criticisms are correct. The problem is that when you mix truth and falsehood or sober analysis and critique with sensationalism the whole product is debased.

I’m going to address some of the most basic errors and misimpressions. This post is “part 1” because I might have follow-ups, as I feel like this is a situation where I have to put out fires periodically, as people write about things they don’t know about, and then those articles get widely shared to a credulous public.

First, if an article mentions STRs or microsatellites or a test with fewer than 1,000 markers in a direct to consumer genomic context, ignore the article. This is like an piece where the author dismisses air travel because it’s noisy due to propeller-driven planes. Propeller-driven planes are a very small niche. Similarly, the major direct to consumer firms which have sold close to ~10 million kits do not use STRs or microsatellites, very much a technology for the 1990s and 2000s. Any mention of STRs or microsatellites or low-density analyses indicate the journalist didn’t do their homework, or simply don’t care to be accurate.

Second, there is constant harping on the fact that different companies give different results. This is because tests don’t really give results as much is interpretations. The raw results consist of your genotype. On the major SNP-chip platforms this will be a file on the order of 20 MBs. The companies could provide this as the product, but most humans have difficulty grokking over 100,000 variables.

So what’s the solution? The same that scientists have been using for decades: reduce the variation into a much smaller set of elements which are human digestible, often through tables or visualization.

For example, consider a raw data set consisting of my three genotypes from 23andMe, Ancestry, and Family Tree DNA. Merged with public data these are ~201,000 single nucleotide markers. You can download the plink formatted data yourself and look at it. The PCA below shows where my three genotypes are positioned, by the Tamil South Asians. Observe that my genotypes are basically at the same point:

The differences between the different companies have nothing to do with the raw data, because with hundreds of thousands of markers they capture enough of the relevant between population differences in my genome (do you need to flip a coin 1 million times after you’ve flipped it 100,000 times to get a sense of whether it is fair?). The law of large numbers is kicking in at this point, with genotyping errors on the order of 0.5% not being sufficient to differentiate the files.

Sure enough raw genotype files of the three services match pretty closely. 99.99% for Family Tree DNA and 23andMe, 99.7% for Family Tree DNA and Ancestry, and 99.6% for Ancestry and 23andMe. For whatever reason Ancestry is the outlier here. My personal experience looking at genotype data from Illumina chips is that most are pretty high quality, but it’s not shocking to see instances with 0.5% no call or bad call rates. For phylogenetic purposes if the errors are not systematic it’s not a big deal.

The identity to other populations is consistent. About 74% to Tamils. 72-73% for other Eurasians. 71% for the Surui, an isolated Amazonian group. And 69% to Yoruba. Observe that this recapitulates the phylogenetic history of what we know for the population which I am from, Bengalis. The greater the genetic distance between two populations due to distinct evolutionary histories the greater the genetic divergence. This is not rocket science. This gets to the point that the raw results make a lot more sense when you integrate and synthesize them with other information you have. Most customers are not going into the process of getting a personal genomic ancestry test blind…but that causes pitfalls as well as opportunities.

But most people do not receive statistics of the form:

SNP Identity
You Yoruba 0.69
You German 0.72
You Japanese 0.73
You Tamil 0.74

Mind you, this is informative. It’s basically saying I am most genetically distant from Yoruba and closer in sequence to Tamils. But this is somewhat thin gruel for most people. Consider the below which is a zoom in of PC 2 vs. PC 4. I am blue and the purple/pink are Tamils, and the population at the bottom left are East Asians.

If you looked at enough PCA plots it will become rather clear I am shifted toward East Asians in comparison to most other South Asians. The high identity that I have with Japanese and Dai is due in part to the fact that I have relatively recent admixture from an East Asian population, above and beyond what is typical in South Asians. Remember, all three of my genotypes are basically on the same spot on PCA plots. That’s because they’re basically the same. Genotyping error is rather low.

How do we summarize this sort of information for a regular person? The standard method today is giving people a set of proportions with specific population labels. Why? People seem to understand population labels and proportions, but can be confused by PCA plots. Additionally, the methods that give out populations and proportions are often better at capturing pulse admixture events relatively recent in time than PCA, and for most consumers of ancestry services, this is an area that they are particularly focused on (i.e., Americans).

An easy way to make one’s genetic variation comprehensible to the general public is to model them as a mixture of various populations that they already know of. So consider the ones above in the plink file. I ran ADMIXTURE in supervised model progressively removing populations for my three genotypes. The results are below.

  Dai Druze German Japanese Papuan Sardinian Surui Tamil Yoruba
Razib23andMe 11% 3% 8% 4% 1% 0% 1% 73% 1%
RazibAncestry 10% 2% 8% 4% 1% 0% 1% 73% 1%
RazibFTDNA 11% 2% 8% 3% 1% 0% 1% 72% 1%
                   
  Dai Druze German Japanese Papuan Sardinian Surui Tamil  
Razib23andMe 11% 3% 8% 4% 1% 0% 1% 73%  
RazibAncestry 10% 3% 8% 4% 1% 0% 1% 74%  
RazibFTDNA 11% 3% 8% 3% 1% 0% 1% 73%  
                   
  Dai Druze Japanese Papuan Surui Tamil      
Razib23andMe 10% 9% 4% 1% 1% 74%      
RazibAncestry 10% 9% 4% 1% 1% 75%      
RazibFTDNA 11% 9% 4% 1% 1% 74%      
                   
  Dai Japanese Surui Tamil          
Razib23andMe 11% 4% 1% 84%          
RazibAncestry 10% 4% 1% 85%          
RazibFTDNA 11% 3% 1% 84%        

Please observe again that they are broadly congruent. These methods exhibit a stochastic element, so there is some noise baked into the cake, but with 200,000+ markers and a robust number of reference populations the results come out the same across all methods (also, 23andMe and Family Tree DNA seem to correlate a bit more, which makes sense since these two genotypes are more similar to each other than they are to Ancestry).

Observe that until I remove all other West Eurasian populations the Tamil fraction in my putative ancestry is rather consistent. Why? Because my ancestry is mostly Tamil-like, but social and historical evidence would point to the likelihood of some exogenous Indo-Aryan component. Additionally, seeing as how very little of my ancestry could be modeled as West African removing that population had almost no impact.

When there were three West Eurasian populations, Germans, Druze, and Sardinians, the rank order was in that sequence. Removing Germans and Sardinians and the Druze picked up most of that ancestral component. This a supervised method, so I’m assigning the empirical populations as reified clusters which can be used to reconstitute the variation you see in my own genotype. No matter what I put into the reference data, the method tries its best to assign proportions to populations.

The question then comes into the stage of subtle choices one makes to obtain the most informative inferences for the customer. These are not always matters of different results in terms of accuracy or precision, but often of presentation. If West Eurasian populations are removed entirely, my Tamil fraction inflates. That’s the closest to the West Eurasian populations left in the data. In contrast, the East Asian fraction remains the same because I’ve left the two proxy populations in the data (I rigged the die here because I know I have Tibeto-Burman admixture which is a combination of Northeast and Southeast Asian).

Let’s do something different. I’m going to swap out the West Eurasian populations with equivalents.

  Armenians Dai French_Basque Japanese Mandenka Surui Sweden Tamil
Razib23andMe 6% 11% 0% 4% 1% 1% 5% 72%
RazibAncestry 5% 11% 0% 4% 1% 1% 5% 73%
RazibFTDNA 6% 11% 0% 4% 1% 1% 5% 72%
                 
German Papuan Yoruba          
Razib23andMe 68% 20% 13%          
RazibAncestry 68% 20% 13%          
RazibFTDNA 68% 20% 13%          
                 
French_Basque Tamil            
Razib23andMe 8% 92%            
RazibAncestry 7% 93%            
RazibFTDNA 8% 92%            
                 
Tamil Yoruba            
Razib23andMe 97% 3%            
RazibAncestry 97% 3%            
RazibFTDNA 97% 3%          

I have no ancestry from French Basque, but I do have ancestry from Armenians and Swedes in this model. Why? If you keep track of the most recent population genomic ancestry this all makes sense. But if you don’t, well, it’s harder to unpack. This is part of the problem with these sorts of tests: how to make it comprehensible to the public while maintaining fidelity to the latest research.

This is not always easy, and differences between companies in terms of interpretation are not invidious as some of the press reports would have you think, but a matter of difficult choices and trade-offs one needs to make to give value to customers. True, this could all be ironed out if there was a ministry of genetic interpretation and a rectification of names in relation to population clusters, but right now there isn’t. This allows for both brand differentiation and engenders confusion.

In most of the models with a good number of populations, my Tamil ancestry is in the low 70s. Notice then that some of these results are relatively robust to the populations one specifies. Some of the patterns are so striking and clear that one would have to work really hard to iron them out and mask them in interpretation. But what happens when I remove Tamils and include populations I’m only distantly related to? This is a ridiculous model, but the algorithm tries its best. My affinity is greatest to Germans, both because of shared ancestry, and in the case of Papuans, their relatively high drift from other East Eurasians and Denisovan ancestry. But both Papuan and Yoruba ancestry are assigned because I’m clearly not 100% German, and I share alleles with both these populations. In models where there are not enough populations to “soak up” an individual’s variation, but you include Africans, it is not uncommon for African ancestry to show up at low fractions. If you take Europeans, Africans, and East Asians, and force two populations out of this mix, then Europeans are invariably modeled as a mix of Africans and East Asians, with greater affinity to the latter.

Even when you model my ancestry as Tamil or Yoruba, you see that there is a Yoruba residual. I have too much genetic variation that comes from groups not closely related to the variation you find in Tamils to eliminate this residual.

Just adding a few populations fixes this problem:

  Dai Tamil Yoruba  
Razib23andMe 14% 83% 2%  
RazibAncestry 14% 84% 2%  
RazibFTDNA 14% 83% 2%  
         
  Dai German Tamil Yoruba
Razib23andMe 15% 10% 74% 1%
RazibAncestry 14% 9% 75% 1%
RazibFTDNA 15% 10% 74% 1%

Notice how my Tamil fraction is almost the same as when I had included in many more reference populations. Why? My ancestral history is complex, like most humans, but it’s not that complex. The goal for public comprehensibility is to reduce the complexity into digestible units which give insight.

Of course, I could just say read Inference of Population Structure Using Multilocus Genotype Data. The basic framework was laid out in that paper 17 years ago for model-based clustering of the sort that is very common in direct to consumer services (some use machine learning and do local ancestry decomposition across the chromosome, but really the frameworks are an extension of the original logic). But that’s not feasible for most people, including journalists.

Consider this piece at Gizmodo, Why a DNA Test Is Actually a Really Bad Gift. I pretty much disagree with a lot of the privacy concerns, seeing as how I’ve had my public genotype downloadable for seven years. But this portion jumped out at me: “Ancestry tests are based on sound science, but variables in data sets and algorithms mean results are probabilities, not facts, as many people expect.”

Yes, there are probabilities involved. But if a DNA test using the number of markers above tells you you are 20% Sub-Saharan African and 80% European in ancestry, that probability is of the same sort of confidence of you determining that a coin flip is fair after 100,000 flips. True, you can’t be totally sure after 100,000 flips that you have a fair coin, but you can be pretty confident. With hundreds of thousands of markers, a quantum of 20% Sub-Saharan African in a person of predominantly European heritage is an inference made with a degree of confidence that verges upon certitude within a percentage or so.

As for the idea that they are not “facts.” I don’t even know what that means in this context. And I doubt the journalist does either. Which is one of my main gripes with these sorts of stories: unless they talk to a small subset of scientists the journalists just don’t know what they are talking about when it comes to the statistical genetics.

Finally, there is the issue about what does it even mean to be % percent of population X, Y, or Z? Even many biologists routinely reify and confuse the population clusters with something real and concrete in a Platonic sense. But deep down when you think about it we all need to recall we’re collapsing genealogies of many different segments of DNA into broad coarse summaries when we say “population.” And populations themselves are by their nature often somewhat open and subject to blending and flow with others. A population genomic understanding of structure does not bring into clarity Platonic facts, but it gives one instruments and tools to smoke out historical insight.

The truth, in this case, is not a thing in and of itself, but a dynamic which refines our intuitions of a fundamentally alien process of Mendelian assortment and segregation.

December 2, 2017

South Asian Genotype Project

Filed under: Personal genomics,South Asian Genotype Project — Razib Khan @ 6:02 pm


It’s been a few years since I’ve done any serious “Genome Blogging.” Mostly I’ve been very busy and there isn’t much low-hanging fruit left as it is. But today I want to announce that I’ll be running the generically titled “South Asian Genotype Project.”

The way it works is simple: send me a 23andMe, Ancestry, or Family Tree DNA raw genotype file to contactgnxp -at- gmail.com (though 23andMe’s new chip has far less overlap with other platforms earlier, so probably best if you were typed before August 2017).

In the subject please put:

  1. “South Asian Genotype Project”
  2. The state/province your family is from
  3. Ethnolinguistic group
  4. If applicable, caste

In the body of the email you can put Y and mtDNA and any other information you want. Obviously your data is confidential and I won’t identify you by name, just ethnolinguistic group and such.

Since the last time I did this I have some scripts that make this a lot of easier, so hopefully I’ll be adding individuals to this spreadsheet every few days. I’ll give project members an ID and try to email them when the results are up.

The main motivator for this project on my part is that people still ask me questions about Sinhalese, Nasrani Christians, and other assorted groups which we don’t have answers to because current research projects haven’t focused on them.

Since Zack worked on the Harappa Ancestry Project we know a lot more about South Asian ancestry. Basically, there is an ANI-ASI cline, and some South Asians have exogenous ancestry off this cline. Indian Jews have Middle Eastern ancestry, while Bengalis have East Asian ancestry, and some groups in Pakistan have African ancestry. With that in mind I’ll be testing a smaller number of populations. The marker set is 240,000 SNPs by the way.

Below are some representative results. You can see that my results from three DTC services are basically the same. Also, some South Indian groups (see Pulliyar) show “Dai” ancestry, when I’m pretty sure it’s just that I didn’t sample as much on the extreme portion of the ASI-cline.

ID
Armenians
Belorussian
C_India
Dai
Nigerian
NWIndia
S_India
YemeniteJews
Balochi
34%
1%
0%
0%
0%
66%
0%
0%
Bangladesh_Razib (23andMe)
0%
0%
14%
14%
0%
15%
57%
0%
Bangladesh_Razib (Ancestry)
0%
0%
14%
14%
0%
15%
57%
0%
Bangladesh_Razib (ftDNA)
0%
0%
13%
14%
0%
15%
58%
0%
Chenchus
0%
0%
1%
1%
0%
0%
98%
0%
Dharkars
0%
0%
16%
2%
0%
38%
44%
0%
Dusadh
0%
0%
21%
1%
0%
2%
76%
0%
Iranians
65%
2%
1%
2%
0%
20%
0%
10%
Kallar
0%
0%
0%
0%
0%
0%
100%
0%
Kurumba
0%
0%
0%
0%
0%
4%
96%
0%
Meghawal
0%
0%
10%
0%
0%
26%
64%
0%
MumbaiJews
18%
0%
4%
0%
0%
39%
28%
11%
Naga
0%
0%
0%
90%
0%
0%
10%
0%
NorthKannadi
0%
0%
0%
2%
0%
0%
98%
0%
Pakistani
3%
7%
19%
6%
0%
38%
23%
4%
Pathan
12%
3%
1%
1%
0%
80%
3%
0%
TamilNadu_Iyer
0%
1%
2%
0%
0%
42%
54%
0%
TamilNadu_Nadar
0%
0%
0%
1%
0%
0%
99%
0%
UP_Kayastha
0%
0%
17%
2%
0%
42%
39%
0%
WestBengal_Kayastha
0%
2%
15%
6%
0%
14%
64%
0%
Pulliyar
0%
0%
0%
7%
0%
0%
93%
0%
DalitTN
0%
0%
0%
1%
0%
0%
99%
0%
Velama
0%
0%
9%
0%
0%
22%
68%
0%

South Asian Genotype Project

Filed under: Personal genomics,South Asian Genotype Project — Razib Khan @ 6:02 pm


It’s been a few years since I’ve done any serious “Genome Blogging.” Mostly I’ve been very busy and there isn’t much low-hanging fruit left as it is. But today I want to announce that I’ll be running the generically titled “South Asian Genotype Project.”

The way it works is simple: send me a 23andMe, Ancestry, or Family Tree DNA raw genotype file to contactgnxp -at- gmail.com (though 23andMe’s new chip has far less overlap with other platforms earlier, so probably best if you were typed before August 2017).

In the subject please put:

  1. “South Asian Genotype Project”
  2. The state/province your family is from
  3. Ethnolinguistic group
  4. If applicable, caste

In the body of the email you can put Y and mtDNA and any other information you want. Obviously your data is confidential and I won’t identify you by name, just ethnolinguistic group and such.

Since the last time I did this I have some scripts that make this a lot of easier, so hopefully I’ll be adding individuals to this spreadsheet every few days. I’ll give project members an ID and try to email them when the results are up.

The main motivator for this project on my part is that people still ask me questions about Sinhalese, Nasrani Christians, and other assorted groups which we don’t have answers to because current research projects haven’t focused on them.

Since Zack worked on the Harappa Ancestry Project we know a lot more about South Asian ancestry. Basically, there is an ANI-ASI cline, and some South Asians have exogenous ancestry off this cline. Indian Jews have Middle Eastern ancestry, while Bengalis have East Asian ancestry, and some groups in Pakistan have African ancestry. With that in mind I’ll be testing a smaller number of populations. The marker set is 240,000 SNPs by the way.

Below are some representative results. You can see that my results from three DTC services are basically the same. Also, some South Indian groups (see Pulliyar) show “Dai” ancestry, when I’m pretty sure it’s just that I didn’t sample as much on the extreme portion of the ASI-cline.

ID
Armenians
Belorussian
C_India
Dai
Nigerian
NWIndia
S_India
YemeniteJews
Balochi
34%
1%
0%
0%
0%
66%
0%
0%
Bangladesh_Razib (23andMe)
0%
0%
14%
14%
0%
15%
57%
0%
Bangladesh_Razib (Ancestry)
0%
0%
14%
14%
0%
15%
57%
0%
Bangladesh_Razib (ftDNA)
0%
0%
13%
14%
0%
15%
58%
0%
Chenchus
0%
0%
1%
1%
0%
0%
98%
0%
Dharkars
0%
0%
16%
2%
0%
38%
44%
0%
Dusadh
0%
0%
21%
1%
0%
2%
76%
0%
Iranians
65%
2%
1%
2%
0%
20%
0%
10%
Kallar
0%
0%
0%
0%
0%
0%
100%
0%
Kurumba
0%
0%
0%
0%
0%
4%
96%
0%
Meghawal
0%
0%
10%
0%
0%
26%
64%
0%
MumbaiJews
18%
0%
4%
0%
0%
39%
28%
11%
Naga
0%
0%
0%
90%
0%
0%
10%
0%
NorthKannadi
0%
0%
0%
2%
0%
0%
98%
0%
Pakistani
3%
7%
19%
6%
0%
38%
23%
4%
Pathan
12%
3%
1%
1%
0%
80%
3%
0%
TamilNadu_Iyer
0%
1%
2%
0%
0%
42%
54%
0%
TamilNadu_Nadar
0%
0%
0%
1%
0%
0%
99%
0%
UP_Kayastha
0%
0%
17%
2%
0%
42%
39%
0%
WestBengal_Kayastha
0%
2%
15%
6%
0%
14%
64%
0%
Pulliyar
0%
0%
0%
7%
0%
0%
93%
0%
DalitTN
0%
0%
0%
1%
0%
0%
99%
0%
Velama
0%
0%
9%
0%
0%
22%
68%
0%

November 27, 2017

Razib Khan’s raw genotype data on 23andMe, Family Tree DNA, Geno 2.0 and Ancestry

It has been a while since I posted an update on my genotype. Since then I’ve been tested on most of the major platforms. I don’t see any harm in releasing this to the public or researchers who want to look at it (though I don’t know why anyone would).

You can download all the files here.

Having my genotypes public is pretty useful for me. If I inquire about someone’s genetics oftentimes people get weirdly defense and ask “what about you?” I Just invite them to look at my raw data and analyze it for themselves! I’m not a hypocrite about this.

Over the years I’ve had researchers inquire about my ethnicity when they stumble upon my genotype on platforms such as openSNP. So in full disclosure, most of my ancestry is pretty standard eastern Bengali. I’m more East Asian shifted than most Bangladeshi samples in the 1000 Genomes project, but then my family is from Comilla, in the far east of eastern Bengal (anyone who cares, my Y is of course R1a1a-Z93 and my mtDNA U2b).

As before I’ll put the genotype under a Creative Commons license:Creative Commons License

November 9, 2017

Bank your exome with Helix for free ($0.00)

Filed under: Personal Genetics,Personal Genome,Personal genomics — Razib Khan @ 4:06 pm

I wasn’t going to do this again, but I’ve decided to promote Helix’s special discount. It ends at 2:59 AM EDT November 10th. Eight hours from when I push this post.

Obviously, there is a conflict of interest as I work for one of Helix’s partners. What does that mean?

  • Helix does an exome+ sequence and stores your data.
  • Then, you buy applications which use that data.
  • The company I work for is one of the application providers.
  • “Exome” means that Helix does a very accurate medical grade sequence of all your genes. The “+” points to the fact that they include a substantial number of positions which are not within genes (in the “junk DNA”). That totals up to 30,000,000+ markers (the exome is 1% of your whole genome). This is not trivial. Current direct-to-consumer genomics companies are looking at 500,000 to 1,000,000 markers with SNP arrays.
  • Helix keeps this data. Within a few months, you can buy the data at cost (it won’t be cheap!). But the model is that you buy a la cart apps, which will be affordable (our products are affordable).

I’m laying this all out very plainly because many people are asking me about these details right now as the sale winds down, and this includes people who are pretty savvy about personal genomics. Here is why I think you should get the kits now:

  1. It gets my company more customers. That’s the self-interested part, and less important for the target audience.
  2. For you, it gets you an exome that you can buy later without any upfront cost. For the next eight hours, Helix is basically waiving the kit costs by dropping the price $100.

Our Neanderthal product is now $9.99. Our Metabolism product is $19.99. These products are great, as they give you functional information in a very user-friendly manner. But a lot of my readers can analyze their own data, so what’s the incentive then? Again, the incentive is that you get an exome for free, and can later buy it if you want, or, perhaps even a savvy personal genomics consumer will find an app they’ll want to purchase. Normally the kit is $80, so buying it now means you’ll never have to pay this cost. If you are the type of person who has qualms about a private company keeping your data, this may not be for you.

Of course, there are other app developers in the Helix store, so just buy whatever you want. This is a way to get your exome sequenced for free nowI will tell you that the Insitome apps are among the cheapest.

Finally, a lot of people are buying “family-pack” quantities. I got four kits for example for my immediate family. Unfortunately, there are some issues with the Helix site and the extra purchases. You can buy more than one easily at Amazon right now. Our Neanderthal product is not in low stock. The Metabolism product has only a few left, though I don’t know what that means.

Note: The discount is client-side, so you may need to switch browsers if you are going to the Helix site to buy (or turn off ad-block). From what I can see Amazon does not have these issues.

November 2, 2017

Introducing DNAGeeks.com

Filed under: DTC personal genomics,Personal genomics — Razib Khan @ 8:27 am

Four years ago my friend David Mittleman and I wrote Rumors of the death of consumer genomics are greatly exaggerated. The context was the FDA crackdown on 23andMe. Was the industry moribund before it began? The title gives away our opinion. We were personally invested. David and I were both working for Family Tree DNA, which is part of the broader industry. But we were sincere too.

Both of us have moved on to other things. But we still stand by our original vision. And to a great extent, we think we had it right. The consumer genomics segment in DTC is now nearing 10 million individuals genotyped (Ancestry itself seems to have gone north of 5 million alone).

One of the things that we observed in the Genome Biology piece is that personal genomics was still looking for a “killer app”, like the iPhone. Since then the Helix startup has been attempting to create an ecosystem for genomics with a variety of apps. Though ancestry has driven nearly ten million sales, there still isn’t something as ubiquitous as the iPhone. We’re still searching, but I think we’ll get there. Data in search of utility….

David and I are still evangelizing in this space, and together with another friend we came up with an idea: DNAGreeks. We’re starting with t-shirts because it’s something everyone understands, but also can relay our (and your) passion about genomics. We started with “Haplotees.” Basically the most common Y and mtDNA lineages. This might seem silly to some, but it’s something a lot of people have an interest in, and it’s also a way to get ‘regular people’ interested in genetics. Genealogy isn’t scary, and it’s accessible.

We are also field-testing other ideas. If there is a demand we might roll out a GNXP t-shirt (logo only?). The website is obscure enough that it won’t make sense to a lot of people, but perhaps it will make sense to the people who you want it to make sense too!

Anyway, as they say, “keep watching this space!” We don’t know where DNAGeeks is going, but we’re aiming to have fun with genomics and make a little money too.

October 4, 2017

A very special episode of South Park

Filed under: Personal genomics — Razib Khan @ 5:17 pm

September 4, 2017

The issue is with the model, not precision!

Filed under: Personal genomics — Razib Khan @ 6:38 pm

The Wirecutter has a thorough review of direct-to-consumer ancestry testing services. Since I now work at a human personal genomics company I’m not going to comment on the merits of any given service. But, I do want to clarify something in regards to the precision of these tests. Before the author quotes Jonathan Marks, he says:

For Jonathan Marks, anthropology professor at University of North Carolina at Charlotte, the big unknown for users is the margin for error with these estimates….

The issue I have with this quote is that the margin of error on these tests is really not that high. Margin of error itself is a precise concept. If you sample 1,000 individuals you’ll have a lower margin of error than if you sample 100 individuals. That’s common sense.

But for direction-to-consumer genomic tests you are sampling 100,000 to 1 million markers on SNP arrays (the exact number used for ancestry inference is often lower than the total number on the array). For ancestry testing you are really interested in the 10 million or so (order of magnitude) markers which vary between population, and a random sampling of 100,000 to 1 million is going to be pretty representative (consider that election year polling usually surveys a few thousand people to represent an electorate of tens of millions).

If you run a package like Admixture you can repeat the calculation for a given individual multiple times. In most cases there is very little variation between replicates in relation to the percentage breakdowns, even though you do a random seed to initialize the process as it begins to stochastically explore the parameter space (the variance is going to be higher if you try to resolve clusters which are extremely phylogenetically close of course).

As I have stated before, the reason these different companies offer varied results is that they start out with different models. When I learned the basic theory around phylogenetics in graduate school the philosophy was definitely Bayesian; vary the model parameters and the model and see what happens. But you can’t really vary the model all the time between customers, can you? It starts to become a nightmare in relation to customer service.

There are certain population clusters that customers are interested in. To provide a service to the public a company has to develop a model that answers those questions which are in demand. If you are designing a model for purely scientific purposes then you’d want to highlight the maximal amount of phylogenetic history. That isn’t always the same though as the history that customers want to know about it. This means that direct-to-consumer ethnicity tests in terms of the specification of their models deviate from pure scientific questions, and result in a log of judgment calls based on company evaluations of their client base.

Addendum: There is a lot of talk about the reference population sets. The main issue is representativeness, not sample size. You don’t really need more than 10-100 individuals from a given population in most cases. But you want to sample the real population diversity that is out there.

Older Posts »

Powered by WordPress