Razib Khan One-stop-shopping for all of my content

December 12, 2017

Helix kit price waived until December 26 at 2:59am EST

Filed under: Personal genomics — Razib Khan @ 7:56 am

Happy Hanukkah! My main qualm with wishing you a happy holiday is that I’m a thorough assimilator and I don’t want to be disemboweled.

For the context, listen to the Stuff You Missed in History Class episode on the Maccabean Revolt. As a Jewish friend of mine once observed, the Maccabees were kind of the Al-Qaeda of their day (today she would have said ISIS).

With that out of the way, I want to give you a heads up that Helix has a sale going until December 26 at 2:59am EST where the $80 kit cost for purchase of any app is waived if you haven’t purchased at app before. Just enter the promotion code HOLIDAY at checkout.

That means presales of Insitome’s Regional Ancestry is no more than $19.99, while Neanderthal is $29.99 and Metabolism is $39.99 (this applies to all of Helix’s products except embodyDNA by Lose It! and Geno 2.0 by National Geographic).

Why does it matter? Again, Helix banks a high quality exome+ (the + is for non-exonic positions) when you purchase any of their apps. If you want subsequent apps you don’t have to sent another kit in, you just buy the app and get the results. Also, I do have to say that from what I’ve seen and heard Helix’s laboratory facilities are top-notch in terms of getting results turned around rapidly.

December 8, 2017

Genomic ancestry tests are not cons, part 2: the problem of ethnicity

Filed under: Personal genomics — Razib Khan @ 12:51 am

The results to the left are from 23andMe for someone whose paternal grandparents were immigrants from southern Germany. Their mother had a father who was of English American background (his father was a Yankee American with an English surname and his mother was an immigrant from England), and grandparents who were German (Rhinelander) and French Canadian respectively on their maternal side.

Looking at the results from 23andMe one has to wonder, why is this individual only a bit under 25% French & German, when genealogical records show places of birth that indicates they should be 75% French & German (more precisely, 62.5% German and 12.5% French). Though their ancestry is 25% English, only 13% of their ancestry is listed as such.

First, notice that nearly half of their ancestry is “Broadly Northwestern European.” Last I  checked  23andMe uses phased haplotypes to detect segments of ancestry. This is a very powerful method and is often quite good at zeroing in on people of European ancestry. But with Americans of predominant, but mixed, Northern European background rather than giving back precise proportions often you obtain results of the form of “Broadly…” because presumably, recombination has generated novel haplotypes in white Americans.

But this isn’t the whole story. Why, for example, are many of the Finnish people I know on 23andMe assigned as >90% Finnish, while a Danish friend is 40% Scandinavian?

The issue here is that to be “Finnish” and “Scandinavian” are not equivalent units in terms of population genetics. Finns are a relatively homogeneous ethnic group who seem to have undergone a recent population bottleneck. In contrast, Scandinavia encompasses several different, albeit related, ethnicities which are geographically widely distributed.

Ethnic identities are socially and historically constructed. Additionally, they are often clear and distinct. This is not always the case for population genetic classifications. On a continental scale, racial classification is trivial, and feasible with only a modest number of genetic markers. Why? Because the demographic and evolutionary history of Melanesians and West Africans, to give two concrete examples, are distinct over tens of thousands of years. Population genetic analyses which attempt to identify or differentiate these groups have a lot of raw material to work with.

To the left, you see a PCA plot of Papuans, Yoruba, and Swedes. They are clear and distinct populations. I pruned the marker set down to 750 SNPs. Now, since these were SNPs selected to be variable in human populations, they aren’t just random markers. They are biased toward being informative of population history. That being said, notice how distinct the groups are.

The Yoruba and Swedes and Papuans are separated by 50,000 to 100,000 years of history. That history is reflected in the genetic variation. And the social construct of an ethnocultural identity is nested within that demographic history. The Yoruba people are a coherent cultural unit. Similarly, the Swedes emerged in the last 1,000 years through a fusion of tribes such as the Geats and Svear. The Papuans are a different case, as “Papuan” brackets a whole range of groups. To a great extent, one can argue that a self-conscious Papuan identity is a product of the 20th century, because of political forces (the independence of Papua New Guinea), and large-scale contact with Europeans and Austronesians. Nevertheless, when comparing extreme different groups, an artificial catchall ethnic identity such as “Papuan” is quite informative.

Using the same marker set I plotted individuals from the Yoruba and Esan ethnic groups from the southwest and south of Nigeria, respectively. It is immediately clear that you can barely differentiate the Esan from the Yoruba geneticallyAt least with 750 SNPs.

The Esan and Yoruba have distinct identities, but culturally they are not too distinct from each other. They even share some traditional deities. Being close neighbors there has likely been a great deal of gene flow, as the shared common common ancestors are much closer in time to the present than in the cases I illustrated above.

But when I increased the marker set to ~250,000 SNPs the Yoruba and Esan were clearly distinct populations. This is not surprising. Often today we are wont to assert that ethnic identities are recent historically contingent creations. The reality is many ethnic identities were assembled out of clear and distinct preexistent elements, which had their own history, and so could be reflected in genetics.

That being said, the closer two ethnic groups are geographically and socioculturally, the more likely the two groups are to overlap genetically (more precisely, they can be much harder to differentiate). Sometimes though genetics and culture are very different. The Basque people of northern Spain and southwest France are only mildly genetically distinct from their Romance-speaking neighbors, but they are an ethnolinguistic isolate. The cultural chasm in language is huge. But the genetic chasm is much smaller.

Scandinavia is a coherent ethnolinguistic category which encompasses various northern Germanic people who were relatively untouched by Roman cultural influences. This is in contrast to many Germanic tribes to the south, such as the Franks, who emerged in dynamic tension with the rise of the Roman Empire. The final Scandinavian conversion to Christianity, and so admission into the post-Roman European world, began about two centuries after the conversion of the pagan Saxons by Charlemagne.

Later, the two centuries of the Kalmar Union brought all the modern nations of Scandinavia under one ruler. Today, the concept of Norden, which includes non-Scandinavian Finland, expresses the cultural and social connections of the northern peoples.

And yet genetically the reality is more muddled. Looking at samples of Germans, Danes, Swedes and Norwegians, the geographic patterning is clear. Danes occupy a position between Germans on the one hand, and Norwegians and Swedes on the other. Because of Sami ancestry in many Norwegians and Sami and Finnish ancestry in many Swedes they are genetically distinct from continental Germanic peoples to the south, including Danes.

So what is a Scandinavian? A Scandinavian is a Swede, Dane, or Norwegian (or an Icelander). Scandinavians share 1,000 years of history since their integration into the European system. As a cultural category Scandinavians are clear and distinct.

But as a genetic cluster things are not so clear. First, there is the Danish connection to Germany. This is due to both history and geography. People from northern Germany are clearly genetically close to the Danes. While the Angles and Jutes were from modern Denmark, the Saxons were from northern Germany. Yet in Britain, they fused seamlessly into one people. Before the mass conversion of the continental Saxons under the Carolingians, the cultural barriers between the peoples of Jutland and Saxony must have been marginal at best.

Second, an enormous number of Swedes in particular seem to be highly admixed with Finnic peoples. Many Swedes are highly “Finn-shifted”, both due to Sami assimilation in the past few hundred years, and the long history of Finnish migration into Sweden (which dominated Finland either politically or culturally for nearly 1,000 years). But culturally, and in their ethnolinguistic identity, these people are nothing but Scandinavian at this point.

Going back to the results of the 23andMe user above, who genealogically is more than 60% German, but comes back as 25% German, how to make sense of it? Anyone who has looked at German data realizes that it is very difficult to identify a ‘prototypical’ German. Germans are people who speak Germanic languages, whose ancestors out of the European Bronze Age, when much of Northern European population structure was established. But being at the center of Europe means that Germans have been subject to gene flow by peoples to from all other directions. Also, some ethnic Germans in the eastern regions clearly descend from Slavic tribes, and more recently there were migrations of peoples such as French Huguenots.

A PCA of Danes, English, French, and Germans, show differences across the groups. But Germans overlap a great deal with the English, and a substantial minority overlap with Danes. Also, many more of the Germans are “French-shifted” than the English.

The point is that to be German is to be many things. At least in the context of Northern European peoples.

There are powerful methods of ancestry inference using more information than just genotypes, such as fineSTRUCTURE. And, there are methods relying in rare variants, which allow for much more fine-grained distinctions. But all these methods suffer from the fact that one has to define populations with labels in the first place.  Genetically Germany has several closely related clusters, and all of them are arguably authentically Germany.

Because ethnolinguistic categories are constructions of human history and social preferences they do not always map onto genetic differences at a fine-grain. But, because ethnolinguistic categories were created by humans to give intelligibility to national and cultural variation they are incredibly powerful ways in which to communicate classification to the general public.

Some people believe that personal genomics tests are wrong and false because of the discrepancies as the one I highlight in this post. Actually, the issue is that the language we use shapes our preconceptions, and these companies are attempting to leverage categories and classes which are highly informative to give us a general sense of the patterns they are detecting. Language does not shape reality, but it shapes our perception of reality. To say someone is 25% French-German is more informative to the end-user than to say someone is 25% Generic Continental North European, even though really they are basically the same thing. And yet, if you told someone they were 25% Generic Continental North European they might be less likely to cross-reference that result with their genealogy, because the term is expansive and vague that one does not assume ethnolinguistic precision.

Ultimately I don’t think there is a right answer on this sort of issue. My own preference is clearly to avoid national and ethnic terms to which people bring their own preconceptions. At least when possible.

December 4, 2017

Genomic ancestry tests are not cons, part 1

Filed under: DTC personal genomics,Personal genomics — Razib Khan @ 6:45 pm

As someone who is part of the personal genomics sector, I keep track of media representations of the industry very closely. There is the good and the bad, some justified and some not.

But there is one aspect which I need to weigh in on because it is close to my interests and professional focus, and it is one where I have a lot of experience: ancestry inference on human data.

Periodically I see in my Twitter timeline an article shared by a biologist which is filled with either misrepresentation, confusions, and even falsehoods. Of course, some of the criticisms are correct. The problem is that when you mix truth and falsehood or sober analysis and critique with sensationalism the whole product is debased.

I’m going to address some of the most basic errors and misimpressions. This post is “part 1” because I might have follow-ups, as I feel like this is a situation where I have to put out fires periodically, as people write about things they don’t know about, and then those articles get widely shared to a credulous public.

First, if an article mentions STRs or microsatellites or a test with fewer than 1,000 markers in a direct to consumer genomic context, ignore the article. This is like an piece where the author dismisses air travel because it’s noisy due to propeller-driven planes. Propeller-driven planes are a very small niche. Similarly, the major direct to consumer firms which have sold close to ~10 million kits do not use STRs or microsatellites, very much a technology for the 1990s and 2000s. Any mention of STRs or microsatellites or low-density analyses indicate the journalist didn’t do their homework, or simply don’t care to be accurate.

Second, there is constant harping on the fact that different companies give different results. This is because tests don’t really give results as much is interpretations. The raw results consist of your genotype. On the major SNP-chip platforms this will be a file on the order of 20 MBs. The companies could provide this as the product, but most humans have difficulty grokking over 100,000 variables.

So what’s the solution? The same that scientists have been using for decades: reduce the variation into a much smaller set of elements which are human digestible, often through tables or visualization.

For example, consider a raw data set consisting of my three genotypes from 23andMe, Ancestry, and Family Tree DNA. Merged with public data these are ~201,000 single nucleotide markers. You can download the plink formatted data yourself and look at it. The PCA below shows where my three genotypes are positioned, by the Tamil South Asians. Observe that my genotypes are basically at the same point:

The differences between the different companies have nothing to do with the raw data, because with hundreds of thousands of markers they capture enough of the relevant between population differences in my genome (do you need to flip a coin 1 million times after you’ve flipped it 100,000 times to get a sense of whether it is fair?). The law of large numbers is kicking in at this point, with genotyping errors on the order of 0.5% not being sufficient to differentiate the files.

Sure enough raw genotype files of the three services match pretty closely. 99.99% for Family Tree DNA and 23andMe, 99.7% for Family Tree DNA and Ancestry, and 99.6% for Ancestry and 23andMe. For whatever reason Ancestry is the outlier here. My personal experience looking at genotype data from Illumina chips is that most are pretty high quality, but it’s not shocking to see instances with 0.5% no call or bad call rates. For phylogenetic purposes if the errors are not systematic it’s not a big deal.

The identity to other populations is consistent. About 74% to Tamils. 72-73% for other Eurasians. 71% for the Surui, an isolated Amazonian group. And 69% to Yoruba. Observe that this recapitulates the phylogenetic history of what we know for the population which I am from, Bengalis. The greater the genetic distance between two populations due to distinct evolutionary histories the greater the genetic divergence. This is not rocket science. This gets to the point that the raw results make a lot more sense when you integrate and synthesize them with other information you have. Most customers are not going into the process of getting a personal genomic ancestry test blind…but that causes pitfalls as well as opportunities.

But most people do not receive statistics of the form:

SNP Identity
You Yoruba 0.69
You German 0.72
You Japanese 0.73
You Tamil 0.74

Mind you, this is informative. It’s basically saying I am most genetically distant from Yoruba and closer in sequence to Tamils. But this is somewhat thin gruel for most people. Consider the below which is a zoom in of PC 2 vs. PC 4. I am blue and the purple/pink are Tamils, and the population at the bottom left are East Asians.

If you looked at enough PCA plots it will become rather clear I am shifted toward East Asians in comparison to most other South Asians. The high identity that I have with Japanese and Dai is due in part to the fact that I have relatively recent admixture from an East Asian population, above and beyond what is typical in South Asians. Remember, all three of my genotypes are basically on the same spot on PCA plots. That’s because they’re basically the same. Genotyping error is rather low.

How do we summarize this sort of information for a regular person? The standard method today is giving people a set of proportions with specific population labels. Why? People seem to understand population labels and proportions, but can be confused by PCA plots. Additionally, the methods that give out populations and proportions are often better at capturing pulse admixture events relatively recent in time than PCA, and for most consumers of ancestry services, this is an area that they are particularly focused on (i.e., Americans).

An easy way to make one’s genetic variation comprehensible to the general public is to model them as a mixture of various populations that they already know of. So consider the ones above in the plink file. I ran ADMIXTURE in supervised model progressively removing populations for my three genotypes. The results are below.

  Dai Druze German Japanese Papuan Sardinian Surui Tamil Yoruba
Razib23andMe 11% 3% 8% 4% 1% 0% 1% 73% 1%
RazibAncestry 10% 2% 8% 4% 1% 0% 1% 73% 1%
RazibFTDNA 11% 2% 8% 3% 1% 0% 1% 72% 1%
                   
  Dai Druze German Japanese Papuan Sardinian Surui Tamil  
Razib23andMe 11% 3% 8% 4% 1% 0% 1% 73%  
RazibAncestry 10% 3% 8% 4% 1% 0% 1% 74%  
RazibFTDNA 11% 3% 8% 3% 1% 0% 1% 73%  
                   
  Dai Druze Japanese Papuan Surui Tamil      
Razib23andMe 10% 9% 4% 1% 1% 74%      
RazibAncestry 10% 9% 4% 1% 1% 75%      
RazibFTDNA 11% 9% 4% 1% 1% 74%      
                   
  Dai Japanese Surui Tamil          
Razib23andMe 11% 4% 1% 84%          
RazibAncestry 10% 4% 1% 85%          
RazibFTDNA 11% 3% 1% 84%        

Please observe again that they are broadly congruent. These methods exhibit a stochastic element, so there is some noise baked into the cake, but with 200,000+ markers and a robust number of reference populations the results come out the same across all methods (also, 23andMe and Family Tree DNA seem to correlate a bit more, which makes sense since these two genotypes are more similar to each other than they are to Ancestry).

Observe that until I remove all other West Eurasian populations the Tamil fraction in my putative ancestry is rather consistent. Why? Because my ancestry is mostly Tamil-like, but social and historical evidence would point to the likelihood of some exogenous Indo-Aryan component. Additionally, seeing as how very little of my ancestry could be modeled as West African removing that population had almost no impact.

When there were three West Eurasian populations, Germans, Druze, and Sardinians, the rank order was in that sequence. Removing Germans and Sardinians and the Druze picked up most of that ancestral component. This a supervised method, so I’m assigning the empirical populations as reified clusters which can be used to reconstitute the variation you see in my own genotype. No matter what I put into the reference data, the method tries its best to assign proportions to populations.

The question then comes into the stage of subtle choices one makes to obtain the most informative inferences for the customer. These are not always matters of different results in terms of accuracy or precision, but often of presentation. If West Eurasian populations are removed entirely, my Tamil fraction inflates. That’s the closest to the West Eurasian populations left in the data. In contrast, the East Asian fraction remains the same because I’ve left the two proxy populations in the data (I rigged the die here because I know I have Tibeto-Burman admixture which is a combination of Northeast and Southeast Asian).

Let’s do something different. I’m going to swap out the West Eurasian populations with equivalents.

  Armenians Dai French_Basque Japanese Mandenka Surui Sweden Tamil
Razib23andMe 6% 11% 0% 4% 1% 1% 5% 72%
RazibAncestry 5% 11% 0% 4% 1% 1% 5% 73%
RazibFTDNA 6% 11% 0% 4% 1% 1% 5% 72%
                 
German Papuan Yoruba          
Razib23andMe 68% 20% 13%          
RazibAncestry 68% 20% 13%          
RazibFTDNA 68% 20% 13%          
                 
French_Basque Tamil            
Razib23andMe 8% 92%            
RazibAncestry 7% 93%            
RazibFTDNA 8% 92%            
                 
Tamil Yoruba            
Razib23andMe 97% 3%            
RazibAncestry 97% 3%            
RazibFTDNA 97% 3%          

I have no ancestry from French Basque, but I do have ancestry from Armenians and Swedes in this model. Why? If you keep track of the most recent population genomic ancestry this all makes sense. But if you don’t, well, it’s harder to unpack. This is part of the problem with these sorts of tests: how to make it comprehensible to the public while maintaining fidelity to the latest research.

This is not always easy, and differences between companies in terms of interpretation are not invidious as some of the press reports would have you think, but a matter of difficult choices and trade-offs one needs to make to give value to customers. True, this could all be ironed out if there was a ministry of genetic interpretation and a rectification of names in relation to population clusters, but right now there isn’t. This allows for both brand differentiation and engenders confusion.

In most of the models with a good number of populations, my Tamil ancestry is in the low 70s. Notice then that some of these results are relatively robust to the populations one specifies. Some of the patterns are so striking and clear that one would have to work really hard to iron them out and mask them in interpretation. But what happens when I remove Tamils and include populations I’m only distantly related to? This is a ridiculous model, but the algorithm tries its best. My affinity is greatest to Germans, both because of shared ancestry, and in the case of Papuans, their relatively high drift from other East Eurasians and Denisovan ancestry. But both Papuan and Yoruba ancestry are assigned because I’m clearly not 100% German, and I share alleles with both these populations. In models where there are not enough populations to “soak up” an individual’s variation, but you include Africans, it is not uncommon for African ancestry to show up at low fractions. If you take Europeans, Africans, and East Asians, and force two populations out of this mix, then Europeans are invariably modeled as a mix of Africans and East Asians, with greater affinity to the latter.

Even when you model my ancestry as Tamil or Yoruba, you see that there is a Yoruba residual. I have too much genetic variation that comes from groups not closely related to the variation you find in Tamils to eliminate this residual.

Just adding a few populations fixes this problem:

  Dai Tamil Yoruba  
Razib23andMe 14% 83% 2%  
RazibAncestry 14% 84% 2%  
RazibFTDNA 14% 83% 2%  
         
  Dai German Tamil Yoruba
Razib23andMe 15% 10% 74% 1%
RazibAncestry 14% 9% 75% 1%
RazibFTDNA 15% 10% 74% 1%

Notice how my Tamil fraction is almost the same as when I had included in many more reference populations. Why? My ancestral history is complex, like most humans, but it’s not that complex. The goal for public comprehensibility is to reduce the complexity into digestible units which give insight.

Of course, I could just say read Inference of Population Structure Using Multilocus Genotype Data. The basic framework was laid out in that paper 17 years ago for model-based clustering of the sort that is very common in direct to consumer services (some use machine learning and do local ancestry decomposition across the chromosome, but really the frameworks are an extension of the original logic). But that’s not feasible for most people, including journalists.

Consider this piece at Gizmodo, Why a DNA Test Is Actually a Really Bad Gift. I pretty much disagree with a lot of the privacy concerns, seeing as how I’ve had my public genotype downloadable for seven years. But this portion jumped out at me: “Ancestry tests are based on sound science, but variables in data sets and algorithms mean results are probabilities, not facts, as many people expect.”

Yes, there are probabilities involved. But if a DNA test using the number of markers above tells you you are 20% Sub-Saharan African and 80% European in ancestry, that probability is of the same sort of confidence of you determining that a coin flip is fair after 100,000 flips. True, you can’t be totally sure after 100,000 flips that you have a fair coin, but you can be pretty confident. With hundreds of thousands of markers, a quantum of 20% Sub-Saharan African in a person of predominantly European heritage is an inference made with a degree of confidence that verges upon certitude within a percentage or so.

As for the idea that they are not “facts.” I don’t even know what that means in this context. And I doubt the journalist does either. Which is one of my main gripes with these sorts of stories: unless they talk to a small subset of scientists the journalists just don’t know what they are talking about when it comes to the statistical genetics.

Finally, there is the issue about what does it even mean to be % percent of population X, Y, or Z? Even many biologists routinely reify and confuse the population clusters with something real and concrete in a Platonic sense. But deep down when you think about it we all need to recall we’re collapsing genealogies of many different segments of DNA into broad coarse summaries when we say “population.” And populations themselves are by their nature often somewhat open and subject to blending and flow with others. A population genomic understanding of structure does not bring into clarity Platonic facts, but it gives one instruments and tools to smoke out historical insight.

The truth, in this case, is not a thing in and of itself, but a dynamic which refines our intuitions of a fundamentally alien process of Mendelian assortment and segregation.

December 2, 2017

South Asian Genotype Project

Filed under: Personal genomics,South Asian Genotype Project — Razib Khan @ 6:02 pm


It’s been a few years since I’ve done any serious “Genome Blogging.” Mostly I’ve been very busy and there isn’t much low-hanging fruit left as it is. But today I want to announce that I’ll be running the generically titled “South Asian Genotype Project.”

The way it works is simple: send me a 23andMe, Ancestry, or Family Tree DNA raw genotype file to contactgnxp -at- gmail.com (though 23andMe’s new chip has far less overlap with other platforms earlier, so probably best if you were typed before August 2017).

In the subject please put:

  1. “South Asian Genotype Project”
  2. The state/province your family is from
  3. Ethnolinguistic group
  4. If applicable, caste

In the body of the email you can put Y and mtDNA and any other information you want. Obviously your data is confidential and I won’t identify you by name, just ethnolinguistic group and such.

Since the last time I did this I have some scripts that make this a lot of easier, so hopefully I’ll be adding individuals to this spreadsheet every few days. I’ll give project members an ID and try to email them when the results are up.

The main motivator for this project on my part is that people still ask me questions about Sinhalese, Nasrani Christians, and other assorted groups which we don’t have answers to because current research projects haven’t focused on them.

Since Zack worked on the Harappa Ancestry Project we know a lot more about South Asian ancestry. Basically, there is an ANI-ASI cline, and some South Asians have exogenous ancestry off this cline. Indian Jews have Middle Eastern ancestry, while Bengalis have East Asian ancestry, and some groups in Pakistan have African ancestry. With that in mind I’ll be testing a smaller number of populations. The marker set is 240,000 SNPs by the way.

Below are some representative results. You can see that my results from three DTC services are basically the same. Also, some South Indian groups (see Pulliyar) show “Dai” ancestry, when I’m pretty sure it’s just that I didn’t sample as much on the extreme portion of the ASI-cline.

ID
Armenians
Belorussian
C_India
Dai
Nigerian
NWIndia
S_India
YemeniteJews
Balochi
34%
1%
0%
0%
0%
66%
0%
0%
Bangladesh_Razib (23andMe)
0%
0%
14%
14%
0%
15%
57%
0%
Bangladesh_Razib (Ancestry)
0%
0%
14%
14%
0%
15%
57%
0%
Bangladesh_Razib (ftDNA)
0%
0%
13%
14%
0%
15%
58%
0%
Chenchus
0%
0%
1%
1%
0%
0%
98%
0%
Dharkars
0%
0%
16%
2%
0%
38%
44%
0%
Dusadh
0%
0%
21%
1%
0%
2%
76%
0%
Iranians
65%
2%
1%
2%
0%
20%
0%
10%
Kallar
0%
0%
0%
0%
0%
0%
100%
0%
Kurumba
0%
0%
0%
0%
0%
4%
96%
0%
Meghawal
0%
0%
10%
0%
0%
26%
64%
0%
MumbaiJews
18%
0%
4%
0%
0%
39%
28%
11%
Naga
0%
0%
0%
90%
0%
0%
10%
0%
NorthKannadi
0%
0%
0%
2%
0%
0%
98%
0%
Pakistani
3%
7%
19%
6%
0%
38%
23%
4%
Pathan
12%
3%
1%
1%
0%
80%
3%
0%
TamilNadu_Iyer
0%
1%
2%
0%
0%
42%
54%
0%
TamilNadu_Nadar
0%
0%
0%
1%
0%
0%
99%
0%
UP_Kayastha
0%
0%
17%
2%
0%
42%
39%
0%
WestBengal_Kayastha
0%
2%
15%
6%
0%
14%
64%
0%
Pulliyar
0%
0%
0%
7%
0%
0%
93%
0%
DalitTN
0%
0%
0%
1%
0%
0%
99%
0%
Velama
0%
0%
9%
0%
0%
22%
68%
0%

November 27, 2017

Razib Khan’s raw genotype data on 23andMe, Family Tree DNA, Geno 2.0 and Ancestry

It has been a while since I posted an update on my genotype. Since then I’ve been tested on most of the major platforms. I don’t see any harm in releasing this to the public or researchers who want to look at it (though I don’t know why anyone would).

You can download all the files here.

Having my genotypes public is pretty useful for me. If I inquire about someone’s genetics oftentimes people get weirdly defense and ask “what about you?” I Just invite them to look at my raw data and analyze it for themselves! I’m not a hypocrite about this.

Over the years I’ve had researchers inquire about my ethnicity when they stumble upon my genotype on platforms such as openSNP. So in full disclosure, most of my ancestry is pretty standard eastern Bengali. I’m more East Asian shifted than most Bangladeshi samples in the 1000 Genomes project, but then my family is from Comilla, in the far east of eastern Bengal (anyone who cares, my Y is of course R1a1a-Z93 and my mtDNA U2b).

As before I’ll put the genotype under a Creative Commons license:Creative Commons License

November 9, 2017

Bank your exome with Helix for free ($0.00)

Filed under: Personal Genetics,Personal Genome,Personal genomics — Razib Khan @ 4:06 pm

I wasn’t going to do this again, but I’ve decided to promote Helix’s special discount. It ends at 2:59 AM EDT November 10th. Eight hours from when I push this post.

Obviously, there is a conflict of interest as I work for one of Helix’s partners. What does that mean?

  • Helix does an exome+ sequence and stores your data.
  • Then, you buy applications which use that data.
  • The company I work for is one of the application providers.
  • “Exome” means that Helix does a very accurate medical grade sequence of all your genes. The “+” points to the fact that they include a substantial number of positions which are not within genes (in the “junk DNA”). That totals up to 30,000,000+ markers (the exome is 1% of your whole genome). This is not trivial. Current direct-to-consumer genomics companies are looking at 500,000 to 1,000,000 markers with SNP arrays.
  • Helix keeps this data. Within a few months, you can buy the data at cost (it won’t be cheap!). But the model is that you buy a la cart apps, which will be affordable (our products are affordable).

I’m laying this all out very plainly because many people are asking me about these details right now as the sale winds down, and this includes people who are pretty savvy about personal genomics. Here is why I think you should get the kits now:

  1. It gets my company more customers. That’s the self-interested part, and less important for the target audience.
  2. For you, it gets you an exome that you can buy later without any upfront cost. For the next eight hours, Helix is basically waiving the kit costs by dropping the price $100.

Our Neanderthal product is now $9.99. Our Metabolism product is $19.99. These products are great, as they give you functional information in a very user-friendly manner. But a lot of my readers can analyze their own data, so what’s the incentive then? Again, the incentive is that you get an exome for free, and can later buy it if you want, or, perhaps even a savvy personal genomics consumer will find an app they’ll want to purchase. Normally the kit is $80, so buying it now means you’ll never have to pay this cost. If you are the type of person who has qualms about a private company keeping your data, this may not be for you.

Of course, there are other app developers in the Helix store, so just buy whatever you want. This is a way to get your exome sequenced for free nowI will tell you that the Insitome apps are among the cheapest.

Finally, a lot of people are buying “family-pack” quantities. I got four kits for example for my immediate family. Unfortunately, there are some issues with the Helix site and the extra purchases. You can buy more than one easily at Amazon right now. Our Neanderthal product is not in low stock. The Metabolism product has only a few left, though I don’t know what that means.

Note: The discount is client-side, so you may need to switch browsers if you are going to the Helix site to buy (or turn off ad-block). From what I can see Amazon does not have these issues.

November 2, 2017

Introducing DNAGeeks.com

Filed under: DTC personal genomics,Personal genomics — Razib Khan @ 8:27 am

Four years ago my friend David Mittleman and I wrote Rumors of the death of consumer genomics are greatly exaggerated. The context was the FDA crackdown on 23andMe. Was the industry moribund before it began? The title gives away our opinion. We were personally invested. David and I were both working for Family Tree DNA, which is part of the broader industry. But we were sincere too.

Both of us have moved on to other things. But we still stand by our original vision. And to a great extent, we think we had it right. The consumer genomics segment in DTC is now nearing 10 million individuals genotyped (Ancestry itself seems to have gone north of 5 million alone).

One of the things that we observed in the Genome Biology piece is that personal genomics was still looking for a “killer app”, like the iPhone. Since then the Helix startup has been attempting to create an ecosystem for genomics with a variety of apps. Though ancestry has driven nearly ten million sales, there still isn’t something as ubiquitous as the iPhone. We’re still searching, but I think we’ll get there. Data in search of utility….

David and I are still evangelizing in this space, and together with another friend we came up with an idea: DNAGreeks. We’re starting with t-shirts because it’s something everyone understands, but also can relay our (and your) passion about genomics. We started with “Haplotees.” Basically the most common Y and mtDNA lineages. This might seem silly to some, but it’s something a lot of people have an interest in, and it’s also a way to get ‘regular people’ interested in genetics. Genealogy isn’t scary, and it’s accessible.

We are also field-testing other ideas. If there is a demand we might roll out a GNXP t-shirt (logo only?). The website is obscure enough that it won’t make sense to a lot of people, but perhaps it will make sense to the people who you want it to make sense too!

Anyway, as they say, “keep watching this space!” We don’t know where DNAGeeks is going, but we’re aiming to have fun with genomics and make a little money too.

October 4, 2017

A very special episode of South Park

Filed under: Personal genomics — Razib Khan @ 5:17 pm

September 4, 2017

The issue is with the model, not precision!

Filed under: Personal genomics — Razib Khan @ 6:38 pm

The Wirecutter has a thorough review of direct-to-consumer ancestry testing services. Since I now work at a human personal genomics company I’m not going to comment on the merits of any given service. But, I do want to clarify something in regards to the precision of these tests. Before the author quotes Jonathan Marks, he says:

For Jonathan Marks, anthropology professor at University of North Carolina at Charlotte, the big unknown for users is the margin for error with these estimates….

The issue I have with this quote is that the margin of error on these tests is really not that high. Margin of error itself is a precise concept. If you sample 1,000 individuals you’ll have a lower margin of error than if you sample 100 individuals. That’s common sense.

But for direction-to-consumer genomic tests you are sampling 100,000 to 1 million markers on SNP arrays (the exact number used for ancestry inference is often lower than the total number on the array). For ancestry testing you are really interested in the 10 million or so (order of magnitude) markers which vary between population, and a random sampling of 100,000 to 1 million is going to be pretty representative (consider that election year polling usually surveys a few thousand people to represent an electorate of tens of millions).

If you run a package like Admixture you can repeat the calculation for a given individual multiple times. In most cases there is very little variation between replicates in relation to the percentage breakdowns, even though you do a random seed to initialize the process as it begins to stochastically explore the parameter space (the variance is going to be higher if you try to resolve clusters which are extremely phylogenetically close of course).

As I have stated before, the reason these different companies offer varied results is that they start out with different models. When I learned the basic theory around phylogenetics in graduate school the philosophy was definitely Bayesian; vary the model parameters and the model and see what happens. But you can’t really vary the model all the time between customers, can you? It starts to become a nightmare in relation to customer service.

There are certain population clusters that customers are interested in. To provide a service to the public a company has to develop a model that answers those questions which are in demand. If you are designing a model for purely scientific purposes then you’d want to highlight the maximal amount of phylogenetic history. That isn’t always the same though as the history that customers want to know about it. This means that direct-to-consumer ethnicity tests in terms of the specification of their models deviate from pure scientific questions, and result in a log of judgment calls based on company evaluations of their client base.

Addendum: There is a lot of talk about the reference population sets. The main issue is representativeness, not sample size. You don’t really need more than 10-100 individuals from a given population in most cases. But you want to sample the real population diversity that is out there.

August 28, 2017

When journalists get out of their depth on genetic genealogy

Filed under: DTC personal genomics,Genealogy,Genetics,Personal genomics — Razib Khan @ 7:39 pm

For some reason The New York Times tasked Gina Kolata to cover genetic genealogy and its societal ramifications, With a Simple DNA Test, Family Histories Are Rewritten. The problem here is that to my knowledge Kolata doesn’t cover this as part of her beat, and so isn’t well equipped to write an accurate and in depth piece on the topic in relation to the science.

This is a general problem in journalism. I notice it most often when it comes to genetics (a topic I know a lot about for professional reasons) and the Middle East and Islam (topics I know a lot about because I’m interested in them). It’s unfortunate, but it has also made me a lot more skeptical of journalists whose track record I’m unfamiliar with.* To give a contrasting example, Christine Kenneally is a journalist without a background in genetics who nevertheless is immersed in genetic genealogy, so that she could have written this sort of piece without objection from the likes of me (she did write a book on the topic, The Invisible History of the Human Race: How DNA and History Shape Our Identities and Our Futures, which I had a small role in fact-checking).

What are the problems with the Kolata piece? I think the biggest issue is that she didn’t go in to test any particular proposition, and leaned on the wrong person for the science. She quotes Joe Pickrell, who knows this stuff like the back of his hand. But a more space is given to Jonathan Marks, an anthropologist who is quite opinionated and voluble, and so probably a “good source” for any journalist.

Marks seems well respected in anthropology from what I can tell, but he’s also the person who put up a picture of L. L. Cavalli-Sforza juxtaposed with a photo of Josef Mengele in the late 1990s during a presentation at Stanford. Perhaps this is why anthropologists respect him, I don’t know, but I do not like him because of his nasty tactics (I wouldn’t be surprised if Marks had power he would make sure people like me were put in political prison camps, his rhetoric is often so unhinged).

Marks’ quotes wouldn’t be much of an issue if Kolata could figure out when he’s making sense, and when he’s just bullshitting. But she can’t. For example:

…“tells me I’m 95 percent Ashkenazi Jewish and 5 percent Korean, is that really different from 100 percent Ashkenazi Jewish and zero percent Korean?”

The precise numbers offered by some testing services raise eyebrows among genetics researchers. “It’s all privatized science, and the algorithms are not generally available for peer review,” Dr. Marks said.

The part about precise numbers is an issue, though a lot less of an issue with high density SNP-chips (the real issue is sensitivity to reference population and other such parameters). But if a modern test says you are 95 percent Ashkenazi Jewish and 5 percent Korean it really is different from 100% Ashkenazi. Someone who comes up as 5% Korean against an Ashkenazi Jewish background is most definitely of some East Asian heritage. In the early 2000s with ancestrally informative markers and microsatellite based tests you’d get somewhat weird results like this, but with the methods used by the major DTC companies (and in academia) today these sorts of proportions are just not reported as false positives. Marks may not know because this isn’t his area, but Pickrell would have. Kolata probably did not think to double-check with him, but that’s because she isn’t able to smell out tendentious assertions. She has no feel for the science, and is flying blind.

Second, Marks notes that the science is privatized, and it isn’t totally open. But it’s just false that the algorithms are not generally available for peer review. All the details of the pipeline are not downloadable on GitHub, but the core ancestry estimation methods are well known. Eric Durand, who wrote the originally 23andMe ancestry composition methodology presented on it at ASHG 2013. I know because I was there during his session.

You can find a white paper for 23andMe’s method and Ancestry‘s. Not everything is as transparent as open science would dictate (though there are scientific papers and publications which also mask or hide elements which make reproducibility difficult), but most geneticists with domain experience can figure out what’s going on and it if it is legitimate. It is. The people who work at the major DTC companies often come out of academia, and are known to academic scientists. This isn’t blackbox voodoo science like “soccer genomics.”

Then Marks says this really weird thing:

“That’s why their ads always specify that this is for recreational purposes only: lawyer-speak for, ‘These results have no scientific standing.’”

Actually, it’s lawyer-speak for “do not sue us, as we aren’t providing you actionable information.” Perhaps I’m ignorant, but lawyers don’t get to define “scientific standing”.

The problem, which is real, is that the public is sometimes not entirely clear on what the science is saying. This is a problem of communication from the companies to the public. I’ve even been in scientific sessions where geneticists who don’t work in population genomics have weak intuition on what the results mean!

Earlier Kolata states:

Scientists simply do not have good data on the genetic characteristics of particular countries in, say, East Africa or East Asia. Even in more developed regions, distinguishing between Polish and, for instance, Russian heritage is inexact at best.

This is not totally true. We have good data now on China and Japan. Korea also has some data. Using haplotype-based methods you can do a lot of interesting things, including distinguish someone who is Polish from Russian. But these methods are computationally expensive and require lots of information on the reference samples (Living DNA does this for British people). The point is that the science is there. Reading this sort of article is just going to confuse people.

On the other hand a lot of Kolata’s piece is more human interest. The standard stuff about finding long lost relatives, or discovering your father isn’t your father. These are fine and not objectionable factually, though they’ve been done extensively before and elsewhere. I actually enjoyed the material in the second half of the piece, which had only a tenuous connection to scientific detail. I just wish these sorts of articles represented the science correctly.

Addendum: Just so you know, three journalists who regularly cover topics I can make strong judgments on, and are always pretty accurate: Carl Zimmer, Antonio Regalado, and Ewen Callaway.

* I don’t follow Kolata very closely, but to be frank I’ve heard from scientist friends long ago that she parachutes into topics, and gets a lot of things wrong. Though I can only speak on this particular piece.

July 11, 2017

23andMe ancestry only is $49.99 for Prime Day

Filed under: 23andMe,D.T.C. Personal Genomics,Personal genomics — Razib Khan @ 11:10 am


23andMe has gone below $50 for “Prime Day”! For those of us who bought kits (albeit more fully featured) at $399 or even more this is pretty incredible. But from what I’m to understand these sorts of SNP-chips are now possible to purchase from Illumina for well less than $50 so this isn’t charitable.

At minimum a way to get a raw genotype you can bank later.

June 27, 2017

Genome sequencing for the people is near

Filed under: Genomics,Personal genomics — Razib Khan @ 7:22 am

When I first began writing on the internet genomics was an exciting field of science. Somewhat abstruse, but newly relevant and well known due to the completion of the draft of the human genome. Today it’s totally different. Genomics is ubiquitous. Instead of a novel field of science, it is transitioning into a personal technology.

But life comes at you fast. For all practical purposes the $1,000 genome is here.

And yet we haven’t seen a wholesale change in medicine. What happened? Obviously a major part of it is polygenicity of disease. Not to mention that a lot of illness will always have a random aspect. People who get back a “clean” genome and live a “healthy” life will still get cancer.

Another issue is a chicken & egg problem. When a large proportion of the population is sequenced and phenotyped we’ll probably discover actionable patterns. But until that moment the yield is going to not be too impressive.

Consider this piece in MIT Tech, DNA Testing Reveals the Chance of Bad News in Your Genes:

Out of 50 healthy adults [selected from a random 100] who had their genomes sequenced, 11—or 22 percent—discovered they had genetic variants in one of nearly 5,000 genes associated with rare inherited diseases. One surprise is that most of them had no symptoms at all. Two volunteers had genetic variants known to cause heart rhythm abnormalities, but their cardiology tests were normal.

There’s another possible consequence of people having their genome sequenced. For participants enrolled in the study, health-care costs rose an average of $350 per person compared with a control group in the six months after they received their test results. The authors don’t know whether those costs were directly related to the sequencing, but Vassy says it’s reasonable to think people might schedule follow-up appointments or get more testing on the basis of their results.

Researchers worry about this problem of increased costs. It’s not a trivial problem, and one that medicine doesn’t have a response to, as patients often find a way to follow up on likely false positives. But it seems that this is a phase we’ll have to go through. I see no chance that a substantial proportion of the American population in the 2020s will not be sequenced.

June 12, 2017

10 million DTC dense marker genotypes by end of 2017?


Today I got an email from 23andMe that they’d hit the 2 million customer mark. Since they reached their goal of 1 million kits purchased the company seems to have taken its foot off the pedal of customer base growth to focus on other things (in particular, how to get phenotypic data from those who have been genotyped). In contrast Ancestry has been growing at a faster rate of late. After talking to Spencer Wells (who was there at the beginning of the birth of this sector) we estimated that the direct-to-consumer genotyping kit business is now north of 5 million individuals served. Probably closer to 6 or 7 million, depending on the numbers you assume for the various companies (I’m counting autosomal only).

This pretty awesome. Each of these firm’s genotype in the range of 100,000 to 1 million variant markers, or single nucleotide base pairs. 20 years ago this would have been an incredible achievement, but today we’re all excited about long-read sequencing from Oxford Nanopore. SNP-chips are almost ho-hum.

But though sequencing is the cutting edge, the final frontier and terminal technology of reading your DNA code, genotyping in humans will be around for a while because of cost. At ASHG last year a medical geneticist was claiming price points in bulk for high density SNP-chips are in the range of the low tens of dollars per unit. A good high coverage genome sequence is still many times more expensive (perhaps an order of magnitude ore more depending on who you believe). It also can impose more data processing costs than a SNP-chip in my experience.

Here’s a slide from Spencer:

I suspect genotyping will go S-shaped before 2025 after explosive growth in genotyping. Some people will opt-out. A minority of the population, but a substantial proportion. At the other extreme of the preference distribution you will have those who will start getting sequenced. Researchers will begin talk about genotyping platforms like they talk about microarrays (yes, I know at places like the Broad they already talk about genotyping like that, but we can’t all be like the Broad!).

Here’s an article from 2007 on 23andMe in Wired. They’re excited about paying $1,000 genotyping services…the cost now of the cheapest high quality (30x) whole genome sequences. Though 23andMe has a higher price point for its medical services, many of the companies are pushing their genotyping+ancestry below $100, a value it had stabilized at for a few years. Family Tree DNA has a father’s day sale for $69 right now. Ancestry looks to be $79. The Israel company MyHeritage is also pushing a $69 sale price (the CSO there is advertising that he’s hiring human geneticists, just so you know). It seems very likely that a $50 price point is within site in the next few years as SNP-chip costs become trivial and all the expenses are on the data storage/processing and visualization costs. I think psychologically for many people paying $50 is not cheap, but it is definitely not expensive. $100 feels expensive.

Ultimately I do wonder if I was a bit too optimistic that 50% of the US population will be sequenced at 30x by 2025. But the dynamic is quite likely to change rapidly because of a technological shift as the sector goes through a productivity uptick. We’re talking about exponential growth, which humans have weak intuition about….

Addendum: Go into the archives of Genomes Unzipped and reach the older posts. Those guys knew where we were heading…and we’re pretty much there.

April 7, 2017

Direct-to-consumer genomics, it’s back on!

Filed under: 23andMe,DTC,Genetics,Personal genomics — Razib Khan @ 8:11 am

The past three and a half years, and arguably longer, there has been something of a dark night passing over direct to consumer (DTC) personal genomics. The regulatory issues have been unclear to unfavorable. If you have read this blog you know 23andMe‘s saga with the Food and Drug Administration.

It looks like 2017 DTC is finally turning a regulatory corner, with some clarity and freedom to operate, FDA Opens Genetic Floodgates with 23andMe Decision:

Today, the U.S. Food and Drug Administration told gene-testing company 23andMe that it will be allowed to directly tell consumers whether their DNA puts them at higher risk for 10 different diseases, including late-onset Alzheimer’s disease and Parkinson’s.

The decision to allow these direct-to-consumer tests is a big vindication for 23andMe, which in 2013 was forced to cease marketing such results after the FDA said they could be inaccurate and risky to consumers, and that they required regulatory approval.

I still agree with my assessment in 2013, this won’t mean anything in the long run. DTC is here to stay, and if the decentralization of medical testing and services don’t happen in the USA, they’ll happen elsewhere, and at some point medical tourism will get cheap enough that any restrictions in this nation won’t be of relevance. But, this particular decision alters the timeline in the grand scheme of things, and matters a great deal for specific players.

It’s on!

March 23, 2017

Ancestry inference won’t tell you things you don’t care about (but could)

Filed under: Anthroplogy,Genetics,Genomics,Personal genomics — Razib Khan @ 5:59 pm

The figure above is from Noah Rosenberg’s relatively famous paper, Clines, Clusters, and the Effect of Study Design on the Inference of Human Population Structure. The context of the publication is that it was one of the first prominent attempts to use genome-wide data on a various of human populations (specifically, from the HGDP data set) and attempt model-based clustering. There are many details of the model, but the one that will jump out at you here is that the parameter defines the number of putative ancestral populations you are hypothesizing. Individuals then shake out as proportions of each element, K. Remember, this is a model in a computer, and you select the parameters and the data. The output is not “wrong,” it’s just the output based how you set up the program and the data you input yourself.

These sorts of computational frameworks are innocent, and may give strange results if you want to engage in mischief. For example, let’s say that you put in 200 individuals, of whom 95 are Chinese, 95 are Swedish, and 10 are Nigerian. From a variety of disciplines we know to a good approximation that non-Africans form a monophyletic clade in relation to Africans (to a first approximation). In plain English, all non-Africans descend from a group of people who diverged from Africans more than 50,000 years ago. That means if you imagine two populations, the first division should be between Africans and non-Africans, to reflect this historical demography. But if you skew the sample size, as the program looks for the maximal amount of variation in the data set it may decide that dividing between Chinese and Swedes as the two ancestral populations is the most likely model given the data.

This is not wrong as such. As the number of Africans in the data converges on zero, obviously the dividing line is between Swedes and Chinese. If you overload particular populations within the data, you may marginalize the variation you’re trying to explore, and the history you’re trying to uncover.

I’ve written all of this before. But I’m writing this in context of the earlier post, Ancestry Inference Is Precise And Accurate(Ish). In that post I showed that consumers drive genomics firms to provide results where the grain of resolution and inference varies a lot as a function of space. That is, there is a demand that Northern Europe be divided very finely, while vast swaths of non-European continents are combined into one broad cluster.

Less than 5% Ancient North Eurasian

Another aspect though is time. These model-based admixture frameworks can implicitly traverse time as one ascends up and down the number of K‘s. It is always important to explain to people that the number of K‘s may not correspond to real populations which all existed at the same time. Rather, they’re just explanatory instruments which illustrate phylogenetic distance between individuals. In a well-balanced data set for humans K = 2 usually separates Africans from non-Africans, and K = 3 then separates West Eurasians from other populations. Going across K‘s it is easy to imagine that is traversing successive bifurcations.

A racially mixed man, 15% ANE, 30% CHG, 30% WHG, 30% EEF

But today we know that’s more complicated than that. Three years ago Pickrell et al. published Toward a new history and geography of human genes informed by ancient DNA, where they report the result that more powerful methods and data imply most human populations are relatively recent admixtures between extremely diverged lineages. What this means is that the origin of groups like Europeans and South Asians is very much like the origin of the mixed populations of the New World. Since then this insight has become only more powerful, as ancient DNA has shed light as massive population turnovers over the last 5,000 to 10,000 years.

These are to some extent revolutionary ideas, not well known even among the science press (which is too busy doing real journalism, i.e. the art of insinuation rather than illumination). As I indicated earlier direct-to-consumer genomics use national identities in their cluster labels because these are comprehensible to people. Similarly, they can’t very well tell Northern Europeans that they are an outcome of a successive series of admixtures between diverged lineages from the late Pleistocene down to the Bronze Age. Though Northern Europeans, like South Asians, Middle Easterners, Amerindians, and likely Sub-Saharan Africans and East Asians, are complex mixes between disparate branches of humanity, today we view them as indivisible units of understanding, to make sense of the patters we see around us.

Personal genomics firms therefore give results which allow for historically comprehensible results. As a trivial example, the genomic data makes it rather clear that Ashkenazi Jews emerged in the last few thousand years via a process of admixture between antique Near Eastern Jews, and the peoples of Western Europe. After the initial admixture this group became an endogamous population, so that most Ashkenazi Jews share many common ancestors in the recent past with other Ashkenazi Jews. This is ideal for the clustering programs above, as Ashkenazi Jews almost always fit onto a particular K with ease. Assuming there are enough Ashkenazi Jews in your data set you will always be able to find the “Jewish cluster” as you increase the value.

But the selection of a K which satisfies this comprehensibility criterion is a matter of convenience, not necessity. Most people are vaguely aware that Jews emerged as a people at a particular point in history. In the case of Ashkenazi Jews they emerged rather late in history. At certain K‘s Ashkenazi Jews exhibit mixed ancestral profiles, placing them between Europeans and Middle Eastern peoples. What this reflects is the earlier history of the ancestors of Ashkenazi Jews. But for most personal genomics companies this earlier history is not something that they want to address, because it doesn’t fit into the narrative that their particular consumers want to hear. People want to know if they are part-Jewish, not that they are part antique Middle Eastern and Southwest European.

Perplexment of course is not just for non-scientists. When Joe Pickrell’s TreeMix paper came out five years ago there was a strange signal of gene flow between Northern Europeans and Native Americans. There was no obvious explanation at the time…but now we know what was going on.

It turns out that Northern Europeans and Native Americans share common ancestry from Pleistocene Siberians. The relationship between Europeans and Native Americans has long been hinted at in results from other methods, but it took ancient DNA for us to conceptualize a model which would explain the patterns we were seeing.

An American with recent Amerindian (and probably African) ancestry

But in the context of the United States shared ancestry between Europeans and Native Americans is not particularly illuminating. Rather, what people want to know is if they exhibit signs of recent gene flow between these groups, in particular, many white Americans are curious if they have Native American heritage. They do not want to hear an explanation which involves the fusion of an East Asian population with Siberians that occurred 15,000 to 20,000 years ago, and then the emergence of Northern Europeans thorough successive amalgamations between Pleistocene, Neolithic, and Bronze Age, Eurasians.

In some of the inference methods Northern Europeans, often those with Finnic ancestry or relationship to Finnic groups, may exhibit signs of ancestry from the “Native American” cluster. But this is almost always a function of circumpolar gene flow, as well as the aforementioned Pleistocene admixtures. One way to avoid this would be to simply not report proportions which are below 0.5%. That way, people with higher “Native American” fractions would receive the results, and the proportions would be high enough that it was almost certainly indicative of recent admixture, which is what people care about.

Why am I telling you this? Because many journalists who report on direct-to-consumer genomics don’t understand the science well enough to grasp what’s being sold to the consumer (frankly, most biologists don’t know this field well either, even if they might use a barplot here and there).

And, the reality is that consumers have very specific parameters of what they want in terms of geographic and temporal information. They don’t want to be told true but trivial facts (e.g., they are Northern European). But neither they do want to know things which are so novel and at far remove from their interpretative frameworks that they simply can’t digest them (e.g., that Northern Europeans are a recent population construction which threads together very distinct strands with divergent deep time histories). In the parlance of cognitive anthropology consumers want their infotainment the way they want their religion, minimally counterintuitive. Consume some surprise. But not too much.

November 28, 2013

The total information world

Filed under: Personal genomics — Razib Khan @ 1:46 pm

Credit: Cryteria

Credit: Cryteria

Happy Thanksgiving (if you are an American)!

It’s been a busy few days in the world of personal genomics. By coincidence I have a coauthored comment in Genome Biology out, Rumors of the death of consumer genomics are greatly exaggerated (it was written and submitted a while back). If you haven’t, please read the FDA’s letter, and 23andMe’s response, as much as there is one right now. Since Slate ran my piece on Monday a lot of people have offered smart, and more well informed, takes. On the one hand you have someone like Alex Tabarrok, with “Our DNA, Our Selves”, which is close to a libertarian cri de coeur. Then you have cases like Christine Gorman, “FDA Was Right to Block 23andMe”. It will be no surprise that I am much closer to Tabarrok than I am to Gorman (she doesn’t even seem to be aware that 23andMe offers a genotyping, not sequencing, service, though fuzziness on the details doesn’t discourage strong opinions from her). An interesting aspect is that many who are not deeply in the technical weeds of the issue are exhibiting politicized responses. I’ve noticed this on Facebook, where some seem to think that 23andMe and the Tea Party have something to do with each other, and the Obama administration and the FDA are basically stand-ins. In other words, some liberals are seeing this dispute as one of those attempts to evade government regulation, something they support on prior grounds. Though Tabarrok is more well informed than the average person (his wife is a biologist), there are others from the right-wing who are taking 23andMe’s side on normative grounds as well. Ultimately I’m not interested in this this argument, because it’s not going to have any significant lasting power. No one will remember in 20 years. As I implied in my Slate piece 23andMe the company now is less interesting than personal genomics the industry sector in the future. Over the long term I’m optimistic that it will evolve into a field which impacts our lives broadly. Nothing the United States government can do will change that.

Yet tunneling down to the level of 23andMe’s specific issues with the regulatory process, there is the reality that it has to deal with the US government and the FDA, no matter what the details of its science are. It’s a profit-making firm. Matt Herper has a judicious take on this, 23andStupid: Is 23andMe Self-Destructing? I don’t have any “inside” information, so I’m not going to offer the hypothesis that this is part of some grand master plan by Anne Wojcicki. I hope it is, but that’s because I want 23andMe to continue to subsidize genotyping services (I’ve heard that though 23andMe owns the machines, the typing is done by LabCorp. And last I checked the $99 upfront cost is a major loss leader; they’re paying you to get typed). I’m afraid that they goofed here, and miscalculated. As I said above, it won’t make a major difference in the long run, but I have many friends who were waiting until this Christmas to purchase kits from 23andMe.


Then there are “the scientists,” or perhaps more precisely the genoscenti. Matt Herper stated to the effect that the genoscenti have libertarian tendencies, and I objected. In part because I am someone who has conservative and/or libertarian tendencies, and I’m pretty well aware that I’m politically out of step with most individuals deeply involved in genetics, who are at most libertarian-leaning moderate liberals, and more often conventional liberal Democrats. Michael Eisen has a well thought out post, FDA vs. 23andMe: How do we want genetic testing to be regulated? Eisen doesn’t have a political ax to grind, and is probably representative of most working geneticists in the academy (he is on 23andMe’s board, but you should probably know that these things don’t mean that much). I may not know much about the FDA regulatory process, but like many immersed in genomics I’m well aware that many people talking about these issues don’t know much about the cutting edge of the modern science. Talk to any geneticist about conversations with medical doctors and genetic counselors, and they will usually express concern that these “professionals” and “gatekeepers” are often wrong, unclear, or confused, on many of the details. A concrete example, when a friend explained to a veteran genetic counselor how my wife used pedigree information combined with genomic data to infer that my daughter did not have an autosomally dominant condition, the counselor asserted that you can’t know if there were two recombination events within the gene, which might invalidate these inferences. Though my friend was suspicious, they did not say anything, because they were not a professional. As a matter of fact there just aren’t enough recombinations across the genome for an intra-genic event to be a likely occurrence (also, recombination likelihood is not uniformly distributed, and not necessarily independent, insofar as there may be suppression of very close events). And this was a very well informed genetic counselor.

Additionally, there are the two major objections to 23andMe’s service which some on Twitter have been pointing me to. First, they return results which are highly actionable. The FDA explicitly used the example of a woman who goes and gets a mastectomy due to a 23andMe result. I don’t think this is a very strong objection. No doctor would perform a mastectomy based on 23andMe results. So that’s not an issue. Then there are those who allude to psychological harm. This could be a problem, but 23andMe has multiple steps which make it so that you have to proactively find out information on these sorts of diseases. Call me a libertarian if you will, but I object on principle to the idea that medical professionals necessarily have to be involved in the dissemination of information about my own genome as a regulatory matter. Obviously when it comes to a course of treatment they will be consulted, and no doubt there will be replications of any actionable results. But I don’t trust the medical sector to be populated by angels. To illustrate why I don’t trust medical professionals to always behave out of the goodness of their hearts, consider that deaths from hospital infections started dropping sharply when Medicare stopped paying for treating these infections. Workers in the health care sector do care about patients, but even here incentives matter, and the human cognitive budget is such that they can shift the outcomes greatly by reminding nurses and doctors that washing hands is going to impact the bottom line (the reality is that hospitals probably instituted much stricter measures). What does this have to do with personal genomics? You are our own best advocate, and one of the major reasons that those in higher socioeconomic strata have better health outcomes is that they are so much less passive as patients. The more detailed the information you have on your own health, the better you can advocate and be involved in the decision-making process. And the reality is that with dropping prices in sequencing, and the ability to design software to do interpretation, without draconian measures there’s almost nothing the United States government will be able to do to prevent anyone with a moderate amount of motivation from getting this sort of information.

A second objection is that the SNPs returned are of small and very probabilistic effect. This is embedded in issues regarding the “missing heritability” and the reality that most complex diseases are due to many factors. Because of the small effect size, and until recently, small sample sizes, this literature has been littered with false positives, which passed arbitrary statistical thresholds. The argument then boils down to the reality that 23andMe in many cases is not really adding any informative value. If that’s the case though then why the urgency to regulate it? Horoscopes and diet books do not add informative value either. This problem with small effect SNPs is widely known, so bringing it up as if it is revelatory is rather strange to me. Additionally, as Eric Lander and others have pointed out the locus which helped us discover statins is of very small effect. As long as they’re not false positives, small-effect SNPs are likely a good way to go in understanding biological pathways for pharmaceutical products. But that doesn’t speak to the risk prediction models. I think there the possibilities are murkier even in the long run because complex traits are complex. Even if we have massive GWAS with sample sizes in the millions and 100x whole-genome coverage (this will happen), the environmental factors may still be significant enough that researchers may balk at definitive risk predictions.

Ultimately where I think personal genomics is going is alluded to in the Genome Biology piece: as part of a broader suite of information services, probably centralized, filtered, and curated, by a helper artificial intelligence. What cognitive science and behavioral economics are telling us is that individuals operate under mental budget constraints. Dan MacArthur is probably right that personal genomics enthusiasts overestimated how involved the average person on the street was going to want to get in terms of their own interpretations of returned results. The reality is that even genetic counselors can barely keep up. Someday the field will stabilize, but this is not that day. But overall the information overload is going to get worse and worse, not better, and where the real upside, and game-changer, will be is in the domain of computational tools which helps us make decisions with a minimum of effort. A cartoon model of this might be an artificial intelligence which talks to you through an ear-bud all day, and takes your genomic, epigenomic, and biomarker status into account when advising you on whether you should pass on the dessert. But to get from here to there is going to require innovation. The end point is inevitable, barring a collapse of post-industrial civilization. The question is where it is going to happen. Here in the United States we have the technology, but we also have cultural and institutional road-blocks to this sort of future. If those road-blocks are onerous enough it doesn’t take a genius to predict that high-tech lifestyle advisement firms, whose aims are to replace the whole gamut of self-help sectors with rationally designed applications and appliances, will simply decamp to Singapore or Dubai.

Personal genomics is a small piece of that. And 23andMe is a small piece of personal genomics. But they are not trivial pieces.

The post The total information world appeared first on Gene Expression.

November 25, 2013

The FDA and 23andMe

Filed under: Personal genomics — Razib Khan @ 10:30 am

napFirst, download your 23andMe raw results now if you have them. If you don’t know what’s going on, the FDA has finally started to move aggressively against the firm. Unfortunately this is not surprising, as this was foreshadowed years ago. And, 23andMe has been moving aggressively to emphasize its medical, as opposed to genealogical, services over the past year. But this isn’t the story of one firm. This is the story of government response to very important structural shifts occurring in the medical delivery system of the United States. The government could potentially bankrupt 23andMe, but taking a step back that would still be like the RIAA managing to take down Napster. The information is coming, and if there’s one thing that can overpower state planning it is consumer demand. Unless the US government wants to ban their citizens from receiving their own genetic data they’re just putting off the inevitable outsourcing of various interpretation services. Engagement would probably be the better long term bet, but I don’t see that happening.

The post The FDA and 23andMe appeared first on Gene Expression.

November 7, 2013

The future always advances

Filed under: DTC personal genomics,Personal genomics — Razib Khan @ 12:56 am

The last week has seen a lot of chatter about the slapping down of the diagnostic patent by Sequenom, Judge Invalidates Patent for a Down Syndrome Test:

A federal judge has invalidated the central patent underlying a noninvasive method of detecting Down syndrome in fetuses without the risk of inducing a miscarriage.

The ruling is a blow to Sequenom, a California company that introduced the first such noninvasive test in 2011 and has been trying to lock out competitors in a fast-growing market by claiming they infringe on the patent.

Sequenom’s stock fell 23 percent on Thursday, to $1.92.

The judge, Susan Illston of the United States District Court in Northern California, issued a ruling on Wednesday that the patent was invalid because it covered a natural phenomenon — the presence of DNA from the fetus in the mother’s blood.

The existence of intellectual property is a utilitarian one. That is, these are institutions which are meant to further the cause of creativity and innovation. Is there going to be an abandonment in this domain of the push toward technological innovation? Coincidentally in the last week of October Sequenom put out a press release which heralded some advances in its panel:

…The MaterniT21 PLUS test will begin reporting additional findings for the presence of subchromosomal microdeletions and autosomal trisomies for chromosomes 16 and 22, in addition to the previously announced additional findings for sex chromosome aneuploidies involving an abnormal number of the X or Y chromosomes. These additional findings complement the MaterniT21 PLUS test core identification of trisomies for chromosome 21, chromosome 18 and chromosome 13. With this expansion, the MaterniT21 PLUS test is the first-of-its-kind noninvasive prenatal technology (NIPT) to provide these comprehensive results from a maternal blood draw.

Sequenom Laboratories will begin reporting on these select, clinically relevant microdeletions, including 22q11.2 deletion syndrome (DiGeorge), Cri-du-chat syndrome, Prader-Willi/Angelman syndrome, 1p36 deletion syndrome, as well as trisomies 16 and 22 the last week of October. Results from a method validation study….

It seems that the firm’s main path to profit and riches is going to be to innovate faster, gain market share, brand recognition, and economies of scale. This seems as if it is a greater good for the public than its rents extracted through intellectual property monopolies.

The post The future always advances appeared first on Gene Expression.

December 18, 2012

Buddy, can you spare some ascertainment?

The above map shows the population coverage for the Geno 2.0 SNP-chip, put out by the Genographic Project. Their paper outlining the utility and rationale by the chip is now out on arXiv. I saw this map last summer, when Spencer Wells hosted a webinar on the launch of Geno 2.0, and it was the aspect which really jumped out at me. The number of markers that they have on this chip is modest, only >100,000 on the autosome, with a few tens of thousands more on the X, Y, and mtDNA. In contrast, the Axiom® Genome-Wide Human Origins 1 Array Plate being used by Patterson et al. has ~600,000 SNPs. But as is clear by the map above Geno 2.0 is ascertained in many more populations that the other comparable chips (Human Origins 1 Array uses 12 populations). It’s obvious that if you are only catching variation on a few populations, all the extra million markers may not give you much bang for the buck (not to mention the biases that that may introduce in your population genetic and phylogenetic inferences).


To the left are the list of populations against which the Human Origins 1 Array was ascertained, and they look rather comprehensive to me. In contrast, for Geno 2.0 ‘ancestrally informative markers’ were ascertained on 450 populations. The ultimate question for me is this: is all the extra ascertainment on diverse and obscure groups worth it? On first inspection Geno 2.0′s number of SNPs looks modest as I stated, but in my experience when you quality control and merge different panels together you are often left with only a few hundred thousand SNPs in any case. 100-200,000 SNPs is also sufficient to elucidate relationships even in genetically homogeneous regions such as Europe in my experience (it’s more than enough for model-based clustering, and seems to be overkill for MDS or PCA). One issue that jumps out at me about the Affymetrix chip is that it is ascertained toward the antipodes. In contrast, Geno 2.0 takes into account the Eurasian heartland. I suspect, for example, that Geno 2.0 would be better for population or ancestry assignment for South Asians because it would have more informative markers for those populations.

Ultimately I can’t really say much more until I use both marker sets in different and similar contexts. Since Geno 2.0 consciously excludes many functional and medically relevant SNPs its utility is primarily in the domain of demographics and history. If the populations in question are well covered by the Human Origins 1 Array, I see no reason why one shouldn’t go with it. Not only does it have more information about biological function, but the number of markers are many fold greater. On the other hand, Geno 2.0 may be more useful on the “blank zones” of the Affy chip. Hopefully the Genographic Project results paper for Geno 2.0 will come out soon and I can pull down their data set and play with it.

Cite: arXiv:1212.4116

December 17, 2012

Buyer beware in ancestry testing!

Filed under: Personal genomics — Razib Khan @ 10:20 pm

Over at Genomes Unzipped Vincent Plagnol has put up a post, Exaggerations and errors in the promotion of genetic ancestry testing, which to my mind is an understated and soft-touch old-fashioned “fisking” of the pronouncements of a spokesperson for an outfit termed Britain’s DNA. The whole post is worth reading, but this is a very grave aspect of the response of the company:

…The main reason is that listening to this radio interview prompted my UCL colleagues David Balding and Mark Thomas to ask questions to the Britain’s DNA scientific team; the questions have not been satisfactorily answered. Instead, a threat of legal action was issued by solicitors for Mr Moffat. Any type of legal threat is an ominous sign for an academic debate. This motivated me to point out some of the incorrect, or at the very least exaggerated, statements made in this interview. Importantly, while I received comments from several people for this post, the opinion presented here is entirely mine and does not involve any of my colleagues at Genomes Unzipped.

From what I can gather this firm is charging two to three times more than 23andMe for state-of-the-art scientific genealogy, circa 2002. So if you can’t be bothered to read the piece, it looks like Britain’s DNA is threatening litigation for researchers having the temerity to point out that the firm is providing substandard services at above-market costs. Plagnol’s critique lays out point-by-point refutation of assertions, but the interpretation services on offer seem to resemble nothing more than genetically rooted epic fantasy. A triumph of marketing over science.


In other scientific genealogy news, a friend recent sent me results for his family from Ancestry.com’s AncestryDNA service. Looking at the pie-charts, I can say one thing: they were whack! But the question then is are they truly just whack, or does their peculiarity indicate real genetic insight? I have no way to judge, because they still aren’t providing raw data downloads, though they promise to soon. I actually talked to a scientist from Ancestry.com for a little while at ASHG 2012, and he claimed that they were tweaking the algorithms as as we were speaking. Nevertheless, bizarre results still seem to abound. It would be nie to figure out the method to this madness.

Finally, the genomic angle to the Dan MacArthur → Dan MacCurry saga is approaching closure. My friend Zack Ajmal promises to put up his analysis before he goes on vacation. I asked Zack to look into the matter because he has a very large database of South Asians, and I want to see if he could find the best match to Dan’s chromosome 10. If it does turn out that it is highly probable that Dan’s South Asian ancestry is Bengali, then I’ll have to make sure he’s introduced to the aloo bhorta which his ancestors no doubt relished (and which is unpalatable to people of other South Asian ethnic groups because of the mustard oil).

Older Posts »

Powered by WordPress