Razib Khan One-stop-shopping for all of my content

November 30, 2016

Razib Khan at the Center of Eurasia

Filed under: Genomics,science — Razib Khan @ 6:41 am
The Eurogenes blog is running a fundraiser. I chipped in mostly to support his continued blogging. I don't agree with everything he posts, but the site is a good and valuable resource. "Genome blogging" hasn't gotten as far as I'd have thought it would have in 2010, mostly because the initial burst of enthusiasm wasn't...

November 8, 2013

Selection happens; but where, when, and why?

Filed under: Anthroplogy,Genetics,Genomics,Pigmentation — Razib Khan @ 1:49 am
Distribution of SLC452 variation at SNP rs1426654. Credit, HGDP Browser

Distribution of SLC452 variation at SNP rs1426654. Credit, HGDP Browser

Nina Davuluri, Miss America 2014, Credit: Andy Jones

Nina Davuluri, Miss America 2014, Credit: Andy Jones

One of the secondary issues which cropped up with Nina Davuluri winning Miss America is that it seems implausible that someone with her complexion would be able to win any Indian beauty contest. A quick skim of Google images “Miss India” will make clear the reality that I’m alluding to. The Indian beauty ideal, especially for females, is skewed to the lighter end of the complexion distribution of native South Asians. Nina Davuluri herself is not particularly dark skinned if you compared her to the average South Asian; in fact she is likely at the median. But it would be surprising to see a woman who looks like her held up as conventionally beautiful in the mainstream Indian media. When I’ve pointed this peculiar aspect out to Indians* some of them of will submit that there are dark skinned female celebrities, but when I look up the actresses in question they are invariably not very dark skinned, though perhaps by comparison to what is the norm in that industry they may be. But whatever the cultural reality is, the fraught relationship of color variation to aesthetic variation prompts us to ask, why are South Asians so diverse in their complexions in the first place? A new paper in PLoS Genetics, The Light Skin Allele of SLC24A5 in South Asians and Europeans Shares Identity by Descent, explores this genetic question in depth.

Much of the low hanging fruit in this area was picked years ago. A few large effect genetic variants which are known to be polymorphic across many populations in Western Eurasia segregate within South Asian populations. What this means in plainer language is that a few genes which cause major changes in phenotype are floating around in alternative flavors even within families among people of Indian subcontinental origin. Ergo, you can see huge differences between full siblings in complexion (African Americans, as an admixed population, are analogous). While loss of pigmentation in eastern and western Eurasia seems to be a case of convergent evolution (different mutations in overlapping sets of genes), the H. sapiens sapiens ancestral condition of darker skin is well conserved from Melanesia to Africa.


So what’s the angle on this paper you may ask? Two things. The first is that it has excellent coverage of South Asian populations. This matters because to understand variation in complexion you should probably look at populations which vary a great deal. Much of the previous work has focused on populations at the extremes of the human distribution, Africans and Europeans. There are obvious limitations using this approach. If you are looking at variant traits, then focusing on populations where the full range of variation is expressed can be useful. Second, this paper digs deeply into the subtle evolutionary and phylogenomic questions which are posed by the diversification of human pigmentation. It is often said that race is often skin deep, as if to dismiss the importance of human biological variation. But skin is a rather big deal. It’s our biggest organ, and the pigmentation loci do seem to be rather peculiar.

You probably know that on the order of ~20% of genetic variation is partitioned between continent populations (races). But this is not the case at all genes. And pigmentation ones tend to be particular notable exceptions to the rule. In late 2005 a paper was published which arguably ushered in the era of modern pigmentation genomics, SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. The authors found that one nonsynonomous mutation was responsible for on the order of 25 to 33% of the variation in skin color difference between Africans and Europeans. And, the allele frequency was nearly disjoint across the two populations, and between Europeans and East Asians. When comparing Europeans to Africans and East Asians almost all the variation was partitioned across the populations, with very little within them. The derived SNP, which differs from the ancestral state, is found at ~100% frequency in Europeans, and ~0% in Africans and East Asians. It is often stated (you can Google it!) that this variant is the second most ancestrally informative allele in the human genome in relation to Europeans vs. Africans.

SLC24A5 was just the beginning. SLC45A2, TYR, OCA2, and KITLG are just some of the numerous alphabet soup of loci which has come to be understood to affect normal human variation in pigmentation. Despite the relatively large roll call of pigmentation genes one can safely say that between any two reasonably distinct geographic populations ~90 percent of the between population variation in the trait is going to be due to ~10 genes. Often there is a power law distribution as well. The first few genes of large effect are over 50% of the variance, while subsequent loci are progressively less important.

So how does this work to push the overall results forward?

- With their population coverage the authors confirm that SLC24A5 seems to be polymorphic in all Indo-European and Dravidian speaking populations in the subcontinent. The frequency of the derived variant ranges from ~90% in the Northwest, and ~80% in Brahmin populations all over the subcontinent, to ~10-20% in some tribal groups.

- Though there is a north-south gradient, it is modest, with a correlation of ~0.25. There is a much stronger correlation with longtitude, but I’m rather sure that this is an artifact of their low sampling of Indo-European populations in the eastern Gangetic plain. As hinted in the piece the correlation with longitude has to do with the fact that Tibetan and Burman populations in these fringe regions tend to lack the West Eurasian allele.

- Using haplotype based tests of natural selection the authors infer that the frequency of this allele has been driven up positively in north, but not south, India. It could be that the authors lack power to detect selection in the south because of lower frequency of the derived allele. And, I did wonder if selection in the north was simply an echo of what occurred in West Eurasia. But if you look at the frequency of the A allele in the north most of the populations seem to have a higher frequency of the derived variant than they do of inferred “Ancestral North Indian”.

What’s perhaps more interesting is the bigger picture of human evolutionary dynamics and phylogenetics that these results illuminate. Resequencing the region around SLC24A5 these researchers confirmed it does look like the derived variant is identical by descent in all populations across Western Eurasia and into South Asia. What this means is that this mutation arose in someone at some point around the Last Glacial Maximum, after West Eurasians separated from East Eurasians. The authors gives some numbers using some standard phylogenetic techniques, but admit that it is ancient DNA that will give true clarity on the deeper questions. When I see something written like that my hunch, and hope, is that more papers are coming soon.

When I first read The Light Skin Allele of SLC24A5 in South Asians and Europeans Shares Identity by Descent, I thought that it was essential to read Ancient DNA Links Native Americans With Europe and Efficient moment-based inference of admixture parameters and sources of gene flow. The reason goes back to the plot which I generated at the top of this post: notice that Native Americans do not carry the West Eurasian variant of SLC24A5. What the find of the ~24,000 Siberian boy, and his ancient DNA, suggest is that there was a population with affinities closer to West Eurasians than East Eurasians that contributed to the ancestry of Native Americans. The lack of the European variant of SLC24A5 in Native Americans suggests to me that the sweep had not begun, or, that the European variant was disfavored. What the other paper reports is that on the order of 20-40% of the ancestry of Europeans may be derived from an ancient North Eurasian population, unrelated to West Eurasians (or at least not closely related). It is likely that this population has something to do with the Siberian boy. Since Europeans are fixed for the derived variant of SLC24A5, that implies to me that sweep must have occurred after 24,000 years ago.

journal.pgen.1003912.g002At this point I have to admit that I believe need to be careful calling this a “European variant.” Just because it is nearly fixed in Europe, does not imply that the variant arose in Europe. If you look at the frequency of the derived variant you see it is rather high in the northern Middle East. Looking at some of the populations in the Middle Eastern panel the ancestral variant might be all explained by admixture in historical time from Africa. If the sweep began during the last Ice Age, then most of Europe would have been uninhabited. The modern distribution is informative, but it surely does not tell the whole story.

Where we are is that SLC24A5 , and pigmentation as a whole, is coming to be genomically characterized fully. We don’t know the whole story of why light skin was selected so strongly. And we don’t quite know where the selection began, and when it began. But through gradually filling in pieces of the puzzle we may come to grips with this adaptively significant trait in the nearly future.

Citation: Basu Mallick C, Iliescu FM, Möls M, Hill S, Tamang R, et al. (2013) The Light Skin Allele of SLC24A5 in South Asians and Europeans Shares Identity by Descent. PLoS Genet 9(11): e1003912. doi:10.1371/journal.pgen.1003912

* From my personal experience American born Indians often do not share the same prejudices and biases, partly because subtle shades of brown which are relevant in the Indian context seem ludicrous in the United States.

The post Selection happens; but where, when, and why? appeared first on Gene Expression.

January 8, 2013

Using your 23andMe data: exploring with MDS

Filed under: 23andMe,Genomics — Razib Khan @ 2:09 am

Note: please read the the earlier post on this topic if you haven’t.

The above image is from 23andMe. It’s from a feature which seems to have been marginalized a bit with their ancestry composition. Basically it is projecting 23andMe customers on a visualization of genetic variation from the HGDP data set. This is actually a rather informative sort of representation of variation. But there has always been an issue with the 23andMe representation: you are projected onto their invariant data set. In other words, you can’t mix & match the populations so as to explore different relationships. The nature of the algorithm and representation produces strange results, so varying the population sets is often useful in smoking out the true shape of things.

With the MDS feature I wrote about yesterday you can now compute positions with different weights of populations and mixes. This post will focus on how to manipulate the overall data set. You should have PHYLO from the the earlier post. Open up the .fam file. It should look like this:

Malayan A382 0 0 1 -9
Paniya D36 0 0 1 -9
BiakaPygmies HGDP00479 0 0 1 -9
BiakaPygmies HGDP00985 0 0 1 -9
BiakaPygmies HGDP01094 0 0 1 -9
MbutiPygmies HGDP00982 0 0 1 -9
Mandenkas HGDP00911 0 0 1 -9
Mandenkas HGDP01202 0 0 1 -9
Yorubas HGDP00927 0 0 1 -9
BiakaPygmies HGDP00461 0 0 1 -9
BiakaPygmies HGDP00986 0 0 1 -9
MbutiPygmies HGDP00449 0 0 1 -9
Mandenkas HGDP00912 0 0 1 -9
Mandenkas HGDP01283 0 0 1 -9
Yorubas HGDP00928 0 0 2 -9

And so forth. PHYLO has 1,500+ individuals. This is a bit much, which is why the – -genome command took so long. To ask particular questions it is often useful to prune the population down. I have a friend who is 1/4 Filipino who is curious as to whether his ancestry was more Chinese or native Filipino. How to answer this?

- You want a range of East Asian populations, north to south.

- You want a good out group. I’ll use the Utah whites.

All you need to do is go through the .fam file and keep only those lines you want, and put them into a new file, keep.txt. Then you run this command:

plink – -noweb – -bfile PHYLO – -keep keep.txt – -make-bed – -out PHYLONARROW

So I’ve now made a new pedigree data set which is a subset of the original. Now I merged my friend and my daughter’s genotype into this data set. What about if I wanted to remove some individuals, for examples, the ones in keep.txt? You do it like so:

plink – -noweb – -bfile PHYLO – -remove keep.txt – -make-bed – -out PHYLOAFEWGONE

With – -keep and – -remove, and making files drawn from the .fam file(s), you can customize your own data set for your own purposes. Again you want to produce an MDS, so run:

- -plink – -noweb – -bfile PHYLONARROW – -genome

-plink – -noweb – -bfile PHYLONARROW – -read-genome plink.genome – -mds-plot 6

This time – -genome will run very fast, because there are far fewer individuals. Here is my plot of the result of the outcome (my friend is “RF,” my daughter is “RD”):

Note that RF is aligned straight toward the “Dai” population, an ethnic group from South China, but not Han (they are related to the Thai). It seems plausible that my friend is of mixed Chinese and Filipino background. My daughter’s minimal East Asian ancestry is indeed Southeast Asian, and this is clear from this plot, as she is shifted further toward the Cambodians (this may be due to South Asian affinities as well).

The point is not to rely on one plot, but to generate many so as to explore the possibilities, and develop and intuition.

January 7, 2013

Using your 23andMe data in Plink

Filed under: Genomics,PLINK — Razib Khan @ 2:58 am

With the recent $99 price point for 23andMe many of my friends have purchased kits (finally!). 23andMe’s interpretive results are pretty rich now, but there are still things missing. There are plenty of third party tools you can use, but I know some people might want to do their own data analysis. There are many ways you could go about this, but I want to put up some posts on DIY genomic data analysis to making the learning curve a little less steep, and get people started. Motivation to actually begin going down this road is a big issue, but I think once you get over the hump it gets a lot easier.

First, you need Plink. It is really preferable that you work on a Mac or in Linux to engage in heavy duty analysis, but in this post I’ll assume you are working on the Windows platform. Again, the point here is to make this accessible. Download Plink if you don’t have it, and extract it where ever you like.


Plink is a command line tool, which means that you need to into the folder with the old MS-DOS interface. So use the cd command to get into that folder. Here is a screenshot of my shell:

The selection “plink –noweb –bfile PhyloF –genome” is a command that I entered. It is not part of the directory structure. If you don’t know about the cd command, please see the Wikipedia entry. It’s really just a simple way to step through the directory structure of your files and folders.

Now you have Plink. We need to put your 23andMe data into pedigree format. Additionally it would be convenient to have other reference data sets . Go to here. You now need to click the ZIP option. That will download a 74 MB zip containing all the files you see listed to the left. Most of that is in the two zip files, which are pedigree file data sets that I have provided for your future use. More on that later. First you need to use “CONVERT_23AME_PED.pl”  This a Perl script which takes the 23andMe text file, and converts it to pedigree format which Plink can use. You need to have Perl to use this script.

If you are on Windows you need to get ActivePerl. Download it. Again you have to open the command prompt and go into the appropriate folder. On my computer (this is the first time I’m using Perl on Windows in 10 years, the sacrifices I make for the readership of this weblog!) it is in the C:\ directory, so you probably have to move “up” the directory tree twice by typing “cd ..” (if you do this you’ll see what I mean). Once you are in the Perl directory you need to go into the bin directory. Remember to move the Perl script into the Perl directory. Here is a screen of what I get when I try and run the Perl script without any parameters:

Basically there needs to be a file for the script to process. You should have a 23andMe text file, your raw data. It will start like so: “genome_”. If you don’t have it, go into your account, and click “Browse Raw Data.”  On this page there will be options to download various peoples’ data if you have multiple accounts. It will download whoever is selected in your profile (for most it will be just one person of course).

Now you need to just select the button and enter your password. An 8 MB zip file will come down from the server. Put it into your Perl/bin folder by extracting it. Do not try and process the zip file! Once in there you now add it as your first parameter. I’d rename it something short and sweet since you’ll be typing it in. You don’t need to put a unique id parameter in, but I would if I were you. Try “me.” And “me” for the family id. At some point you’ll do more sophisticated things and need less silly ids, but not right now.

Here’s a screenshot of me running the Perl script with my own data (I renamed the text file). If the file name isn’t recognized make sure that you didn’t add the file extension within Windows, that might confuse it (e.g., for razibdata.txt if that’s what you see in the directory, you’d have to enter in razibdata.txt.txt in the parameter value since the extension is hidden):

There are two output files. In my case they are razibdata.ped and razibdata.map. As you can see they are named from your original file. You need to move both into the Plink directory. The .ped file has your individual data, the first half a dozen columns being the same as the parameters you may, or may not, have entered above. But it is very large because the whole line is filled out with your 23andMe genotype. The .map file basically has the information about the SNPs. These are both text files, and unwieldy.  You need to make it into a binary file. At the end of this there are three new files of the same name with extensions .bed, .bim, .fam:

You can see a lot of information. Most of it is not relevant to you, but note the number of SNPs. So now you have a pedigree file! Great. What do you do with it? Lots of stuff. You can look at the Plink documentation. Because the .bed file is a binary, never open it. The .bim has SNP info. You shouldn’t open this. On the other hand when you merge data sets .fam is useful. It’s a text file with all your individual and family id information. In this case with one file it isn’t informative, though you could change the id by editing the .fam file.

One thing you can do with just one individual is look for runs of homozygosity. The command is:

plink –bfile mydata –homozyg

You enter your binary pedigree file name, without the extension. Observe that now we are use –bfile instead of –file. Many commands will be bCommand instead of just Command if you are using binary files instead of the conventional ones. Binary files are smaller and the commands finish much faster, so use them! The output files, unless you use the –out command at the end to define them, usually begin with plink. So above you have plink.hom. It has some interesting information about the runs of homozygosity, but it is probably not too illuminating unless you suspect you are inbred!

Ultimately what I want you do by the end of this is compute an MDS with your own data against a reference set. That’s PHYLO in the data I’ve provided. It has 99,000 SNPs that overlap with 23andMe, and 1,500 individuals. I’ve altered the .fam file so that all the family ids are recognizable as populations. This will make analysis of the output easier for you. First you need to merge the files. It will be useful for you to prune your data set down, since you have a lot of extra SNPs.

Assuming you’ve extracted PHYLO out of the zip downloaded here is my command writing out the list of SNPs within PHYLO:

You can see from reading this that this data set has ~99,000 SNPs. I pruned it so that it ran quicker for phylogenetic analysis. This is more than a sufficient number for most analysis. What you want to next is create a copy of your own data which doesn’t have so many SNPs, so you can merge them well. Because I created this data set I can tell you that all of the above SNPs are probably in your 23andMe file. With the commands above there is a file, plink.snplist, which you will use to filter your data set.

Here’s how to do it:

Now we’ve got it ready to merge. I will warn you that this takes forever on Windows! No idea why. Also, Windows tends to do strange things with the file extensions. If Plink tells you that a .fam does not exist, look to the file extension. If you label something as something.fam, it might actually be something.fam.fam. In any case, here’s how you merge:

This is going to give you lots of warnings. Often this won’t matter, but sometimes it will tell you that you might need to “flip” one of the files. Try flipping it. If it still doesn’t work I would remove the SNPs causing problems. Something like this:

Honestly you might have to do a lot of things to get data sets to merge. But this particular combination of 23andMe genotype and PHYLO shouldn’t be too bad. Let’s assume that your merge worked. What do you want to do? One thing that might be interesting is an MDS plot (it’s like a PCA plot).

First you run the genome command, which takes forever to finish. It might be best if you did this before you go to sleep, and just check in in the morning. The genome command will produce an output that you’ll use next.

Notice the input file. That was generated in the previous step. The value 6 is a parameter that defines how many dimensions you want to output. My experience with this is that it doesn’t take too long, so I go for 6 at least. The final result of this is that you have an plink.mds file with an ordered list of family and individual ids, along with positions for 6 dimensions. It should be straightforward to import this into Excel, and then plot your MDS, emphasizing your own position. Since I can no longer use Excel I couldn’t be bothered to figure out how to plot my own position, but the distribution should be familiar.

That’s about it for now. I’ll put up another post focusing less on phylogenetics, using the HapMap data set that I provided. I don’t know if I can continue to do this in Windows, but hopefully this illustrated how easy (if tedious) most of this is.

January 5, 2013

Why the future won’t be genetically homogeneous

While reading The Founders of Evolutionary Genetics I encountered a chapter where the late James F. Crow admitted that he had a new insight every time he reread R. A. Fisher’s The Genetical Theory of Natural Selection. This prompted me to put down The Founders of Evolutionary Genetics after finishing Crow’s chapter and pick up my copy of The Genetical Theory of Natural Selection. I’ve read it before, but this is as good a time as any to give it another crack.

Almost immediately Fisher aims at one of the major conundrums of 19th century theory of Darwinian evolution: how was variation maintained? The logic and conclusions strike you like a hammer. Charles Darwin and most of his contemporaries held to a blending model of inheritance, where offspring reflect a synthesis of their parental values. As it happens this aligns well with human intuition. Across their traits offspring are a synthesis of their parents. But blending presents a major problem for Darwin’s theory of adaptation via natural selection, because it erodes the variation which is the raw material upon which selection must act. It is a famously peculiar fact that the abstraction of the gene was formulated over 50 years before the concrete physical embodiment of the gene, DNA, was ascertained with any confidence. In the first chapter of The Genetical Theory R. A. Fisher suggests that the logical reality of persistent copious heritable variation all around us should have forced scholars to the inference that inheritance proceeded via particulate and discrete means, as these processes do not diminish variation indefinitely in the manner which is entailed by blending.

More formally the genetic variance decreases by a factor of 1/2 every generation in a blending model. This is easy enough to understand. But I wanted to illustrate it myself, so I slapped together a short simulation script. The specifications are as follows:

1) Fixed population size, in this case 100 individuals

2) 100 generations

3) All individuals have 2 offspring, and mating is random (no consideration of sex)

4) The offspring trait value is the mid-parent value of the parents, though I also including a “noise” parameter in some of the runs, so that the outcome is deviated somewhat in a random fashion from expected parental values

In terms of the data structure the ultimate outcome is a 100 ✕ 100 matrix, with rows corresponding to generations, and each cell an individual in that generation. The values in each cell span the range from 0 to 1. In the first generation I imagine the combining of two populations with totally different phenotypic values; 50 individuals coded 1 and 50 individuals coded 0. If a 1 and 1 mate, the produce only 1′s. Likewise with 0′s. On the other hand a 0 and a 1 produce a 0.5. And so forth. The mating is random in each generation.

The figure to the left illustrates the decay in the variance of the trait value over generation time in different models. The red line is the idealized decay: 1/2 decrease in variance per generation. The blue line is one simulation. It roughly follows the decay pattern, though it is deviated somewhat because it seems that there was some assortative mating randomly (presumably if I used many more individuals it would converge upon the analytic curve). Finally you see one line which follows the trajectory of a simulation with noise. Though this population follows the theoretical decay more closely initially, it converges upon a different equilibrium value, one where some variance remains. That’s because the noise parameter continues to inject this every generation. The relevant point is that most of the variation disappears < 5 generations, and it is basically gone by the 10th generation. To maintain variation in a blending inheritance model requires a great deal of mutation, the extent of which is just not plausible.

To get a different sense of what occurred in these two particular simulations, here are heat maps. The interval 0 and  1 now have shading in each sell. I am displaying only 50 generations here. The top panel is one without noise, while the bottom panel has the noise parameter.

The contrast with a Mendelian model is striking. Imagine that 0 and 1 are now coded by two homozygote genotypes, with heterozygotes exhibiting a value of 0.5. If all the variation is controlled by the genotypes, then you have three genotypes, and three trait values. If I change the scenario above to a Mendelian one than variance will initially decrease, but the equilibrium will be maintained at a much higher level, as 50% of the population will be heterozygotes (0.5), and 50% homozygotes of each variety (0 and 1). With the persistence of heritable variation natural selection can operate to change the allele frequencies over time without the worry that the trait values within a breeding population will converge upon each other too rapidly. This is true even in cases of polygenic traits. Height and I.Q. remain variant, because they are fundamentally heritable through discrete and digital processes.

All this is of course why the “blond gene” won’t disappear, redheads won’t go extinct, nor will humans converge upon a uniform olive shade in a panmictic future. A child is a genetic cross between parents, but only between 50% of each parent’s genetic makeup. And that is one reason they are not simply an “averaging” of parental trait values.

January 4, 2013

Mitochondrial Eve: a de facto deception?

The above image, and the one to the left, are screenshots from my father’s 23andMe profile. Interestingly, his mtDNA haplogroup is not particularly common among ethnic Bengalis, who are more than ~80% on a branch of M. This reality is clear in the map above which illustrates the Central Asian distribution my father’s mtDNA lineage. In contrast, his whole genome is predominantly South Asianform, as is evident in the estimate that 23andMe provided via their ancestry composition feature, which utilizes the broader genome. The key takeaway here is that the mtDNA is informative, but it should not be considered to be representative, or anything like the last word on one’s ancestry in this day and age.


As a matter of historical record mtDNA looms large in human population genetics and phylogeography for understandable reasons. Mitchondria produce more genetic material than is found in the nucleus, and so were the lowest hanging fruit in the pre-PCR era. Additionally, because mtDNA lineages do not recombine they are well suited to a coalescent framework, where an idealized inverted treelike phylogeny converges upon a common ancestor. Finally, mtDNA was presumed to be neutral, so reflective of demographic events unperturbed by adaptation, and characterized by a high mutation rate, yielding a great amount of variation with which to differentiate the branches of the human family tree.

Many of these assumptions are are now disputable. But that’s not the point of this post. In the age of dense 1 million marker SNP-chips why are we still focusing on the history of one particular genetic region? In a word: myth. Eve, the primal woman. The “mother of us all,” who even makes cameos in science fiction finales!

In 1987 a paper was published which found that Africans harbored the greatest proportion of mtDNA variation among human populations. Additionally, these lineages coalesced back to a common ancestor on the order of 150,000 years ago. Since mtDNA is present in humans, there was a human alive 150,000 years ago who carried this ancestral lineage, from which all modern lineages derive. Mitochondrial DNA is passed from mothers to their offspring, so this individual must have been a woman. In the press she was labeled Eve, for obvious reasons. The scientific publicity resulted in a rather strange popular reaction, culminating in a Newsweek cover where Adam and Eve are depicted as naked extras from Eddie Murphy’s Coming to America film.

The problem is that people routinely believe that mtDNA Eve was the only ancestress of all modern humans from the period in which she lived. Why they believe this is common sense, and requires no great consideration. The reality is that the story being told by science is the story of mtDNA, with inferences about the populations which serve as hosts for mtDNA being incidental. These inferences need to be made cautiously and with care. It is basic logic that a phylogeny will coalesce back to a common ancestor at some point. Genetic lineages over time go extinct, and so most mtDNA lineages from the time of Eve went extinct. There were many woman who were alive during the same time as Eve, who contributed at least as much, perhaps more, to the genetic character of modern humans today. All we can say definitively is that their mtDNA lineage is no longer present. As mtDNA is passed from mother to daughter (males obviously have mtDNA, but we are dead ends, and pass it to no one), all one needs for a woman’s mtDNA lineage to go extinct is for her to have only sons. Though she leaves no imprint on the mtDNA phylogeny, obviously her sons may contribute genes to future generations.

Prior to ancient DNA and the proliferation of dense SNP data sets scholars were a bit too ambitious about what they believed they could infer from mtDNA and Y lineages (e.g., The Real Eve: Modern Man’s Journey Out of Africa). We are in a different time now, inferences made about the past rest on more than one leg. But the legend of Eve of the mtDNA persists, not because of its compelling scientific nature, but because this is a case where science piggy-backs upon prior conceptual furniture. This yields storytelling power, but a story which is based on a thin basis of fact becomes just another tall tale.

All this is on my mind because one of the scientists involved with Britain’s DNA, Jim Wilson, has penned a response to Vincent Plagnol’s Exaggerations and errors in the promotion of genetic ancestry testing (see here for more on this controversy). Overall I don’t find Wilson’s rebuttal too persuasive. It is well written, but it has the air of sophistry and lawyerly precision. I have appreciated Wilson’s science before, so I am not casting aspersions at his professional competence. Rather, some of the more enthusiastic and uninformed spokespersons for his firm have placed him in a delicate and indefensible situation, and he is gamely attempting to salvage the best of a bad hand. Importantly, he does not reassure me in the least that his firm did not use Britain’s atrocious libel laws as a threat to mute forceful criticism of their business model on scientific grounds. A more general issue here is that Wilson is in a situation where he must not damage the prospects of his firm, all the while maintaining his integrity as a scientist. From what I have seen once science becomes a business one must abandon the pretense of being a scientist first and foremost, no matter how profitable that aura of objectivity may be. The nature of marketing is such that the necessary caution and qualification essential for science becomes a major liability in the processing of communicating. It’s about selling, not convincing.

Going back to Eve, Wilson marshals a very strange argument:

“The claim that Adam and Eve really existed, as you suggest, refers to the most recent common ancestors of the mtDNA and non-recombining part of the Y chromosome. I don’t agree that there is nothing special about these individuals: there must have been a reason why mitochondrial Eve was on the front cover of Time magazine in the late 80s!….

A minor quibble, but I suspect he means the Newsweek cover. More seriously, this line of argumentation is bizarre on scientific grounds. Rather, it is a tack which is more rational when aiming toward a general audience which might purchase a kit which they believe might tell them of their relationship to “Eve.”

In the wake of the discussion at Genomes Unzipped I participated in further exchanges with Graham Coop and Aylwyn Scally on Twitter, and decided to spend 20 minutes this afternoon asking people what they thought about mitochondrial Eve. By “people,” I mean individuals who are pursuing graduate educations in fields such as genetics and forensics. My cursory “field research” left me very alarmed. Naturally these were individuals who did not make elementary mistakes in regards to the concept, but there was great confusion. I can only wonder what’s going through the minds of the public.

Analogies, allusions, and equivalences are useful when they leverage categories and concepts which we are solidly rooted in, and transpose them upon a foreign cognitive landscape. By pointing to similarities of structure and relation one can understand more fully the novel ground which one is exploring. Saying that the president of India is analogous to the queen of England is an informative analogy. These are both positions where the individual is a largely ceremonial head of state. In contrast, the president of the United States and the queen of England are very different figures, because the American executive is not ceremonial at all. This is not a useful analogy, even though superficially it sees no lexical shift.

Who was Eve? A plain reading is that she is the ancestor of all humans, and more importantly, the singular ancestress of all humans back to the dawn of time. This is a concept which the public grasps intuitively. Who is mtDNA Eve? A woman who flourished 150,000 years ago, who happened to carry the mtDNA lineage which would drift to fixation in the ancestors of modern humans. I think this is a very different thing indeed. For purposes of poetry and marketing the utilization of the name Eve is justifiable. But on scientific grounds all it does is confuse, obfuscate, and mislead.

The fiasco that Vincent Plagnol stumbled upon is just a symptom of a broader problem. Scientists need to engage in massive conceptual clean up, as catchy phrases such as “mitochondrial Eve” and “Y Adam” permeated the culture over the past generation, and mislead many sincere and engaged seekers of truth. This is of the essence because personal genomics, and the scientific understanding of genealogy, are now moving out of the ghetto of hobbyists, enthusiasts, and researchers. Though I doubt this industry will be massive, it will be ubiquitous, and a seamless part of our information portfolio. If people still have ideas like mitochondrial Eve in their head it is likely to cloud their perception of the utility of the tools at hand, and their broader significance.

December 21, 2012

The causes of evolutionary genetics

A few days ago I was browsing Haldane’s Sieve,when I stumbled upon an amusing discussion which arose on it’s “About” page. This “inside baseball” banter got me to thinking about my own intellectual evolution. Over the past few years I’ve been delving more deeply into phylogenetics and phylogeography, enabled by the rise of genomics, the proliferation of ‘big data,’ and accessible software packages. This entailed an opportunity cost. I did not spend much time focusing so much on classical population and evolutionary genetic questions. Strewn about my room are various textbooks and monographs I’ve collected over the years, and which have fed my intellectual growth. But I must admit that it is a rare day now that I browse Hartl and Clark or The Genetical Theory of Natural Selection without specific aim or mercenary intent.

R. A. Fisher

Like a river inexorably coursing over a floodplain, with the turning of the new year it is now time to take a great bend, and double-back to my roots, such as they are. This is one reason that I am now reading The Founders of Evolutionary Genetics. Fisher, Wright, and Haldane, are like old friends, faded, but not forgotten, while Muller was always but a passing acquaintance. But ideas 100 years old still have power to drive us to explore deep questions which remain unresolved, but where new methods and techniques may shed greater light. A study of the past does not allow us to make wise choices which can determine the future with any certitude, but it may at least increase the luminosity of the tools which we have iluminate the depths of the darkness. The shape of nature may become just a bit less opaque through our various endeavors.

Figure from “Directional Positive Selection on an Allele of Arbitrary Dominance”, Teshima KM, Przeworski M

So what of this sieve of Haldane? As noted at  Haldane’s Sieve the concept is simple. Imagine two mutations, one which expresses a trait in a recessive fashion, and another in a dominant one. The sieve operates by favoring the emergence out of the low frequency zone where stochastic forces predominate of dominantly expressing variants (i.e., even if an allele confers a large fitness benefit, at low frequencies the power of random chance may still imply that it is highly likely to go extinct). An example of this would be lactase persistence, which in the modal  Eurasian variant seems to exhibit dominance. The converse case, where beneficial mutations are recessive in expression suffer from a structural problem where their benefit is more theoretical than realized.

The mathematics of this is exceedingly simple, a consequence of the Hardy-Weinberg dynamics of diploid random mating organisms. Let’s use the gene which is implicated in variation in lactase persistence as an example, LCT. Consider two alleles, LP and LNP, where the former confers persistence (one can digest lactose sugar as an adult), and the latter manifests the conventional mammalian ‘wild type’ (the production of lactase ceases as one leaves the life stage when nursing is feasible). LP is clearly the novel mutant. In a small population it is not unimaginable that by random chance the frequency of LP rises to ~10%. What now? At HWE you have:

p2 + 2pq + q2 = 1, where q = LP allele. At ~10% the numbers substituted would be:

(0.90)2 + 2(0.90)(0.10) + (0.10)2

This is where dominance or recessive expression is highly relevant. The reality is that LP is a dominant trait. So in this population the frequency of LP as a trait would be:

(0.10)2 + 2(0.90)(0.10) = 19%

Now imagine a model where LP is favored, but it expresses in a recessive fashion. Then the frequency of the trait would equal q2, the homozygote LP-allele proportion. That is, 1%. Though population genetics is often constructed on an algebraic foundation, the results lend themselves to intuition. A structural parameter endogenous to the genetic system, dominant or recessive expression, can have longstanding consequences in terms of the likely trajectory of the alleles. Selection only “sees” the trait, so a recessive trait with sterling qualities may as well be a trait with no qualities. In contrast, a dominantly expressed allele can cut like a scythe through a population, because every copy “counts.”

In preparation for this post I revisited the selection on Haldane’s Sieve in the encyclopediac Elements of Evolutionary Genetics. The authors note that this phenomenon, though of vintage character as these things can be reckoned is a field as young as evolutionary genetics, is still a live one. The dominance of favored mutations in wild populations, or the recessive character of deleterious ones in laboratory stock, may reflect the different regimes which these two genes pools are subject to. The nature of things is such that is easier to generate recessive mutations than dominant ones (i.e., loss is easier than gain), so the preponderance of dominant variants in wild stocks subject to positive selective pressure lends credence to the idea that evolutionary rather than development forces and constraints shape the genetic character of many species.

And yet things are not quite so tidy. Haldane’s Sieve, and the framework of dominant versus recessive alleles, operates differently in the area of sex chromosomes. In many lineages there is a ‘heterogametic sex’ which carries only one functional chromosome for most of the genome. In mammals this is the male (XY), while in birds this is the female (ZW). As males have only one functional copy of most genes on the sex chromosome, the masking effect of recessive expression does not apply to them in mammals. This may imply that because of the exposure of many deleterious recessive variants to natural selection within the heterogametic sex one would see different allelic distributions and genetic landscapes on these chromosomes (e.g., more rapid adaptation because of the exposure of nominally recessive alleles in the heterogametic sex, as well as more purifying selection on deleterious variants). But the reality is more complex, and the literature in this area is somewhat muddled. More precisely, it seems phylogenetically sensitive. Validation of the theory in mammals founders once one moves to Drosphila.

And that is why research in evolutionary genetics continues. The theory stimulates empirical exploration, and is tested against it. Much of the formal theory of classical evolutionary genetics, which crystallized in the years before World War II, is now gaining renewed relevance because of empirical testability in the era of big data and big computation. This is an domain where the past is not simply of interest to historians. Scientists themselves, chasing the next grant, and producing the expected stream of publications, may benefit from a little historical perspective by standing upon the shoulders of giants.

December 13, 2012

We are Nature

Filed under: Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 8:03 am

There’s an interesting piece in Slate, The Great Schism in the Environmental Movement, which seems to be a distillation of trends which have been bubbling within the modern environmentalist movement for a generation now (I’ve read earlier manifestos in a similar vein). I can’t assess the magnitude of the shift, but here’s the top-line:

But that is a false construct that scientists and scholars have been demolishing the past few decades. Besides, there’s a growing scientific consensus that the contemporary human footprint—our cities, suburban sprawl, dams, agriculture, greenhouse gases, etc.—has so massively transformed the planet as to usher in a new geological epoch. It’s called the Anthropocene.

Modernist greens don’t dispute the ecological tumult associated with the Anthropocene. But this is the world as it is, they say, so we might as well reconcile the needs of people with the needs of nature. To this end, Kareiva advises conservationists to craft “a new vision of a planet in which nature—forests, wetlands, diverse species, and other ancient ecosystems—exists amid a wide variety of modern, human landscapes.”


Let’s take this debate as a given. It is fundamentally normative. That is, it is about values. We we need to tread carefully before projecting values across disputants. Far too often in this domain people seem to presume normative alignments, and therefore confuse ideological disagreement for rejection of factual truths. But, one thing to consider is that it is probable that human beings have already radically reshaped the ecological character of the world over the past 100,000 years. The implicit model that many older environmental activists seem to present is a framework pitting man & the machine vs. nature (the Shire vs. Mordor). But it is just not a useful dichotomy for many.

It is possible that there was, and is, no “pristine” nature. These disparate perspectives come to the fore in particular in post-colonial landscapes settled by Europeans. There is a long tradition in these areas of transforming ‘natives’ into ‘Noble Savages,’ who have attained some idealized harmony with Nature. The reality is that it is not harmony that was attained, but equilibrium. The arrival of anatomically modern humans to Australia and the New World resulted in a ‘shock’ to the ecological system, as megafauna went extinct due to the new variable of human predation. Even if H. sapiens were not the sufficient condition for these extinctions (populations naturally go through cycles), it is likely they were necessary (i.e., humans might extirpate species during times of low census size). But it is not just the initial impact in terms of species turnover. Australian and Amerindian populations seem to have reshaped the long term character of the landscape through fireCharles C. Mann argues in 1491 that  the vast forests which colonial and early American settlers cleared were in fact second growth, which emerged in the wake of massive die-offs of indigenous peoples due to Old World disease.

All of this is fundamentally complicated. Instead of a decision tree with two options, ‘Civilization’ vs. ‘Nature,’ there is actually a space populated with a multitude of positions. As someone touched by a moderate amount of biophilia my vision for the future is one of arcology based urbanism, massively scaled up algaculture, and megafaunal rewilding through genetic engineering and ancient DNA. Rather than idealize a mythic past we should endeavor to forge a new future. So it was, and so shall it ever be.

December 12, 2012

A lighter shade of brown: Dan MacArthur, look east or south!

Filed under: Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 2:58 pm

South Indian Udupi cuisine

In the post below I offered up my supposition that Dan MacArthur’s ancestry is unlikely to be Northwest Indian, which precludes a Romani origin for his South Asian ancestry. Indeed this is almost certainly so, Dienekes Pontikos followed up my crude analyses with IBD-sharing calculations (IBD = ‘identity by descent,’ which is basically what you would think it is). The South Asian population which MacArthur has the closest affinity to is from Karnataka, which is one of the Dravidian speaking states of the South. This does not necessarily refute my earlier contention, as aside from Brahmins most Bengalis seem to have broad South Indian affinities, except for the fact that they often have more East Asian ancestry.


Now, I may seem a touch obsessive on this issue at this point. There are several things motivating me. First, this was laying around in plain sight, but we missed it for years! Second, I’ve known Dan for a while, so this is very amusing on a personal level. Third, Dienekes’ has been pushing me to continue my exploration in a friendly competition. None of this is very difficult, and I’ve been going at it in the early hours of the day before work, or right before I go to sleep. In short, I’m doing this in part to show that you don’t need to just talk genomics, you too can do genomics. Ironically the age of “Big Data” is also the age of distributed data.

Don’t wait to have children!

Filed under: Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 7:34 am

The New Republic has a piece up, How Older Parenthood Will Upend American Society, which won’t have surprising data for readers of this weblog. But it’s nice to see this sort of thing go “mainstream.” My daughter was born when her parents were in their mid-30s, so I know all the statistics. They aren’t good bed-time reading (she’s healthy and robust so far!). If I had to do it over again I definitely wouldn’t have waited this long. After becoming a father it brought home to me that waiting was one of the worst decisions of my life. Why postpone something this incredible for the more far more prosaic pleasures of an extended adolescence? Granted, I’m not sure that I would have been the best father at 25, but I don’t think there’s much I can say in reply to the argument that I should have become a father by 30.

More concretely, we would have had sperm and egg “banked” if we had been smart delaying parenthood. The article notes that storage of sperm costs $850 up front, and $300 to $500 per year after that, and that many balk at the cost. And how much do you spend on your cell phone every year? The issue here seems to be time preference.


Of course one of the aspects of the piece which is implicit is its class-orientation. It finishes up:

It won’t be easy to make the world more baby-friendly, but if we were to try, we’d have to restructure the professions so that the most intensely competitive stage of a career doesn’t occur right at the moment when couples should be lavishing attention on infants….

Though the fertility transition has occurred across all groups, it has been most striking among the professional classes. But the the professional classes are a minority of the population. The reality is that we’d expect fertility to rebound over time as professional class people, whether for genetic or cultural (likely both) reasons, fail to reproduce, and are replaced by those who produce large families early, and continue to do so in a heritable fashion.

 
The plot to the left illustrates the dynamic which I’m alluding to. If at t = 0 you have two populations where one outnumbers the second by a factor of 100, if the population which is at ~1% proportion has a doubling-time of 30 years vs. replacement for the majority, then it takes ~200 years for the minority to outnumber the majority. Of course, we need to be careful about projections. Something I’ve criticized more gullible or historically naive readers about. Mormon fundamentalists or Old Order Amish have very high fertility rates in the context of being embedded in advanced societies which have very low fertility (at least in relation). But context-free models are useless. For example, some hyper-fertile groups, such as Mormon fundamentalists and Hasidic Jews, have been known to exploit the welfare state to subsidize their natalism. Such a subsidy lasts only so long as the ratio between producers and consumers favors the former. If a trend is unsustainable, it won’t be sustained. In the short term the “limits to growth” are probably not capped by resources, but human capital, which allows us to have increasing marginal returns. Systems do return to equilibria eventually.

As I don’t see a shift back to younger families among the professional classes, what’s the solution? Massive sperm and egg banking of 20 year old individuals seems like an economically feasible and effective strategy. I wouldn’t be surprised if the math works out that this reduces our long term health care costs, by cutting down on children born with congential defects and decreasing the lifetime morbidity rate of the population as a whole. Want to bend the cost curve? This might be a way! Combined with pre-implantation genetic diagnosis the ‘solution’ then is probably more technology, not less.

The long term fix of course is to reform the professional system and the culture of careerism so that what is rewarded is genuine productivity, rather than signalling. By this, I mean that I suspect many people who work “80 hours a week” aren’t all that productive beyond about 50, but they do signal to their superiors that they are “committed.” And of course now the whole peer group has to shift toward this irrational norm. Better workplace monitoring and productivity assessments might be a way to address this issue, so that superiors can actually evaluate gains to hours of putative labor.

Finally, the General Social Survey asks people about the ideal number of children they’d like to have, versus the actually number of children they do have (CHLDIDEL and CHILDS). To remove demographic confounds I limited the sample below to non-Hispanic white women age 45 and up between the year 2006 and 2010, and compared across educational attainment.

December 11, 2012

We don’t know why Ethiopians breathe easy

Filed under: Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 9:40 pm

Most people are aware that altitude imposes constraints on individual performance and function. Much of this is flexible; athletes who train at high altitudes may gain a performance edge. But over the long term there are costs, just as there are with computers which are ‘overclocked.’ This is the point where you make the transition from physiology to evolution. Residence at high altitude entails strong selective pressures on populations. Over the past few years there has been a great deal of exploration of the genetics of long resident high altitude groups, the Tibetans, Peruvians, and Ethiopians.

Why is high altitude adaptation of interest? First, it’s clear and distinct. Coding the two phenotypes is not that difficult. Second, the adaptive value is clear. Though people can survive at high altitudes, most do not flourish. And third, you have several distinct populations which are phylogenetically diverse. In other words, you can test evolution across multiple lineages, and see if the phenotypic and genetic features resemble each other, or differ. To some extent we know the answer: though there are overlaps, altitude adaptations differ both in their physiological presentation, and in their genetic architecture. Because it is such a powerful force, the shape of adaptation to high altitudes gives us a better sense of the arc of evolution across diverse populations and times.

With all that under our belt, I have to admit that a new paper in PLoS Genetics, The Genetic Architecture of Adaptations to High Altitude in Ethiopia, is quite peculiar in my opinion. To not put too fine a point on it, it seems hurried and confused. Granted, from what I can gather the authors confirmed the likelihood that the genetic architecture of high altitude adaptations in Ethiopian populations is such that variance exhibits a more polygenic character, resulting in weaker ability to detect genes of large effect. But if you read the paper you notice that they go off on a particular tangent on the adaptation to diseases on immune loci. Much of the work over the past 10 years on recent evolution in humans has focused on immune response, so I don’t see why that was necessary for the broader paper.

Additionally, there’s one element which I think needs to be explored more: the phylogenetic history of the Ethiopian people. The paper makes the blithe assertion that Ethiopians have been resident in the highlands of that nation for tens of thousands of years, but the latest research is that that just isn’t so. In short, it may be that Ethiopians as we understand them are recent compounds of a long resident Sub-Saharan African population and West Eurasian migrants. I’m not sure that this necessarily impacts the high altitude adaptations, as the admixed populations would presumably inherit the alleles from the local residents if they did not exterminate them. But, it does seem to complicate the story that you’d want to tell.

Overall, I think this sort of paper, which follows on the heels of other work, should have been presented in a more explicitly evolutionary context. If these adaptations are old, then there may be many variants in the highlands of Ethiopia. In contrast, the Peruvian adaptations are presumably new, and from what I have been told they are more like improvisations on the fly. In a way the spectrum of subtly of adaptation, with Ethiopians being the most nuanced, and Peruvians the least (which Tibetans in the middle), gives us a sense of the different power of evolution on the scale of 50,000 years vs. 10,000 years. Additionally, the region around Ethiopia has long been inhabited by hominins. No prior population was resident on the Altiplano before the arrival of the first Amerindians ~10,000 years before the present.

I’m hoping that the Nielsen Group comes out with a better paper. I know they’re going to revisit the topic of altitude adaptation (so I was told at ASHG 2012).

Citation: Alkorta-Aranburu G, Beall CM, Witonsky DB, Gebremedhin A, Pritchard JK, et al. (2012) The Genetic Architecture of Adaptations to High Altitude in Ethiopia. PLoS Genet 8(12): e1003110. doi:10.1371/journal.pgen.1003110

December 10, 2012

Is Daniel MacArthur ‘desi’?

My initial inclination in this post was to discuss a recent ordering snafu which resulted in many of my friends being quite peeved at 23andMe. But browsing through their new ‘ancestry composition’ feature I thought I had to discuss it first, because of some nerd-level intrigue. Though I agree with many of Dienekes concerns about this new feature, I have to admit that at least this method doesn’t give out positively misleading results. For example, I had complained earlier that ‘ancestry painting’ gave literally crazy results when they weren’t trivial. It said I was ~60 percent European, which makes some coherent sense in their non-optimal reference population set, but then stated that my daughter was >90 percent European. Since 23andMe did confirm she was 50% identical by descent with me these results didn’t make sense; some readers suggested that there was a strong bias in their algorithms to assign ambiguous genomic segments to ‘European’ heritage (this was a problem for East Africans too).

Here’s my daughter’s new chromosome painting:

One aspect of 23andMe’s new ancestry composition feature is that it is very Eurocentric. But, most of the customers are white, and presumably the reference populations they used (which are from customers) are also white. Though there are plenty of public domain non-white data sets they could have used, I assume they’d prefer to eat their own data dog-food in this case. But that’s really a minor gripe in the grand scheme of things. This is a huge upgrade from what came before. Now, it’s not telling me, as a South Asian, very much. But, it’s not telling me ludicrous things anymore either!

But in regards to omission I am curious to know why this new feature rates my family as only ~3% East Asian, when other analyses put us in the 10-15% range. The problem with very high values is that South Asians often have some residual ‘eastern’ signal, which I suspect is not real admixture, but is an artifact. Nevertheless, northeast Indians, including Bengalis, often have genuine East Asia admixture. On PCA plots my family is shifted considerably toward East Asians. The signal they are picking up probably isn’t noise. Almost every apportionment of East Asian ancestry I’ve seen for my family yields a greater value for my mother, and that holds here. It’s just that the values are implausibly low.

In any case, that’s not the strangest thing I saw. I was clicking around people who I had “shared” genomes with, and I stumbled upon this:

As you can guess from the screenshot this is Daniel MacArthur’s profile. And according to this ~25% of chromosome 10 is South Asian! On first blush this seemed totally nonsensical to me, so I clicked around other profiles of people of similar Northern European background…and I didn’t see anything equivalent.

What to do? It’s going to take more evidence than this to shake my prior assumptions, so I downloaded Dr. MacArthur’s genotype. Then I merged it with three HapMap populations, the Utah whites (CEU), the Gujaratis (GIH), and the Chinese from Denver (CHD). The last was basically a control. I pulled out chromosome 10. I also added Dan’s wife Ilana to the data set, since I believe she got typed with the same Illumina chip, and is of similar ethnic background (i.e., very white). It is important to note that only 28,000 SNPs remained in the data set. But usually 10,000 is more than sufficient on SNP data for model-based clustering with inter-continental scale variation.

I did two things:

1) I ran ADMIXTURE at K = 3, unsupervised

2) I ran an MDS, which visualized the genetic variation in multiple dimensions

Before I go on, I will state what I found: these methods supported the inference from 23andMe, on chromosome 10 Dr. MacArthur seems to have an affinity with South Asians (i.e., this is his ‘curry chromosome’). Here are the average (median) values in tabular format, with MacArthur and his wife presented for comparison.

ADMIXTURE results for chromosome 10
K 1 K 2 K 3
CEU 0.04 0.02 0.93
GIH 0.87 0.05 0.08
CHD 0.01 0.97 0.01
Daniel MacArthur 0.29 0.07 0.64
Ilana Fisher 0.01 0.06 0.94

You probably want a distribution. Out of the non-founder CEU sample none went above 20% South Asian. Though it did surprise me that a few were that high, making it more plausible to me that MacArthur’s results on chromosome 10 were a fluke:

And here’s the MDS with the two largest dimensions:

Again, it’s evident that this chromosome 10 is shifted toward South Asians. If I had more time right now what I’d do is probably get that specific chromosomal segment, phase it, and then compare it to various South Asian populations. But I don’t have time now, so I went and checked out the results from the Interpretome. I cranked up the settings to reduce the noise, and so that it would only spit out the most robust and significant results. As you can see, again chromosome 10 comes up as the one which isn’t quite like the others.

Is there is a plausible explanation for this? Perhaps Dr. MacArthur can call up a helpful relative? From what  recall his parents are immigrants from the United Kingdom, and it isn’t unheard of that white Britons do have South Asian ancestry which dates back to the 19th century. Though to be totally honest I’m rather agnostic about all this right now. This genotype has been “out” for years now, so how is it that no one has noticed this peculiarity??? Perhaps the issue is that everyone was looking at the genome wide average, and it just doesn’t rise to the level of notice? What I really want to do is look at the distribution of all chromosomes and see how Daniel MacArthur’s chromosome 10 then stacks up. It might be a random act of nature yet.

Also, I guess I should add that at ~1.5% South Asian that would be consistent with one of MacArthur’s great-great-great-great grandparents being Indian. Assuming 25 year generation times that puts them in the mid-19th century. Of course, at such a low proportion the variance is going to be high, so it is quite possible that you need to push the real date of admixture one generation back, or one generation forward.

December 9, 2012

The origins of the Romani determined definitively

Filed under: Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 11:52 am

In many cases there are questions of a historical and ethnographic nature which are subject to controversy and debate. Scholarly arguments are laid out, and further dispute ensues. For decades progress seems fleeting, as one hypothesis is accepted, only to be subject to later revision. This sort of pattern gives succor to the most cynical and jaded of ‘Post Modern’ set, especially when the ‘discourse’ in question is in the domain of science.

But thankfully these debates can come to an end in some cases. So it is with the origins of the European Romani, better known as ‘Gypsies’ (though the Roma are the most well known of the Romani, other groups within Europe have different ethnonyms). Obviously many of the basic elements have long been there, but I think the most recent genetic work now establishes a level of closure. Taking a step back, what do we know?

1) The Romani language seems to be Indo-Aryan, with a likely affinity with the northwest group of Indo-Aryan languages

2) The Romani presence in Europe only dates to the past ~1,000 years, with an entry point in the Byzantine Empire

3) They are an admixture between an ancestral Indian element, and local populations

4) Their history of endogamy has resulted in a strong genetic drift effect

The two papers which seem to nail the coffin shut on these questions use somewhat different methodologies. One relies on Y chromosomal STRs (hypervariable repeat regions) to generate a paternal phylogeny. Focusing just on the paternal phylogeny allows for one to make very robust genealogical inferences. Additionally, the authors had a very large data set across India. Their goal was to ascertain the exact region of origin of the Romani before they left India. As noted in bullet #1 there is already some evidence from their language that this must be in northwest India. The second paper uses a SNP-chip; hundreds of thousands of autosomal markers. This has been done to death for other populations, so the method isn’t new. Rather, it is that it is now being applied to the Romani.

First, the Y chromosomal paper. The Phylogeography of Y-Chromosome Haplogroup H1a1a-M82 Reveals the Likely Indian Origin of the European Romani Populations:

Linguistic and genetic studies on Roma populations inhabited in Europe have unequivocally traced these populations to the Indian subcontinent. However, the exact parental population group and time of the out-of-India dispersal have remained disputed. In the absence of archaeological records and with only scanty historical documentation of the Roma, comparative linguistic studies were the first to identify their Indian origin. Recently, molecular studies on the basis of disease-causing mutations and haploid DNA markers (i.e. mtDNA and Y-chromosome) supported the linguistic view. The presence of Indian-specific Y-chromosome haplogroup H1a1a-M82 and mtDNA haplogroups M5a1, M18 and M35b among Roma has corroborated that their South Asian origins and later admixture with Near Eastern and European populations. However, previous studies have left unanswered questions about the exact parental population groups in South Asia. Here we present a detailed phylogeographical study of Y-chromosomal haplogroup H1a1a-M82 in a data set of more than 10,000 global samples to discern a more precise ancestral source of European Romani populations. The phylogeographical patterns and diversity estimates indicate an early origin of this haplogroup in the Indian subcontinent and its further expansion to other regions. Tellingly, the short tandem repeat (STR) based network of H1a1a-M82 lineages displayed the closest connection of Romani haplotypes with the traditional scheduled caste and scheduled tribe population groups of northwestern India.

 

Two trees illustrate the results succinctly:

The bottom line:

- This particular Y chromosomal lineage which is highly diagnostic of South Asian origin in the Romani shows that the Romani seem to derive from the populations of northwest India

- Additionally, within these populations the Romani Y chromosomal lineages derive from the lower caste elements, the scheduled castes and scheduled tribes

But the above results don’t get directly at genome-wide admixture. The second paper does, using hundreds of thousands of markers to explore the Romani affinity to other populations. Reconstructing the Population History of European Romani from Genome-wide Data:

The Romani, the largest European minority group with approximately 11 million people…constitute a mosaic of languages, religions, and lifestyles while sharing a distinct social heritage. Linguistic…and genetic…studies have located the Romani origins in the Indian subcontinent. However, a genome-wide perspective on Romani origins and population substructure, as well as a detailed reconstruction of their demographic history, has yet to be provided. Our analyses based on genome-wide data from 13 Romani groups collected across Europe suggest that the Romani diaspora constitutes a single initial founder population that originated in north/northwestern India ∼1.5 thousand years ago (kya). Our results further indicate that after a rapid migration with moderate gene flow from the Near or Middle East, the European spread of the Romani people was via the Balkans starting ∼0.9 kya. The strong population substructure and high levels of homozygosity we found in the European Romani are in line with genetic isolation as well as differential gene flow in time and space with non-Romani Europeans. Overall, our genome-wide study sheds new light on the origins and demographic history of European Romani.

The plot to the left illustrates the relationship of the Romani to world-wide populations using multi-dimensional scaling, where genetic variation is decomposed into dimensions, and individuals are plotted on those dimensions. In short, the Romani exhibit a classic admixture cline pattern.That is, they are the products of a two-way admixture between populations which occupy distinct positions along a cline, and Romani individuals and populations are distributed along the cline in proportion to their admixture. One notable aspect is that the Romani are actually two clusters; one which manifests a strong ‘east’-'west’ distribution, and another which seems located purely within the European cluster. The latter seems to be the Welsh Romani, who in the neighbor-joining tree (see the supplements) fall on the same branch as European populations, as opposed to the other Romani, who form their own clade.

To drill down further you need to ascertain admixture with a model-based clustering algorithm. Ergo, ADMIXTURE. I’ve reedited the figure to illustrate the salient points. In particular, it is clear that the Roma populations except the Welsh have significant South Asian ancestry. The question is how much? To answer this question you need to know the source population in South Asia. A peculiar aspect of this plot is that the Romani have very little of the green ancestral component, which happens to be modal in the Middle East (not shown). This element happens to be highly enriched in many Pakistani populations, but not necessarily northwest Indian ones. Nevertheless, the issue that leaves me suspicious of this particular finding is that many of the European populations, in particular those groups (e.g., Balkans) which may have admixed with the Romani, have this element to extent not evident in one of their presumed ‘daughter’ populations. I wonder if perhaps the peculiarities of Romani inbreeding has skewed the allele frequency distribution so much that you get strangeness like this. I am not showing higher K’s because those break out with a Romani-cluster. Just like the Kalash-cluster this is to a great extent a feature of the long term endogamy of these communities. With high levels of drift the allele frequency of these groups moves into a very peculiar space in relation to their parental populations, but one must not become confused and assume that the Romani or Kalash are themselves appropriate independent clusters in the same way that Europeans or East Asians are.

Using various forms of admixture analysis the authors seem to conclude that the Balkan Romani are 30-50% South Asian. This seems in line with intuition. But that still leaves open the question of who those South Asians were. As I noted above the most thorough Y chromosomal data point to the lower caste elements of northwest India. What do the autosomes say?

I don’t want get into the technical details of how they tested the models, but it seems that one of the likely parental populations to the Romani had a close relationship to the Meghwal, a scheduled caste from northwest India. In other words, the autosome results align very well with the Y chromosomal inferences. Additionally, the models tested imply that the Romani likely left South Asian ~1,000 years before the present, which aligns well with what is known from the historical record (though this is a case where I put much more stock in the historical record than inferences from population genetic models; look at the intervals).

Finally, there is the question of inbreeding. One aspect of the Romani genome is jumps out you is that they have many long “runs-of-homozygosity” (ROH). This is totally expected, as decades of uniparental analyses suggested a great deal of population bottleneck events as the Romani spread throughout Europe. But the ROH patterns also unearth an interesting fact: some of the Balkan Romani clearly have recent European admixture, while the non-Balkan Romani had an initial period of admixture followed by endogamy. The latter scenario seems to resemble Askhenazi Jews, while the former would suggest that the boundary between Romani and non-Romani in the Balkans is more fluid than is sometimes portrayed.

So there we have it. The Romani derive from lower castes populations from the northwest Indian subcontinent who seem to have left ~1,000 years ago. Over time they admixed with local populations, and are now 50-70% non-South Asian, with some groups being ~90% European (e.g., Welsh Romani). And, they have a long history as an endogamous group, judging by their inbreeding.

December 2, 2012

TreeMix: Who were the West Eurasian ancestors of Ethiopians?

Filed under: Anthroplogy,Ethiopia,Genetics,Genomics — Razib Khan @ 3:46 pm

One of the primary concerns/questions I had about Luca Pagani’s paper on the genetic origin of Ethiopians is that he found that their West Eurasian ancestor was closer to Levantine than Arabian. I was confused by this because on model-based clustering (e.g., Admixture) when you push down to a fine level of granularity you always see that the Ethiopians cluster with the Yemenis for their non-African ancestry. More precisely, Yemeni Jews are often ~100% component X, which ~50% of the ancestry of Ethiopians.

From what I recall Pagani et al. used haplotype windows which they assigned to Eurasian or African ancestral components, and they compared these to the populations related to the putative ancestral groups. Because Pagani et al. used blocks of the genome, rather than just on specific genotypes, I weight their finding more strongly. But I wanted to double check with TreeMix if the finding in Admixture was peculiar.

So again, I took a ~150,000 SNP set ran it on TreeMix with migration = 5.

Again, you see that the gene flow to the Ethiopians is coming from a position on the tree rather close to Yemenite Jews. One model which may explain this, and still align with Pagani’s findings, is that Arabians themselves are a synthetic population. A “pure” Yemenite Jew may have ancient admixture of African affinity beneath an intrusive element from the north. The parallelism between Ethiopia and Arabia in this model is clear, with the major difference being magnitude of the source population admixture (greater in Arabia), as well as some differences of the target population.

This again reiterates us to be careful of trust first-blush summaries.

More diversity in the Personal Genome Project

Filed under: Genomics,Human Genomics,PGP — Razib Khan @ 1:42 pm

Seeking Diversity (Especially Families):

Because the PGP is self-recruiting, we don’t have a very balanced set of participants. “Self-recruitment” means that all participants have enrolled in our project through word of mouth, finding our website and enrolling online. To put it bluntly, that means we mostly end up with young white men….

…Research within one or two racial/ethnic categories isn’t necessarily a virtue, biracial and multiracial heritage may be even more interesting to some researchers and can open more areas for future….

In particular, NIST is looking for “trios”: two parents and a child. Researchers like to use samples from trios because they know every piece of DNA in the child comes from one of the parents. This makes it easier to assess error rates — and that sort of quality control is what NIST expects the genome material to be used for. We think all such family groups are valuable, but current trios in the PGP haven’t been the most diverse….

Reader Paul brought this to my attention. I haven’t been too interested in the PGP for myself because it’s just so slow to “play” with whole genomes (~3 GB) as opposed to 1 million SNPs. But over Christmas I’ll look into signing up, and see if they are interested in my own “trio.” I also thought I’d pass this along to readers, though my readership actually looks almost exactly like current PGP participants, so I don’t know if I’ll be contributing to the problem.

Layering genetic histories

Filed under: Anthroplogy,Genetics,Genomics,Human Genetics,Human Genomics — Razib Khan @ 12:14 pm

As a follow up to my post from yesterday, I decided to run TreeMix on a data set I happened to have had on hand (see Inference of Population Splits and Mixtures from Genome-Wide Allele Frequency Data for more on TreeMix). Basically I wanted to display a tree with, and without, gene flow.

The technical details are straightforward. I LD pruned ~550,000 SNPs down to ~150,000. I ran TreeMix without and with migration parameters with the Bantu Kenya population being the root. Finally, when I did turn on the migration parameter I set it for 5. You can see the results below.

Most of the flows are pretty expected. The West Eurasian flow from the Turks to the Uygurs makes sense, because there is a large West Asian component to what the Uygurs have (from East Iranians?). The Chuvash are a Turkic group with minor, but significant, Turkic component. The HGDP Russian sample does have some East Eurasian ancestry. And the Moroccans also have African ancestry. But your guess is as good as mine with the Bantu flow in. These are I think Kenya, so it might be trying to interpret Nilotic admixture as generalized Eurasian.

A minor note: installing TreeMix and generating the appropriate files from pedigree format is not to difficult. But you might have confusion in how to generate the pedigree input file. You do it like so in PLINK:

./plink --noweb --bfile YourFile --freq --within YourGroupNamesFile --out YourOutPutFile

It’s the last you want to put into TreeMix’s python conversion script. The YourGroupNamesFile is basically the .fam file with an extra column, the population names for each individual.

December 1, 2012

Africa’s hidden people hold the keys to the past

I mentioned this in passing on my post on ASHG 2012, but it seems useful to make explicit. For the past few years there has been word of research pointing to connections between the Khoisan and the Cushitic people of Ethiopia. To a great extent in the paper which is forthcoming there is the likely answer to the question of who lived in East Africa before the Bantu, and before the most recent back-migration of West Eurasians. On one level I’m confused as to why this has to be something of a mystery, because the most recent genetic evidence suggests a admixture on the order of 2-3,000 years before the past.* If the admixture was so recent we should find many of the “first people,” no? As it is, we don’t. I think these groups, and perhaps the Sandawe, are the closest we’ll get.

Publication is imminent at this point (of this, I was assured), so I’m going to just state the likely candidate population (or at least one of them): the Sanye, who speak a Cushitic language with possible Khoisan influences. There really isn’t that much information on these people, which is why when I first heard about the preliminary results a few years back and looked around for Khoisan-like populations in Kenya I wasn’t sure I’d hit upon the right group. But at ASHG I saw some STRUCTURE plots with the correct populations, and the Sanye were one of them. I would have liked to see something like TreeMix, but the STRUCTURE results were of a quality that I could accept that these populations were not being well modeled by the variation which dominated their data set. Though Cushitic in language the Sanye had far less of the West Eurasian element present among other Cushitic speaking populations of the Horn of Africa. Neither were their African ancestral components quite like that of the Nilotic or Bantu populations. The clustering algorithm was having a “hard time” making sense of them (it seemed to wanted to model them as linear combinations of more familiar groups, but was doing a bad job of it).

Here is an interesting article on these groups: Little known tribe that census forgot. Like the Sandawe this is a population which seems to have been hunter-gatherers very recently, and to some extent still engage in this lifestyle. In this way I think they are fundamentally different from Indian tribal populations, who are often held up to be the “first people” of the subcontinent.  More and more it seems that the tribes of India are less the descendants of the original inhabitants of the subcontinent, at least when compared to the typical Indian peasant, and more simply those segments of the Indian population which were marginalized and pushed into less productive territory. Over time they naturally diverged culturally because of their isolation, but the difference was not primal. In contrast, groups like the Sanye and Sandawe may have mixed to a great extent with their neighbors (and lost their language like the Pygmies), but evidence of full featured hunting & gathering lifestyles implies a sort of direct cultural continuity with the landscape of eastern Africa before the arrival of farmers and pastoralists from the west and north.

* I understand some readers refuse to accept the likelihood of these results because of other lines of information. I am just relaying the results of the geneticists. I am not interested in re-litigating prior discussions on this. We’ll probably have a resolution soon enough.

Northern Europeans and Native Americans are not more closely related than previously thought

A new press release is circulating on the paper which I blogged a few months ago, Ancient Admixture in Human History. Unlike the paper, the title of the press release is misleading, and unfortunately I notice that people are circulating it, and probably misunderstanding what is going on. Here’s the title and first paragraph:

Native Americans and Northern Europeans More Closely Related Than Previously Thought

Released: 11/30/2012 2:00 PM EST
Source: Genetics Society of America

Newswise — BETHESDA, MD – November 30, 2012 — Using genetic analyses, scientists have discovered that Northern European populations—including British, Scandinavians, French, and some Eastern Europeans—descend from a mixture of two very different ancestral populations, and one of these populations is related to Native Americans. This discovery helps fill gaps in scientific understanding of both Native American and Northern European ancestry, while providing an explanation for some genetic similarities among what would otherwise seem to be very divergent groups. This research was published in the November 2012 issue of the Genetics Society of America’s journal GENETICS

 

The reality is ta Native Americans and Northern Europeans are not more “closely related” genetically than they were before this paper. There has been no great change to standard genetic distance measures or phylogeographic understanding of human genetic variation. A measure of relatedness is to a great extent a summary of historical and genealogical processes, and as such it collapses a great deal of disparate elements together into one description. What the paper in Genetics outlined was the excavation of specific historically contingent processes which result in the summaries of relatedness which we are presented with, whether they be principal component analysis, Fst, or model-based clustering.

What I’m getting at can be easily illustrated by a concrete example. To the left is a 23andMe chromosome 1 “ancestry painting” of two individuals. On the left is me, and the right is a friend. The orange represents “Asian ancestry,” and the blue represents “European” ancestry. We are both ~50% of both ancestral components. This is a correct summary of our ancestry, as far as it goes. But you need some more information. My friend has a Chinese father and a European mother. In contrast, I am South Asian, and the end product of an ancient admixture event. You can’t tell that from a simple recitation of ancestral quanta. But it is clear when you look at the distribution of ancestry on the chromosomes. My components have been mixed and matched by recombination, because there have been many generations between the original admixture and myself. In contrast, my friend has not had any recombination events between his ancestral components, because he is the first generation of that combination.

So what the paper publicized in the press release does is present methods to reconstruct exactly how patterns of relatedness came to be, rather than reiterating well understood patterns of relatedness. With the rise of whole-genome sequencing and more powerful computational resources to reconstruct genealogies we’ll be seeing much more of this to come in the future, so it is important that people are not misled as to the details of the implications.

Northern Europeans and Native Americans are not more closely related than previously thought

A new press release is circulating on the paper which I blogged a few months ago, Ancient Admixture in Human History. Unlike the paper, the title of the press release is misleading, and unfortunately I notice that people are circulating it, and probably misunderstanding what is going on. Here’s the title and first paragraph:

Native Americans and Northern Europeans More Closely Related Than Previously Thought

Released: 11/30/2012 2:00 PM EST
Source: Genetics Society of America

Newswise — BETHESDA, MD – November 30, 2012 — Using genetic analyses, scientists have discovered that Northern European populations—including British, Scandinavians, French, and some Eastern Europeans—descend from a mixture of two very different ancestral populations, and one of these populations is related to Native Americans. This discovery helps fill gaps in scientific understanding of both Native American and Northern European ancestry, while providing an explanation for some genetic similarities among what would otherwise seem to be very divergent groups. This research was published in the November 2012 issue of the Genetics Society of America’s journal GENETICS

 

The reality is ta Native Americans and Northern Europeans are not more “closely related” genetically than they were before this paper. There has been no great change to standard genetic distance measures or phylogeographic understanding of human genetic variation. A measure of relatedness is to a great extent a summary of historical and genealogical processes, and as such it collapses a great deal of disparate elements together into one description. What the paper in Genetics outlined was the excavation of specific historically contingent processes which result in the summaries of relatedness which we are presented with, whether they be principal component analysis, Fst, or model-based clustering.

What I’m getting at can be easily illustrated by a concrete example. To the left is a 23andMe chromosome 1 “ancestry painting” of two individuals. On the left is me, and the right is a friend. The orange represents “Asian ancestry,” and the blue represents “European” ancestry. We are both ~50% of both ancestral components. This is a correct summary of our ancestry, as far as it goes. But you need some more information. My friend has a Chinese father and a European mother. In contrast, I am South Asian, and the end product of an ancient admixture event. You can’t tell that from a simple recitation of ancestral quanta. But it is clear when you look at the distribution of ancestry on the chromosomes. My components have been mixed and matched by recombination, because there have been many generations between the original admixture and myself. In contrast, my friend has not had any recombination events between his ancestral components, because he is the first generation of that combination.

So what the paper publicized in the press release does is present methods to reconstruct exactly how patterns of relatedness came to be, rather than reiterating well understood patterns of relatedness. With the rise of whole-genome sequencing and more powerful computational resources to reconstruct genealogies we’ll be seeing much more of this to come in the future, so it is important that people are not misled as to the details of the implications.

November 11, 2012

The Genographic Project’s Scientific Grants Program

While I was at Spencer Wells’ poster at ASHG I was primarily curious about bar plots. He’s got really good spatial coverage, so I’m moderately excited about the paper (though I didn’t see much explicit testing of phylogenetic hypotheses, which I think this sort of paper has to do now; we’re beyond PCA and bar plots only papers). That being said, Spencer was more interested in me promoting the Scientific Grants Program. Here’s some more information:

The Genographic Project’s Scientific Grants Program awards grants on a rolling basis for projects that focus on studying the history of the human species utilizing innovative anthropological genetic tools. The variety of projects supported by the scientific grants will aim to construct our ancient migratory and demographic history while developing a better understanding of the phylogeographic structure of world populations. Sample research topics could include subjects like the origin and spread of the Indo-European languages, genetic insights into Papua New Guinea’s high linguistic diversity, the number and routes of migrations out of Africa, the origin of the Inca, or the genetic impact of the spread of maize agriculture in the Americas.

Recipients will typically be population geneticists, students, linguists, and other researchers or scientists interested ...

Older Posts »

Powered by WordPress