Razib Khan One-stop-shopping for all of my content

October 16, 2011

Harappa Ancestry Project at 10 months

Filed under: Genome Blogging,Personal genomics — Razib Khan @ 12:36 pm

It’s been 10 months since Zack Ajmal first contacted me about the possibility of the Harappa Ancestry Project. I was of two minds. On the one hand I did think there was a major problem with undersampling some regions of South Asia. But, it seemed that the 1000 Genomes would fix that soon enough. As it turns out the 1000 Genomes has been a bit slower than I had anticipated (and I assume that the nixing of the Indian samples was a matter of politics not science). So I’m glad Zack started the project when he did.

At this point he’s hit the zone of diminishing marginal returns when it comes to participants. Looking through his samples he has a little over 100 non-founders of unadmixed South Asian ancestry (I’m not a founder because both my parents are in the database). I decided to prune the individuals down to this selection, and tack on a lot of his reference populations, with a bias toward South Asians, and see what I could find. I used his K = 11 ADMIXTURE run, since this seems maximally informative for South Asians. You can find the file here.

One interesting aspect of Zack’s project is that he began to collect Y and mtDNA haplogroups at a certain point. Not too surprising there was a preponderance of R1a1a. For many years now this paternal marker has been suggested to have some association with Indo-Iranians, though more recently researchers have suggested that in fact it’s a very old haplogroup sharply differentiated between a European branch and a South Asian one. Zack has 56 individuals with Y and mtDNA information in his database. These have to be males. He has 14 individuals with mtDNA information and no Y information. These are probably females (obviously there could be males who are only entering their mtDNA information, but this seems unlikely given that most of the results come from 23andMe). 27 of the males are R1a1a. 29 are not. The mean “Onge” proportion of those with R1a1a is 24%. Without? 24%. The respective values for “South Asian” is 56 and 55 percent respectively. In this likely skewed sample R1a1a doesn’t seem to predict the ancestral variation much.

How about we look at mtDNA. Haplogroup M is localized to South Asia. Dividing the population into M and not M you get the following values:

Not M, South Asian = 55%
Not M, Onge = 23%
M, South Asian = 56%
M, Onge = 23%

There doesn’t seem to be that much in uniparental markers, which aligns with my intuition. At least to this scale of analysis. So let’s look at the autosomal genome. The total genetic variation. If you’ve been following HAP the following won’t be news, for those who haven’t, I thought I’d generate some plots.

The two-way admixture aspect of South Asian populations is evident in the HAP data. “Onge” refers to an element affinal to those of Andaman Islanders. “S.Asian” seems to be some sort of compound, but with strong West Eurasian affinities. The axis is NW-SE, upper caste to lower caste, just as you’d expect.

There are two West Eurasian components which aren’t collapsed into “S.Asian,” “SW.Asian” and “European.” The names are rather self-evident. The interesting thing here is that “SW.Asian” tends to be elevated among South Indians, especially non-Brahmin upper castes. In contrast, there is far less “SW.Asian” amongst Northeast Indians, and proportionally more “European.” This is more evident when you look at populations in the reference set.

There are also some interesting caste/region patterns.

When you remove region from consideration it is interesting that Brahmins are somewhat “central” among South Asian populations.

In contrast, Punjabis are where you’d expect geography to predict. That’s one reason it was somewhat problematic that the HGDP had only Pakistani groups for South Asians. They’re not too representative of South Asians.

Differences along the axis of caste become more clear when you correct for region, at least mostly.

Punjab is somewhat atypical here. I am now much more willing to credit migrations within the last 2,000 years accounting for the distinctiveness of groups like Jatts.

On a somewhat less exciting note, it looks like a lot of the genome blogging projects are losing steam. I’m pretty busy right now, so I haven’t been able to maintain AAP, though we’ll have another Merina soon. But I suspect it goes to show just how important collection of new data is to these endeavors. There’s only so much juice you can get out of the same data set. Right now we depend on research groups and the 1000 Genomes, as well as enthusiasts. At some point in the near future the genotypes won’t be the limiting factor. I think then you’ll see a renaissance of amateur ancestral genomics.

September 26, 2011

Dodecad Ancestry Project is at ~10,000

Filed under: Genome Blogging,Genomics,Personal genomics — Razib Khan @ 10:35 pm

A few days ago I noticed that the Dodecad Ancestry Project had nearly nearly 10,000 individuals! ~500 are participants in the project (like myself, I’m DOD075). But most of the individuals were derived from public or shared data sets. You can see them in the Google spreadsheet with all the results. It’s quite an accomplishment, and I commend Dienekes for it. I also have to enter into the record that Dodecad prompted my own forays into genome blogging, and Dienekes also helped Zack with pointers for Harappa in the early days.

July 27, 2011

DIY admixture analysis

Dienekes Pontikos has just released DIY Dodecad, a DIY admixture analysis program. You can download the files yourself. It runs on both Linux and Windows. Since I already have tools in Linux I decided to try out the Windows version, and it seems to work fine. It is somewhat limited in that you start out with the parameters which Dienekes has set for you, but if you don’t want to write your own scripts and get familiar with all the scientific programs out there, I think this is a very good option. Additionally, it seems to run rather fast, so you won’t spend days experimenting with different parameters.

Dienekes has already run me, but I put my parents’ genotype files through the system. Here are the results:

Population Razib Mother Father East_European 6.9 6.5 4.3 West_European 1.7 3.1 5.5 Mediterranean 6.3 5.6 5.9 Neo_African 0 0 0 West_Asian 0 2 3.9 South_Asian 65.9 59.6 60.4 Northeast_Asian 2.9 3.8 3.6 Southeast_Asian 15.8 16.6 15.5 East_African 0 0 0.2 Southwest_Asian 0.5 2.5 0.7 Northwest_African 0 0 0 Palaeo_African 0 0.3 0

The main thing to notice is that my mother has more total East Asian ancestry than my father, and, that she has a Southwest Asian component which is at a few percent. These are always consistent findings in the dozens of ADMIXTURE runs I’ve done with various parameter settings and reference population mixes, so it’s nice that DIY Dodecad replicates those findings. Though the population sets seem a bit Eurocentric to ...

June 27, 2011

The impact of genetic ancestry testing

Filed under: Genome Blogging,Genomics — Razib Khan @ 12:07 am

Attitudes on DNA ancestry tests:

The DNA ancestry testing industry is more than a decade old, yet details about it remain a mystery: there remain no reliable, empirical data on the number, motivations, and attitudes of customers to date, the number of products available and their characteristics, or the industry customs and standard practices that have emerged in the absence of specific governmental regulations. Here, we provide preliminary data collected in 2009 through indirect and direct participant observation, namely blog post analysis, generalized survey analysis, and targeted survey analysis. The attitudes include the first available data on attitudes of those of individuals who have and have not had their own DNA ancestry tested as well as individuals who are members of DNA ancestry-related social networking groups. In a new and fluid landscape, the results highlight the need for empirical data to guide policy discussions and should be interpreted collectively as an invitation for additional investigation of (1) the opinions of individuals purchasing these tests, individuals obtaining these tests through research participation, and individuals not obtaining these tests; (2) the psychosocial and behavioral reactions of individuals obtaining their DNA ancestry information with attention given both to expectations prior to testing and ...

June 9, 2011

A genomic sketch of the Horn of Africa

Iman, a Somali model

Since I started up the African Ancestry Project one of the primary sources of interest has been from individuals whose family hail for Northeast Africa. More specifically, the Horn of Africa, Ethiopia, Eritrea, and Somalia. The problem seems to be that 23andMe’s “ancestry painting” algorithm uses West African Yoruba as a reference population, and East Africans are often not well modeled as derivative of West Africans. So, for example, the Nubian individual who I’ve analyzed supposedly comes up to be well over 50% “European” in ancestry painting. Then again, I”m 55-60% “European” as well according that method! So we shouldn’t take these judgments to heart too much. Obviously something was off, and thanks to Genome Bloggers like Dienekes Pontikos we know what the problem was: the populations of the Horn of Africa have almost no distinctive “Bantu” element to connect them with West Africans like the Yoruba. Additionally, a closer inspection shows that the “Eurasian” component present in these populations is very specific as well, almost totally derived from Arabian-like sources. When breaking apart the West Eurasian populations it is no surprise ...

May 21, 2011

A map of genome blogging participants

Filed under: Genome Blogging,Genomics,Personal genomics — Razib Khan @ 11:02 am

Both Eurogenes and Harappa now have map interfaces where you can drop in the origin of your location if you’re a participant. If you have submitted your data you should add your information in. We’re at a point where data is relatively plentiful, at least before the tsunami of whole genomes, so visualization and representation is of the essence.

Here’s HAP:

April 26, 2011

Two new genome bloggers

Filed under: Genome Blogging,Genomics,Personal genomics — Razib Khan @ 9:27 am

Zack pointed me to two new ones, Fennoscandia Biographic Project, and Magnus Ducatus Lituaniae Project – BGA analysis project for the territories of former Grand Duchy of Lithuania. So I guess the circum-Baltic region is getting some thick coverage. The latter is also releasing some format conversion tools which seem to work in Windows, if you want to play with the analytic software yourself.

April 19, 2011

Jared Diamond was right!

Filed under: Genome Blogging,Genomics — Razib Khan @ 1:26 am

At least about some things. In Guns, Germs, and Steel he argued that latitudinal diffusion of agricultural toolkits was much easier than longitudinal diffusion. This seems right, but, one thing which Diamond did not emphasize enough in hindsight I suspect is that demographic diffusion and replacement can follow a similar pattern. I am probably not a “Neolithic population replacement” maximalist to the extent of someone like “Diogenes” or Peter Bellwood, but that is probably mostly a matter of my modest confidence about all of these sorts of issues. But, after running many trials of ADMIXTURE, along with perusing the results generated by Dienekes, David, and Zack, I am more confident in the position that agriculture and agriculture-bearing populations tend to initially follow paths of least ecological resistance. In kilometers the distance between Lisbon and Damascus is 4,000 units, while between Helsinki and Damascus it is 3,000 units, but Lisbon has been much more affected by the migrations from the Middle East than Helsinki. The facilitation of water transportation as well as ecological similarities between Lisbon and Damascus, at least in relation to Helsinki, explains this phenomenon.

To illustrate this issue more broadly, let’s look ...

April 16, 2011

The value of “open genomics”

Filed under: Genome Blogging,Genomics,Personal genomics — Razib Khan @ 6:49 pm

Zack Ajmal has been methodically working his way through issues in the public genomic data sets. Often it just involves noting duplicate samples across data sets, which need to be accounted for. But sometimes there seem to be problems within the uploaded data sets, for example relatively close related individuals. Today he highlights an issue which early on was noticeable in the Behar et al. data set:

Behar as in the Behar et al paper/dataset and not the Indian state of Bihar. The Behar dataset contains 4 samples of Paniya, which apparently is a Dravidian language of some Scheduled Tribes in Kerala.

I had always been suspicious of those four samples since one of them had admixture proportions similar to other South Indians but the other three were like Southeast Asians.

Since the Austroasiatic Paniya samples originated from Behar et al, I guess at some point before the Behar data being submitted to the GEO database the Paniyas got mislabeled.

I pulled down the Behar et al. data set too, and the Paniya just look weird enough that I just avoided them. Ideally this sort of stuff should be caught, but errors happen. Best to get as many eyeballs looking over everything.

Powered by WordPress