Razib Khan One-stop-shopping for all of my content

June 7, 2011

Against social constructionism

Filed under: Culture,Genetics,Harappa Ancestry Project — Razib Khan @ 6:12 pm

My first post over at HAP. The theses of some scholars who argue for social construction of caste has obviously gone too far, but I do want to add that I suspect there’s a lot correct about specific details. For example, it seems possible that the class “Kayastha” broadly refers to groups which uplifted from the non-Brahmin substrate relatively across the Indo-Gangetic plain relatively recently. We will know soon, as the 1000 Genomes will release a lot of Bengali Kayastha results, and we can compare them to Santhals and other groups.

March 29, 2011

The limits of computational power – shades of 1982

Filed under: Genetics,Genomics,Harappa Ancestry Project,Personal genomics — Razib Khan @ 12:44 pm

Two Steps Forward, Two Steps Back:

I got my daughter a netbook, so now my computer is doing Harappa Prohect work 24×7.

Also, Simranjit was nice enough to offer me the use of a server. For privacy reasons, I am not going to upload any of the participants’ data there but it is much faster than my machine and hence very useful for running Admixture on the reference data (especially with crossvalidation).

As for steps back, I downloaded the current 1000genomes data (1,212 samples, 2.4 million SNPs). It’s in vcf format. Using vcftools to convert it to ped format will take about 3 weeks. Yes you heard that right. BTW, the good stuff from a South Asian point of view will come later this year with a 100 Assamese AhomF, 100 Kayadtha from Calcutta, 100 Reddys from Hyderabad, 100 Maratha from Bombay and 100 Lahori Punjabis.

Also, I spent most of Sunday evening and night in the ER and got a diagnosis of ureterolithiasis for my efforts. All I can say is: Three cheers for Percocet!!

First, wish Zack well. Second, he has over 70 individuals in the Harappa Ancestry Project data base (in addition to the public data sets). If ...

March 24, 2011

The genetic world in 3-D

Filed under: Anthroplogy,Genetics,Genomics,Harappa Ancestry Project,PCA — Razib Khan @ 5:44 pm

When Zack first mooted the idea of the Harappa Ancestry Project I had no idea what was coming down the pipe. I wonder if his daughter and wife are curious as to what’s happened to their computer! Since collecting the first wave of participants he’s been a result generating machine. Today he produced a fascinating three dimensional PCA (modifying Doug McDonald’s Javascript) using his “Reference 1″ data set. He rescaled the dimensions appropriately so that they reflect how much of the genetic variance they explain. The largest principal component of variance is naturally Africa vs. non-Africa, the second is west to east in Eurasia, and the third is a north to south Eurasian axis.

I decided to be a thief and take Zack’s Javascript and resize it a bit to fit the width of my blog, blow up the font size, as well as change the background color and aspects of positioning. All to suit my perverse taste. You see the classic “L” shaped distribution familiar from the two-dimensional plots, but observe the “pucker” in the third dimension of South Asian, and to a lesser extent Southeast Asian, populations.

The the topology of the first three independent dimensions of ...

March 14, 2011

Analyzing ancestry with ADMIXTURE, step by step

Over the past few months I was hoping more people would start doing what Zack Ajmal, Dienekes, and David, have been doing. There are public data sets, and open source software, so that anyone with nerdy inclination can explore their own questions out of curiosity. That way you can see the power and the limitations of  genomics on your own desktop. I wonder if one of the biggest reasons that more people haven’t started doing this is formatting. It can be a pain to convert matrix formatted files into pedigree format, for example. But the data gusher isn’t ending, look at what’s coming out (and has come out) in the 1000 Genomes project!

I’ve been thinking I need to write up a post which is a “soft landing” for people so that we can reduce the “activation energy” for this sort of thing…once you get hooked, you only go deeper. Luckily an anonymous tipster has sent me the link to a URL with a huge data set which has been merged, already pedigree formatted. Here are the populations:

!Kung Buryats Hausa Mada Punjabi Arain Totonac Adygei Cambodian Hazara Makrani Pygmy Tu African Americans Chinese Hema Malayan Romanians Tujia Algeria Chinese Americans Hezhen Mandenka Russian Tunisia Altaians Chukchis Hungarians Maya Sahara Occ Turks Alur Chuvashs Iban Mbuti Sakilli Tuscans Ap Brahmin Cochin Jews Igbo Melanesian Samaritians Tuvinians Ap Madiga Colombian Iranian Jews Mexicans Samoan Urkarah Ap Mala Cypriots Iranians Miao San Utahn Whites Armenians Dai Iraq Jews Mongola San Nb Uygur Armenians B Daur Irula Mongolians Sandawe Uzbekistan Jews Ashkenazy Jews Dogon Italian Moroccans Sardinian Uzbeks Azerbaijan Jews Dolgans Japanese Morocco Jews Saudis Vietnamese Balochi Druze Jordanians Morocco N Selkups Greenlanders Bambaran Greenlanders Kaba Morocco ...

March 12, 2011

Harappa Ancestry Project @ N ~ 50

Zack Ajmal now has over 50 participants in the Harappa Ancestry Project. This does not include the Pakistani populations in the HGDP, the HapMap Gujaratis, the Indians from the SVGP. Nevertheless, all these samples still barely cover vast heart of South Asia, the Indo-Gangetic plain. Here is the provenance of the submitted samples Zack has so far:

Punjab: 7 Iran: 7 Tamil: 6 Bengal: 5 Andhra Pradesh: 2 Bihar: 2 Karnataka: 2 Caribbean Indian: 2 Kashmir: 2 Uttar Pradesh: 2 Sri Lankan: 2 Kerala: 2 Iraqi Arab: 2 Anglo-Indian: 1 Roma: 1 Goa: 1 Rajasthan: 1 Baloch: 1 Unknown: 1 Egyptian/Iraqi Jew: 1 Maharashtra: 1

Again, note the underrepresentation of two of India’s most populous states, Uttar Pradesh, ~200 million, and Bihar, ~100 million. Nevertheless, there are already some interesting yields from the project. Below I’ve reedited Zack’s static images (though go to his website for something more dynamic) with the labels of individuals. I’ve highlighted myself and my parents with the red pointers.

To the left is a set of plots and tables which I’ve spliced together from Zack’s various posts. What you need to know is that this at K = 12, and I’ve used the labels that Zack gave the various putative “ancestral populations” which emerged out ...

February 24, 2011

Harappa Ancestry Project update

Filed under: Genetic,Genetics,Genomics,Harappa Ancestry Project — Razib Khan @ 11:23 am

If you haven’t, please check in with what Zack Ajmal is doing. There is still a great deal more to be done in terms of population coverage, but I think we’re getting some sense of the overall picture, even if the error bars are kind of large at this point…

February 23, 2011

Better comprehension through visualization

Filed under: Genetics,Genomics,Harappa Ancestry Project — Razib Khan @ 11:56 pm

Zack has started to improve on static R plots with Google powered charts. Check it out. Alas, I can’t inject script tags into the body of my posts, so that’s not feasible for me. Notice on Zack’s plot that I’m more East Asian than either of my parents. The tendency first cropped up with 23andMe’s ancestry painting, and I have seen it in my own ADMIXTURE runs, so I don’t dismiss it as V2 vs. V3 chip anymore. Though I’ve ordered an upgrade myself, so we’ll see for sure. Also, though both my parents are about the same East Asian, they exhibit a different balance of East Asian subcomponents. I’ve seen this in my own ADMIXTURE runs, and I’m going to check for more fine-grained matches with the HGDP East Asian populations soon to ascertain whether their eastern ancestral mix is different. Good times.

February 10, 2011

Personal genomics around the web

Filed under: Genetics,Genomics,Harappa Ancestry Project — Razib Khan @ 11:03 am

Just some pointers. Dr. Daniel MacArthur has put up a guest post where I outline my own experience with personal genomics. Cool times that we live in. Also, Zack Ajmal has started posting higher K’s of HAP participants. He’s now in the second batch. My parents will be in the third. Lots of Tamils and Punjabis. The Khan’s are the only Bengalis so far. One individual to represent all of Uttar Pradesh. Here’s a list of participants so far.

Finally, I know 3-D visualization is bad form, but I went for it anyway. Below is a cube which shows the positions of Gujaratis, Chinese, Mexican Americans, and Utah whites and Tuscans from the HapMap, along with a few extra samples from friends and family. Can you tell where my parents are?


February 8, 2011

Harappa Ancestry Project, K = 9, HRP001 to HRP010

Filed under: HAP,Harappa Ancestry Project — Razib Khan @ 9:41 pm

Zack keeps cranking away on the Harappa Ancestray Project. If you want to know what the bar plot below (or the title) means, see Zack’s post. He still needs people from the “cow belt.” Actually, he needs more people, period. So if you’re brown, Tibetan, or Burmese, get in touch!

Also, a PCA for you:


Dodecad open for submissions

Since I know plenty of friends are getting, or just got, their V3 results, I thought I’d pass this on, Open-ended submission opportunity for 23andMe data (#2):

Who is eligible

Everyone who is of European, Asian, or North African ancestry and all four of his/her grandparents are from the same European, Asian, or North African ethnic group or the same European, Asian, or North African country.

Also, Zack has more than 30 individuals in HAP. The “cow belt” is still way underrepresented. The only Bengalis in the data set are my parents.

February 2, 2011

Harappa Ancestry Project, first batch of results

Filed under: Genetics,Harappa Ancestry Project — Razib Khan @ 12:39 am

Zack’s started posting them. I’m HRP002.

February 1, 2011

A genomic map of human variation, where we’re at

Zack has started exploring the K’s of his merged data set for HAP. A commenter suggests that:

As you have begun interpreting the reference results, let me make a friendly warning: you have to keep in mind that most of the reference populations of ethnic groups are extremely limited in sample size (with only between 2 and 25 individuals) and from very obscure sources, and you should keep away from drawing conclusions about millions of people based on such limited number of individuals.

This seems a rather reasonable caution. But I don’t think such a vague piece of advice really adds any value. These sorts of caveats are contingent upon:

- The scope of the question being asked (i.e., how fine a grain is the variation you are attempting to measure going to be)

- The sample size

- The representativeness

- The thickness of the marker set (10 autosomal markers vs. 500,000 SNPs)


This isn’t a qualitative issue, easily to divide into “right” and “wrong.” Sometimes an N = 1 is very insightful. That’s why the whole genome of one Bushman was very useful. In fact, the whole genome of any random Sub-Saharan African, and the whole genome of any random non-African ...

January 31, 2011

Harappa Ancestry Project, t-minus one day

Filed under: Harappa Ancestry Project,Personal genomics — Razib Khan @ 12:57 pm

Zack is going to post the first batch of results from HAP tomorrow. It looks like he’s going to be using mostly the merged HGDP, HapMap, SVGP, and Behar data set, supplemented by a second set which also merges the Xing et al. sample (the intersection of Xing et al. with the other results is a much smaller number of SNPs, but, it includes a better coverage of various South Asian groups). He’ll initially be posting ADMIXTURE estimates as you’ve seen on Dodecad. I’m especially interested in the Anglo-Indian and Roma individuals which have sent Zack their samples. I don’t know of any genomic investigation of the former community, while the published research on Roma genetics doesn’t include SNP-chip results (usually they’re mtDNA, Y, or only a few autosomal markers). I’d be curious for possible evidence of homozygosity or linkage disequilibrium in the Roma individual due to the population bottlenecks which other studies have detected (I assume that’ll be in the future). The Roma are to a good approximation an admixture of India, West Asia, and European (often Balkan) groups, but, their history of endogamy and ...

January 28, 2011

Harappa Ancestry Project, before the first wave

Zack has been posting his data sources, as well as how he filtered and formatted them, all this week. I assume that the first wave of results will be online soon. As of yesterday, this is what he had (I know he got some more today):

- Punjab 7
- Bengal 1
- Bihar 1
- Tamil 5
- Karnataka 1
- Anglo-Indian 1
- Roma 1
- Iran 3

Whole swaths of north-central India are missing. I am hopeful that more people will join in after the first wave of results are put out there. But, from what I have discussed with Zack it looks plausible that the very first wave will have a richer set of results because of the necessity of preliminary steps. So there’s some benefit in getting early. It’s really ridiculous to have literally 1 sample representing the 300 million people of Uttar Pradesh and Bihar. That’s 25% of South Asians represented by one person. I’ve gotten a commitment from one friend who was born U.P. to give his data up once it comes in, but there have to be others out there. (the Bengali N should go up to 2 when I swap my parents ...

Harappa Ancestry Project, before the first wave

Zack has been posting his data sources, as well as how he filtered and formatted them, all this week. I assume that the first wave of results will be online soon. As of yesterday, this is what he had (I know he got some more today):

- Punjab 7
- Bengal 1
- Bihar 1
- Tamil 5
- Karnataka 1
- Anglo-Indian 1
- Roma 1
- Iran 3

Whole swaths of north-central India are missing. I am hopeful that more people will join in after the first wave of results are put out there. But, from what I have discussed with Zack it looks plausible that the very first wave will have a richer set of results because of the necessity of preliminary steps. So there’s some benefit in getting early. It’s really ridiculous to have literally 1 sample representing the 300 million people of Uttar Pradesh and Bihar. That’s 25% of South Asians represent by one person. I’ve gotten a commitment from one friend who was born U.P. to give his data up once it comes in, but there have to be others out there. (the Bengali N should go up to 2 when I swap my parents ...

January 27, 2011

The “cow belt” is underrepresented in the Harappa Ancestry Project

Filed under: Genetics,Harappa Ancestry Project — Razib Khan @ 12:29 pm

On the Harappa Ancestry Project Zack says:

I have got 17 South Asians and 2 Iranians now. Punjabis and Tamils are well represented.

We definitely need participation from central India, north-central and east India. Uttar Pardesh has 300 million people but no samples so far.

This is a problem with many South Asian genetics surveys: “Aryavarta” is extremely undersampled! I have one friend from U.P. who is waiting on their results, but seriously, it’s ridiculous to have 200 million people represented by one individual!

January 24, 2011

Call for South Asian 23andMe samples!

Filed under: Genetics,Harappa Ancestry Project — Razib Khan @ 12:18 pm

The Harappa Ancestry Project now has its own website. Zack has 10 South Asian samples now, but only Punjabis are represented in multiples (this should make Zach L. happy!). We need more population coverage! If you know someone who is South Asian and has a 23andMe genotyping done, please forward the URL of the Harappa Ancestry Project on. Additionally, point people to the Facebook page. I know the samples are out there! Also remember we’re looking for Iranians, Tibetans, and Burmese, as outgroups.

In other news, Zack has disclosed that he is very inbred. This is one of the reasons we need good community coverage among South Asians: endogamy has probably generated a great deal of genetic distinctiveness.

Addendum: If you care, I am not inbred at all. But that makes sense given my family background.

Harappa Ancestry Project, update

Last week I announced the Harappa Ancestry Project. It now has its own dedicate website, http://www.harappadna.org. Additionally, it has its own Facebook page. For Zack to get his own URL he needs about 10 more “likes,” so please like it! (if you are so disposed) Finally, from what I’ve heard the first wave of the 23andMe holiday sale results are coming online this week. Actually, one of the relatives who I purchased the kit for is in processing currently, so I know that we should have a bunch of new people in the system very, very, soon.

Speaking of people, last I heard Zack had gotten about a dozen responses. That’s enough to start an initial round of runs, but obviously he needs more people. More importantly, the goal here is to get better population coverage. One of the things we know intuitively and also from the most current research is the existence of a lot of within-region population variation in South Asia which is structured by community. In other words, a sample of 30 people, where you have 3 from 10 different communities exhibiting geographical and ...

January 17, 2011

Introducing the Harappa Ancestry Project

Filed under: Genetics,Genomics,Harappa Ancestry Project — Razib Khan @ 9:22 am

A few weeks ago I hinted at a South Asian equivalent to Dodecad & Eurogenes BGA. It is now public and in the data collection phase. You can read the whole thing here:

http://www.zackvision.com/weblog/2011/01/harappa-ancestry-project

This is the feed:

http://www.zackvision.com/feed/

If your ancestry is from these nations:

Afghanistan Bangladesh Bhutan Burma India Iran Maldives Nepal Pakistan Sri Lanka Tibet

Read on! If not, “for entertainment purposes only”….


I have been griping in public and in private about the “reference” populations used for South Asian genomics for years. Because of the Permit Raj the HGDP had to use Pakistani populations. Additionally, because of the HGDP’s mandate to focus on smaller groups which might harbor genetic uniqueness you have some very obscure tribes, but only one sample set from an Indo-Aryan speaking population. And even there, it was a minority, not the Punjabi speaking majority of Pakistan.

Some of this has changed in recent years. Papers such as Reconstructing Indian History and Genetic diversity in India and the inference of Eurasian population expansion have added more populations to the mix. The current phase of the HapMap has Gujaratis from Houston. But there is always a problem when you take a small population set to be representative of a broader group. There are ~1.3 ...

Powered by WordPress