Razib Khan One-stop-shopping for all of my content

September 27, 2012

A zoom in on Western Eurasia

Filed under: PCA — Razib Khan @ 1:00 am

September 25, 2012

Re-imagining genetic variation

Filed under: Genetics,Genomics,PCA — Razib Khan @ 11:39 pm

To the left is a PCA from The History and Geography of Human Genes. If you click it you will see a two dimensional plot with population labels. How were these plots generated? In short what these really are are visual representations of a matrix of genetic distances (those distances being general FST), which L. L. Cavalli-Sforza and colleagues computed from classical autosomal markers. Basically what the distances measure are the differences across populations in regards to their genetics. The unwieldy matrix tables can be visualized as a neighbor-joining tree, or a two dimensional plot as you see here. But that’s not the end of the story.

In the past ten years with high density SNP-chip arrays instead of just representing the relationship of populations, these plots often can now illustrate the position of an individual (the methods differ, from components analysis or coordinate analysis, to multi-dimensional scaling, but the outcomes are the same).


 For example, the famous genetic map of Europe. Here you see the colors representing nationalities, and centroid positions of the populations as well as individuals. In this ...

March 24, 2011

The genetic world in 3-D

Filed under: Anthroplogy,Genetics,Genomics,Harappa Ancestry Project,PCA — Razib Khan @ 5:44 pm

When Zack first mooted the idea of the Harappa Ancestry Project I had no idea what was coming down the pipe. I wonder if his daughter and wife are curious as to what’s happened to their computer! Since collecting the first wave of participants he’s been a result generating machine. Today he produced a fascinating three dimensional PCA (modifying Doug McDonald’s Javascript) using his “Reference 1″ data set. He rescaled the dimensions appropriately so that they reflect how much of the genetic variance they explain. The largest principal component of variance is naturally Africa vs. non-Africa, the second is west to east in Eurasia, and the third is a north to south Eurasian axis.

I decided to be a thief and take Zack’s Javascript and resize it a bit to fit the width of my blog, blow up the font size, as well as change the background color and aspects of positioning. All to suit my perverse taste. You see the classic “L” shaped distribution familiar from the two-dimensional plots, but observe the “pucker” in the third dimension of South Asian, and to a lesser extent Southeast Asian, populations.

The the topology of the first three independent dimensions of ...

February 11, 2011


Filed under: Genetics,Genomics,PCA — Razib Khan @ 1:50 am

Long time readers know that I have a fixation on people not taking PCA too literally as something concrete. Tonight I finally merged the HGDP data set with some of the HapMap ones I’ve been playing with, and tacked my parents onto the sample. I took the ~50 HGDP populations, added the Tuscans, the two Kenyan groups, and the Gujaratis, and merged them. I thinned the marker set to 105,000 SNPs (I had to flip the HGDP strand too). Then I just let Eigensoft do its magic, and 2 hours on I produced my own plot. I’m still getting a hang of the labeling issues, but first let’s look at what 23andMe produces (I’m green):

Now let’s see what I outputted:

I suspect that the gap between my parents and the main South Asian cluster is just an artifact of the lack of South and East Indians in the sample. Additionally, things would look different if I removed the Africans, since the first principal component would be freed up. More on that later. All in all, still pretty awesome that ...

January 26, 2011

Visualizing variation, input → output

I have noted a few times that one thing you have to be careful about in two dimensional plots which show genetic variance is that the dimensions in which the data are projected upon are often generated from the data itself. So adding more data can change the spatial relationships of previous data points. Additionally, in 23andMe’s global similarity advanced plot you are projected onto the dimensions generated from the HGDP data set. There are some practical reasons for this. First, it’s computationally intensive to recalculate components of variance every time someone is added to the data set. Second, it isn’t as if the ethnic identity of any given individual is validated. What would you do if an alien sent in a kit and spuriously put “French” as their ancestry?

So, in reply to this comment: “Let me rephrase: is there any difference when you switch to the world-wide plot? I imagine not, or you would’ve mentioned it.” Actually, there is a slight difference. Below on the right you have a “world view,” with my position being marked with green, and on the left a “zoom in” for Central/South Asia in the HGDP data set.

August 23, 2010

Just pushing buttons

Filed under: Analysis,Genetics,PCA,Tools — Razib Khan @ 11:05 pm

Mike the Mad Biologist, whose bailiwick is the domain of the small, asks in the comments:

I don’t mean to bring up a tangential point to the post, but why does the field of human genetics use PCA to visualize relationships? When I see plots like those shown here that have a ‘geometric pattern’ to them (the sharp right angles; another common pattern is a Y-shape), that tells me that there are lots of samples with zeros for many of the Y-variables (i.e., alleles that are unique to certain populations). Thus, the spatial arrangement of the points is largely an artifact of an inappropriate method: how does one calculate a correlation matrix when many of things one is correlating have values of zero?

If one really was keen on using PCA, one could calculate a pairwise distance matrix and then use that instead of the correlation matrix (Principal Coordinates Analysis).

Since I know some human geneticists do read this weblog, I thought it was worth throwing the question out there.

Powered by WordPress