Razib Khan One-stop-shopping for all of my content

July 11, 2018

Tutorial to run PCA, Admixture, Treemix and pairwise Fst in one command

Filed under: Admixture,data,Fst,PCA,PLINK,Population genetics,TreeMix — Razib Khan @ 11:50 pm

Today on Twitter I stated that “if the average person knew how to run PCA with plink and visualize with R they wouldn’t need to ask me anything.” What I meant by this is that the average person often asks me “Razib, is population X closer to population Y than Z?” To answer this sort of question I dig through my datasets and run a few exploratory analyses, and get back to them.

I’ve been meaning to write up and distribute a “quickstart” for a while to help people do their own analyses. So here I go.

The audience of this post is probably two-fold:

  1. “Trainees” who are starting graduate school and want to dig in quickly into empirical data sets while they’re really getting a handle on things. This tutorial will probably suffice for a week. You should quickly move on to three population and four population tests, and Eigensoft and AdmixTools. As well fineStructure
  2. The larger audience is technically oriented readers who are not, and never will be, geneticists professionally. 

What do you need? First, you need to be able to work in a Linux or environment. I work both in Ubuntu and on a Mac, but this tutorial and these scripts were tested on Ubuntu. They should work OK on a Mac, but there may need to be some modifications on the bash scripts and such.

Assuming you have a Linux environment, you need to download this zip or tar.xz file. Once you open this file it should decompress a folderancestry/.

There are a bunch of files in there. Some of them are scripts I wrote. Some of them are output files that aren’t cleaned up. Some of them are packages that you’ve heard of. Of the latter:

  • admixture
  • plink
  • treemix

You can find these online too, though these versions should work out of the box on Ubuntu. If you have a Mac, you need the Mac versions. Just replace the Mac versions into the folderancestry/. You may need some libraries installed into Ubuntu too if you recompile yourselfs. Check the errors and make search engines your friends.

You will need to install R (or R Studio). If you are running Mac or Ubuntu on the command line you know how to get R. If not, Google it.

I also put some data in the file. In particular, a plink set of files Est1000HGDP. These are merged from the Estonian Biocentre, HGDP, and 1000 Genomes. There are 4,899 individuals in the data, with 135,000 high quality SNPs (very low missingness).

If you look in the “family” file you will see an important part of the structure. So do:

less Est1000HGDP.fam

You’ll see something like this:
Abhkasians abh154 0 0 1 -9
Abhkasians abh165 0 0 1 -9
Abkhazian abkhazian1_1m 0 0 2 -9
Abkhazian abkhazian5_1m 0 0 1 -9
Abkhazian abkhazian6_1m 0 0 1 -9
AfricanBarbados HG01879 0 0 0 -9
AfricanBarbados HG01880 0 0 0 -9

There are 4,899 rows corresponding to each individual. I have used the first column to label the ethnic/group identity. The second column is the individual ID. You can ignore the last 4 columns.

There is no way you want to analyze all the different ethnic groups. Usually, you want to look at a few. For that, you can use lots of commands, but what you need is a subset of the rows above. The grep command matches and returns rows with particular patterns. It’s handy. Let’s say I want just Yoruba, British (who are in the group GreatBritain), Gujurati, Han Chinese, and Druze. The command below will work (note that Han matches HanBeijing, Han_S, Han_N, etc.).

grep "Yoruba|Great|Guj|Han|Druze" Est1000HGDP.fam > keep.txt

The file keep.txt has the individuals you want. Now you put it through plink to generate a new file:

./plink --bfile Est1000HGDP --keep keep.txt --make-bed --out EstSubset

This new file has only 634 individuals. That’s more manageable. But more important is that there are far fewer groups for visualization and analysis.

As for that analysis, I have a Perl script with a bash script within it (and some system commands). Here is what they do:

1) they perform PCA to 10 dimensions
2) then they run admixture on the number of K clusters you want (unsupervised), and generate a .csv file you can look at
3) then I wrote a script to do pairwise Fst between populations, and output the data into a text file
4) finally, I create the input file necessary for the treemix package and then run treemix with the number of migrations you want

There are lots of parameters and specifications for these packages. You don’t get those unless you to edit the scripts or make them more extensible (I have versions that are more flexible but I think newbies will just get confused so I’m keeping it simple).

Assuming I create the plink file above, running the following commands mean that admixture does K = 2 and treemix does 1 migration edge (that is, -m 1). The PCA and pairwise Fst automatically runs.

perl pairwise.perl EstSubset 2 1

Just walk away from your box for a while. The admixture will take the longest. If you want to speed it up, figure out how many cores you have, and edit the file makecluster.sh, go to line 16 where you see admixture. If you have 4 cores, then type -j4 as a parameter. It will speed admixture up and hog all your cores.

There is as .csv that has the admixture output. EstSubset.admix.csv. If you open it you see something like this:
Druze HGDP00603 0.550210 0.449790
Druze HGDP00604 0.569070 0.430930
Druze HGDP00605 0.562854 0.437146
Druze HGDP00606 0.555205 0.444795
GreatBritain HG00096 0.598871 0.401129
GreatBritain HG00097 0.590040 0.409960
GreatBritain HG00099 0.592654 0.407346
GreatBritain HG00100 0.590847 0.409153

Column 1 will always be the group, column 2 the individual, and all subsequent columns will be the K’s. Since K = 2, there are two columns. Space separated. You should be able to open the .csv or process it however you want to process it.

You’ll also see two other files: plink.eigenval plink.eigenvec. These are generic output files for the PCA. The .eigenvec file has the individuals along with the values for each PC. The .eigenval file shows the magnitude of the dimension. It looks like this:

Basically, this means that PC 1 explains twice as much of the variance as PC 2. Beyond PC 4 it looks like they’re really bunched together. You can open up this file as a .csv and visualize it however you like. But I gave you an R script. It’s RPCA.R.

You need to install some packages. First, open R or R studio. If you want to go command line at the terminal, type R. Then type:

Once those packages are loaded you can use the script:

Then, to generate the plot at the top of this post:

There are some useful parameters in this function. The plot to the left adds some shape labels to highlight two populations. A third population I label by individual ID. This second is important if you want to do outlier pruning, since there are mislabels, or just plain outlier individuals, in a lot of data (including in this). I also zoomed in.

Here’s how I did that:
plinkPCA(subVec = c("Druze","GreatBritain"),labelPlot = c("Lithuanians"),xLim=c(-0.01,0.0125),yLim=c(0.05,0.062))

To look at stuff besides PC 1 and PC 2 you can do plinkPCA(PC=c("PC3","PC6")).

I put the PCA function in the script, but to remove individuals you will want to run the PCA manually:

./plink --bfile EstSubset --pca 10

You can remove individuals manually by creating a remove file. What I like to do though is something like this:
grep "randomID27 " EstSubset.fam >> remove.txt

The double-carat appends to the remove.txt file, so you can add individuals in the terminal in one window while running PCA and visualizing with R in the other (Eigensoft has an automatic outlier removal feature). Once you have the individuals you want to remove, then:

./plink --bfile EstSubset --remove remove.txt --make-bed --out EstSubset
./plink --bfile EstSubset --pca 10

Then visualize!

To make use of the pairwise Fst you need the fst.R script. If everything is set up right, all you need to do is type:

It will load the file and generate the tree. You can modify the script so you have an unrooted tree too.

The R script is what generates the FstMatrix.csv file, which has the matrix you know and love.

So now you have the PCA, Fst and admixture. What else? Well, there’s treemix.

I set the number of SNPs for the blocks to be 1000. So -k 1000. As well as global rearrangement. You can change the details in the perl script itself. Look to the bottom. I think the main utility of my script is that it generates the input files. The treemix package isn’t hard to run once you have those input files.

Also, as you know treemix comes with R plotting functions. So run treemix with however many migration edges (you can have 0), and then when the script is done, load R.


But actually, you don’t need to do the above. I added a script to generate a .png file with the treemix plot in pairwise.perl. It’s called TreeMix.TreeMix.Tree.png.

OK, so that’s it.

To review:

Download zip or tar.xz file. Decompress. All the packages and scripts should be in there, along with a pretty big dataset of modern populations. If you are on a non-Mac Linux you are good to go. If you are on a Mac, you need the Mac versions of admixture, plink, and treemix. I’m going to warn you compiling treemix can be kind of a pain. I’ve done it on Linux and Mac machines, and gotten it to work, but sometimes it took time.

You need R and/or R Studio (or something like R Studio). Make sure to install the packages or the scripts for visualizing results from PCA and pairwiseFst won’t work.*

There is already a .csv output from admixture. The PCA also generates expected output files. You may want to sort, so open it in a spreadsheet.

This is potentially just the start. But if you are a laypersonwith a nagging question and can’t wait for me, this should be you where you need to go!

* I wrote a lot of these things piecemeal and often a long time ago. It may be that not all the packages are even used. Don’t bother to tell me.

August 21, 2010

Desmond Tutu, Spaniards, and genetic distance

Filed under: African Genetics,Bushmen,Fst,Genetics,Genomics — Razib Khan @ 11:35 am

Since we’ve been talking about Fst a fair amount, I thought it might be nice to put it in some concrete graphical perspective. First, to review Fst in the genetic context measures the proportion of genetic variation which can be attributed to between population differences. To give a “toy” example if you randomly divided the population of a large Swedish village into two groups, and calculated their Fst, it should be ~ 0, because if you randomly select from an unstructured population by definition there shouldn’t be noticeable between population differences. In contrast, if you compare a Swedish village to a Japanese village, a large fraction of the genetic variation is going to be distinct to each population. Around ~10% of the genetic variation in fact will be between the two groups. Many of the genes will be extremely informative, so that if you know the allelic state from a given individual you can predict with a high degree of certitude which population that individual was from (e.g., SLC24A5 and EDAR). A small set of ancestrally informative alleles would produce a sequence of conditional probabilities of extremely high certitude (on the order of 10 genes for these two populations should suffice, perhaps three for “government work”).

But to put this in perspective, and show how genetic variation differs from locale to locale, I though I would compare continental-scale Fst values with that in a small region, southern Africa. The Fst values for the first I obtained from Investigation of the fine structure of European populations with applications to disease association studies, and the second, Complete Khoisan and Bantu genomes from southern Africa. The Bantu in this case is Desmond Tutu, who is from the Xhosa tribe, and has substantial admixture from the non-Bantu populations which were resident in South Africa prior to the arrival of the Bantus.

First, in tabular format:

Spain Sweden Russia Japan
France 0.0008 0.0023 0.0037 0.1116
Spain 0.0047 0.0059 0.1118
Sweden 0.0025 0.1095
Russia 0.1057

KB1 NB1 TK1 MD8 Desmond Tutu
KB1 0.021 0.024 0.022 0.08
NB1 -0.007 0.006 0.091
TK1 0.016 0.088
MD8 0.061

Second, two adjacent bar graphs. In the foreground I’ve simply take the Spain vs. other Eurasian population comparisons, while in the background Desmond Tutu is the reference for the four Bushmen.


In some ways this comparison is an exaggeration of the variation in African genes. The Bushmen and Bantu populations are of very distinct origins, as the latter spread over eastern and southern Africa only in the last 2,000 years. The Bushmen-Bantu cultural gap is one of sharp discontinuity, and despite gene flow it is still to some extent a genetic one as well. But there are other factors dampening Fst in this case. First, Tutu is himself of partial Khoisan ancestry (of whom there are other groups besides the Bushmen), so his genetic distance is likely to be smaller than someone from the Zulu tribe, which has presumably had less admixture with the indigenous populations, being a bit farther from the edge of the demographic “wave of advance.” Second, the gene chips are geared toward Eurasian populations, and presumably missed African, and particularly Bushmen, specific variants because they didn’t go looking.

My own confusion on these issues the past week illustrates I suppose the difficulty in mapping these abstruse and yet materially concrete patterns onto human categories. But quite often wrestling with the difficulties in the surest path to illumination.

Powered by WordPress