Razib Khan One-stop-shopping for all of my content

July 11, 2018

Tutorial to run PCA, Admixture, Treemix and pairwise Fst in one command

Filed under: Admixture,data,Fst,PCA,PLINK,Population genetics,TreeMix — Razib Khan @ 11:50 pm

Today on Twitter I stated that “if the average person knew how to run PCA with plink and visualize with R they wouldn’t need to ask me anything.” What I meant by this is that the average person often asks me “Razib, is population X closer to population Y than Z?” To answer this sort of question I dig through my datasets and run a few exploratory analyses, and get back to them.

I’ve been meaning to write up and distribute a “quickstart” for a while to help people do their own analyses. So here I go.

The audience of this post is probably two-fold:

  1. “Trainees” who are starting graduate school and want to dig in quickly into empirical data sets while they’re really getting a handle on things. This tutorial will probably suffice for a week. You should quickly move on to three population and four population tests, and Eigensoft and AdmixTools. As well fineStructure
  2. The larger audience is technically oriented readers who are not, and never will be, geneticists professionally. 

What do you need? First, you need to be able to work in a Linux or environment. I work both in Ubuntu and on a Mac, but this tutorial and these scripts were tested on Ubuntu. They should work OK on a Mac, but there may need to be some modifications on the bash scripts and such.

Assuming you have a Linux environment, you need to download this zip or tar.xz file. Once you open this file it should decompress a folderancestry/.

There are a bunch of files in there. Some of them are scripts I wrote. Some of them are output files that aren’t cleaned up. Some of them are packages that you’ve heard of. Of the latter:

  • admixture
  • plink
  • treemix

You can find these online too, though these versions should work out of the box on Ubuntu. If you have a Mac, you need the Mac versions. Just replace the Mac versions into the folderancestry/. You may need some libraries installed into Ubuntu too if you recompile yourselfs. Check the errors and make search engines your friends.

You will need to install R (or R Studio). If you are running Mac or Ubuntu on the command line you know how to get R. If not, Google it.

I also put some data in the file. In particular, a plink set of files Est1000HGDP. These are merged from the Estonian Biocentre, HGDP, and 1000 Genomes. There are 4,899 individuals in the data, with 135,000 high quality SNPs (very low missingness).

If you look in the “family” file you will see an important part of the structure. So do:

less Est1000HGDP.fam

You’ll see something like this:
Abhkasians abh154 0 0 1 -9
Abhkasians abh165 0 0 1 -9
Abkhazian abkhazian1_1m 0 0 2 -9
Abkhazian abkhazian5_1m 0 0 1 -9
Abkhazian abkhazian6_1m 0 0 1 -9
AfricanBarbados HG01879 0 0 0 -9
AfricanBarbados HG01880 0 0 0 -9

There are 4,899 rows corresponding to each individual. I have used the first column to label the ethnic/group identity. The second column is the individual ID. You can ignore the last 4 columns.

There is no way you want to analyze all the different ethnic groups. Usually, you want to look at a few. For that, you can use lots of commands, but what you need is a subset of the rows above. The grep command matches and returns rows with particular patterns. It’s handy. Let’s say I want just Yoruba, British (who are in the group GreatBritain), Gujurati, Han Chinese, and Druze. The command below will work (note that Han matches HanBeijing, Han_S, Han_N, etc.).

grep "Yoruba|Great|Guj|Han|Druze" Est1000HGDP.fam > keep.txt

The file keep.txt has the individuals you want. Now you put it through plink to generate a new file:

./plink --bfile Est1000HGDP --keep keep.txt --make-bed --out EstSubset

This new file has only 634 individuals. That’s more manageable. But more important is that there are far fewer groups for visualization and analysis.

As for that analysis, I have a Perl script with a bash script within it (and some system commands). Here is what they do:

1) they perform PCA to 10 dimensions
2) then they run admixture on the number of K clusters you want (unsupervised), and generate a .csv file you can look at
3) then I wrote a script to do pairwise Fst between populations, and output the data into a text file
4) finally, I create the input file necessary for the treemix package and then run treemix with the number of migrations you want

There are lots of parameters and specifications for these packages. You don’t get those unless you to edit the scripts or make them more extensible (I have versions that are more flexible but I think newbies will just get confused so I’m keeping it simple).

Assuming I create the plink file above, running the following commands mean that admixture does K = 2 and treemix does 1 migration edge (that is, -m 1). The PCA and pairwise Fst automatically runs.

perl pairwise.perl EstSubset 2 1

Just walk away from your box for a while. The admixture will take the longest. If you want to speed it up, figure out how many cores you have, and edit the file makecluster.sh, go to line 16 where you see admixture. If you have 4 cores, then type -j4 as a parameter. It will speed admixture up and hog all your cores.

There is as .csv that has the admixture output. EstSubset.admix.csv. If you open it you see something like this:
Druze HGDP00603 0.550210 0.449790
Druze HGDP00604 0.569070 0.430930
Druze HGDP00605 0.562854 0.437146
Druze HGDP00606 0.555205 0.444795
GreatBritain HG00096 0.598871 0.401129
GreatBritain HG00097 0.590040 0.409960
GreatBritain HG00099 0.592654 0.407346
GreatBritain HG00100 0.590847 0.409153

Column 1 will always be the group, column 2 the individual, and all subsequent columns will be the K’s. Since K = 2, there are two columns. Space separated. You should be able to open the .csv or process it however you want to process it.

You’ll also see two other files: plink.eigenval plink.eigenvec. These are generic output files for the PCA. The .eigenvec file has the individuals along with the values for each PC. The .eigenval file shows the magnitude of the dimension. It looks like this:

Basically, this means that PC 1 explains twice as much of the variance as PC 2. Beyond PC 4 it looks like they’re really bunched together. You can open up this file as a .csv and visualize it however you like. But I gave you an R script. It’s RPCA.R.

You need to install some packages. First, open R or R studio. If you want to go command line at the terminal, type R. Then type:

Once those packages are loaded you can use the script:

Then, to generate the plot at the top of this post:

There are some useful parameters in this function. The plot to the left adds some shape labels to highlight two populations. A third population I label by individual ID. This second is important if you want to do outlier pruning, since there are mislabels, or just plain outlier individuals, in a lot of data (including in this). I also zoomed in.

Here’s how I did that:
plinkPCA(subVec = c("Druze","GreatBritain"),labelPlot = c("Lithuanians"),xLim=c(-0.01,0.0125),yLim=c(0.05,0.062))

To look at stuff besides PC 1 and PC 2 you can do plinkPCA(PC=c("PC3","PC6")).

I put the PCA function in the script, but to remove individuals you will want to run the PCA manually:

./plink --bfile EstSubset --pca 10

You can remove individuals manually by creating a remove file. What I like to do though is something like this:
grep "randomID27 " EstSubset.fam >> remove.txt

The double-carat appends to the remove.txt file, so you can add individuals in the terminal in one window while running PCA and visualizing with R in the other (Eigensoft has an automatic outlier removal feature). Once you have the individuals you want to remove, then:

./plink --bfile EstSubset --remove remove.txt --make-bed --out EstSubset
./plink --bfile EstSubset --pca 10

Then visualize!

To make use of the pairwise Fst you need the fst.R script. If everything is set up right, all you need to do is type:

It will load the file and generate the tree. You can modify the script so you have an unrooted tree too.

The R script is what generates the FstMatrix.csv file, which has the matrix you know and love.

So now you have the PCA, Fst and admixture. What else? Well, there’s treemix.

I set the number of SNPs for the blocks to be 1000. So -k 1000. As well as global rearrangement. You can change the details in the perl script itself. Look to the bottom. I think the main utility of my script is that it generates the input files. The treemix package isn’t hard to run once you have those input files.

Also, as you know treemix comes with R plotting functions. So run treemix with however many migration edges (you can have 0), and then when the script is done, load R.


But actually, you don’t need to do the above. I added a script to generate a .png file with the treemix plot in pairwise.perl. It’s called TreeMix.TreeMix.Tree.png.

OK, so that’s it.

To review:

Download zip or tar.xz file. Decompress. All the packages and scripts should be in there, along with a pretty big dataset of modern populations. If you are on a non-Mac Linux you are good to go. If you are on a Mac, you need the Mac versions of admixture, plink, and treemix. I’m going to warn you compiling treemix can be kind of a pain. I’ve done it on Linux and Mac machines, and gotten it to work, but sometimes it took time.

You need R and/or R Studio (or something like R Studio). Make sure to install the packages or the scripts for visualizing results from PCA and pairwiseFst won’t work.*

There is already a .csv output from admixture. The PCA also generates expected output files. You may want to sort, so open it in a spreadsheet.

This is potentially just the start. But if you are a laypersonwith a nagging question and can’t wait for me, this should be you where you need to go!

* I wrote a lot of these things piecemeal and often a long time ago. It may be that not all the packages are even used. Don’t bother to tell me.

January 7, 2013

Using your 23andMe data in Plink

Filed under: Genomics,PLINK — Razib Khan @ 2:58 am

With the recent $99 price point for 23andMe many of my friends have purchased kits (finally!). 23andMe’s interpretive results are pretty rich now, but there are still things missing. There are plenty of third party tools you can use, but I know some people might want to do their own data analysis. There are many ways you could go about this, but I want to put up some posts on DIY genomic data analysis to making the learning curve a little less steep, and get people started. Motivation to actually begin going down this road is a big issue, but I think once you get over the hump it gets a lot easier.

First, you need Plink. It is really preferable that you work on a Mac or in Linux to engage in heavy duty analysis, but in this post I’ll assume you are working on the Windows platform. Again, the point here is to make this accessible. Download Plink if you don’t have it, and extract it where ever you like.

Plink is a command line tool, which means that you need to into the folder with the old MS-DOS interface. So use the cd command to get into that folder. Here is a screenshot of my shell:

The selection “plink –noweb –bfile PhyloF –genome” is a command that I entered. It is not part of the directory structure. If you don’t know about the cd command, please see the Wikipedia entry. It’s really just a simple way to step through the directory structure of your files and folders.

Now you have Plink. We need to put your 23andMe data into pedigree format. Additionally it would be convenient to have other reference data sets . Go to here. You now need to click the ZIP option. That will download a 74 MB zip containing all the files you see listed to the left. Most of that is in the two zip files, which are pedigree file data sets that I have provided for your future use. More on that later. First you need to use “CONVERT_23AME_PED.pl”  This a Perl script which takes the 23andMe text file, and converts it to pedigree format which Plink can use. You need to have Perl to use this script.

If you are on Windows you need to get ActivePerl. Download it. Again you have to open the command prompt and go into the appropriate folder. On my computer (this is the first time I’m using Perl on Windows in 10 years, the sacrifices I make for the readership of this weblog!) it is in the C:\ directory, so you probably have to move “up” the directory tree twice by typing “cd ..” (if you do this you’ll see what I mean). Once you are in the Perl directory you need to go into the bin directory. Remember to move the Perl script into the Perl directory. Here is a screen of what I get when I try and run the Perl script without any parameters:

Basically there needs to be a file for the script to process. You should have a 23andMe text file, your raw data. It will start like so: “genome_”. If you don’t have it, go into your account, and click “Browse Raw Data.”  On this page there will be options to download various peoples’ data if you have multiple accounts. It will download whoever is selected in your profile (for most it will be just one person of course).

Now you need to just select the button and enter your password. An 8 MB zip file will come down from the server. Put it into your Perl/bin folder by extracting it. Do not try and process the zip file! Once in there you now add it as your first parameter. I’d rename it something short and sweet since you’ll be typing it in. You don’t need to put a unique id parameter in, but I would if I were you. Try “me.” And “me” for the family id. At some point you’ll do more sophisticated things and need less silly ids, but not right now.

Here’s a screenshot of me running the Perl script with my own data (I renamed the text file). If the file name isn’t recognized make sure that you didn’t add the file extension within Windows, that might confuse it (e.g., for razibdata.txt if that’s what you see in the directory, you’d have to enter in razibdata.txt.txt in the parameter value since the extension is hidden):

There are two output files. In my case they are razibdata.ped and razibdata.map. As you can see they are named from your original file. You need to move both into the Plink directory. The .ped file has your individual data, the first half a dozen columns being the same as the parameters you may, or may not, have entered above. But it is very large because the whole line is filled out with your 23andMe genotype. The .map file basically has the information about the SNPs. These are both text files, and unwieldy.  You need to make it into a binary file. At the end of this there are three new files of the same name with extensions .bed, .bim, .fam:

You can see a lot of information. Most of it is not relevant to you, but note the number of SNPs. So now you have a pedigree file! Great. What do you do with it? Lots of stuff. You can look at the Plink documentation. Because the .bed file is a binary, never open it. The .bim has SNP info. You shouldn’t open this. On the other hand when you merge data sets .fam is useful. It’s a text file with all your individual and family id information. In this case with one file it isn’t informative, though you could change the id by editing the .fam file.

One thing you can do with just one individual is look for runs of homozygosity. The command is:

plink –bfile mydata –homozyg

You enter your binary pedigree file name, without the extension. Observe that now we are use –bfile instead of –file. Many commands will be bCommand instead of just Command if you are using binary files instead of the conventional ones. Binary files are smaller and the commands finish much faster, so use them! The output files, unless you use the –out command at the end to define them, usually begin with plink. So above you have plink.hom. It has some interesting information about the runs of homozygosity, but it is probably not too illuminating unless you suspect you are inbred!

Ultimately what I want you do by the end of this is compute an MDS with your own data against a reference set. That’s PHYLO in the data I’ve provided. It has 99,000 SNPs that overlap with 23andMe, and 1,500 individuals. I’ve altered the .fam file so that all the family ids are recognizable as populations. This will make analysis of the output easier for you. First you need to merge the files. It will be useful for you to prune your data set down, since you have a lot of extra SNPs.

Assuming you’ve extracted PHYLO out of the zip downloaded here is my command writing out the list of SNPs within PHYLO:

You can see from reading this that this data set has ~99,000 SNPs. I pruned it so that it ran quicker for phylogenetic analysis. This is more than a sufficient number for most analysis. What you want to next is create a copy of your own data which doesn’t have so many SNPs, so you can merge them well. Because I created this data set I can tell you that all of the above SNPs are probably in your 23andMe file. With the commands above there is a file, plink.snplist, which you will use to filter your data set.

Here’s how to do it:

Now we’ve got it ready to merge. I will warn you that this takes forever on Windows! No idea why. Also, Windows tends to do strange things with the file extensions. If Plink tells you that a .fam does not exist, look to the file extension. If you label something as something.fam, it might actually be something.fam.fam. In any case, here’s how you merge:

This is going to give you lots of warnings. Often this won’t matter, but sometimes it will tell you that you might need to “flip” one of the files. Try flipping it. If it still doesn’t work I would remove the SNPs causing problems. Something like this:

Honestly you might have to do a lot of things to get data sets to merge. But this particular combination of 23andMe genotype and PHYLO shouldn’t be too bad. Let’s assume that your merge worked. What do you want to do? One thing that might be interesting is an MDS plot (it’s like a PCA plot).

First you run the genome command, which takes forever to finish. It might be best if you did this before you go to sleep, and just check in in the morning. The genome command will produce an output that you’ll use next.

Notice the input file. That was generated in the previous step. The value 6 is a parameter that defines how many dimensions you want to output. My experience with this is that it doesn’t take too long, so I go for 6 at least. The final result of this is that you have an plink.mds file with an ordered list of family and individual ids, along with positions for 6 dimensions. It should be straightforward to import this into Excel, and then plot your MDS, emphasizing your own position. Since I can no longer use Excel I couldn’t be bothered to figure out how to plot my own position, but the distribution should be familiar.

That’s about it for now. I’ll put up another post focusing less on phylogenetics, using the HapMap data set that I provided. I don’t know if I can continue to do this in Windows, but hopefully this illustrated how easy (if tedious) most of this is.

March 14, 2011

Analyzing ancestry with ADMIXTURE, step by step

Over the past few months I was hoping more people would start doing what Zack Ajmal, Dienekes, and David, have been doing. There are public data sets, and open source software, so that anyone with nerdy inclination can explore their own questions out of curiosity. That way you can see the power and the limitations of  genomics on your own desktop. I wonder if one of the biggest reasons that more people haven’t started doing this is formatting. It can be a pain to convert matrix formatted files into pedigree format, for example. But the data gusher isn’t ending, look at what’s coming out (and has come out) in the 1000 Genomes project!

I’ve been thinking I need to write up a post which is a “soft landing” for people so that we can reduce the “activation energy” for this sort of thing…once you get hooked, you only go deeper. Luckily an anonymous tipster has sent me the link to a URL with a huge data set which has been merged, already pedigree formatted. Here are the populations:

!Kung Buryats Hausa Mada Punjabi Arain Totonac Adygei Cambodian Hazara Makrani Pygmy Tu African Americans Chinese Hema Malayan Romanians Tujia Algeria Chinese Americans Hezhen Mandenka Russian Tunisia Altaians Chukchis Hungarians Maya Sahara Occ Turks Alur Chuvashs Iban Mbuti Sakilli Tuscans Ap Brahmin Cochin Jews Igbo Melanesian Samaritians Tuvinians Ap Madiga Colombian Iranian Jews Mexicans Samoan Urkarah Ap Mala Cypriots Iranians Miao San Utahn Whites Armenians Dai Iraq Jews Mongola San Nb Uygur Armenians B Daur Irula Mongolians Sandawe Uzbekistan Jews Ashkenazy Jews Dogon Italian Moroccans Sardinian Uzbeks Azerbaijan Jews Dolgans Japanese Morocco Jews Saudis Vietnamese Balochi Druze Jordanians Morocco N Selkups Greenlanders Bambaran Greenlanders Kaba Morocco ...

Powered by WordPress