Razib Khan One-stop-shopping for all of my content

September 17, 2020

The genomic landscape of Brazil in 1950

Filed under: Admixture,Human Population Genetics,Human Variation,race — Razib Khan @ 12:12 am


A new whole-genome analysis out of Brazil has some interesting ancestry information. The preprint, Whole-genome sequencing of 1,171 elderly admixed individuals from the largest Latin American metropolis (São Paulo, Brazil):

As whole-genome sequencing (WGS) becomes the gold standard tool for studying population genomics and medical applications, data on diverse non-European and admixed individuals are still scarce. Here, we present a high-coverage WGS dataset of 1,171 highly admixed elderly Brazilians from a census-based cohort, providing over 76 million variants, of which ~2 million are absent from large public databases….

Admixed populations are useful for a lot of reasons. But let’s observe some things about his Brazilian population.

First, it’s old. The average age is 72, so these are people born in 1950. This is the genetic characteristics of Brazil in 1950 in many ways, not today. This is why you see so many individuals who self-identify as Asian who are nearly 100% Asian. These individuals are the children of Japanese immigrants. In 1950 the endogamy of the community was high. Today the youngest generation of Japanese Brazilians is 60% mixed.

Second, most of the ancestry of self-identified Brazilian whites in this sample is mostly white. Like the Japanese, a large number of these individuals are probably the children of European immigrants. I suspect this accounts for many of the 20% of the “white” sample that has no trace non-European ancestry. But observe that around another 20% has trace proportions (~1%) of non-European ancestry, mostly African.  My supposition, in this case, is that these are “old stock” white Brazilians. That this, one or both of their parents descend from Portuguese Brazilians who settled in overwhelmingly European areas and retain some non-European admixture due to long-term residence in Brazil. The remainder is white Brazilians who have substantial non-European ancestry, with a small minority whose proportions are quite high from a North American perspective.

A point of comparison is probably useful. About 95% of non-Hispanic whites in the United States seem to have almost no detectable non-European ancestry using this sort of model-based clustering. This illustrates the massive demographic difference between the USA and Latin American nations. The vast majority of white Latin Americans look quite Iberian, but the majority also have far more non-European ancestry than 95% of North American whites. This is partly a reflection of the smaller population sizes of native peoples in North America, and, the nature of hypodescent for people of any African ancestry in the United States, so that mixed individuals were integrated into African Americans.

Third, the people who are “mixed” and black in Brazil are more European than you might expect. All the estimates of European ancestry I’ve seen for self-identified black Brazilians (a somewhat protean category due to social changes over the past few generations) indicate higher European ancestry fraction than among African Americans (~20% median in the latter). Self-identified “mixed” Brazilians have more European ancestry than anything.

The native category is interesting because most of these people have only a minor component of that ancestry. Additionally, a huge number of white, mixed, and black Brazilians have native ancestry. This is not surprising from previous work. Ancestry deconvolution indicates this is an old admixture, and mtDNA lineages are more native than Y chromosomes. There was a sex asymmetry in the early settlement, and native women married into the settler population. Both black and white Brazilians (and mixed) have lots of native ancestry.

Finally, though there is some overlap between these groups (despite their average differences), I assume that the overlap is much greater in contemporary cohorts in terms of genomic ancestry. It will be interesting to see when we get temporal transects in Brazil to see how assortative mating does, or doesn’t, work.

Looking forward to more of this from Latin America. So many opportunities for admixture mapping!

The genomic landscape of Brazil in 1950

Filed under: Admixture,Human Population Genetics,Human Variation,race — Razib Khan @ 12:12 am


A new whole-genome analysis out of Brazil has some interesting ancestry information. The preprint, Whole-genome sequencing of 1,171 elderly admixed individuals from the largest Latin American metropolis (São Paulo, Brazil):

As whole-genome sequencing (WGS) becomes the gold standard tool for studying population genomics and medical applications, data on diverse non-European and admixed individuals are still scarce. Here, we present a high-coverage WGS dataset of 1,171 highly admixed elderly Brazilians from a census-based cohort, providing over 76 million variants, of which ~2 million are absent from large public databases….

Admixed populations are useful for a lot of reasons. But let’s observe some things about his Brazilian population.

First, it’s old. The average age is 72, so these are people born in 1950. This is the genetic characteristics of Brazil in 1950 in many ways, not today. This is why you see so many individuals who self-identify as Asian who are nearly 100% Asian. These individuals are the children of Japanese immigrants. In 1950 the endogamy of the community was high. Today the youngest generation of Japanese Brazilians is 60% mixed.

Second, most of the ancestry of self-identified Brazilian whites in this sample is mostly white. Like the Japanese, a large number of these individuals are probably the children of European immigrants. I suspect this accounts for many of the 20% of the “white” sample that has no trace non-European ancestry. But observe that around another 20% has trace proportions (~1%) of non-European ancestry, mostly African.  My supposition, in this case, is that these are “old stock” white Brazilians. That this, one or both of their parents descend from Portuguese Brazilians who settled in overwhelmingly European areas and retain some non-European admixture due to long-term residence in Brazil. The remainder is white Brazilians who have substantial non-European ancestry, with a small minority whose proportions are quite high from a North American perspective.

A point of comparison is probably useful. About 95% of non-Hispanic whites in the United States seem to have almost no detectable non-European ancestry using this sort of model-based clustering. This illustrates the massive demographic difference between the USA and Latin American nations. The vast majority of white Latin Americans look quite Iberian, but the majority also have far more non-European ancestry than 95% of North American whites. This is partly a reflection of the smaller population sizes of native peoples in North America, and, the nature of hypodescent for people of any African ancestry in the United States, so that mixed individuals were integrated into African Americans.

Third, the people who are “mixed” and black in Brazil are more European than you might expect. All the estimates of European ancestry I’ve seen for self-identified black Brazilians (a somewhat protean category due to social changes over the past few generations) indicate higher European ancestry fraction than among African Americans (~20% median in the latter). Self-identified “mixed” Brazilians have more European ancestry than anything.

The native category is interesting because most of these people have only a minor component of that ancestry. Additionally, a huge number of white, mixed, and black Brazilians have native ancestry. This is not surprising from previous work. Ancestry deconvolution indicates this is an old admixture, and mtDNA lineages are more native than Y chromosomes. There was a sex asymmetry in the early settlement, and native women married into the settler population. Both black and white Brazilians (and mixed) have lots of native ancestry.

Finally, though there is some overlap between these groups (despite their average differences), I assume that the overlap is much greater in contemporary cohorts in terms of genomic ancestry. It will be interesting to see when we get temporal transects in Brazil to see how assortative mating does, or doesn’t, work.

Looking forward to more of this from Latin America. So many opportunities for admixture mapping!

July 17, 2020

Ancestry analysis quickstart

Filed under: Admixture,Admixture analysis — Razib Khan @ 9:49 pm

Over the years I have posted periodic tutorials on how to do some simple admixture analysis. Initially this was to foster the growth of “genome blogging”, but that’s basically dead along with blogging as a whole (Eurogenes being the primary exception here).

But, unexpectedly it turns out a lot of baby-academics find my tutorials early in graduate school or when they are trying to transition into ancestry inference in population genetics. I have lots of scripts I use myself, but they are not too organized or clear to others. But, I did create two tutorials for simple pipelines that are useful to others, judging by how many emails I get.

The reason I get emails is occasionally I delete the scripts in the process of housekeeping…but I’ve now created a folder “tutorials” that I will make sure NOT to delete. So all the files are present below at the links now:

Tutorial To Run PCA, Admixture, Treemix And Pairwise Fst In One Command

Tutorial To Run Supervised Admixture Analyses

In the near future, I’m going to clean up and post more scripts (e.g., some I use to make qpAdmin outputs earlier). But I will update this particular blog post since I think these posts are more useful/relevant for people doing web searches than regular readers.

September 20, 2018

Indic civilization came to Southeast Asia because Indian people came to Southeast Asia. Lots of them

Reading Indonesia: Peoples and Histories. I selected it because unlike many books it wasn’t incredibly skewed to the early modern and postcolonial period. The author makes the interesting point that the Islamicization of western Indonesia and the rise of the great Javanese Hindu kingdom of Majapahit occurred around the same time. This, in contrast to the skein of Indic civilization which had been layered over maritime Southeast Asia for hundreds of years before the medieval period, starting around 500 AD with polities such as that of Kalingga.

As is usual in these sorts of books, it is emphasized that Indian civilization spread through cultural diffusion (in contrast to the fact that though Chinese trade was evident and present early on, the cultural impact was minimal). Any migrations are dismissed as legends, with the possible exception of a few elite religious functionaries.

I now believe this is wrong. I’ve discussed this extensively in the past, but the Singapore Genome Variation Project (SGVP) data set along with more Southeast Asians allows me to illustrate rather clearly the issues. The short of it is that it is highly likely that substantial South Asian ancestry exists within Southeast Asia, and that that ancestry is not just a function of colonial contact (e.g., as certainly occurred in Malaysia).

Click to enlarge

Merging the various data sets together I got 172,000 SNPs. The initial PC plot shows that Southeast Asian populations exist on a cline to Indians (these are Tamils from the SGVP). The Burmese and Malays in particular have a wide distribution toward the Indians, indicate of a range of ancestry due to continuous admixture. I separated the SGVP Malays into two groups: Malay, and MalayMix. The MalayMix are those Malays who are more shifted toward the Indians, and like the Burmese show wide variance. The Mala proper as a more straightforward cluster. Shifted toward Indians more as a group.

Click to enlarge

Zooming in you see that Malays (not MalayMix) are not too different from Cambodians, but are slightly shifted toward Papuans. Filipino samples are similar, but further from Indians. Please note that Malaysia and the Philippines both are somewhat shifted toward the Papuans, and these are two nations where there are still extant Negrito populations (in contrast to Cambodia).

Groups like Lahu, Dai, Koreans, and the Dayak samples from Borneo I put in there partly because I assumed they would be less admixed with South Asians.

Click to enlarge

Running the samples in an admixture model with K = 5, the results are pretty clean even in unsupervised mode. Part of this is that I did do some outlier analysis and pruning ahead of time.

The Melanesian sample has admixture from something that is maximized in Filipinos and the Borneo samples. This is clearly Austronesian. Notice that the Melanesian samples don’t have any other Southeast Asian ancestry. This indicates that the cosmopolitan nature of some Austronesian groups in maritime Southeast Asia were due to later admixture. In particular, I accept the argument of Lipson et al. that there was an Austro-Asiatic substrate that was absorbed by incoming Austronesians.

Because I was very particular about sample selection, the Indians are nearly fixed for their modal ancestral component. Notice which groups don’t have the Indian ancestry in Southeast Asia: the Borneo samples. Additionally, the frequency in the Philippines may be due to European ancestry. Notice that in the Filipino samples the more diverse individuals tend to have more Indian ancestry, perhaps indicative of cosmopolitanism.

The Lahu and Dai do not have any the Indian modal ancestry, suggesting that this was not present when the Southeast Asians arrived.

The Cambodians have the Indian modal ancestry, as do many of the Malays. The MalayMix population has a lot, as expected. They are rather like the Burmese samples in that way. Some of the Malays don’t have Indian ancestry though. I think this may be due to the reality that the Malay population is actually cosmopolitan in origin, absorbing Indians, Chinese, and, Orang Asli groups. The latter of which may not have had Indian ancestry.

Click to enlarge

Next I ran some Treemix. Cambodians and MalayMix have affinity with Indians, as you’d expect. The Malay group gets gene flow from the Borneo population, and is positioned rather closer to Indians.

Here are some f3-statistics. At least those with z less than -2.

outp1p2f3z
BurmKoreaIndian-0.00371314-40.1063
BurmDaiIndian-0.00368793-36.4354
BurmLahuIndian-0.00363462-33.3115
BurmBorneoIndian-0.00297696-30.3724
BurmFilipinoIndian-0.00222445-24.3581
BurmKoreaPapuan-0.00243075-19.9711
BurmDaiPapuan-0.00213815-15.6106
BurmMalayIndian-0.00133932-15.233
BurmKoreaNAN_Melanesian-0.00158736-12.6428
BurmCambodiaIndian-0.000991136-10.9863
BurmLahuPapuan-0.00185199-10.8255
BurmDaiNAN_Melanesian-0.0011808-8.7863
BurmLahuNAN_Melanesian-0.00136834-8.35811
BurmKoreaMalayMix-0.000470731-7.64052
BurmBorneoPapuan-0.00105531-7.04586
BurmKoreaCambodia-0.000388278-6.74484
CambodiaDaiIndian-0.00166543-19.5634
CambodiaBorneoIndian-0.00135571-16.9002
CambodiaLahuIndian-0.00106449-10.9303
CambodiaDaiPapuan-0.00128886-9.86858
CambodiaBorneoPapuan-0.000607278-4.5826
CambodiaDaiNAN_Melanesian-0.000449035-3.69865
CambodiaLahuPapuan-0.000455081-2.64151
FilipinoBorneoPapuan-0.000462553-3.8874
FilipinoBorneoNAN_Melanesian-0.000325208-3.54648
MalayFilipinoCambodia-0.000763086-32.6034
MalayBorneoIndian-0.0020853-29.1425
MalayBorneoCambodia-0.000613918-26.5048
MalayBorneoPapuan-0.00223031-20.037
MalayDaiIndian-0.00136434-14.4879
MalayBorneoNAN_Melanesian-0.00131484-14.4241
MalayDaiPapuan-0.00188121-13.6787
MalayFilipinoIndian-0.000850623-12.4534
MalayBorneoBurm-0.000447661-11.0181
MalayDaiNAN_Melanesian-0.00122082-10.1649
MalayLahuIndian-0.000658295-6.56147
MalayFilipinoPapuan-0.00061669-6.52747
MalayBorneoMalayMix-0.000237474-5.75298
MalayLahuPapuan-0.000942326-5.35136
MalayLahuNAN_Melanesian-0.000755618-5.0158
MalayKoreaPapuan-0.000473046-3.65977
MalayDaiMalayMix-8.93679E-05-2.12082
MalayMixBorneoIndian-0.00469843-45.6919
MalayMixFilipinoIndian-0.00377864-39.6124
MalayMixDaiIndian-0.00412557-35.9643
MalayMixMalayIndian-0.0028506-33.0568
MalayMixLahuIndian-0.00345861-28.1738
MalayMixKoreaIndian-0.00281846-23.528
MalayMixBorneoPapuan-0.00322593-21.9346
MalayMixCambodiaIndian-0.00192058-19.5189
MalayMixDaiPapuan-0.00302494-19.2884
MalayMixBorneoNAN_Melanesian-0.00208153-15.6894
MalayMixDaiNAN_Melanesian-0.00213561-14.4382
MalayMixFilipinoPapuan-0.0019272-14.0354
MalayMixKoreaPapuan-0.00198522-12.4299
MalayMixCambodiaNAN_Melanesian-0.00114701-11.2074
MalayMixMalayPapuan-0.00123309-10.69
MalayMixCambodiaPapuan-0.00119651-10.578
MalayMixLahuPapuan-0.00212514-10.5372
MalayMixMalayNAN_Melanesian-0.00100416-9.70624
MalayMixLahuNAN_Melanesian-0.0017095-9.61884
MalayMixKoreaNAN_Melanesian-0.00120984-7.54544
MalayMixFilipinoNAN_Melanesian-0.000920147-6.96775
MalayMixBorneoBurm-0.000446434-5.966
MalayMixFilipinoCambodia-0.000336794-5.33937
MalayMixFilipinoBurm-0.000279165-4.31016
MalayMixBurmMalay-0.000236247-4.15308

No big surprises.

I’m trying to get rolloff to work on one of the Reich lab datasets, but it isn’t working (says not enough snps, but the file has 350,000!). I need to establish the admixture date. Perhaps I’ll look to use fineStructure?

Note that this paper shows that of 125 male Cambodians, 9 of them carry R1a1a. This is unlikely to come from French, and Cambodia, unlike Malaysia, doesn’t have a colonial Indian community.

More to come….

July 13, 2018

Tutorial to run supervised admixture analyses

Filed under: Admixture,Data Analysis,Population genetics — Razib Khan @ 11:03 pm
IDDaiGujratiLithuaniansSardinianTamil
razib_23andMe0.140.260.020.000.58
razib_ancestry0.140.260.020.000.58
razib_ftdna0.140.260.020.000.57
razib_daughter0.050.140.290.180.34
razib_son0.070.170.280.190.30
razib_son_20.060.190.290.190.27
razib_wife0.000.070.550.380.00

This is a follow-up to my earlier post, Tutorial To Run PCA, Admixture, Treemix And Pairwise Fst In One Command. Hopefully you’ll be able to run supervised admixture analysis with less hassle after reading this.

The above results are from a supervised admixture analysis of my family and myself. The fact that there are three replicates of me is due to the fact that I converted my 23andMe, Ancestry, and Family Tree DNA raw data into plink files. Notice that the results are broadly consistent. This emphasizes that discrepancies between DTC companies in their results are due to their analytic pipeline, not because of data quality.

The results for my family are not surprising. I’m about ~14% “Dai”, reflecting East Asian admixture into Bengalis. My wife is ~0% “Dai”. My children are somewhere in between. At the low fraction you expect some variance in the F1.

Now below are results for three Swedes with the sample reference panel:

GroupIDDaiGujratiLithuaniansSardinianTamil
SwedenSweden170.000.090.630.280.00
SwedenSweden180.000.080.620.310.00
SwedenSweden200.000.050.720.230.00

All these were run on supervised admixture frameworks where I used Dai, Gujrati, Lithuanians, Sardinians, and Tamils, as the reference “ancestral” populations. Another way to think about it is: taking the genetic variation of these input groups, what fractions does a given test focal individual shake out at?

The commands are rather simple. For my family:
bash rawFile_To_Supervised_Results.sh TestScript

For the Swedes:
bash supervisedTest.sh Sweden TestScript

The commands need to be run in a folder: ancestry_supervised/.

You can download the zip file.

Here is what the scripts do in two different situations. Imagine you have raw genotype files downloaded from 23andMe, Ancestry, and Family Tree DNA.

Download the files as usual. Rename them in an intelligible way, because the file names are going to be used for generating IDs. So above, I renamed them “razib_23andMe.txt” and such. Leave the extensions as they are. You need to make sure they are not compressed obviously. Then place them all in ancestry_supervised/RAWINPUT.

The script looks for the files in there. You don’t need to specify names, it will find them. In plink the family ID and individual ID will be taken from the text before the extension in the file name. Output files will also have the file name.

Aside from the raw genotype files, you need to determine a reference file. In REFERENCESFILES/ you see the binary pedigree/plink file Est1000HGDP. The same file from the earlier post. It would be crazy to run supervised admixture on the dozens of populations in this file. You need to create a subset.

For the above I did this:
grep "Dai|Guj|Lithua|Sardi|Tamil" Est1000HGDP.fam > ../keep.keep

Then:
./plink --bfile REFERENCEFILES/Est1000HGDP --keep keep.keep --make-bed --out REFERENCEFILES/TestScript

When the script runs, it converts the raw genotype files into plink files, puts them in INDIVPLINKFILES/. Then it takes each plink file and uses it as a test against the reference population file. That file has a preprend on group/family IDs of the form AA_Ref_. This is essential for the script to understand that this is a reference population. The .pop files are automatically generated, and the script inputs in the correct K by looking for unique population numbers.

The admixture is going to be slow. I recommend you modify runadmixture.pl by adding the number of cores parameters so it can go multi-threaded.

When the script is done it will put the results in RESULTFILES/. They will be .csv files with strange names (they will have the original filename you provided, but there are timestamps in there so that if you run the files with a different reference and such it won’t overwrite everything). Each individual is run separately and has a separate output file (a .csv).

But this is not always convenient. Sometimes you want to test a larger batch of individuals. Perhaps you want to use the reference file I provided? For the Swedes I did this:
grep "Swede" REFERENCEFILES/Est1000HGDP.fam > ../keep.keep

Then:
./plink --bfile REFERENCEFILES/Est1000HGDP --keep keep.keep --make-bed --out INDIVPLINKFILES/Sweden

Please note the folder. There are modifications you can make, but the script assumes that the test files inINDIVPLINKFILES/. The next part is important. The Swedish individuals will have AA_Ref_ preprended on each row since you got them out of Est1000HGDP. You need to remove this. If you don’t remove it, it won’t work. In my case, I modified using the vim editor:
vim Sweden.fam

You can do it with a text editor too. It doesn’t matter. Though it has to be the .fam file.

After the script is done, it will put the .csv file in RESULTFILES/. It will be a single .csv with multiple rows. Each individual is tested separately though, so what the script does is append each result to the file. If you have 100 individuals, it will take a long time. You may want to look in the .csv file as the individuals are being added to make sure it looks right.

The convenience of these scripts is that it does some merging/flipping/cleaning for you. And, it formats the output so you don’t have to.

I originally developed these scripts on a Mac, but to get it to work on Ubuntu I made a few small modifications. I don’t know if it still works on Mac, but you should be able to make the modifications if not. Remember for a Mac you will need the make versions of plink and admixture.

For supervised analysis, the reference populations need to make sense and be coherent. Please check the earlier tutorial and use the PCA functions to remove outliers.

Older Posts »

Powered by WordPress

Do NOT follow this link or you will be banned from the site!