# Razib KhanOne-stop-shopping for all of my content

## July 13, 2018

### Tutorial to run supervised admixture analyses

Filed under: Admixture,Data Analysis,Population genetics — Razib Khan @ 11:03 pm
 ID Dai Gujrati Lithuanians Sardinian Tamil razib_23andMe 0.14 0.26 0.02 0.00 0.58 razib_ancestry 0.14 0.26 0.02 0.00 0.58 razib_ftdna 0.14 0.26 0.02 0.00 0.57 razib_daughter 0.05 0.14 0.29 0.18 0.34 razib_son 0.07 0.17 0.28 0.19 0.30 razib_son_2 0.06 0.19 0.29 0.19 0.27 razib_wife 0.00 0.07 0.55 0.38 0.00

This is a follow-up to my earlier post, Tutorial To Run PCA, Admixture, Treemix And Pairwise Fst In One Command. Hopefully you’ll be able to run supervised admixture analysis with less hassle after reading this.

The above results are from a supervised admixture analysis of my family and myself. The fact that there are three replicates of me is due to the fact that I converted my 23andMe, Ancestry, and Family Tree DNA raw data into plink files. Notice that the results are broadly consistent. This emphasizes that discrepancies between DTC companies in their results are due to their analytic pipeline, not because of data quality.

The results for my family are not surprising. I’m about ~14% “Dai”, reflecting East Asian admixture into Bengalis. My wife is ~0% “Dai”. My children are somewhere in between. At the low fraction you expect some variance in the F1.

Now below are results for three Swedes with the sample reference panel:

 Group ID Dai Gujrati Lithuanians Sardinian Tamil Sweden Sweden17 0.00 0.09 0.63 0.28 0.00 Sweden Sweden18 0.00 0.08 0.62 0.31 0.00 Sweden Sweden20 0.00 0.05 0.72 0.23 0.00

All these were run on supervised admixture frameworks where I used Dai, Gujrati, Lithuanians, Sardinians, and Tamils, as the reference “ancestral” populations. Another way to think about it is: taking the genetic variation of these input groups, what fractions does a given test focal individual shake out at?

The commands are rather simple. For my family:
bash rawFile_To_Supervised_Results.sh TestScript

For the Swedes:
bash supervisedTest.sh Sweden TestScript

The commands need to be run in a folder: ancestry_supervised/.

Here is what the scripts do in two different situations. Imagine you have raw genotype files downloaded from 23andMe, Ancestry, and Family Tree DNA.

Download the files as usual. Rename them in an intelligible way, because the file names are going to be used for generating IDs. So above, I renamed them “razib_23andMe.txt” and such. Leave the extensions as they are. You need to make sure they are not compressed obviously. Then place them all in ancestry_supervised/RAWINPUT.

The script looks for the files in there. You don’t need to specify names, it will find them. In plink the family ID and individual ID will be taken from the text before the extension in the file name. Output files will also have the file name.

Aside from the raw genotype files, you need to determine a reference file. In REFERENCESFILES/ you see the binary pedigree/plink file Est1000HGDP. The same file from the earlier post. It would be crazy to run supervised admixture on the dozens of populations in this file. You need to create a subset.

For the above I did this:
grep "Dai|Guj|Lithua|Sardi|Tamil" Est1000HGDP.fam > ../keep.keep

Then:
./plink --bfile REFERENCEFILES/Est1000HGDP --keep keep.keep --make-bed --out REFERENCEFILES/TestScript

When the script runs, it converts the raw genotype files into plink files, puts them in INDIVPLINKFILES/. Then it takes each plink file and uses it as a test against the reference population file. That file has a preprend on group/family IDs of the form AA_Ref_. This is essential for the script to understand that this is a reference population. The .pop files are automatically generated, and the script inputs in the correct K by looking for unique population numbers.

The admixture is going to be slow. I recommend you modify runadmixture.pl by adding the number of cores parameters so it can go multi-threaded.

When the script is done it will put the results in RESULTFILES/. They will be .csv files with strange names (they will have the original filename you provided, but there are timestamps in there so that if you run the files with a different reference and such it won’t overwrite everything). Each individual is run separately and has a separate output file (a .csv).

But this is not always convenient. Sometimes you want to test a larger batch of individuals. Perhaps you want to use the reference file I provided? For the Swedes I did this:
grep "Swede" REFERENCEFILES/Est1000HGDP.fam > ../keep.keep

Then:
./plink --bfile REFERENCEFILES/Est1000HGDP --keep keep.keep --make-bed --out INDIVPLINKFILES/Sweden

Please note the folder. There are modifications you can make, but the script assumes that the test files inINDIVPLINKFILES/. The next part is important. The Swedish individuals will have AA_Ref_ preprended on each row since you got them out of Est1000HGDP. You need to remove this. If you don’t remove it, it won’t work. In my case, I modified using the vim editor:
vim Sweden.fam

You can do it with a text editor too. It doesn’t matter. Though it has to be the .fam file.

After the script is done, it will put the .csv file in RESULTFILES/. It will be a single .csv with multiple rows. Each individual is tested separately though, so what the script does is append each result to the file. If you have 100 individuals, it will take a long time. You may want to look in the .csv file as the individuals are being added to make sure it looks right.

The convenience of these scripts is that it does some merging/flipping/cleaning for you. And, it formats the output so you don’t have to.

I originally developed these scripts on a Mac, but to get it to work on Ubuntu I made a few small modifications. I don’t know if it still works on Mac, but you should be able to make the modifications if not. Remember for a Mac you will need the make versions of plink and admixture.

For supervised analysis, the reference populations need to make sense and be coherent. Please check the earlier tutorial and use the PCA functions to remove outliers.

## July 12, 2018

Filed under: Genetics,Population genetics,Scripts — Razib Khan @ 8:25 am

For the technically inclined people here: Tutorial To Run PCA, Admixture, Treemix And Pairwise Fst In One Command.

## July 11, 2018

### Tutorial to run PCA, Admixture, Treemix and pairwise Fst in one command

Today on Twitter I stated that “if the average person knew how to run PCA with plink and visualize with R they wouldn’t need to ask me anything.” What I meant by this is that the average person often asks me “Razib, is population X closer to population Y than Z?” To answer this sort of question I dig through my datasets and run a few exploratory analyses, and get back to them.

I’ve been meaning to write up and distribute a “quickstart” for a while to help people do their own analyses. So here I go.

The audience of this post is probably two-fold:

1. “Trainees” who are starting graduate school and want to dig in quickly into empirical data sets while they’re really getting a handle on things. This tutorial will probably suffice for a week. You should quickly move on to three population and four population tests, and Eigensoft and AdmixTools. As well fineStructure
2. The larger audience is technically oriented readers who are not, and never will be, geneticists professionally.

What do you need? First, you need to be able to work in a Linux or environment. I work both in Ubuntu and on a Mac, but this tutorial and these scripts were tested on Ubuntu. They should work OK on a Mac, but there may need to be some modifications on the bash scripts and such.

Assuming you have a Linux environment, you need to download this zip or tar.xz file. Once you open this file it should decompress a folderancestry/.

There are a bunch of files in there. Some of them are scripts I wrote. Some of them are output files that aren’t cleaned up. Some of them are packages that you’ve heard of. Of the latter:

• admixture
• plink
• treemix

You can find these online too, though these versions should work out of the box on Ubuntu. If you have a Mac, you need the Mac versions. Just replace the Mac versions into the folderancestry/. You may need some libraries installed into Ubuntu too if you recompile yourselfs. Check the errors and make search engines your friends.

You will need to install R (or R Studio). If you are running Mac or Ubuntu on the command line you know how to get R. If not, Google it.

I also put some data in the file. In particular, a plink set of files Est1000HGDP. These are merged from the Estonian Biocentre, HGDP, and 1000 Genomes. There are 4,899 individuals in the data, with 135,000 high quality SNPs (very low missingness).

If you look in the “family” file you will see an important part of the structure. So do:

less Est1000HGDP.fam

You’ll see something like this: Abhkasians abh154 0 0 1 -9 Abhkasians abh165 0 0 1 -9 Abkhazian abkhazian1_1m 0 0 2 -9 Abkhazian abkhazian5_1m 0 0 1 -9 Abkhazian abkhazian6_1m 0 0 1 -9 AfricanBarbados HG01879 0 0 0 -9 AfricanBarbados HG01880 0 0 0 -9 

There are 4,899 rows corresponding to each individual. I have used the first column to label the ethnic/group identity. The second column is the individual ID. You can ignore the last 4 columns.

There is no way you want to analyze all the different ethnic groups. Usually, you want to look at a few. For that, you can use lots of commands, but what you need is a subset of the rows above. The grep command matches and returns rows with particular patterns. It’s handy. Let’s say I want just Yoruba, British (who are in the group GreatBritain), Gujurati, Han Chinese, and Druze. The command below will work (note that Han matches HanBeijing, Han_S, Han_N, etc.).

grep "Yoruba|Great|Guj|Han|Druze" Est1000HGDP.fam > keep.txt

The file keep.txt has the individuals you want. Now you put it through plink to generate a new file:

./plink --bfile Est1000HGDP --keep keep.txt --make-bed --out EstSubset

This new file has only 634 individuals. That’s more manageable. But more important is that there are far fewer groups for visualization and analysis.

As for that analysis, I have a Perl script with a bash script within it (and some system commands). Here is what they do:

1) they perform PCA to 10 dimensions
2) then they run admixture on the number of K clusters you want (unsupervised), and generate a .csv file you can look at
3) then I wrote a script to do pairwise Fst between populations, and output the data into a text file
4) finally, I create the input file necessary for the treemix package and then run treemix with the number of migrations you want

There are lots of parameters and specifications for these packages. You don’t get those unless you to edit the scripts or make them more extensible (I have versions that are more flexible but I think newbies will just get confused so I’m keeping it simple).

Assuming I create the plink file above, running the following commands mean that admixture does K = 2 and treemix does 1 migration edge (that is, -m 1). The PCA and pairwise Fst automatically runs.

perl pairwise.perl EstSubset 2 1

Just walk away from your box for a while. The admixture will take the longest. If you want to speed it up, figure out how many cores you have, and edit the file makecluster.sh, go to line 16 where you see admixture. If you have 4 cores, then type -j4 as a parameter. It will speed admixture up and hog all your cores.

There is as .csv that has the admixture output. EstSubset.admix.csv. If you open it you see something like this:
Druze HGDP00603 0.550210 0.449790 Druze HGDP00604 0.569070 0.430930 Druze HGDP00605 0.562854 0.437146 Druze HGDP00606 0.555205 0.444795 GreatBritain HG00096 0.598871 0.401129 GreatBritain HG00097 0.590040 0.409960 GreatBritain HG00099 0.592654 0.407346 GreatBritain HG00100 0.590847 0.409153 

Column 1 will always be the group, column 2 the individual, and all subsequent columns will be the K’s. Since K = 2, there are two columns. Space separated. You should be able to open the .csv or process it however you want to process it.

You’ll also see two other files: plink.eigenval plink.eigenvec. These are generic output files for the PCA. The .eigenvec file has the individuals along with the values for each PC. The .eigenval file shows the magnitude of the dimension. It looks like this: 68.7974 38.4125 7.16859 3.3837 2.05858 1.85725 1.73196 1.63946 1.56449 1.53666 

Basically, this means that PC 1 explains twice as much of the variance as PC 2. Beyond PC 4 it looks like they’re really bunched together. You can open up this file as a .csv and visualize it however you like. But I gave you an R script. It’s RPCA.R.

You need to install some packages. First, open R or R studio. If you want to go command line at the terminal, type R. Then type: install.packages("ggplot2") install.packages("reshape2") install.packages("plyr") install.packages("ape") install.packages("igraph") install.packages("ggplot2") 

Once those packages are loaded you can use the script:
source("RPCA.R")

Then, to generate the plot at the top of this post:
plinkPCA()

There are some useful parameters in this function. The plot to the left adds some shape labels to highlight two populations. A third population I label by individual ID. This second is important if you want to do outlier pruning, since there are mislabels, or just plain outlier individuals, in a lot of data (including in this). I also zoomed in.

Here’s how I did that:
plinkPCA(subVec = c("Druze","GreatBritain"),labelPlot = c("Lithuanians"),xLim=c(-0.01,0.0125),yLim=c(0.05,0.062))

To look at stuff besides PC 1 and PC 2 you can do plinkPCA(PC=c("PC3","PC6")).

I put the PCA function in the script, but to remove individuals you will want to run the PCA manually:

./plink --bfile EstSubset --pca 10

You can remove individuals manually by creating a remove file. What I like to do though is something like this:
grep "randomID27 " EstSubset.fam >> remove.txt

The double-carat appends to the remove.txt file, so you can add individuals in the terminal in one window while running PCA and visualizing with R in the other (Eigensoft has an automatic outlier removal feature). Once you have the individuals you want to remove, then:
 ./plink --bfile EstSubset --remove remove.txt --make-bed --out EstSubset ./plink --bfile EstSubset --pca 10 

Then visualize!

To make use of the pairwise Fst you need the fst.R script. If everything is set up right, all you need to do is type:
source("fst.R")

It will load the file and generate the tree. You can modify the script so you have an unrooted tree too.

The R script is what generates the FstMatrix.csv file, which has the matrix you know and love.

So now you have the PCA, Fst and admixture. What else? Well, there’s treemix.

I set the number of SNPs for the blocks to be 1000. So -k 1000. As well as global rearrangement. You can change the details in the perl script itself. Look to the bottom. I think the main utility of my script is that it generates the input files. The treemix package isn’t hard to run once you have those input files.

Also, as you know treemix comes with R plotting functions. So run treemix with however many migration edges (you can have 0), and then when the script is done, load R.

Then: >source("src/plotting_funcs.R") >plot_tree("TreeMix") 
But actually, you don’t need to do the above. I added a script to generate a .png file with the treemix plot in pairwise.perl. It’s called TreeMix.TreeMix.Tree.png.

OK, so that’s it.

To review:

Download zip or tar.xz file. Decompress. All the packages and scripts should be in there, along with a pretty big dataset of modern populations. If you are on a non-Mac Linux you are good to go. If you are on a Mac, you need the Mac versions of admixture, plink, and treemix. I’m going to warn you compiling treemix can be kind of a pain. I’ve done it on Linux and Mac machines, and gotten it to work, but sometimes it took time.

You need R and/or R Studio (or something like R Studio). Make sure to install the packages or the scripts for visualizing results from PCA and pairwiseFst won’t work.*

There is already a .csv output from admixture. The PCA also generates expected output files. You may want to sort, so open it in a spreadsheet.

This is potentially just the start. But if you are a laypersonwith a nagging question and can’t wait for me, this should be you where you need to go!

* I wrote a lot of these things piecemeal and often a long time ago. It may be that not all the packages are even used. Don’t bother to tell me.

## June 21, 2018

### The fault in our parameters

Filed under: Population genetics — Razib Khan @ 10:49 pm

Of the books, I own Elements of Evolutionary Genetics is one I consult frequently because of its range and comprehensiveness. The authors, Brian Charlesworth and Deborah Charleswencyclopedican encyclopediac knowledge of the literature. To truly understand the evolutionary process in all its texture and nuance it is important to absorb a fair amount of theory, and Elements of Evolutionary Genetics does do that (though it’s not as abstruse as something like An Introduction to Population Genetics Theory).

When I see a paper by one of the Charlesworth’s, I try and read them. Not because I have a love of Drosophila or Daphnia, but because to develop strong population genetics intuitions it always helps to stand on the shoulders of giants. So with that, I pass on this preprint, Mutational load, inbreeding depression and heterosis in subdivided populations:

This paper examines the extent to which empirical estimates of inbreeding depression and inter-population heterosis in subdivided populations, as well as the effects of local population size on mean fitness, can be explained in terms of estimates of mutation rates, and the distribution of selection coefficients against deleterious mutations provided by population genomics data. Using results from population genetics models, numerical predictions of the genetic load, inbreeding depression and heterosis were obtained for a broad range of selection coefficients and mutation rates. The models allowed for the possibility of very high mutation rates per nucleotide site, as is sometimes observed for epiallelic mutations. There was fairly good quantitative agreement between the theoretical predictions and empirical estimates of heterosis and the effects of population size on genetic load, on the assumption that the deleterious mutation rate per individual per generation is approximately one, but there was less good agreement for inbreeding depression. Weak selection, of the order of magnitude suggested by population genomic analyses, is required to explain the observed patterns. Possible caveats concerning the applicability of the models are discussed.

## June 19, 2018

### Burmese are a bit Bengali

Filed under: Burma,Population genetics — Razib Khan @ 12:48 am

About ten years ago I read the book The River of Lost Footsteps: Histories of Burma. Though I have read books where Burma figures prominently (e.g., Strange Parallels), this is the only history of Burma I have read. The author is Burmese, and provide something much more than a travelogue, as might have been the case if he was of Western background. By chance over the past month or so I’ve been in contact with the author, who made a few inquiries as to the genetics of his own family (he came with genotypes in hand). But this brought us to the issue of the genetics of the Burmese people, and their position in the historical-genetic landscape.

The author of The River of Lost Footsteps reminded me of something that’s curious about Southeast Asia: its Indic influences tend to be from the south of the subcontinent. In particular, the native scripts derive from a South Indian parent. Could genetics confirm this connection as well? Also, could genetics give some insights as to the timing of admixture/gene-flow?

In theory, yes.

I had a lot of Southeast Asian datasets to play with, and did a lot of pruning to remove outliers (e.g., people with obvious recent Chinese ancestry). First, comparing them to Bangladeshis it seems that even without local ancestry tract analysis that Burmese and Malays have more varied, and so likely recent, exogenous ancestry than Bangladeshis. At least this is evidence on the PCA plot, where these two groups exhibit strong admixture clines toward South Asians.

But what about the question of Southeast Asian affinities? This needs deeper analysis. Three-population tests, which measure admixture with outgroups when compared to a dyad of populations which are modeled as a clade, can be informative.

 Outgroup Pop1 Pop2 f3 z Bangladeshi Telugu Cambodians -0.00183999 -46.3322 Bangladeshi Telugu Han -0.00220121 -46.046 Burma Telugu Han -0.00406071 -51.0018 Burma Han Bangladeshi -0.00348186 -49.1398 Burma Han Punjabi_ANI_2 -0.00418193 -47.2351 Cambodians Telugu Viet -0.00126923 -16.91 Cambodians Punjabi_ANI_2 Viet -0.00129881 -15.6039 Cambodians Bangladeshi Viet -0.000970022 -14.5642 Malay Igorot Telugu -0.00249795 -18.758 Malay Igorot Bangladeshi -0.00223454 -18.5212 Malay Igorot Punjabi_ANI_2 -0.00250732 -18.3027 Malay Igorot Cambodians -0.00107817 -16.6214 Viet Han Cambodians -0.000569337 -13.1139

Bangladeshis show strong signatures with both Cambodians and Han. This is in accordance with earlier analysis which suggests Austro-Asiatic and Tibeto-Burman contributions to the “East Asian” element of Bengali ancestry. The Burmese always have Han ancestry, with a South Asian donor as well. This aligns with other PCA analysis which shows the Burmese samples skewed toward Han Chinese. Burma is a compound of different ethnic groups. Some are Austro-Asiatic. The Bamar, the core “Burman” group, have some affinities to Tibetans. And the Shan are a Thai people who are relatively late arrivals.

Cambodians have a weaker admixture signature and are paired with a South Asian group and their geographic neighbors the Vietnamese. The Malays are similar to Cambodians but have the Igorot  people from the Philippines as one of their donors. And finally, not surprisingly the Vietnamese show some mixture between Han-like and Cambodian-like ancestors.

Further PCA analysis shows that while Cambodians and Malays tend to skew somewhat neutrally to South Asians (the recent Indian migration to Malaysia is mostly Tamil), the Burmese are shifted  toward Bangladeshis:

Finally, I ran some admixture analyses.

First, I partitioned the samples with an unsupervised set of runs (K = 4 and K = 5). In this way I obtained reified reference groups as follows:

“Austronesians” (Igorot tribesmen from the Philippines)
“Austro-Asiatic” (a subset of Cambodians with the least exogeneous admixture)
“North Indians” (Punjabis)
“South Indians” (A subset of middle-caste Telugus highest on the modal element in South Indians)
“Han” (a proxy for “northern” East Asian)

The results are mostly as you’d expect. In line with three-population tests, the Vietnamese are Han and Austro-Asiatic. More of the former than latter. There is a minor Austronesian component. Notice there is no South Asian ancestry in this group.

In contrast, Cambodians have low levels of both North and South Indian. These out sample Cambodians are still highly modal for Austro-Asiatic though.

Malays are more Austro-Asiatic than Austronesian, which might surprise. But the Igorot samples are highly drifted and distinct. I think these runs are underestimating Austronesian in the Malays. Notice that some of the Malays have South Asian ancestry, but a substantial number do not. This large range in admixture is what you see in PCA as well. I think this strongly points to the fact that Malays have been receiving gene-flow from India recently, as it is not a well mixed into the population.

The Bangladeshi outgroup is mostly a mix of North and South Indian, with a slight bias toward the latter. No surprise. As I suggested earlier you can see that the Bangladeshi samples are hard to model as just a mix of Burmese with South Asians. The Austro-Asiatic component is higher in them than the Burmese. This could be because Burma had recent waves of northern migration (true), and, eastern India prior to the Indo-Aryan expansion was mostly inhabited by Austro-Asiatic Munda (probably true). That being said, the earlier analysis suggested that the Munda cannot be the sole source of East Asian ancestry in Bengalis.

Finally, every single Burmese sample has South Asian ancestry. Much higher than Cambodians. And, there is variance.  I think that leads us to the likely conclusion that Burma has been subject to continuous gene-flow as well as recent pulses of admixture from South Asia. The variation in South Asian ancestry in the Burmese is greater than East Asian ancestry in Bengalis. I believe this is due to more recent admixture in Burmese due to British colonial Indian settlement in that country.

The cultural and historical context of this discussion is the nature of South Asian, Indic, influence, on Southeast Asia. One can not deny that there has been some gene-flow between Southeast Asia and South Asia. In prehistoric times it seems that Austro-Asiatic languages moved from mainland Southeast Asia to India. More recently there is historically attested, and genetically confirmed, instances of colonial Indian migration. But, the evidence from Cambodia suggests that this is likely also ancient, as unlike Malaysia or Burma, Cambodia did not have any major flow of Indian migrants during the colonial period. One could posit that perhaps the Cambodian Indian affinity is a function of “Ancestral South Indian.” But the Cambodians are not skewed toward ASI-enriched groups in particular. And, I know for a fact that appreciable frequencies of R1a1a exist within the male Khmer population (this lineage is common in South Asia, especially the north and upper castes).

As far as Burma goes, I think an older period of South Indian cultural influence, and some gene-flow seems likely. But, with the expansion of Bengali settlement to the east over the past 2,000 years, more recent South Asian ancestry is probably enriched for that ethnolinguistic group.

I’m going to try and follow-up with some ancestry tract analysis….

## May 30, 2018

### Soft & hard selection vs. soft & hard sweeps

Filed under: Population genetics,Quantitative Genetics — Razib Khan @ 12:09 am

When I was talking to Matt Hahn I made a pretty stupid semantic flub, confusing “soft selection” with “soft sweeps.” Matt pointed out that soft/hard selection were terms more appropriate to quantitative genetics rather than population genomics. His viewpoint is defensible, though going back into the literature on soft/selection, e.g., Soft and hard selection revisited, the main thinkers pushing the idea were population geneticists who were also considering ecological questions.*

The strange thing is that I had already known the definitions of hard and soft selection on some level because I had read about them as I was getting confused with hard and soft sweeps! But this was more than ten years ago now, and since then I haven’t given the matter enough thought obviously, as I defaulted back to confusing the two classes of terms, just as I used to.

Matt pointed out that truncation selection is a form of hard selection. All individuals below (or above) a certain phenotype value have a fitness of zero, as they don’t reproduce. In a single locus context, hard selection would involve deleterious lethal alleles, whose impact on the genotype was the same irrespective of ecological context. So in a hard selection, it operates by reducing the fitness of individuals/genotypes to zero.

For soft selection, context matters much more, and you would focus more on relative fitness differences across individuals/genotypes. Some definitions of soft vs. hard selection emphasize that in the former case fitness is defined relative to the local ecological patch, while the latter is a universal estimate. Soft selection does not necessarily operate through the zero fitness value for a genotype, but rather differential fitness. Hard selection can crash your population size. Soft selection does not necessarily do that.

Though I won’t outline the details, one of the originators of the soft/hard selection concept analogized them to density-dependent/independent dynamics in ecology. If you know the ecological models, the correspondence probably is obvious to you.

As for hard and soft sweeps, these are particular terms of relevance to genomics, because genome-wide data has allowed for their detection through the impact they have on the variation in the genome. A “sweep” is a strong selective event that tends to sweep away variation around the focus of selection. A hard sweep begins with a single mutant, and positive selection tends to drive it toward fixation.

A classical example is lactase persistence in Northern Europeans and Northwest South Asians (e.g., Punjabis). The mutation in the LCT gene is the same across a huge swath of Eurasia. And, the region around the genome is also the same, because regions of the genome adjacent to that single mutation increased in frequency as well (they “hitchhiked”). This produces a genetic block of highly reduced diversity since the hard selective sweep increases the frequency of so many variants which are associated with the advantageous one, and may drive to extinction most other competitive variants.

Someone is free to correct me in the comments, but it strikes me that many hard selective sweeps are driven by soft selection. Fitness differentials between those with the advantageous alleles and those without it are not so extreme, and obviously context dependent, even in cases of hard sweeps on a single locus.

The key to understanding soft sweeps is that there isn’t a focus on a singular mutation. Rather, selection can target multiple mutations, which may have the same genetic position, but be embedded within different original gene copies. In fact, soft selection often operates on standing variation, preexistent alleles which were segregating in the population at low frequencies or were totally neutral. Genetic signatures of these events are less striking than those for hard sweeps because there is far less diminishment of diversity, since it’s not the increase in the frequency of a singular mutation and the hitchhiking of its associated flanking genomic region.

Soft sweeps can clearly occur with soft selection. But truncation selection can occur on polygenic traits, so depending on the architecture of the trait (i.e., effect size distribution across the loci) one can imagine them associated with hard selection as well.

Going back to the conversation I had with Matt the reason semantics is important is that terms in population genetics are informationally rich, and lead you down a rabbit-hole of inferences. If population genetics is a toolkit for decomposing reality, then you need to have your tools well categorized and organized. On occasion it is important to rectify the names.

* There are two somewhat related definitions of soft/hard selection. I’ll follow Wallace’s original line here, though I’m not sure they differ that much.

## May 21, 2018

### The mutation accumulation controversy continues….

Filed under: Mutation,Population genetics — Razib Khan @ 10:56 pm

Every few years I check to see if the great mutation accumulation controversy has resolved itself. I don’t know if anyone calls it that, but that’s what I think of it as. There are two major issues that matter here: mutation rates are a critical parameter in evolutionary models, and, mutation accumulation over time matters for parental age effects when it comes to disease (speaking as an older father!).

In the latter case, I’m talking about the reasons that people freeze their eggs or sperm. In the former case, I’m talking about whether we can easily extrapolate mutation rates over evolutionary time as semi-fixed, so we can infer dates of last common ancestry and such. To give a concrete example of what I’m talking about, if mutation rates varied a lot over the evolutionary history of our hominin lineage, then we might need to rethink some of the inferred timings.

Today two preprints came out on mutation accumulation. First, Overlooked roles of DNA damage and maternal age in generating human germline mutations. Second, Reproductive longevity predicts mutation rates in primates. What a coincidence in synchronicity!

Additionally, the last author on the second preprint, Matt Hahn, is someone I’ll be doing a podcast with this week. So aside from talking about neutral theory, and his book Molecular Population Genetics, I’m going to have to bring up this mutation business.

The figure above from the first preprint shows that the proportion of mutations derived from the father don’t increase over time, as textbooks generally state. Why would we expect this? Sperm keeps replicating after puberty so you should be gaining more mutations. In contrast, the eggs are arrested in meiosis. There are various mechanistic reasons that the authors of the first preprint give for why the ratio does not change between paternal and maternal mutations (e.g., non-replicative mutations seem to be the primary one). The authors are using a very “pedigree” strategy, rather than an “evolutionary” one. They’re looking at sequenced trios, and noticing patterns. I think in the near future they’ll be far more sure of what’s going on because they’ll have bigger sample sizes. They admit the effects are subtle (also, some of the p-values are getting close to 0.05).

Instead of focusing on a human pedigree, the second preprint does some sequencing on owl monkeys (I had no idea there were “owl monkeys” before this paper). They find that the mutation rate is ~32% lower in owl monkeys than in humans. Why is this?

The plot to the left shows that mutations increase across age with species (though the number of data points is pretty small). The authors contend that:

The association between mutation rates and reproductive longevity implies that changes in life history traits rather than changes to the mutational machinery are responsible for the evolution of these rates. Species that have evolved greater reproductive longevity will have a higher mutation rate per generation without any underlying change to the replication, repair, or proofreading proteins.

If I read this right: owl monkeys reproduce fast and don’t have as much reproductive longevity. Ergo, lower mutation rates (less mutational build-up from paternal side).

After all these years I’m still not convinced about anything. I assume that eventually bigger data sets will come online and we’ll resolve this. Someone has to be right!

(not too many people on Twitter get what’s going on either)

## May 8, 2018

### The peoples of the Maghreb have some Pleistocene roots

Filed under: Human Population Genetics,North Africa,Population genetics — Razib Khan @ 11:58 pm

The Maghreb is an important and interesting place. In the history of Western civilization, the tension between Carthage, the ancient port city based out of modern-day Tunisia, and Rome, is one of the more dramatic and tragic rivalries that has resonances down through the ages. Read Adrian Goldsworthy’s chapter on the Battle of Cannae in The Punic Wars for what I’m alluding to (and of course there was Cato the Younger’s dramatic remonstrations).

Later Roman Africa, which really encompassed northern Morocco, coastal Algeria, and Tunisia and Tripolitania, became a major social and economic pillar of the Imperium. Not only did men such as the emperor Septimius Severus and St. Augustine have roots in the region, but these provinces were a major economic bulwark for the Western Empire in its last century. The wealthy Senators of the 4th and 5th century were often absentee landlords of vast estates in North Africa. The fall of these provinces to the Vandals and Alans in the 430s began the transformation of the Western Empire based in Rome into a more regional player, rather than a true hegemon (perhaps an analogy here can be made to the loss of Anatolia by the Byzantines in the 11th century).

Another important aspect of North Africa is that it is the westernmost extension of the region possibly settled by Near Eastern farmers in Africa. The native Afro-Asiatic Berber languages seem to have been dominant in the region despite the influence and prestige of Punic and Latin in the cities when Muslim Arabs conquered the region in the late 7th century. The genetic-demographic characteristics of the region are relevant to attempts to understand the origins of the Afro-Asiatic languages more generally since Berber is part of the clade with the Semitic languages.

A preprint and a paper utilizing ancient DNA have shed a great deal of light on these questions recently. The paper is in Science, Pleistocene North African genomes link Near Eastern and sub-Saharan African human populations. The preprint is Ancient genomes from North Africa evidence prehistoric migrations to the Maghreb from both the Levant and Europe. They are in broad agreement, though they cover somewhat different periods.

The figure below is the big finding of the Science paper:

They retrieved some genotypes from a site in northern Morocco, Taforalt, which dates to ~15,000 years before the present. This is a Pleistocene site, before the rise of agriculture. The Taforalt individuals are about 65% Eurasian in affinity, and 35% Sub-Saharan African. This confirms that the Eurasian back-migration to northern Africa predates the Holocene, just as many archaeologists and geneticists have reported earlier.

The samples from the preprint date to a later time. IAM in the samples dates to 7,200 years before the present, and KEB to ~5,000 years before the present. It seems pretty clear that the IAM samples in the preprint exhibit continuity with the Taforalt samples. Though it is not too emphasized in the preprint the lower K’s seem to strongly suggest that the IAM samples have Sub-Saharan African ancestry, just like the Taforalt samples which are nearly 8,000 years older. In the KEB samples, the fraction drops, probably diluted in part by ancestry related to what we elsewhere term “Early European Farmer” (EEF), related to the Anatolian farming expansion.

Both the Taforalt and IAM samples, in particular, seem to exhibit strong affinities to Natufian/Levantine peoples. Additionally, many of these samples carry Y chromosome haplogroup E1b, just like some of the Natufians. These results indicate that the Natufian-North African populations were exchanging genes or one cline rather deep in the Pleistocene.

Though various methods have suggested that there is a lot of recent Sub-Saharan African admixture, dating to the Arab period, in North Africa, these results suggest that much of it is far older. The Mozabites, as an isolated Berber group, reflect this tendency. Though some individuals have inflated African ancestry due to recent admixture, much of it is older and evener. And yet the Mozabites seem to have less Sub-Saharan African ancestry on average than the IAM sample.

There aren’t enough data points to make a strong inference about the temporal transect, but these few results imply a decline in Sub-Saharan ancestral component after the Pleistocene with further farming migration, and then a rise again with the trans-Saharan slave trade during the Muslim period. Another issue, highlighted in the preprint, is likely heterogeneity within the Maghreb in ancestry (lowland populations in modern North Africa tend to have more Sub-Saharan ancestry due to where slaves were settled).

In the Science paper the authors make an attempt to adduce the origin of the Sub-Saharan contribution to the Taforalt individuals. The result is that there is no modern or ancient proxy that totally fits the bill. These individuals have affinities to many Sub-Saharan African populations.  The Sub-Saharan component is likely heterogeneous, but attempts to model European genetic variation during the Ice Age ran into trouble that divergence from modern populations was quite great. Until we get more ancient DNA there probably won’t be too much more clarity.

On the issue of the Eurasian ancestry, it’s clearly quite like the Natufians. But curiously the authors find that the Neanderthal ancestry in these samples is greater than that found in early Holocene Iran samples. From this, the authors conclude that they may have had a lower fraction of “Basal Eurasian” (BEu) than those populations further to the east. But already 15,000 years ago BEu populations were mixed with more generic West Eurasians to generate the back-migration to Africa. If BEu diverged from other Eurasians >50,000 years ago, then it may have merged back into the “Out-of-Africa” populations around or before the Last Glacial Maximum, ~20,000 years ago.

Finally, the authors looked at some pigmentation genes. Curiously the Taforalt and IAM individuals did not carry the derived variants for pigmentation found in many West and South Eurasians, but the KEB did. This confirms results from Europe, and population genomic inference in modern samples, that selection for derived pigmentation variants is relatively recent in the Holocene.

I do want to add that one possibility about the Sub-Saharan ancestry in the Taforalt, and probably all modern North Africans to a lesser extent, is that it is ancient and local. We now know proto-modern humans were present in the region >300,000 years ago. Northwest Africa may have been part of the multi-regional metapopulation of H. sapiens, as opposed to the Eurasian biogeographic zone that it is often placed, before a post-LGM back migration of Eurasians.

## February 24, 2018

### Are Turks Armenians under the hood?

Filed under: Historical Population Genetics,Population genetics — Razib Khan @ 8:31 pm

Benedict Anderson’s Imagined Communities: Reflections on the Origin and Spread of Nationalism is one of those books I haven’t read, but should. In contrast, I have read Azar Gat’s Nations, which is a book-length counterpoint to Imagined Communities. To take a stylized and extreme caricature, Imagined Communities posits nations to be recent social and historical constructions, while Nations sees them as primordial, and at least originally founded on on ties of kinships and blood.

The above doesn’t capture the subtlety of  Gat’s book, and I’m pretty sure it doesn’t capture that of Anderson’s either. But, those are the caricatures that people take away and project in public, especially Anderson’s (since Gat’s is not as famous).

When it comes to “imagined communities” I recently have been thinking how much that of modern Turks fits into the framework well. Though forms of pan-Turkic nationalism can be found as earlier as 9th-century Baghdad, the ideology truly emerges in force in the late 19th century, concomitantly with the development of a Turkish identity in Anatolia which is distinct from the Ottoman one.

The curious thing is that though Turkic and Turkish identity is fundamentally one of language and secondarily of religion (the vast majority of Turkic peoples are Muslim, and there are periods, such as the 17th century when the vast majority of Muslims lived in polities ruled by people of Turkic origin*), there are some attempts to engage in biologism. This despite the fact that the physical dissimilarity of Turks from Turkey and groups like the Kirghiz and Yakut is manifestly clear.

Several years ago this was made manifestly clear in the paper The Genetic Legacy of the Expansion of Turkic-Speaking Nomads across Eurasia. This paper clearly shows that Turkic peoples across Eurasia have been impacted by the local genetic substrate. In plainer language, the people of modern-day Turkey mostly resemble the people who lived in Turkey before the battle of Manzikert and the migration of Turkic nomads into the interior of the peninsula in the 11th century A.D. Of course, there is some genetic element which shows that there was a migration of an East Asian people into modern day Anatolia, but this component in the minority one.**

Sometimes the Turkish fascination with the biological comes out in strange ways, Turkish genealogy database fascinates, frightens Turks. Much of the discussion has to do with prejudice against Armenians and Jews. But the reality is that most Turks at some level do understand that they are descended from Greeks, Armenians, Georgians, etc.

To interrogate this further I decided to look at a data set of Greeks, Turks, Armenians, Georgians, and a few other groups, including Yakuts, who are the most northeastern of Turkic peoples. The SNP panel was >200,000, and I did some outlier pruning. Additionally, I didn’t have provenance on a lot of the Greeks, except some labeled as from Thessaly. I therefore just split those up with “1” being closest to the Thessaly sample and “3” the farthest.

First, let’s look at the PCA.

The Turks are shifted toward the Yakuts, but not too much. In contrast, there is much more of Yakut shift in Tajiks, and especially Turkmens. These are two groups from further east, closer to the heart of the zone Turkic expansion. Curiously, the Tajiks, who are the dominant non-Turkic Iranian speaking people of Central Asia, actually have more East Asian ancestry than the Turks of Turkey. This goes to show that ethnicity is somewhat fluid, and Turkic people have assimilated into the Tajik identity. That being said, please note that the Turkmen are notably more east-shifted than the Tajik.

Let’s see how this looks on pairwise Fst.

Fst is kind of difficult for fine distinctions when you have outgroups like Yakuts and Dai. So let’s look at Treemix with five migrations:

On this, you can see that the relationship of the Greece clusters on Treemix to Lithuanians matches PCA. Greece1 is the closest, Greece 3 the farthest.

The Turks are close to the Georgians and Armenians, but not the Kurds, or Tajiks. And, they receive gene flow from the Turkmen-Yakut region of the graph. So do the Tajiks…but the Tajiks also remove gene flow from the Lithuanians. The admixture plot makes it more clear what’s happening I think.

Yellow ~ modal in Southern Europe, green ~ modal Northern Europe, red ~ Central Asian, while blue and purple are northern and southern East Asian. In comparison to Turks of Anatolia Tajiks have a lot more Northern European affinity, probably because of the common steppe heritage. Not surprisingly, Turks have more Southern European like ancestry.

Curiously the East Asian ancestry in the Turkic people seems to be both Yakut and Dai like, so perhaps it was more cosmopolitan than we might think? The Yakuts after all are from the northern edge of the range, and may have absorbed a lot of indigenous Siberian ancestry.

Georgians have none of the Northern European sort of ancestry, but Armenians do, and Turks even more. One could posit that this is due to Slavic ancestry arriving with the Rumelian Turks who arrived in the 20th century, but just as likely is the possibility that Turks have a lot of ancestry from western Anatolia which was Greek, and Greeks have more of this than Armenians.

It’s hard to tell from these results whether Turks have more of an affinity with Greek or Armenians as their non-Turkic ancestors. So I ran a three population test.

 Outgroup X1 X2 f3 error z Turkey Armenians Yakut -0.00253688 6.70852e-05 -37.8158 Turkey Greece3 Yakut -0.00246931 6.72384e-05 -36.7247 Turkey Georgian Yakut -0.00256555 7.60158e-05 -33.7502 Turkey Armenians Dai -0.00246779 7.40038e-05 -33.3468 Turkey Greece3 Dai -0.0024101 7.34629e-05 -32.8071 Turkey Georgian Dai -0.00249174 8.11957e-05 -30.688 Turkey Greece2 Yakut -0.00222382 7.62368e-05 -29.1699 Turkey Greece2 Dai -0.00231001 8.39207e-05 -27.5261 Turkmen Turkey Dai -0.00288213 0.000108049 -26.6742 Turkmen Turkey Yakut -0.00254805 0.000102816 -24.7826 Turkey Greece1 Yakut -0.00225638 9.94722e-05 -22.6836 Turkey GreekCentral Dai -0.00235681 0.000104014 -22.6587 Turkey Greece3 Tajik -0.000622671 2.76666e-05 -22.5063 Turkey GreekCentral Yakut -0.00221985 0.000101654 -21.8373 Turkey Greece1 Dai -0.00243254 0.000112011 -21.717 Turkey Greece3 Turkmen -0.000640439 3.33529e-05 -19.2019 Turkey GreekThessaly Yakut -0.00208436 0.00011042 -18.8767 Turkey Dai GreekThessaly -0.00225435 0.00012241 -18.4163 Turkey Greece2 Turkmen -0.000584983 3.29819e-05 -17.7365 Turkey Armenians Turkmen -0.000520887 3.07253e-05 -16.953 Turkey Armenians Tajik -0.000421139 2.55274e-05 -16.4975 Tajik Turkey Dai -0.00140423 8.51697e-05 -16.4875 Tajik Turkey Yakut -0.00124601 7.60725e-05 -16.3793 Turkey Georgian Turkmen -0.000532496 3.80694e-05 -13.9875 Turkey Greece2 Tajik -0.000412419 3.04172e-05 -13.5587 Turkey Armenians Lithuanians -0.000459831 3.75838e-05 -12.2348 Turkey Greece1 Turkmen -0.000570715 4.7753e-05 -11.9514 Turkey Kurds Yakut -0.00146087 0.000124799 -11.7058 Turkey GreekThessaly Turkmen -0.000516877 4.46683e-05 -11.5714 Turkey Georgian Tajik -0.000328859 3.02443e-05 -10.8734 Turkey GreekCentral Turkmen -0.000504962 4.92555e-05 -10.2519

Armenians beat out Greece3 a bit better, but really it’s hard to say from this that this is definitive. It’s likely that my Turkish sample has both, and/or the original Turkic nomads had Iranian-like ancestry which was more like Armenian than Greek? Hard to say. Additionally, the face that Greece3 is better than the other options suggests to me that the source are Anatolian Greeks who were less impacted by migrations from the north than Greeks in Greece proper.

* The Mughals were Central Asian Turks, while the Safavids were mostly Azeri Turks.

** Since the Turks who arrived in Anatolia had long sojourned in Turn and Iran it is important not to assume that their contribution is limited only to the East Asian component of ancestry.

## January 27, 2018

### Genetic distances across Eurasia

Filed under: Human Population Genetics,Population genetics — Razib Khan @ 10:00 pm

I feel that for whatever reason that over the past few years that many people have started to exhibit weak intuitions about the magnitude of between population differences on this weblog. Two suggestions for why this might occur.

* First, the proliferation of PCA plots with individuals can make it hard to discern averages

* Second, model-based admixture plots don’t explicitly quantify the differences between the different clusters

To get a better sense of between-group differences I decided to take a step back and look at Fst. Fst basically looks all the genetic variance between groups and quantifies the proportion that can be attributed to differences between groups.

The plot at the top of this post is from an Fst matrix I generated with Plink (I wrote a script to do the pairwise comparison). I did some PCA pruning of the populations to be clear (e.g., with both Cambodians and Filipinos I made them more distinct than they would otherwise be). The goal was to give people a sense of genetic distances within regions and between them.

I also generated a PCA plot and a Treemix plot, for the sake of comparison.

It’s also useful to look at a few group comparisons and judge them in a global context.

 Fst Tamil Telugu 0.0011 Tamil Tamil Scheduled Caste 0.0016 Tamil Bangladeshi 0.0024 Tamil South Indian Brahmin 0.0031 Tamil Uttar Pradesh Brahmin 0.0041 Tamil Sindhi 0.0087 Tamil Vietnamese 0.0668 Southern Chinese Northern Chinese 0.0033 Southern Chinese Vietnamese 0.0034 Southern Chinese Korea 0.0045 Southern Chinese Japanese 0.0087 Southern Chinese Tamil 0.0711 Southern Chinese Polish 0.1141 Gujurati_Patel Telugu 0.0062 Gujurati_Patel Uttar Pradesh Brahmin 0.0065 Gujurati_Patel Bangladeshi 0.0069 Gujurati_Patel Velama 0.0094 Gujurati_Patel Sindhi 0.0104 Gujurati_Patel Polish 0.0405 Gujurati_Patel Japanese 0.0781 GreatBritain Ireland 0.0015 GreatBritain Polish 0.0043 GreatBritain Sicily 0.0077 GreatBritain Uttar Pradesh Brahmin 0.0264 GreatBritain Tamil 0.0430 GreatBritain Korea 0.1130

The non-Brahmin and non-Dalit samples in the 1000 Genomes are not much partitioned much by geography. The Tamil vs. Telugu difference is smaller than that between the British and Irish. Within Tamil Nadu Brahmins though are nearly as different from typical Tamils as Poles are from the English (most of the British sample is English). The biggest differences in Europe are between Sicilians and Northern European groups, which similar in a degree to that between South Indians and Pakistanis. The South Chinese sample is nearly as close to Vietnamese as it is to a North Chinese group, while the difference between Koreans and Chinese is relatively small when compared to the variance you see in South Asia and Europe.

Note: Drift tends to inflate Fst.

## January 25, 2018

### How South Asian populations relate to each other

Filed under: Population genetics,South Asian Genetics — Razib Khan @ 12:45 am

Since people asking me about this, and I’m running the South Asian Genotype Project, I thought I would post two non-PCA visualizations of how various South Asian groups relate to each other (along with a few outgroups).

The radial plot above is a neighbor-joining tree visualized from pairwise Fst statistics (basically a proxy for genetic distance).

I also used Treemix to generate a plot. You see the similar patterns as the one above, though the two methods are different. Treemix tests a bunch of models and sees how the data fit those models. The visualization of Fst is just a way of representing the summary statistic.

I added 5 migration edges to the plot to the right. Not sure if they add anything, but you can see that some of the nodes move around because they are so mixed.

## January 9, 2018

### Island demes in an empty world

Filed under: Beringia,Population genetics — Razib Khan @ 9:58 pm

As you probably know a new ancient genome paper was published last week in Nature, Terminal Pleistocene Alaskan genome reveals first founding population of Native Americans. There is at least one other involving Willerslev in the works for what it’s worth. Carl Zimmer has a good write-up in The New York Times, while Greg Cochran picked up the fact that the latest results show no evidence of “Australo-Melanesian” affinities that have been found in Amazonians.

The key issue here is that they found 11,500 year old remains from Alaska, one of which they sequenced at 17x coverage, which is rather good (not medical grade good, but really sufficient for a lot of population genomic work). It’s clear that the lineage represented by these remains is “basal” to that of other Native American peoples, whom David Reich’s group labeled “First Americans.” Later, the First Americans diverged into different populations, with the two in modern focus being a northern cluster, including the Aboriginal peoples of Canada and parts of the United States, and a southern one including everyone else. This does not mean that the Beringians were isolated outliers. There may have been many other peoples related to the Beringians who diversified, who went extinct as well. The settlement of Alaska by other peoples suggests to me that extreme conditions in the Arctic made it likely that there would be population turnover there. Also, the fact that these samples were located close to the source of settlement in the New World by modern humans makes their distant relation to all other New World populations unsurprising.

The big thing that the press is highlighting is the confirmation of the Beringian Standstill model, where modern humans percolated into the area between Siberia and Alaska, Beringia, and did not move east for thousands of years. Basically, the conditions were inclement toward human habitation on both sides of Beringia, while a relict modern human group likely occupied a pocket of more moderate climes for thousands of years, with minimal gene flow from the west, and blocked from migration to the east. Genetically the Beringia Standstill made sense for a long time…the divergence between Amerindian lineages and those of eastern Eurasia seemed too old to be accounted for by recent migration a bit more than 10,000 years ago (the old “Clovis first” hypothesis).

How old? This paper suggests that the portion of Native American ancestry which indicates an affinity to East Asians stopped exhibiting gene flow from that source around ~25,000 years ago, after diverging around ~36,000 years ago. This points to the fact that after modern humans came to dominate eastern Eurasia they began to diversify rapidly after 40,000 years ago, but gene flow between different populations did not always allow them to drift apart…at least initially. The ancestors of Native Americans and East Asians may have been in extremely separate locations by ~25,000 years ago, whether it be on the fringes of eastern Siberia, or somewhere in southern China (there is no reason that the modern Chinese have to have had ancestors resident on the North China plain before the Last Glacial Maximum).*

One aspect here I want to emphasize is that our image of a world thickly populated with humans may mislead us in our intuition about how patchy occupation was ~25,000 years ago. Yes, humans may have left artifacts all over the world, but that doesn’t mean that there weren’t centuries or millennia of no occupation, or, that meta-population dynamics were such that extinctions were common. For decades in population genetics there has been talk of “clines vs. clusters,” but if human population densities were far lower, or occupation patchier, then clines may have become much more important recently with high density than in the past.

Finally, back to the Australo-Melanesian issue. Either there is a lot of population structure in ancient Beringia to be explored, with diverse quasi-Asiatic groups, or there was an Australo-Melanesian group already in South America.

* Ancient North Eurasian ancestry came into Beringians ~20,000 years ago. Two groups which merged during the middle of the Last Glacial Maximum.

## December 20, 2017

### Natural selection in humans (OK, 375,000 British people)

Filed under: Natural Selection,Population genetics,Population genomics,Selection — Razib Khan @ 10:41 pm

The above figure is from Evidence of directional and stabilizing selection in contemporary humans. I’ll be entirely honest with you: I don’t read every UK Biobank paper, but I do read those where Peter Visscher is a co-author. It’s in PNAS, and a draft which is not open access. But it’s a pretty interesting read. Nothing too revolutionary, but confirms some intuitions one might have.

The abstract:

Modern molecular genetic datasets, primarily collected to study the biology of human health and disease, can be used to directly measure the action of natural selection and reveal important features of contemporary human evolution. Here we leverage the UK Biobank data to test for the presence of linear and nonlinear natural selection in a contemporary population of the United Kingdom. We obtain phenotypic and genetic evidence consistent with the action of linear/directional selection. Phenotypic evidence suggests that stabilizing selection, which acts to reduce variance in the population without necessarily modifying the population mean, is widespread and relatively weak in comparison with estimates from other species.

The stabilizing selection part is probably the most interesting part for me. But let’s hold up for a moment, and review some of the major findings. The authors focused on ~375,000 which matched their sample criteria (white British individuals old enough that they are well past their reproductive peak), and the genotyping platforms had 500,000 markers. The dependent variable they’re focusing on is reproductive fitness. In this case specifically, “rRLS”, or relative reproductive lifetime success.

With these huge data sets and the large number of measured phenotypes they first used the classical Lande and Arnold method, which leveraged regression to measure directional and stabilizing selection. Basically, how does change in the phenotype impact reproductive fitness? So, it is notable that shorter women have higher reproductive fitness than taller women (shorter than the median). This seems like a robust result.

The results using phenotypic correlations for direction (β) and stabilizing (γ) selection are shown below. The abbreviations are the same as above.

There are many cases where directional selection seems to operate in females, but not in males. But they note that that is often due to near zero non-significant results in males, not because there were opposing directions in selection. Height was the exception, with regression coefficients in opposite directions. For stabilizing selection there was no antagonistic trait.

A major finding was that compared to other organisms stabilizing selection was very weak in humans. There’s just not that that much pressure against extreme phenotypes. This isn’t entirely surprising. First, you have the issue of the weirdness of a lot of studies in animal models, with inbred lines, or wild populations selected for their salience. Second, prior theory suggests that a trait with lots of heritable quantitative variation, like height, shouldn’t be subject to that much selection. If it had, the genetic variation which was the raw material of the trait’s distribution wouldn’t be there.

Using more complex regression methods that take into account confounds, they pruned the list of significant hits. But, it is important to note that even at ~375,000, this sample size might be underpowered to detect really subtle dynamics. Additionally, the beauty of this study is that it added modern genomic analysis to the mix. Detecting selection through phenotypic analysis goes back decades, but interrogating the genetic basis of complex traits and their evolutionary dynamics is new.

To a first approximation, the results were broadly consonant across the two methods. But, there are interesting details where they differ. There is selection on height in females, but not in males. This implies that though empirically you see taller males with higher rLSR, the genetic variance that is affecting height isn’t correlated with rLSR, so selection isn’t occurring.

~375,000 may seem like a lot, but from talking to people who work in polygenic selection there is still statistical power to be gained by going into the millions (perhaps tens of millions?). These sorts of results are very preliminary but show the power of synthesizing classical quantitative genetic models and ways of thinking with modern genomics. And, it does have me wondering about how these methods will align with the sort of stuff I wrote about last year which detects recent selection on time depths of a few thousand years. The SDS method for example seems to be detecting selection for increasing height the world over…which I wonder is some artifact, because there’s a robust pattern of shorter women having higher fertilty in studies going back decades.

## December 14, 2017

### A genetic map of the world

Filed under: Population genetics — Razib Khan @ 4:46 pm

The above map is from a new preprint on the patterns of genetic variation as a function of geography for humans, Genetic landscapes reveal how human genetic diversity aligns with geography. The authors assemble an incredibly large dataset to generate these figures. The orange zones are “troughs” of gene flow. Basically barriers to gene flow.  It is no great surprise that so many of the barriers correlate with rivers, mountains, and deserts. But the aim of this sort of work seems to be to make precise and quantitative intuitions which are normally expressed verbally.

To me, it is curious how the borders of the Peoples’ Republic of China is evident on this map (an artifact of sampling?). Additionally, one can see Weber’s line in Indonesia. There are the usual important caveats of sampling, and caution about interpreting present variation and dynamics back to the past. But I believe that these sorts of models and visualizations are important nulls against which we can judge perturbations.

As I said, these methods can confirm rigorously what is already clear intuitively. For example:

Several large-scale corridors are inferred that represent long-range genetic similarity, for example: India is connected by two corridors to Europe (a southern one through Anatolia and Persia ‘SC’, and
a northern one through the Eurasian Steppe ‘NC’)

We still don’t have enough ancient DNA to be totally sure, but it’s hard to ignore the likelihood that “Ancestral North Indians” (AN) actually represent two different migrations.

India also illustrates contingency of these barriers. Before the ANI migration, driven by the rise in agricultural lifestyles, there would likely have been a major trough of gene flow on India’s western border. In fact a deeper one than the one on the eastern border. And if the high genetic structure statistics from ancient DNA are further confirmed then the rate of gene flow was possibly much lower between demes in the past. Perhaps that would simply re-standardize equally so that the map itself would not be changed, but I suspect that we’d see many more “troughs” during the Pleistocene and early Holocene.

Because there are so many geographically distributed samples for humans, and frankly some of the best methods developers work with human data (thank you NIH), it is no surprise that our species would be mapped first. But I think some of the biggest insights may be with understanding the dynamics of gene flow of non-human species, and perhaps the nature and origin of speciation as it relates to isolation (or lack thereof).

### A genetic map of the world

Filed under: Population genetics — Razib Khan @ 4:46 pm

The above map is from a new preprint on the patterns of genetic variation as a function of geography for humans, Genetic landscapes reveal how human genetic diversity aligns with geography. The authors assemble an incredibly large dataset to generate these figures. The orange zones are “troughs” of gene flow. Basically barriers to gene flow.  It is no great surprise that so many of the barriers correlate with rivers, mountains, and deserts. But the aim of this sort of work seems to be to make precise and quantitative intuitions which are normally expressed verbally.

To me, it is curious how the borders of the Peoples’ Republic of China is evident on this map (an artifact of sampling?). Additionally, one can see Weber’s line in Indonesia. There are the usual important caveats of sampling, and caution about interpreting present variation and dynamics back to the past. But I believe that these sorts of models and visualizations are important nulls against which we can judge perturbations.

As I said, these methods can confirm rigorously what is already clear intuitively. For example:

Several large-scale corridors are inferred that represent long-range genetic similarity, for example: India is connected by two corridors to Europe (a southern one through Anatolia and Persia ‘SC’, and
a northern one through the Eurasian Steppe ‘NC’)

We still don’t have enough ancient DNA to be totally sure, but it’s hard to ignore the likelihood that “Ancestral North Indians” (AN) actually represent two different migrations.

India also illustrates contingency of these barriers. Before the ANI migration, driven by the rise in agricultural lifestyles, there would likely have been a major trough of gene flow on India’s western border. In fact a deeper one than the one on the eastern border. And if the high genetic structure statistics from ancient DNA are further confirmed then the rate of gene flow was possibly much lower between demes in the past. Perhaps that would simply re-standardize equally so that the map itself would not be changed, but I suspect that we’d see many more “troughs” during the Pleistocene and early Holocene.

Because there are so many geographically distributed samples for humans, and frankly some of the best methods developers work with human data (thank you NIH), it is no surprise that our species would be mapped first. But I think some of the biggest insights may be with understanding the dynamics of gene flow of non-human species, and perhaps the nature and origin of speciation as it relates to isolation (or lack thereof).

## December 10, 2017

### Visualizing intra-European phylogenetic distances

Filed under: Europe,European genetics,Population genetics,Population genomics — Razib Khan @ 4:53 pm

In L. L. Cavalli-Sforza’s The History and Geography of Human Genes he used between population group genetic distances, as measured in FST values, to generate a series of visualizations, which then allowed him to infer historical processes. Basically the way it works is that you look at genetic variation, and see how much of it can be allocated to between groups. If none of it can be allocated to between groups, then in a population genetic sense it doesn’t make much sense to speak of distinctive groups, they’re basically one breeding population. The higher the FST statistic is, the more of the variation is partitioned between the groups.

Roughly this is used to correlate with genetic distance as well as evolutionary divergence. The longer two populations have been separated, the more and more genetic differences they’ll accumulate, inflating the FST value. There are a lot of subtleties that I’m eliding here (see Estimating and interpreting FST: the impact of rare variants for a survey of the recent literature on the topic and pathways forward), but for a long time, FST was the go-to statistic for making phylogenetic inferences on a within-species scale.

Today we have other techniques, Structure, Treemix, fineStructure, and various local ancestry packages.

But FST is still useful to give one a Gestalt sense of population genetic differences. Cavalli-Sforza admits in The History and Geography of Human Genes that European populations had very low pairwise FST, but because of the importance of Europe for sociocultural reasons a detailed analysis of the region was still provided in the text. Additionally, they had lots of European samples (non-European Caucasoids were thrown into one category for macro-group comparisons because there wasn’t that many samples).

Using results from the 2015 paper Massive migration from the steppe was a source for Indo-European languages in Europe, I visualized pairwise genetic distances for European populations, ancient and modern (Han Chinese as an outgroup), on a tree. What the results illustrate is that

1. Ancient populations were very distinct in Europe from modern ones.
2. Many modern groups are clustered close together.

The bulk of the population genetic structure in modern Europe seems to have been established in the period between 3000 BCE and 2000 BCE. This is not that much time for a lot of distinctiveness to develop, especially on the geographically open North European plain. I suspect with more and more Mesolithic and early to middle Neolithic DNA we’ll see that some of the modern population structure is a ghost of ancient substrate absorption.

Many of the ethno-national categories that are very significant in recent history, and impact the cultural memories of modern people and their genealogies, have very shallow roots. This does not mean they are not “real” (I don’t know what that’s supposed to mean at all), just that many of the identities which seem so salient to us today may be relatively recent in terms of their significance to large groups of humans….

### Visualizing intra-European phylogenetic distances

Filed under: Europe,European genetics,Population genetics,Population genomics — Razib Khan @ 4:53 pm

In L. L. Cavalli-Sforza’s The History and Geography of Human Genes he used between population group genetic distances, as measured in FST values, to generate a series of visualizations, which then allowed him to infer historical processes. Basically the way it works is that you look at genetic variation, and see how much of it can be allocated to between groups. If none of it can be allocated to between groups, then in a population genetic sense it doesn’t make much sense to speak of distinctive groups, they’re basically one breeding population. The higher the FST statistic is, the more of the variation is partitioned between the groups.

Roughly this is used to correlate with genetic distance as well as evolutionary divergence. The longer two populations have been separated, the more and more genetic differences they’ll accumulate, inflating the FST value. There are a lot of subtleties that I’m eliding here (see Estimating and interpreting FST: the impact of rare variants for a survey of the recent literature on the topic and pathways forward), but for a long time, FST was the go-to statistic for making phylogenetic inferences on a within-species scale.

Today we have other techniques, Structure, Treemix, fineStructure, and various local ancestry packages.

But FST is still useful to give one a Gestalt sense of population genetic differences. Cavalli-Sforza admits in The History and Geography of Human Genes that European populations had very low pairwise FST, but because of the importance of Europe for sociocultural reasons a detailed analysis of the region was still provided in the text. Additionally, they had lots of European samples (non-European Caucasoids were thrown into one category for macro-group comparisons because there wasn’t that many samples).

Using results from the 2015 paper Massive migration from the steppe was a source for Indo-European languages in Europe, I visualized pairwise genetic distances for European populations, ancient and modern (Han Chinese as an outgroup), on a tree. What the results illustrate is that

1. Ancient populations were very distinct in Europe from modern ones.
2. Many modern groups are clustered close together.

The bulk of the population genetic structure in modern Europe seems to have been established in the period between 3000 BCE and 2000 BCE. This is not that much time for a lot of distinctiveness to develop, especially on the geographically open North European plain. I suspect with more and more Mesolithic and early to middle Neolithic DNA we’ll see that some of the modern population structure is a ghost of ancient substrate absorption.

Many of the ethno-national categories that are very significant in recent history, and impact the cultural memories of modern people and their genealogies, have very shallow roots. This does not mean they are not “real” (I don’t know what that’s supposed to mean at all), just that many of the identities which seem so salient to us today may be relatively recent in terms of their significance to large groups of humans….

## December 8, 2017

### The Saxon Panmixia

Filed under: Population genetics — Razib Khan @ 9:09 pm

One reason I quite like Norman Davies’ book The Isles is that it is a history of Britain and Ireland which explicitly aims to not privilege the story of the English inordinately. As the most powerful and numerous people of the British Isles the English loom large, but in the period between Gildas and Bede things were very different. In the early 600s the Welsh king Cadwallon ap Cadfan conquered and held Northumbria for a period, northern England from the Irish Sea to the North Sea. But this was the last time that a Celtic monarch held land in eastern England, unless you count the Tudors.

In The Isles, written at the turn of the century, Davies promotes the view dominant among historians at that time that the transition from British Celtic to Anglo-Saxon occurred through diffusion of elite culture. He alludes to the fact that in the year 700 the law code of Wessex alludes explicitly to the fact the weregild paid for the death of a Saxon was many-fold greater than that paid for a Briton (of the same class status). This suggests that many Britons were still resident in the Anglo-Saxon kingdoms. The contrasting view, which was dominant in the early 20th century, was that the English replaced the Celts in toto. The Irish, Welsh, and to some extent the Scots, were viewed as racially distinct from the Germanic English.

2015’s The fine scale genetic structure of the British population answered many of these questions. It turns out the maximal positions were incorrect. The authors estimate that 10-40% of the ancestry in eastern and southern England (the positions on the map) derive from Germanic peoples which we might term Saxon, Angles, and Jutes. Even if the fraction is as low as 10% that is not trivial. If we take a value closer to ~25%, unless there were massive reproductive advantages for elites, it could not have just been diffusion from the elite. Archaeologists also see wholesale changes in agricultural patterns in eastern England, indicative of a transfer of a whole folkway.

All that being said it is likely that the majority of the ancestry of the population of England proper descends from Britons. In fact, once the Anglo-Saxon cultural hegemony was established it seems that some elite Britons may also have changed their identity. It is always a curious fact that the names of the first kings in the genealogy of the House of Wessex are distinctively Celtic. Just as Romano-Gallic aristocrats began aping the styles and mores of the Frankish elite in the 6th century, so perhaps some British warlords became Saxons.

Using similar methods many of the same authors have now put out a preprint on Ireland, Insular Celtic population structure and genomic footprints of migration. Unlike the earlier work on Britain, they’ve acknowledged the ancient DNA work which has reshaped our understanding of population turnover in Ireland. That being said, they are focused on more recent events, as well as spatial structure in the modern era.

Though they don’t have access to as detailed a regional data set as in the earlier work on Britain, in this case, the authors managed to detect a lot of regional population structure within Ireland. Why? Though the Irish are relatively homogeneous, as all Northern Europeans are, looking at long tracts of the genome and the patterns therein can squeeze out more information.

The figure at the top of this post shows how well they can cluster individuals geographically: they’ve basically recapitulated the “map of the British Isles.” There aren’t too many surprises. Western Ireland seems to exhibit greater genetic differences as a function of distance. Probably because it’s less developed, and perhaps because it has been less impacted by outsiders. Ulster and southern Scotland are strongly connected genetically. There are two issues going on here. First, the famous migration of Protestants into this region of Ireland from Scotland and northern England that occurred after the conquest of the 16th century. And second, the earlier migration of Irish to Scotland, which resulted in the creation of the Dal Riata kingdom.

Additionally, the authors detect more admixture in several parts of Ireland from Norse than they had anticipated. The mixing of Scandinavians and Irish created a hybrid culture, the Norse-Gaels, which was highly influential around the Irish Sea. So it would not be exactly surprising if there was a greater Scandinavian contribution to Irish ancestry than had been anticipated.

Of greater interest to me is the impact of social-political institutions on the genetic structure or lack thereof. Both Britain and Ireland have homogenized modal clusters. In Britain, this is associated with the expanding cultural zone of Anglo-Saxon rule, and later became the core of England. In Ireland, it seems to be the Pale, where Anglo-Norman rule was dominant for many centuries. Rapid cultural change seems to induce a state of panmixia. Genetic distinctiveness in the British Isles seems to have persisted in populations which were geographically isolated, or politically insulated, from expansive, assimilative, and integrative cultures. The modal cluster in Ireland is far smaller than in England, which nicely correlates with the much more limited impact of the Anglo-Norman ascendency of the medieval period.

### The Saxon Panmixia

Filed under: Population genetics — Razib Khan @ 9:09 pm

One reason I quite like Norman Davies’ book The Isles is that it is a history of Britain and Ireland which explicitly aims to not privilege the story of the English inordinately. As the most powerful and numerous people of the British Isles the English loom large, but in the period between Gildas and Bede things were very different. In the early 600s the Welsh king Cadwallon ap Cadfan conquered and held Northumbria for a period, northern England from the Irish Sea to the North Sea. But this was the last time that a Celtic monarch held land in eastern England, unless you count the Tudors.

In The Isles, written at the turn of the century, Davies promotes the view dominant among historians at that time that the transition from British Celtic to Anglo-Saxon occurred through diffusion of elite culture. He alludes to the fact that in the year 700 the law code of Wessex alludes explicitly to the fact the weregild paid for the death of a Saxon was many-fold greater than that paid for a Briton (of the same class status). This suggests that many Britons were still resident in the Anglo-Saxon kingdoms. The contrasting view, which was dominant in the early 20th century, was that the English replaced the Celts in toto. The Irish, Welsh, and to some extent the Scots, were viewed as racially distinct from the Germanic English.

2015’s The fine scale genetic structure of the British population answered many of these questions. It turns out the maximal positions were incorrect. The authors estimate that 10-40% of the ancestry in eastern and southern England (the red positions on the map) derive from Germanic peoples which we might term Saxon, Angles, and Jutes. Even if the fraction is as low as 10% that is not trivial. If we take a value closer to ~25%, unless there were massive reproductive advantages for elites, it could not have just been diffusion from the elite. Archaeologists also see wholesale changes in agricultural patterns in eastern England, indicative of a transfer of a whole folkway.

All that being said it is likely that the majority of the ancestry of the population of England proper descends from Britons. In fact, once the Anglo-Saxon cultural hegemony was established it seems that some elite Britons may also have changed their identity. It is always a curious fact that the names of the first kings in the genealogy of the House of Wessex are distinctively Celtic. Just as Romano-Gallic aristocrats began aping the styles and mores of the Frankish elite in the 6th century, so perhaps some British warlords became Saxons.

Using similar methods many of the same authors have now put out a preprint on Ireland, Insular Celtic population structure and genomic footprints of migration. Unlike the earlier work on Britain, they’ve acknowledged the ancient DNA results which have reshaped our understanding of population turnover in Ireland. That being said, they are focused on more recent events, as well as spatial structure in the modern era.

Though they don’t have access to as detailed a regional data set as in the earlier work on Britain, in this case, the authors managed to detect a lot of regional population structure within Ireland. Why? Though the Irish are relatively homogeneous, as all Northern Europeans are, looking at long tracts of the genome and the patterns therein can squeeze out more information.

The figure at the top of this post shows how well they can cluster individuals geographically: they’ve basically recapitulated the “map of the British Isles.” There aren’t too many surprises. Western Ireland seems to exhibit greater genetic differences as a function of distance. Probably because it’s less developed, and perhaps because it has been less impacted by outsiders. Ulster and southern Scotland are strongly connected genetically. There are two issues going on here. First, the famous migration of Protestants into this region of Ireland from Scotland and northern England that occurred after the conquest of the 16th century. And second, the earlier migration of Irish to Scotland, which resulted in the creation of the Dal Riata kingdom.

Additionally, the authors detect more admixture in several parts of Ireland from Norse than they had anticipated. The mixing of Scandinavians and Irish created a hybrid culture, the Norse-Gaels, which was highly influential around the Irish Sea. So it would not be exactly surprising if there was a greater Scandinavian contribution to Irish ancestry than had been anticipated.

Of greater interest to me is the impact of social-political institutions on the genetic structure or lack thereof. Both Britain and Ireland have homogenized modal clusters. In Britain, this is associated with the expanding cultural zone of Anglo-Saxon rule, and later became the core of England. In Ireland, it seems to be the Pale, where Anglo-Norman rule was dominant for many centuries. Rapid cultural change seems to induce a state of panmixia. Genetic distinctiveness in the British Isles seems to have persisted in populations which were geographically isolated, or politically insulated, from expansive, assimilative, and integrative cultures. The modal cluster in Ireland is far smaller than in England, which nicely correlates with the much more limited impact of the Anglo-Norman ascendency of the medieval period.

## November 24, 2017

### Soft selection for gentleness in Puerto Rican African Honeybees

Filed under: Population genetics,Population genomics,Soft Selection,Soft Sweep — Razib Khan @ 3:07 pm

When I was a kid “killer bees” were a major pop culture thing. There were movies about the bees, and we would get updates about their march northward in the news. They were a cautionary tale of our species’ hubris.

Today we have a little bit more perspective. These bees were actually just African honeybees, the ancestral population to European honeybees, which were introduced to the New World with Europeans centuries earlier than the African honeybees. African honeybees were not that different from European honeybees, but they were more aggressive and tended to outcompete European honeybee colonies. They are a major problem for the beekeeping industry, but not a major threat to human life.

Today the African and European populations in the United States seem to have stabilized in their ranges, with a hybrid zone between them. African bee’s migratory behavior makes them less competitive with European bees in colder climates.

A friend of mine once mentioned to me that if he had to do it all over again he would do research on the evolutionary genomics of Hymenoptera, and in particular bees. People care about bees. So it ‘s no surprise that I noticed this paper out in Nature Communications, A soft selective sweep during rapid evolution of gentle behavior in an Africanized honeybee:

Highly aggressive Africanized honeybees (AHB) invaded Puerto Rico (PR) in 1994, displacing gentle European honeybees (EHB) in many locations. Gentle AHB (gAHB), unknown anywhere else in the world, subsequently evolved on the island within a few generations. Here we sequence whole genomes from gAHB and EHB populations, as well as a North American AHB population, a likely source of the founder AHB on PR. We show that gAHB retains high levels of genetic diversity after evolution of gentle behaviour, despite selection on standing variation. We observe multiple genomic loci with significant signatures of selection. Rapid evolution during colonization of novel habitats can generate major changes to characteristics such as morphological or colouration traits, usually controlled by one or more major genetic loci. Here we describe a soft selective sweep, acting at multiple loci across the genome, that occurred during, and may have mediated, the rapid evolution of a behavioural trait.

Come for the bees, but stay for the soft selection! If you talk to anyone in evolutionary and population genomics you know that the future is in understanding patterns of soft selection and polygenic selection from standing variation. Though these are related phenomena which are associated with each other, all are all distinct.

Standing variation just refers to the diversity which is segregating in the population at any given time. At any given moment many loci exhibit polymorphism. This polymorphism can be a target of natural selection if it is correlated with heritable variation and differentials in fitness. Though soft selection can be quite wooly it’s inverse, hard selection, is clear: in genetic terms hard selection can be seen in allele frequency changes at a single variant in a locus, going from the point where it is a novel mutation to nearly fixed in the population. In Haldane’s original conception hard selection involved excess deaths, and imposed a limit on the rate of evolution as well as the amount variation you could expect within a given population. This model was convenient in the pre-genomic and early genomic era because empirical selection tests had to focus on large allele frequency changes around singular loci. Researchers didn’t have large numbers of whole-genome samples available (nor the computational ability to analyze them).

Today this is not a limitation. In the analysis above the authors had 30 individuals of the 3 populations sequenced at high quality (20x). They ended up with millions of genetic variants they could analyze.

The plot to the left shows that “gentle African honeybees” (gAHB) tend to be closer to the African honeybee populations (AHB) overall (though with some hybridization with European honeybees, EHB). This is not surprising.

But the key observation was that over 12 generations the African honeybees of Puerto Rico became progressively less aggressive, despite maintaining overall morphological similarities to the mainland Mexican African bees from which they likely derive. Though buried in the discussion, there is a rationale for why this morphological change may have occurred: the Puerto Rican bees are subject to a lot of negative selection against aggression because of the density of the island, as well as the reality that aside from humans there aren’t other many species where their aggressive tendencies are beneficial. Basically, if you are an aggressive colony, it’s harder to make a go in densely settled areas (the implication here then is that there are probably “gentle” African honeybee populations across Latin America, they just are never disaggregated from the broader meta-population).

It’s the genomics where the real evolutionary insight comes in: they found that there were multiple soft sweep events around genetic regions implicated in behavior. In their overall genome the gAHB of Puerto Rico resembled mainland AHB, but in this subset of genetic loci they resembled EHB. Many of these loci had also been known to be targets of selection when the original European bee population diverged from the ancestral African population. Basically this is a genomic illustration of convergent evolution.

Regular readers of this blog will recognize the ways they detected selection. They used a modified form of EHH, which is reasonable since the selection event was recent enough to have been associated with distinct haplotype blocks. Also, standard Fst analysis showed that these were outliers in relation to the broader genetic pattern of relatedness (these loci were more like EHB than AHB, while most loci were more like AHB than EHB).

So this a form of polygenic selection. Remember, natural selection only knows genes through the phenotype (with intra-genomic selection being an exception). A behavior like aggression is probably subject to the fourth law of behavior genetics. That is, variation won’t be defined around a single genetic locus. Rather, variation across the genome will be correlated with variation in the phenotype. As selection favors a particular value of the phenotype across the distribution the allele frequencies across many genetic loci will shift, but they will not necessarily fix. Polygenic selection operates on the dispersed standing genetic variation which explains much of the variation of the phenotype in question. Instead of total sweeps to fixation due to large fitness differences between a given allele and its alternative form, the selection impact is distributed and diffused across the genome.

Though most of the genetic variants seem to recapitulate the evolution of the less aggressive phenotype that occurred with the original migration north of African honeybees, some of the selection signatures were novel. This points to the reality that when you have soft selection on standing variation you may have similar phenotypes which evolve via different means. Additionally, the authors noted that these results were in contrast to controlled breeding experiments in mammals where selection for gentility (“domestication”) often targeted a few loci and exhibited strong pleiotropic effects (due to the genetic correlation). These results point to the limitations of inferences made from human-directed selection.

Soft selection is probably ubiquitous. Consider the evolution of skin color in humans. There are lots of variants and lots of variation, and most of the variation seems to be ancestral. Only at the locus SLC24A5 do you have a perfect illustration of a hard selective sweep, probably from a de novo mutation that emerged around the Last Glacial Maximum.

From a geneticists’ perspective evolution is basically conceived of as changes in allele frequencies over time. Much of this is due to natural selection. Now that the world of soft selection is opening up, I suspect that we’ll understand a lot more of what we see around us, at least in the generality.

Older Posts »