Razib Khan One-stop-shopping for all of my content

February 18, 2019

Not happening at genomic speed: diversification of GWAS panels

Filed under: Medical Genetics,Polygenic Risk Score,Quantitative Genetics — Razib Khan @ 12:02 am
  One of the things that is evident and the norm when you are interested in genetics and genomics is that things happen fast. There are some sciences which proceed at a normal and conventional pace. But, because genomics is fundamentally driven by the synergy of two technologies, modern automated sequencing, and computation, the field […]

May 30, 2018

Soft & hard selection vs. soft & hard sweeps

Filed under: Population genetics,Quantitative Genetics — Razib Khan @ 12:09 am

When I was talking to Matt Hahn I made a pretty stupid semantic flub, confusing “soft selection” with “soft sweeps.” Matt pointed out that soft/hard selection were terms more appropriate to quantitative genetics rather than population genomics. His viewpoint is defensible, though going back into the literature on soft/selection, e.g., Soft and hard selection revisited, the main thinkers pushing the idea were population geneticists who were also considering ecological questions.*

The strange thing is that I had already known the definitions of hard and soft selection on some level because I had read about them as I was getting confused with hard and soft sweeps! But this was more than ten years ago now, and since then I haven’t given the matter enough thought obviously, as I defaulted back to confusing the two classes of terms, just as I used to.

Matt pointed out that truncation selection is a form of hard selection. All individuals below (or above) a certain phenotype value have a fitness of zero, as they don’t reproduce. In a single locus context, hard selection would involve deleterious lethal alleles, whose impact on the genotype was the same irrespective of ecological context. So in a hard selection, it operates by reducing the fitness of individuals/genotypes to zero.

For soft selection, context matters much more, and you would focus more on relative fitness differences across individuals/genotypes. Some definitions of soft vs. hard selection emphasize that in the former case fitness is defined relative to the local ecological patch, while the latter is a universal estimate. Soft selection does not necessarily operate through the zero fitness value for a genotype, but rather differential fitness. Hard selection can crash your population size. Soft selection does not necessarily do that.

Though I won’t outline the details, one of the originators of the soft/hard selection concept analogized them to density-dependent/independent dynamics in ecology. If you know the ecological models, the correspondence probably is obvious to you.

As for hard and soft sweeps, these are particular terms of relevance to genomics, because genome-wide data has allowed for their detection through the impact they have on the variation in the genome. A “sweep” is a strong selective event that tends to sweep away variation around the focus of selection. A hard sweep begins with a single mutant, and positive selection tends to drive it toward fixation.

A classical example is lactase persistence in Northern Europeans and Northwest South Asians (e.g., Punjabis). The mutation in the LCT gene is the same across a huge swath of Eurasia. And, the region around the genome is also the same, because regions of the genome adjacent to that single mutation increased in frequency as well (they “hitchhiked”). This produces a genetic block of highly reduced diversity since the hard selective sweep increases the frequency of so many variants which are associated with the advantageous one, and may drive to extinction most other competitive variants.

Someone is free to correct me in the comments, but it strikes me that many hard selective sweeps are driven by soft selection. Fitness differentials between those with the advantageous alleles and those without it are not so extreme, and obviously context dependent, even in cases of hard sweeps on a single locus.

The key to understanding soft sweeps is that there isn’t a focus on a singular mutation. Rather, selection can target multiple mutations, which may have the same genetic position, but be embedded within different original gene copies. In fact, soft selection often operates on standing variation, preexistent alleles which were segregating in the population at low frequencies or were totally neutral. Genetic signatures of these events are less striking than those for hard sweeps because there is far less diminishment of diversity, since it’s not the increase in the frequency of a singular mutation and the hitchhiking of its associated flanking genomic region.

Soft sweeps can clearly occur with soft selection. But truncation selection can occur on polygenic traits, so depending on the architecture of the trait (i.e., effect size distribution across the loci) one can imagine them associated with hard selection as well.

Going back to the conversation I had with Matt the reason semantics is important is that terms in population genetics are informationally rich, and lead you down a rabbit-hole of inferences. If population genetics is a toolkit for decomposing reality, then you need to have your tools well categorized and organized. On occasion it is important to rectify the names.

* There are two somewhat related definitions of soft/hard selection. I’ll follow Wallace’s original line here, though I’m not sure they differ that much.

August 19, 2012

Why northern Europeans are taller than southern Europeans?

Filed under: Height,Quantitative Genetics — Razib Khan @ 10:24 pm

In part, genes. Luke Jostins reported this from a conference last year, so not too surprising. Evidence of widespread selection on standing variation in Europe at height-associated SNPs. Let me jump to the summary:

In summary, we have provided an empirical example of widespread weak selection on standing variation. We observed genetic differences using multiple populations from across Europe, thereby showing that the adult height differences across populations of European descent are not due entirely to environmental differences but rather are, at least partly, genetic differences arising from selection. Height differences across populations of non-European ancestries may also be genetic in origin, but potential nongenetic factors, such as differences in timing of secular trends, mean that this inference would need to be directly tested with genetic data in additional populations. By aggregating evidence of directionally consistent intra-European frequency differences over many individual height-increasing alleles, none of which has a clear signal of selection on its own, we observed a combined signature of widespread weak selection. However, we were not able to determine whether this differential weak selection (either positive or negative) favored increased height in Northern Europe, decreased height in Southern Europe or both. One possibility is that sexual ...

January 29, 2012

Most people don’t understand “heritability”

Filed under: Heritability,Quantitative Genetics — Razib Khan @ 1:08 pm

According to the reader survey 88 percent said they understood what heritability was. But only 34 percent understood the concept of additive genetic variance. For the purposes of this weblog it highlights that most people don’t understand heritability, but rather heritability. The former is the technical definition of heritability which I use on this weblog, the latter is heritability in the colloquial sense of a synonym for inheritance, biological and cultural. Almost everyone who understands the technical definition of heritability will know what heritability in the ‘narrow sense’ is, often just informally termed heritability itself. It is the proportion of phenotype variability that can be attributed to additive genetic variation. Those who understand additive genetic variance and heritability in the survey were 32 percent of readers. If you understand heritability in the technical manner you have to understand additive genetic variance. This sets the floor for the number who truly understand the concept in the way I use on this weblog (I suspect some people who were exceedingly modest who basically understand the concept for ‘government purposes’ put themselves in the ‘maybe’ category’). After nearly 10 years of blogging (the first year or so of which I myself wasn’t totally clear on the issue!) that’s actually a pretty impressive proportion. You take what you can get.

December 31, 2011

Are genes the key to the Yankee Empire?

That’s the question a commenter poses, albeit with skepticism. First, the background here. New England was a peculiar society for various demographic reasons. In the early 17th century there was a mass migration of Puritan Protestants from England to the colonies which later became New England because of their religious dissent from the manner in which the Stuart kings were changing the nature of the British Protestant church.* Famously, these colonies were themselves not aiming to allow for the flourishing of religious pluralism, with the exception of Rhode Island. New England maintained established state churches longer than other regions of the nation, down into the early decades of the 19th century.

Between 1630 and 1640 about ~20,000 English arrived on the northeastern fringe of British settlement in North America. With the rise of co-religionists to power in the mid-17th century a minority of these emigres engaged in reverse-migration. After the mid-17th century migration by and large ceased. Unlike the Southern colonies these settlements did not have the same opportunities for frontiersmen across a broad and ecological diverse hinterland, and its cultural mores were decidedly more constrained than the cosmopolitan Middle Atlantic. The growth in population in New England from the low tends of thousands to close to 1 million in the late 18th century was one of endogenous natural increase from the founding stock.

This high fertility regime persisted down into the middle of the 19th century, as the core New England region hit its Malthusian limit, and flooded over into upstate New York, to the irritation of the older Dutch population in that region. Eventually even New York was not enough, and New England swept out across much of the Old Northwest. The last became the “Yankee Empire,” founded by Yankees, but later demographically supplemented and superseded in its western reaches by immigrants from northwest Europe who shared many of the same biases toward order and moral probity which were the hallmarks of Yankees in the early Republic.

While the Yankees were waxing in numbers, and arguably cultural influence, the first decades of the American Republic also saw the waning of New England power and influence in relation to the South in the domain of politics. This led even to the aborted movement to secede from the union by the New England states in the first decade of the century. By the time of Andrew Jackson an ascendant Democrat configuration which aligned Southern uplanders and lowlanders with elements of the Middle Atlantic resistant to Yankee cultural pretension  and demographic expansion would coalesce and dominate American politics down to the Civil War. It is illustrative that one of the prominent Northern figures in this alliance, President Martin Van Buren, was of Dutch New York background.

But this is a case where demographics was ultimate destiny. Not only were the Yankees fecund, but immigrants such as the German liberals fleeing the failures of the tumult of 1848 (e.g., Carl Schurz) were aligned with their anti-slavery enthusiasms (though they often took umbrage at the anti-alcohol stance of the Puritan moralists of the age, familiarizing the nation with beer in the 1840s). The Southern political ascendancy was simply not tenable in the face of Northern demographic robustness, fueled by both fertility and immigration. Because of overreach on the part of the Southern elite the segments of the Northern coalition which were opposed to the Yankees eventually fractured (Martin Van Buren allowed himself to be candidate for the anti-slavery Free Soil party at one point). Though there remained Northern Democrats down to the Civil War, often drawn from the “butternuts” whose ultimate origins were in the Border South, that period saw the shift in national politics from Democrat to Republican dominance (at least up the New Deal). Curiously, the coalition was an inversion of the earlier coalition, with Yankees now being integral constituents in a broader Northern and Midwestern movement, and Southerners being marginalized as the odd-men-out.

 I review all this ethno-history because I think that to a great extent it is part of the “Dark Matter” of American political and social dynamics. Americans are known as “Yankees” to the rest of the world, and yet the reality is that the Yankee was one specific and very distinctive folkway on the American scene. But, that folkway has been very influential, often in a cryptic fashion.

Both Barack H. Obama and George W. Bush are not culturally identified as Yankees in a narrow sense. Obama is a self-identified black American who has adopted the Chicago’s South Side as his community. The South Side is home to black culture which descends from those who arrived at the terminus of their own Great Migration from the American South. George W. Bush fancies himself a West Texan and a cowboy. He was governor of Texas, and makes his residence in Dallas, while much of his young adulthood was spent in Midland. But the reality is that both of these men have Yankee antecedents. This is clear in Bush’s case. His father is a quintessential Connecticut Yankee. Bush is the product of Andover Academy, Yale, and Harvard (by and large thanks to family connections). Barack H. Obama is a different case entirely. His racial identity as a black American is salient, but he grew up in one of the far flung outposts of the Yankee Empire, Hawaii. But perhaps more curiously, many of his mother’s ancestors were clearly Yankees. Obama has a great-grandfather named Ralph Waldo Emerson Dunham.

Within and outside of the United States there is often a stereotype that white Americans are an amorphous whole, a uniform herrenvolk who oppressed the black minority. This ideology was actually to some extent at the heart of the dominance of the early Democratic party before the rise of the Republicans fractured the coalition along sectional lines. In many Northern states one saw populist Democrats revoking property rights which were race-blind with universal white male suffrage.  But white Americans, and Anglo-Americans of British stock at that, were not one. That was clear by the 1850s at the latest. And they exhibit a substantial amount of cultural variation which remains relevant today.

New England in particular stands out over the long historical scale. In many ways of the all the colonies of Great Britain it was the most peculiar in its relationship to the metropole. Unlike Australia or Canada it was not an open frontier, rich with natural resources which could absorb the demographic surplus of Britain. Unlike India it was not a possible source of rents from teeming culturally alien subjects. Unlike the South in the mid-19th century there was no complementary trade relationship. In economic terms New England was a potential and incipient rival to Old England. In cultural and social terms it may have aped Old England, but its “low church” Protestant orientation made it a throwback, and out of step with a metropole which was becoming more comfortable with the English Magisterial Reformation (which eventually led to the emergence of Anglo-Catholicism in the 19th century). Like modern day Japan, and England of its day, New England had to generate wealth from its human capital, its own ingenuity. This resulted in an inevitable conflict with the mother country, whose niche it was attempting to occupy (albeit, with exceptions, such as the early 19th century, before the rise of robust indigenous industry, and the reliance on trade). Today the American republic has pushed England aside as the center of the Anglosphere. And despite the romantic allure of the frontier and the surfeit of natural resources, it is ultimately defined by the spirit of Yankee ingenuity (rivaled by the cowboy, whose violent individualist ethos seems straight out of the Scots-Irish folklore of the South, transposed to the West).

What does this have to do with genetics? Let’s go back to the initial colonial period. As I’ve noted before: the Yankee colonies of New England engaged in selective immigration policies. Not only did they draw Puritan dissenters, but they were biased toward nuclear family units of middling background. By “middling,” that probably refers at least toward the upper quarter of English society of the period. They were literate, with at least some value-added skills. This is in contrast with the Irish Catholic migration of the 19th century, which emptied out Ireland of its tenant peasants (attempts to turn these Irish into yeoman farmers in the Midwest failed, with fiascoes such as the consumption of their seed corn and cattle over harsh Minnesota winters).

So the question is this: could “middle class” values be heritable? Yes, to some extent they are. Almost all behavioral tendencies are heritable to some extent. Adoption studies are clear on that. But, is one generation of selection sufficient to result in a long term shift? First, let’s dismiss the possibility of random genetic drift and therefore a bottleneck. The one generation shift in allele frequencies due to drift is inversely proportional to effective population. If you assume that effective population is ~5,000, then the inverse of that is 0.0002. So you’d expect the allele frequency at any given locus shift by only a tiny fraction. So we have to look to selection.

Let’s do some quick “back of the envelop” calculations. We’ll use IQ as a proxy for a whole host of numbers because the numbers will at least be concrete, though the underlying logic of a quantitative continuous trait remains the same. First, the assumptions:

- Truncation selection on the trait which lops off the bottom 75 percent of the class distribution

- A correlation between the trait and genetic variation, so that you lop off the bottom 50 percent of the IQ distribution

- A heritability of IQ of 0.50

The top 50 percent of the IQ distribution has a median/mean IQ of ~110. Assuming 0.50 heritability implies half way regression back to the mean. Therefore, this model predicts that one generation of selection would entail a median IQ of 105 in the second generation, about 1/3 of a standard deviation above the norm in England.

Is this plausible, and could it result in the differences we see across American white ethnic groups? It is possible, but there are reasons to be skeptical. I think my guess of the top 25 percent of the class distribution is defensible from all I’ve read. But the correlation of this with IQ is probably going to be lower in the pre-modern era than today, where you have meritocratic institutions which channel people of different aptitudes. Second, the heritability of IQ was probably lower back then than now, because of wide environmental variance. Please note, I don’t dismiss the genetic explanation out of hand. Rather, this is a case where there are so many uncertainties that I’m not inclined to say much more than that it is possible, and that we may have an answer in the coming decades with widespread genomic sequencing.

But there’s another option, which is on the face of it is more easy to take in because so many of the parameters are well known and have been thoroughly examined. And that’s cultural selection. While we have to guess at the IQ distributions of the early Puritans, we know about the distribution of their cultural tendencies. They were almost all Calvinists, disproportionately literate. Because of its flexible nature culture can generate enormous inter-group differences in phenotypic variation. The genetic difference between New England and Virginia may have been small, but the cultural difference was wide (e.g., Yankee thrift vs. Cavalier generosity). Yankees who relocated to the South would assimilate Southern values, and the reverse (there is some suggestion that South Carolinian John C. Calhoun’s Unitarianism may have been influenced by his time at Yale, though overall it was obviously acceptable to the Deist inclined Southern elite of the period).

Before New England human societies had an expectation that there would be a literate segment, and an illiterate one. By and large the substantial majority would be illiterate. In the Bronze Age world the scribal castes had almost a magic power by virtue of their mastery of the abstruse cuneiform and hieroglyph scripts. The rise of the alphabet (outside of East Asia) made literacy more accessible, but it seems likely that the majority of ancient populations, even in literary capitals such as Athens, were functionally illiterate. A small minority was sufficient for the production, dissemination, and propagation of literary works. Many ancient books were written with the ultimate understanding that their wider “reading” was going to occur in public forums where crowds gathered to listen to a reader. The printing press changed this with the possibility for at least nominal ownership of books by those with marginal surplus, the middle class. By limiting migration to these elements with the means to buy books, as well as an emphasis on reading the Bible common to scriptural Protestants, you had a society where the majority could be readers in the public forum.

What were the positive cultural feedback loops generated? And what sort of cultural dampeners may have allowed for the new stable cultural equilibrium to persist down the centuries? These are open questions, but they need to be explored. I’ll leave you with a map of public school expenditures in 2003. In the 1840s and 1850s one of the more notable aspects of the opening of the Western frontier with the huge difference between states settled by Yankees, such as Michigan, and those settled by Southerners, such as Arkansas. Both states were settled contemporaneously, but while Michigan had numerous grammar schools, Arkansas had hardly any….

* British Protestantism has shifted several times from a more “Catholic” to “Radical Protestant” direction. Its peak in officially sanctioned Radical Protestantism was probably during the reign of Edward VI, decades before the Stuart kings (the exception being the republic)

December 30, 2011

How do relatives correlate in traits?

Filed under: Correlation,Height,Quantitative Genetics — Razib Khan @ 1:24 am

The Pith: Even traits where most of the variation you see around you is controlled by genes still exhibit a lot of variation within families. That’s why there are siblings of very different heights or intellectual aptitudes.

In a post below I played fast and loose with the term correlation and caused some confusion. Correlation is obviously a set of precise statistical terms, but it also has a colloquial connotation. Additionally, I regularly talk about heritability. Heritability is in short the proportion of phenotypic variance which can be explained by genetic variance. In other words, if heritability is ~1 almost all the variation in the trait is due to variation in genes, while if heritability is ~0 almost none of it is. Correlation and heritability of traits across generations are obviously related, but they’re not the same.

This post is to clarify a few of these confusions, and sharpen some intuitions. Or perhaps more accurately, banish them.


The plot above shows relationship between heights of fathers and heights of sons in standard deviation units (yes, I removed some of the values!). You see that the slope is ~0.45, and that’s the correlation. At this point you probably know that heritability of height is on the order of 0.8-0.9. So why is the correlation so low? A simple biological reason is that you don’t know the value of the mothers. If the parents are not strongly correlated (assortative mating) obviously the values of the sons is going to diverge from that of the father. That being said, you probably notice that the correlation here is about 1/2 that of the heritability you know has been confirmed in the literature. That’s no coincidence. One way to estimate heritability is to take the slope of the plot of offspring vs. parents, and multiply that by 2. Therefore, the correlation (which equals the slope) is 1/2 × h2, where h2 represents heritability.

Correlation (parent to offspring) = 1/2 × h2

1/2 turns out to be the coefficient of relatedness of a parent to offspring. I’ll spare you the algebra, but suffice it to say that this is not a coincide. Where r = coefficient of relatedness the correlation between sets of relatives on a trait value is predicted to be:

Correlation (relative to relative) = r × h2

Where r is simply the coefficient of relatedness across the pair of relatives. Here are some values:

r relationship
0.5 (½) parent-offspring
0.25 (¼) grandparent-grandchild
1 identical twins; clones
0.5 (½) full siblings
0.25 (¼) half siblings
0.125 (⅛) first cousins

Here’s the kicker: the correlation coefficient of the midparent value and the offspring value does not equal the slope of the line of best fit. This is why I had second thoughts about using the term “correlation” so freely, and then switching to heritability. The formula is:

Correlation (midparent to offspring) = 1/√2 × h2

So the correlation of midparent to offspring is 0.71 × heritability.

Why is this something you might want to know? I think people are sometimes confused about how an extremely heritable trait, like height, where you’re given heritability values of 0.90, still yields families with such a wide range of heights. Well, recall that the coefficient of relatedness among siblings is 1/2. So their correlation is going to be the same as with parents. Therefore, the magnitude will be half that of the heritability. A correlation of 0.45 is not small, but neither is it extremely tight. The histogram below illustrates this with the above data set. The values are simply the real difference between fathers and sons:

December 29, 2011

How a “designer baby” might just work

Filed under: Bioethics,Quantitative Genetics,Quantitative Genomics — Razib Khan @ 1:41 pm

In earlier discussions I’ve been skeptical of the idea of “designer babies” for many traits which we may find of interest in terms of selection. For example, intelligence and height. Why? Because variation on these traits seems highly polygenic and widely distributed across the genome. Unlike cystic fibrosis (Mendelian recessive) or blue eye color (quasi-Mendelian recessive) you can’t just focus on one genomic region and then make a prediction about phenotype with a high degree of certainty. Rather, you need to know thousands and thousands of genetic variants, and we just don’t know them.

But I just realized one way that genomics might make it a little easier even without this specific information.

The method relies on the phenotypic correlation between relatives. Even before genomics, and genetics, biometricians could generate rough & ready predictions about phenotypic values based on parental values. The extent of the predictive power depends upon the heritability of the trait. A trait like height is ~80-90% heritable. That means that ~80-90% of the variation in the population of the trait is due to genes. The expected value of your height is strongly conditional upon the heights of your parents.

That’s all common sense. What does this have to do with genomics? Simple. You are 50% identical by descent with each parent. That means half your gene copies come from your mother and half from your father. You can’t change that unless you’re a clone. But, because of the law of segregation and recombination you are not necessarily 25% identical by descent from each grandparent! The expectation is that you’re coefficient of relatedness is 25%, but there is variation around this. A given parent either contributes their own paternal or maternal homologous chromosome. There’s a 50% chance that you’re going to inherit one or the other across your chromosomes, of independent probability. You have 22 autosomal chromosome pairs (non-sex chromosomes), so there’s a strong chance that you won’t be equally balanced between your opposite sex paternal and maternal grandparents (e.g., you have more genes identical by descent from your paternal grandfather than paternal grandmother).* Second, recombination is also going to generate new combinations. In the generation we’re concerned about this will work against the dynamic we’re relying on, by swapping segments across homologous chromosomes from the parents’ mother or father.

The ultimate logic here is to select for zygotes or gametes which are biased toward the grandparents with phenotypic values which you are interested in. To give a concrete example, if you have a parent who is moderately tall, whose own father was very tall, while the mother was somewhat short, and you want the tallest possible child, you’ll want to select zygotes with the most gene content identical by descent with the tall grandparent. The point isn’t to pick specific genetic variants, you don’t need to know that. All you know is that the tall grandfather probably had genes which resulted in a predisposition toward being tall. So just make sure that the grandchild has as much of that grandparent “in them.”

I still don’t know if this is going to be cost effective in the near term. But I began to think of it because in the near future I’ll be checking the genotype of a child who has a full pedigree of 1,000,000 SNPs of their parents and grandparents.

* Modeling it as a binomial, about 1 in 7 cases will have the expected 11 chromosomes from a focal grandparent. The standard deviation is more than 2 chromosomes. You need to have about 100 zygotes to expect to get any individuals who are 5 chromosomal units away from the expected value (i.e., the individual is 10-15% instead of 25% one grandparent, or 35-40%). Obviously you need more to be assured of getting zygotes of that value. And I neglected recombination, which would work against this, by swapping genomic regions….

December 24, 2011

D. S. Falconer, 1913-2004

Filed under: D. S. Falconer,Douglas Falconer,Quantitative Genetics — Razib Khan @ 2:04 pm

In response to comments and queries below I’ve been poking around for more experimental material on quantitative genetics, and in particular the breeder’s equation. That’s how I stumbled upon this very interesting and informative obituary of D. S. Falconer in Genetics. It reviews not only the biographical details of Falconer’s life, but much of his science. It’s free to all now, so I highly recommend it! (as well as Introduction to Quantitative Genetics, which is quite pricey right now, but just keep watching, I recall getting a relatively cheap copy of the 1996 edition) Curiously, quantitative genetics is rather unknown to the general public in comparison to the biophysical sexiness of molecular genetics, but in most ways it’s the much better complement to the “folk genetics” which often crops up in our day to day life (e.g., “why is so-and-so’s son so short when so-and-so is so tall”). DNA illuminates the discontinuities of Mendelian inheritance, often in the gloomy realm of disease, but quantitative genetics sheds light on the continuities and variations we see across the generations.

How much do siblings differ in height?

Filed under: Height,Quantitative Genetics,variation in siblings — Razib Khan @ 1:28 pm

In the comments below a reader asks about the empirical difference in heights between siblings. I went looking…and I have to say that the data isn’t that easy to find, people are more interested in the deeper inferences on can make from the resemblances than the descriptive first-order data itself. But here’s one source I found:

Average difference Identical twins Identical twins raised apart Full siblings
Height, inches 0.67 0.71 1.8
Weight, pounds 4.2 9.9 10.4
IQ 5.9 8.2 9.8

These data indicate that IQ and height variation among sibling cohorts is on the order of ~2/3rd to 3/4th of the variation that one can find within the general population (my estimate of standard deviation of 2.5 inches for height below is about right, if a slight underestimate according to the latest data). But I also found a paper with more detailed statistics.

The aim of the paper was to find outliers from expectation. In other words, which siblings diverged a lot from what you’d expect in terms of normal variation within the cohort? In the process they do report some statistics on inter-sibling variation. The correlation of height between siblings after correcting for age and sex are 0.43. This is what I’ve seen in the literature. Next, the standard deviation is 6.7 centimeters. This is about ~2.7 inches. The average phenotypic difference between siblings was about 7.2 centimeters (D). Therefore, to a first approximation the recapitulation of population-wide variation in a continuous quantitative trait within sibling cohorts seems to hold. Though I’d be curious if readers can provide better and more diverse sources.

A mediocre man’s great son, a great man’s mediocre son

Filed under: Genetics of Height,Height,Kobe Bryant,Quantitative Genetics — Razib Khan @ 1:26 am

Kobe Bryant is an exceptional professional basketball player. His father was a “journeyman”. Similarly, Barry Bonds and Ken Griffey Jr. both surpassed their fathers as baseball players. Both of Archie Manning’s sons are superior quarterbacks in relation to their father. This is not entirely surprising. Though there is a correlation between parent and offspring in their traits, that correlation is imperfect.

Note though that I put journeyman in quotes above because any success at the professional level in major league athletics indicates an extremely high level of talent and focus. Kobe Bryant’s father was among the top 500 best basketball players of his age. His son is among the top 10. This is a large realized difference in professional athletics, but across the whole distribution of people playing basketball at any given time it is not so great of a difference.

What is more curious is how this related to the reality of regression toward the mean. This is a very general statistical concept, but for our purposes we’re curious about its application in quantitative genetics. People often misunderstand the idea from what I can tell, and treat it as if there is an orthogenetic-like tendency of generations to regress back toward some idealized value.

Going back to the basketball example: Michael Jordan, the greatest basketball player in the history of the professional game, has two sons who are modest talents at best. The probability that either will make it to a professional league seems low, a reality acknowledged by one of them. In fact, from what I recall both received special attention and consideration because they were Michael Jordan’s sons. It is still noteworthy of course that both had the talent to make it onto a roster of a Division I NCAA team. This is not typical for any young man walking off the street. But the range in realized talent here is notable. Similarly, Joe Montana’s son has been bouncing around college football teams to find a roster spot. Again, it suggests a very high level of talent to be able to plausibly join a roster of a Division I football team. But for every Kobe Bryant there are many, many, Nate Montanas. There have been enough generations of professional athletes in the United States to illustrate regression toward the mean.

So how does it work? A few years ago a friend told me that the best way to think about it was a bivariate distribution, where the two random variables are additive genetic variation and environmental genetic variation. Clearer? For many, probably not. To make it concrete, let’s go back to the old standby: the quantitative genetics of height.

For height in developed societies we know that ~80% of the variation of the trait in the population can be explained by variation of genes in the population. That is, the heritability of the trait is 0.80. This means that the correspondence between parents and offspring on this trait is rather high. Having tall or short parents is a decent predictor of having tall or short offspring. But the heritability is imperfect. There is a random “environmental” component of variation. I put environmental in quotations because that really just means it’s a random noise effect which we can’t capture in the additive or dominance components (this sort of thing may be why homosexual orientation in individuals is mostly biologically rooted, even if its population-wide heritability is modest). It could be biological, such as developmental stochasticity, or gene-gene interactions. The point is that this is the component which adds an element of randomness to our ability to predict the outcomes of offspring from parents. It is the darkening of the mirror of our perceptions.

Going back to height, the plot to the left shows an idealized normal distribution of height for males. I set the mean as 70 inches, or 5 feet 10 inches. The standard deviation is 2.5, which means that if you randomly sampled any two males from the dataset the most likely value of the difference would be 2.5 inches which is just the average deviation from the mean (it’s a measure of dispersion). Obviously the height of a male is dependent upon the height of a father, but the mother matters as well (perhaps more due to maternal effects!). Here we have to note that there’s clearly a sex difference in height. How do you handle this problem? Actually, that’s easy. Just convert the heights of the parents to sex-controlled standard deviation units. For example, if you are 5 feet and 7.5 inches as a male you are 1 standard deviation unit below the mean. If you are a female at the same height you are 1.4 standard deviation units above the mean (assuming female mean height of 5 feet and 4 inches, and standard deviation of 2.5 inches). If height was nearly ~100% heritable you’d just average the two parental values in standard deviation units to get the expectation of the offspring in standard deviation units. In this case, the offspring should be 0.2 standard deviation units above the mean.

But height is not ~100% heritable. There is an environmental component of variation which isn’t accounted for by the parental genotypic values (at least the ones with effects of interest to us, the additive components). If height is ~80% heritable then you’d expect the offspring to regress 1/5th of the way back to the population mean. For the example above, the expectation of the offspring would be 0.16 standard deviation units, not 0.20.

Let’s make this more concrete. Imagine you sampled a large number of couples whose midparent phenotypic value is 0.20 standard deviation units above the mean in height. This means that if you convert the father and mother into standard deviation units, their average is 0.20. So one pair could be 0.20 and 0.20, and another could be of someone 2.0 and -1.6 standard deviation units. What’s the expected distribution of male offspring height?

The relevant points:

1) The midparent value naturally is constrained to have no variance (though as I indicate above since it’s an average the selected parents may have a wide variance)

2) The male offspring are somewhat above the average population in distribution of height

3) It remains a distribution. The expected value of the offspring is a specific value, but environmental and genetic variation remains to produce a range of outcomes (e.g., Mendelian segregation and recombination)

4) There has been some regression back to the population mean

I only displayed the males. There are obviously going to be females among the offspring generation. What would the outcome be if you mated the females with the males? Recall that the female heights would exhibit the same mean, 0.16 units above the original population mean. This is where many people get confused (frankly, those whose intelligence is somewhat closer to the mean!). They presume that a subsequent generation of mating would result in further regression back to the mean. No! Rather, the expected value of the offspring would be 0.16 units. Why?

Because through the process of selection you’ve created a new genetic population. The selection process is imperfect in ascertaining the exact causal underpinning of the trait value of a given individual. In other words, because height is imperfectly heritable some of the tall individuals you select are going to be tall for environmental reasons, and will not pass that trait to heir offspring. But height is ~80% heritable, which means that the filtering process of genes by using phenotype is going to be rather good, and the genetic makeup of the subsequent population will be somewhat deviated from the original parental population. In other words, the reference population to which individuals “regress” has now changed. The environmental variation remains, but the additive genetic component around which the regression is anchored is now no longer the same.

This is why I state that regression toward the mean is not magical in a biological sense. There is no population with fixed traits to which selected individuals naturally regress or revert to. Rather, populations are useful abstractions in making sense of the statistical correlations we see around us. The process of selection is informed by population-wide trends, so we need to bracket a set of individuals as a population. But what we really care about are the genetic variables which underpin the variation across the population. And those variables can change rather easily through selection. Obviously regression toward the mean would be exhibit the magical reversion-toward-ideal-type property that some imagine if the variables were static and unchanging. But if this was the matter of things, then evolution by natural selection would never occur!

Therefore, in quantitative genetics regression toward the mean is a useful dynamic, a heuristic which allows us to make general predictions. But we shouldn’t forget that it’s really driven by biological processes. Many of the confusions which I see people engage in when talking about the dynamic seem to be rooted in the fact that individuals forget the biology, and adhere to the principle as if it is an unthinking mantra.

And that is why there is a flip side: even though the offspring of exceptional individuals are likely to regress back toward the mean, they are also much more likely to be even more exceptional than the parents than any random individual off the street! Let’s go back to height to make it concrete. Kobe Bryant is 6 feet 6 inches tall. His father is 6 feet 9 inches. I don’t know his mother’s height, but her brother was a basketball player whose height is 6 feet 2 inches. Let’s use him as a proxy for her (they’re siblings, so not totally inappropriate), and convert everyone to standard deviation units.

Kobe’s father: 4.4 units above mean

Kobe: 3.2 units above mean

Kobe’s mother: 1.6 units above the mean

Using the values above the expected value for the offspring of Kobe’s father & mother is a child 2.4 units above the mean. Kobe is somewhat above the expected value (assuming that Kobe’s mother is a taller than average woman, which seems likely from photographs). But here’s the important point: his odds of being this height are much higher with the parents he has than with any random parents. Using a perfect normal distribution (this is somewhat distorted by “fat-tailing”) the odds of an individual being Kobe’s height are around 1 in 1,500. But with his parents the odds that he’d be his height are closer to 1 out of 5. In other words, Kobe’s parentage increased the odds of his being 6 feet 6 inches by a factor of 300! The odds were still against him, but the die was loaded in his direction in a relative sense. By analogy, in the near future we’ll see many more children of professional athletes become professional athletes both due to nature and nurture. But, we’ll continue to see that most of the children of professional athletes will not have the requisite talent to become professional athletes.

Image Credit: Wikipedia

November 26, 2011

On the real possibility of human differences

I have discussed the reality that many areas of psychology are susceptible enough to false positives that the ideological preferences of the researchers come to the fore. CBC Radio contacted me after that post, and I asked them to consider that in 1960 psychologists discussed the behavior of homosexuality as if it was a pathology. Is homosexuality no longer a pathology, or have we as a society changed our definitions? In any given discipline when confronted with the specter of false positives which happen to meet statistical significance there is the natural tendency to align the outcome so that it is socially and professionally optimized. That is, the results support your own ideological preferences, and, they reinforce your own career aspirations. Publishing preferred positive results furthers both these ends, even if at the end of the day many researchers may understand on a deep level the likelihood that a specific set of published results are not robust.

This issue is not endemic to social sciences alone. I have already admitted this issue in medical sciences, where there is a lot of money at stake. But it crops up in more theoretical biology as well. In the early 20th century Charles Davenport’s research which suggested the inferiority of hybrids between human races was in keeping with the ideological preferences of the era. In our age Armand Leroi extols the beauty of hybrids, who have masked their genetic load through heterozygosity (a nations like Britain which once had a public norm against ‘mongrelization’ now promote racial intermarriage in the dominant media!). There are a priori biological rationales for both positions, hybrid breakdown and vigor (for humans from what I have heard and seen there seems to be very little evidence overall for either once you control for the deleterious consequences of inbreeding). In 1900 and in 2000 there are very different and opposing social preferences on this issue (as opposed to individual preferences). The empirical distribution of outcomes will vary in any given set of cases, so researchers are incentivized to seek the results which align well with social expectations. (here’s an example of heightened fatality due to mixing genetic backgrounds; it seems the exception rather than the rule).

Thinking about all this made me reread James F. Crow’s Unequal by nature: a geneticist’s perspective on human differences. Crow is arguably the most eminent living population geneticist (see my interview from 2006). Born in 1916, he has seen much come and go. For those of us who wonder how anyone could accept ideas which seem shocking or unbelievable today, I suspect Crow could give an answer. He was there. In any case, on an editorial note I think the essay should have been titled “Different by nature.” Inequality tends to connote a rank order of superiority or inferiority, though in the context of the essay the title is obviously accurate. Here is the most important section:

Two populations may have a large overlap and differ only slightly in their means. Still, the most outstanding individuals will tend to come from the population with the higher mean. The implication, I think, is clear: whenever an institution or society singles out individuals who are exceptional or outstanding in some way, racial differences will become more apparent. That fact may be uncomfortable, but there is no way around it.

The fact that racial differences exist does not, of course, explain their origin. The cause of the observed differences may be genetic. But it may also be environmental, the result of diet, or family structure, or schooling, or any number of other possible biological and social factors.

My conclusion, to repeat, is that whenever a society singles out individuals who are outstanding or unusual in any way, the statistical contrast between means and extremes comes to the fore. I think that recognizing this can eventually only help politicians and social policymakers.

You can, and should, read the whole thing. Let’s make it concrete. Imagine the following trait with two distributions (i.e., two populations):

- Mean = 100 and 105 (average value)
- Standard deviation = 15 (measure of dispersion)
- Let’s assume a normal distribution

Let’s plot the two distributions:

Observe the close overlap between the two distributions. Most of the variance occurs within both sets of populations. Now let’s impose a cut-off of about ~130 on the curves:

Now the similarity between the two curves is not as striking. As you move to the tails of the distribution they begin to diverge. In other words, the average of the two populations is pretty much interchangeable, but the values at the tails differ. Now let’s move the cut-off to 145:

The difference is now even more stark. Let’s compare the ratios of the area under the curve for the two populations as defined by the cut-offs:

Value at 100 = 1.26 (any given individual in the blue population is 1.26 times more likely to be above 100 than in the red population)
Value at 130 = 1.83
Value at 145 = 28

A major caveat: quantitative traits are only approximately normally distributed, and there tends to be a “fat tail” dynamic, where deviation from the normal increases as one moves away from the mean. Concretely, this means that the ratios at the tails are probably not quite as extreme, as there are more individuals in all populations at the tails than you’d expect.

What does this entail concretely? As Crow noted above if you sample from the tails of the distribution then very modest differences between groups become rather salient. Consider long distance running. To be successful in international competitions one presumably has to be many, many, standard deviations above the norm. One can’t be a 1 out of 100, or 1 out of 1,000. Rather, presumably one should be 1 out of hundreds of thousands, at a minimum. This would be the fastest ~100,000 or so people in the world (out of 7 billion). With this in mind, we should not be surprised a priori at the success of the Kalenjin people of Kenya in this domain. They may have both the biological and social preconditions which allow their distribution of talent to be moderately above that of the human norm. Even a marginal shift can make a huge difference at the tails. 1 out of 100,000 is 4.26 deviation units above the mean. Increasing the mean of a population by half a standard deviation units (e.g., if 100 is the mean, 15 is the standard deviation, then for the population with the higher mean you’d be at 107.5) results in a disproportion in ratio of above 8:1 at 4.26 units (as measured in the first population). This is modest, about 1 order of magnitude, but consider possible gene-environment correlations and synergies that might ensue when you have a critical mass of very fast individuals. This could amplify the effect of a difference in distributions on a single variate (more importantly I suspect, consider that virtuosity in many domains requires an intersection of aptitudes many units deviated from the norm across many traits).

In the early 2000s James F. Crow was responding to the Human Genome Project. As has been thoroughly covered elsewhere human genomics has probably underwhelmed in terms of outcomes 10 years out. But it is often the case that with new technologies we overestimate the short-term change which they will effect and underestimate their long-term consequences. I believe with the rise of mass genomics, a radical increase in population coverage and full genome sequencing, we may finally start to adduce the underpinnings of quantitative traits. We already have indirect methods, but I believe that by 2020 we will have direct means at our disposal. We’ll have a good sense how deeply humans are commensurable on a population genetic level. I doubt it will change much in our values, but it may entail some rhetorical adjustments.

July 6, 2011

Marry far and breed tall strong sons

ResearchBlogging.orgThe Pith: When it comes to the final outcome of a largely biologically specified trait like human height it looks as if it isn’t just the genes your parents give you that matters. Rather, the relationship of their genes also counts. The more dissimilar they are genetically, the taller you are likely to be (all things equal).

Dienekes points me to an interesting new paper in the American Journal of Physical Anthropology, Isolation by distance between spouses and its effect on children’s growth in height. The results are rather straightforward: the greater the distance between the origin of one’s parents, the taller one is likely to be, especially in the case of males. These findings were robust even after controlling for confounds such as socioeconomic status. Their explanation? Heterosis, whether through heterozygote advantage or the masking of recessive deleterious alleles.

The paper is short and sweet, but first one has to keep in mind the long history of this sort of research in the murky domain of human quantitative genetics. This is not a straight-forward molecular genetic paper where there’s a laser-like focus on one locus, and the mechanistic issues are ...

May 9, 2011

Pygmies are short because nature made them so

Aka Pygmies

The Pith: There has been a long running argument whether Pygmies in Africa are short due to “nurture” or “nature.” It turns out that non-Pygmies with more Pygmy ancestry are shorter and Pygmies with more non-Pygmy ancestry are taller. That points to nature.

In terms of how one conceptualizes the relationship of variation in genes to variation in a trait one can frame it as a spectrum with two extremes. One the one hand you have monogenic traits where the variation is controlled by differences on just one locus. Many recessively expressed diseases fit this patter (e.g., cystic fibrosis). Because you have one gene with only a few variants of note it is easy to capture in one’s mind’s eye the pattern of Mendelian inheritance for these traits in a gestalt fashion. Monogenic traits are highly amenable to a priori logic because their atomic units are so simple and tractable. At the other extreme you have quantitative polygenic traits, where the variation of the trait is controlled by variation on many, many, genes. This may seem a simple ...

February 4, 2011

Introduction to Quantitative Genetics & Genetics and Analysis of Quantitative Traits

Short comment: The book to the left is the quantitative genetics text you have to read if you aren’t a specialist. The book to the right is the reference to have if you are a specialist. The former is the Torah. The latter is the Talmud.

February 3, 2011

Why siblings differ differently

The Pith: In this post I examine how looking at genomic data can clarify exactly how closely related siblings really are, instead of just assuming that they’re about 50% similar. I contrast this randomness among siblings to the hard & fast deterministic nature of of parent-child inheritance. Additionally, I detail how the idealized spare concepts of genetics from 100 years ago are modified by what we now know about how genes are physically organized, and, reorganized. Finally, I explain how this clarification allows us to potentially understand with greater precision the nature of inheritance of complex traits which vary within families, and across the whole population.

Humans are diploid organisms. We have two copies of each gene, inherited from each parent (the exception here is for males, who have only one X chromosome inherited from the mother, and lack many compensatory genes on the Y chromosome inherited from the father). Our own parents have two copies of each gene, one inherited from each of their parents. Therefore, one can model a grandchild from two pairs of grandparents as a mosaic of the genes of the four ancestral grandparents. But, the relationship between ...

January 9, 2011

Of association & evolution

Two of the main avenues of research which I track rather closely in this space are genome-wide association studies (GWAS), which attempt to establish a connection between a trait/disease and particular genetic markers, and inquiries into the evolutionary parameters which shape the structure of variation within the human genome. Often with specific relation to a particular trait/disease. By evolutionary parameters I mean stochastic and deterministic forces; mutation, migration, random drift, and natural selection. These two angles are obviously connected. Both focus on phenomena which are proximate in relation to the broader evolutionary principle: the ultimate raison d’être, replication. Stochastic forces such as random genetic drift reflect the error of sampling of genes from generation to generation during the process of reproduction, while adaptation through natural selection is an outcome of the variation of reproductive fitness as a function of variation of heritable traits. Both of these forces have been implicated in diseases and traits which come under the purview of GWAS (and linkage mapping).

GWAS are regularly in the news because of their relevance in identifying the causal genetic factors for specific diseases. For example, schizophrenia. But they can be useful in a non-disease context as well. Human pigmentation is a character whose genetic architecture has been well elucidated thanks to a host of recent association studies. The common disease-common variant has yielded spectacular results for pigmentation; it does seem a few common variants are responsible for most of the variation on this trait. But this has been the exception rather than the rule.

One reason for this disjunction between the promise of GWAS and the concrete tangible outcomes is that many traits/diseases of interest may be polygenic and quantitative. This implies that variation in phenotype is controlled by variation across many genes, and, that the variation itself exhibits gradual continuity (a continuity which can be modeled as a normal distribution of values). The power of GWAS to detect correlated variation across genes and traits of small marginal effect is obviously limited. In contrast, it seems that about half a dozen genes can explain most of the between population variation in pigmentation. One SNP is able to account for 25-40% of the difference in shade between Europeans and Africans. This SNP is fixed in Europeans, nearly absent in Africans and East Asians, and segregating in both ancestral and derived variants in groups such as South Asians and African Americans. In contrast, though traits such as schizophrenia and height are substantially heritable, much of the variation at the population level of the trait is explainable by variation in genes. The effect size at any given locus may be small, or the variation may be accumulated through the sum of larger effect variants of low frequency. In other words, many common variants of small effect, or numerous distinctive rare variants of large effect.

ResearchBlogging.orgThese nuances of genetic architecture are not irrelevant to the possible evolutionary arc of the traits in question. One model of the adaptation leading to the high frequency of a trait or disease is that a novel mutation rapidly “sweeps” to fixation, or nearly to fixation. In other words, it shifts from nearly ~0% to nearly ~100% frequency in the population of alleles at that locus, driven by positive selection. This sort of rapid “hard sweep” would also result in “hitchhiking” of associated variants in the genomic regions adjacent to the originally favored mutant, producing regions of high linkage disequilibrium in the genome and haplotype blocks of associated alleles across loci. Such a model does seem possible in the case of some of the variants which are responsible for diversity of pigmentation. But this neat dovetailing between the strong association of a few variants with trait variance, and signatures of positive selection being driven by adaptation, is not so easy to come by in many instances.

There are other evolutionary possibilities in terms of what could drive a high frequency of particular alleles. Population bottlenecks and inbreeding can crank up the frequency of a variant simply through chance. This may be the origin of many traits and diseases expressed recessively or in quasi-Mendelian form which run in specific populations. Let’s set such stochastic possibilities to the side for now. The well of natural selection is not quite tapped out simply by models of positive selection drawing upon singular new mutations. Another model is that of “soft sweeps” operating upon standing genetic variation. Consider for example a trait which has a heritability of 0.50. 50% of the variance in trait value can be explained by variance in genes. Selection correlated with trait value can rapidly change the distribution of the trait within the population, as modeled by the breeder’s equation. But no new mutations are necessary in this model, rather, the frequencies of extant alleles changes over time. In fact, as the proportions shift novel combinations of alleles which were once too rare to be found together in the same individual will emerge, and so offer up the possibility that the mean trait value in generation t + n generations may be outside of the range of trait values at t = 0.

Over time such selection on a quantitative trait theoretically exhausts its own fuel, genetic variation. But quite often this is not practically operative, because such traits are subject to a background level of novel mutation and balancing selection. Stabilizing selection around a median phenotype, as well as frequency dependence and shifting environmental pressures, may produce a circumstance where adaptation never moves beyond the transient flux toward a new equilibrium. The element of the eternal race is at the heart of the Red Queen’s Hypothesis, where pathogen and host engage in an evolutionary war, and host immune responses are subject to negative frequency dependence. As the frequency of an allele rises, its relative fitness declines. As its frequency declines, its fitness rises.

Naturally such complex evolutionary models, subject to contingency and less non-trivially powerful in their generality, only become appealing when simple hard sweep models no longer suffice. But it seems highly plausible that the genetic architecture of some traits, those which seem plagued by ‘missing heritability,’ are going to necessitate somewhat more baroque evolutionary models to explain their ultimate emergence & persistence. A new paper in PLoS Genetics tackles this complexity by looking at the patterns of variation of SNPs implicated in GWAS in the HGDP data set. Genome-Wide Association Study SNPs in the Human Genome Diversity Project Populations: Does Selection Affect Unlinked SNPs with Shared Trait Associations? First, the abstract:

Genome-wide association studies (GWAS) have identified more than 2,000 trait-SNP associations, and the number continues to increase. GWAS have focused on traits with potential consequences for human fitness, including many immunological, metabolic, cardiovascular, and behavioral phenotypes. Given the polygenic nature of complex traits, selection may exert its influence on them by altering allele frequencies at many associated loci, a possibility which has yet to be explored empirically. Here we use 38 different measures of allele frequency variation and 8 iHS scores to characterize over 1,300 GWAS SNPs in 53 globally distributed human populations. We apply these same techniques to evaluate SNPs grouped by trait association. We find that groups of SNPs associated with pigmentation, blood pressure, infectious disease, and autoimmune disease traits exhibit unusual allele frequency patterns and elevated iHS scores in certain geographical locations. We also find that GWAS SNPs have generally elevated scores for measures of allele frequency variation and for iHS in Eurasia and East Asia. Overall, we believe that our results provide evidence for selection on several complex traits that has caused changes in allele frequencies and/or elevated iHS scores at a number of associated loci. Since GWAS SNPs collectively exhibit elevated allele frequency measures and iHS scores, selection on complex traits may be quite widespread. Our findings are most consistent with this selection being either positive or negative, although the relative contributions of the two are difficult to discern. Our results also suggest that trait-SNP associations identified in Eurasian samples may not be present in Africa, Oceania, and the Americas, possibly due to differences in linkage disequilibrium patterns. This observation suggests that non-Eurasian and non-East Asian sample populations should be included in future GWAS

And now the author summary:

Natural selection exerts its influence by changing allele frequencies at genomic polymorphisms. Alleles associated with harmful traits decrease in frequency while those associated with beneficial traits become more common. In a simple case, selection acts on a trait controlled by a single polymorphism; a large change in allele frequency at this polymorphism can eliminate a deleterious phenotype from a population or fix a beneficial one. However, many phenotypes, including diseases like Type 2 Diabetes, Crohn’s disease, and prostate cancer, and physiological traits like height, weight, and hair color, are controlled by multiple genomic loci. Selection may act on such traits by influencing allele frequencies at a single associated polymorphism or by altering allele frequencies at many associated polymorphisms. To search for cases of the latter, we assembled groups of genomic polymorphisms sharing a common trait association and examined their allele frequencies across 53 globally distributed populations looking for commonalities in allelic behavior across geographical space. We find that variants associated with blood pressure tend to correlate with latitude, while those associated with HIV/AIDS progression correlate well with longitude. We also find evidence that selection may be acting worldwide to increase the frequencies of alleles that elevate autoimmune disease risk.

This is a paper where jumping to the methods might be useful. Though I’m sure that the authors did not intend it, sometimes it felt as if you were following the marble being manipulated by the carnival tender. Since I was not familiar with some of the terms for the statistics, a simple allusion to the methods without elaborating in detail did not suffice. In any case, the key here is that they focused on the set of SNPs which have been associated with trait variance in GWAS, and compared those to the total SNPs found in the HGDP data set of 53 populations. Note that not all SNPs in GWAS were in the HGDP SNP panel. But for the general questions being asked the intersection of SNPs sufficed. Additionally, they generated a further subset of SNPs which were highly likely to be associated with trait variance. These were SNPs where other SNPs of related function were within 1 MB, or, SNPs which were found in more than one GWAS.

There were four primary statistics within the paper: Delta, Fst, LLC, and iHS. Fst and iHS are familiar. Fst measures the extent of between population variance across a set of populations. High Fst means a great deal of population structure, while Fst ~ 0 means basically no population structure. iHS is a test to detect the probability of natural selection based on patterns of linkage disequilibrium in the genome. Basically the important thing for the purposes of this paper is that iHS tends to be good at detecting alleles at moderate frequencies still presumably going through sweeps. This is in contrast to the older EHH test, which only detects sweeps which are nearly complete. If the authors are focusing on polygenic traits and soft sweeps the likelihood of that showing up on EHH is low since that is predicated on hard, nearly complete, sweeps. LLC measures the correlation between genetic variant of a trait as a function of latitude and longitude. Presumably this would be useful for smoking out those traits driven by ecological pressures (an obvious example in a general sense are consistent changes in area-to-volume ratio across taxa as organisms proceed from warmer to colder climes). Finally, Delta measures the allele frequency difference across the set of populations. The sign of Delta is simply a function of whether the allele frequency in question is higher in the first or second population in the comparison.

In doing their comparisons the authors did not simply compare across all 53 populations in a pairwise fashion. Rather, they often pooled continental or regional groups. To the left is a slice of table 1. It shows the populations used to generate the Delta values, and how they were pooled. The HGDP populations are broken down by region in a rather straightforward manner. But also note that some of the comparisons are between populations within regions, and those with different lifestyles. I assume that the comparisons highlighted within the paper were performed with the aim of squeezing maximal informative juice in such an exploratory endeavor. There are no obligate hunter-gatherers within the Eurasian populations in the HGDP data set to my knowledge, so a comparison between agriculturalists and hunter-gatherers would not be possible. There is such a comparison available in the African data set. The authors generated p-values by comparing the GWAS SNPs to random SNPs within the HGDP data set. In particular, they were looking for signatures of distinctiveness among the HGDP data set.

Such distinctiveness is expected. The set of SNPs associated with diseases and traits of note are not likely to be a representative subset of the SNPs across the whole genome. Remember that a neutral model of molecular evolution means that we should expect most genetic variation within the genome is going to be due to stochastic forces. Panel A of figure 1 shows that in fact the SNPs derived from GWAS did exhibit a different pattern from the total set of SNPs in the HGDP panel. Observe that the distribution of minor allele frequency (MAF) is somewhat skewed toward higher values for the GWAS SNPs. If the logic of GWAS is geared toward “common variants” which will be frequent enough within the population to generate an effect which is powerful enough to be picked up by the studies given their sample sizes,  the bias toward more common variants (higher MAF) is understandable.

To the left are some SNPs and traits which had low p-values (i.e., they were deviated from expectation beyond what you’d expect from random noise). Not very surprisingly they found that pigmentation related SNPs tended to show up strongly in all the measures of population differentiation and variation. rs28777 is found in SLC45A2, a locus which differentiates Europeans from non-Europeans. rs1834640 is in SLC24A5, which differentiates Europeans + Middle Easterners + Central/South Asians from other populations. rs12913832 is a “blue eye” related variant. That is, it’s one of the markers associated with blue vs. non-blue eye color differences in Europeans.

Seeing that pigmentation has been one of the few traits which has been well elucidated by the current techniques, it should be expected that more subtle and thorough methods aimed at detecting genetic variation across and within populations should stumble upon those markers first. The authors note that “SNPs and study groups associated with pigmentation and immunological traits made up a majority of those that reached significance in our analysis.” There has long been a tendency toward finding signatures of selection around pigmentation and disease related loci.

One pattern which was also evident in terms of geography in the patterns of low p-values was the tendency for Eurasian groups to be enriched. This is illustrated in figure 2. Most of the SNPs from the GWAS studies were derived from study populations which were European. Because of this there is probably a bias in the set of SNPs being evaluated which are particular informative for Europeans and related populations. Additionally, it may also be that Eurasians were subject to different selective pressures as they left the ancestral African environment ~150-50,000 years B.P. In any case, for purposes of medical analysis the authors did find that using SNPs from East Asian populations produced somewhat different results than using those from European populations. Though some studies have shown a broad applicability of SNPs across populations, there are no doubt many variants in non-European populations which have simply not been detected because GWAS studies are not particularly focused on non-European populations. Consider:

… However, our results indicate that SNPs associated with pigmentation in GWAS display unusual allele frequency patterns almost exclusively in Europe, the Middle East, and Central Asia. This suggests to us that there may be SNPs, perhaps in or near genes other than SLC45A2, IRF4, TYR, SLC24A4, HERC2, MC1R, and ASIP, which are associated with pigmentation in non-Eurasian populations, but which have yet to be identified by GWAS. GWAS for pigmentation traits carried out using non-European subjects are needed to explore this possibility further.

There are two major other classes of trait/disease which were found to vary systematically across the HGDP populations:

- High blood pressure associated variants seemed to decrease with latitude

- Infectious and autoimmune disease SNPs had elevated scores. Specifically, there were some HIV related SNPs associated with Europeans which seem to confer resistance

The first set of traits would naturally come out of GWAS derived SNPs, since so much medical research goes into identifying risk and treating high blood pressure and other circulatory ailments. A consistent pattern where geography and not ancestry predict variation is an excellent tell for exogenous selective pressures. The physical nature of the earth is such that as mammals spread away from the equators their physiques will be reshaped by different sets of ecological parameters. Siberian populations have developed adaptations to cold stress, and there seem to be consistent cross-taxa shifts in body form to maximize or minimize heat radiation among mammals.

In the second case you have resistance to disease cropping up again, as well as pleiotropy, whereby genetic changes can have multiple downstream consequences. Often this is temporally simultaneous; consider the tame silver foxes. But sometimes you have a change in the past which has a subsequent consequence later in time due to different selective pressures. It is not that surprising that immunological responses can be multi-purpose, so even though Europeans did not develop resistance to HIV as a general selective pressure, similar pressures seem to have resulted in responses with general utility and now a specific use in relation to HIV. Selection can often be a blunt instrument, interposing itself into a network of interactions with multiple consequences, reshaping many traits simultaneously in the process of maximizing local fitness. This is most clear when you have a trait such as sicke-cell disease, which emerges only because the fitness benefit of heterozygosity is so great. But no doubt when it comes to many traits the byproducts are more subtle, or may seem cryptic to us. We still do not know why EDAR was driven to higher frequency in East Asians (less body odor and thick straight hair seem implausible targets for selection).

And just as natural selection can be blunt and rude in its impact on the covariance of genes and traits, so its relaxation may remove a suffocating vice. Consider the possibilities with blood pressure: perhaps the reason that northern Eurasians have lower blood pressure is that selection for other correlated traits associated with higher values were relaxed, allowing for fitness to be maximized in this particular dimension. Similarly, African Americans have a lower frequency of the sickle-cell disease than their ~80% West African ancestry would entail, because without the pressure of endemic malaria selection for the heterozygote was removed, allowing for the purging of the allele from the gene pool.

Nevertheless, the authors do conclude::

Despite our broad-based approach, we found only a few examples of what may be a polygenic response to a single selective pressure.</b> We did use stringent significance criteria which might mean that additional examples can be found among the study groups that did not quite meet our threshold of significance. It may also be that there is something about “GWAS” traits and their underlying genetics that served to undermine our approach.

They have several suggestions for why this didn’t pan out:
- The GWAS variants aren’t the primary source of the variation. It could be copy number variants, rare large effect variants (“synthetic”)

- Epistasis. Gene-gene interaction, which would mask or confound linear associations between variants and traits

- Low impact of selection on GWAS SNPs, or, balancing or negative selection

They finish:

In summary, we have examined 1,336 trait-associated SNPs in the 53 CEPH-HGDP populations looking for individual SNPs and groups of SNPs with unusual allele frequency patterns and elevated iHS scores. We identified 13 different traits with an associated SNP or study group that produced a significantly elevated score for at least one delta, Fst, LLC, or iHS measure, a small percentage of the total number of traits analyzed. We believe that the limited number of positive results could be due to our stringent significance criteria or to features of the genetic architecture of the traits themselves. Specifically, the roles of rare variants, epistasis, and pleiotropy in human complex traits are, although areas of active inquiry, still generally not well understood. Our measures may also not be optimal for detecting all types of selection acting on GWAS traits. It has been speculated that variants underlying complex traits will be influenced primarily by negative or balancing selection, which may not produce extreme values for our measures, particularly if these forces are relatively uniform across populations or are acting on many regions in the genome.

If selective pressures on polygenic traits are so common perhaps genomicists are going to be thumbing through Introduction to Quantitative Genetics. These are traits and evolutionary processes which lack clear distinction. In many ways modeling positive selection and hard sweeps resembles the economics of equilibriums. When it comes to continuous and quantitative traits subject to the effect of many genes a different way of thinking has to come to the fore. The transient no longer becomes a punctuation between the stasis, but the thing in and of itself. There are for example HLA genes in humans which are found in chimpanzees, because the nature of the eternal race between host and pathogen means that all the old tricks are preserved, at least at low frequencies. Human variation in intelligence, height, and all sorts of other liabilities and characteristics, may have always been with us, being buffeted continuously by a swarm of selective pressures. The question is, can our crude statistical methods ever get a grip on this diffuse but all-powerful net?

Citation: Casto AM, & Feldman MW (2011). Genome-Wide Association Study SNPs in the Human Genome Diversity Project Populations: Does Selection Affect Unlinked SNPs with Shared Trait Associations? PLoS Genetics : 10.1371/journal.pgen.1001266

November 10, 2010

The future Indian Yao Ming

Filed under: Genetics,Height,Quantitative Genetics — Razib Khan @ 10:10 pm

In a nation of ~1 billion, even one where a large minority are positively malnourished, you’d expect some really tall people. So not that surprising: NBA Awaits Satnam From India, So Big and Athletic at 14:

In a country of 1.3 billion people, 7-foot, 250-pound Satnam Singh Bhamar has become a beacon for basketball hope.

At age 14.

That potential starts with his size, which is incredible itself. At age 14, he is expected to grow for another couple of years. For now, he wears a size-22 basketball shoe. His hands swallow the ball. His father, Balbir Singh Bhamara, is 7-2. His grandmother on his father’s side is 6-9.

Punjab is one of India’s more prosperous states. Interestingly this kid’s paternal grandmother is as tall in standard deviation units as her son or grandson. In Western developed societies height is 80-90% heritable. That means that there’s very little expected regression back to the population mean for any given child. The article doesn’t mention the mother’s height though. If she is of more normal size then Satnam is either a fluke, or, there are dominant large effect rare alleles being passed down by the father, perhaps from the paternal grandmother.

August 17, 2010

Genetics is One: Mendelism and quantitative traits


ResearchBlogging.orgIn the early 20th century there was a rather strange (in hindsight) debate between two groups of biological scientists attempting to understand the basis of inheritance and its relationship to evolutionary processes. The two factions were the biometricians and Mendelians. As indicated by their appellation the Mendelians were partisans of the model of inheritance formulated by Gregor Mendel. Like Mendel many of these individuals were experimentalists, with a rough & ready qualitative understanding of biological processes. William Bateson was arguably the model’s most vociferous promoter. Set against the Mendelians were more mathematically minded thinkers who viewed themselves as the true inheritors of the mantle of Charles Darwin. Though the grand old patron of the biometricians was Francis Galton, the greatest expositor of the school was Karl Pearson.* Pearson, along with the zoologist W. F. R. Weldon, defended Charles Darwin’s conception of evolution by natural selection during the darkest days of what Peter J. Bowler terms “The Eclipse of Darwinism”.** One aspect of Darwin’s theory as laid out in The Origin of Species was gradual change through the operation of natural selection upon extant genetic variation. There was a major problem with the model which Darwin proposed: he could offer no plausible engine in regards to mode of inheritance. Like many of his peers Charles Darwin implicitly assumed a blending model of inheritance, so that the offspring would be an analog constructed about the mean of the parental values. But as any old school boy knows the act of blending diminishes variation! This, along with other concerns, resulted in a general tendency in the late 19th century to accept the brilliance of the idea of evolution as descent with modification, but dismiss the motive engine which Charles Darwin proposed, gradual adaptation via natural selection upon heritable variation.

Mendels theory of inheritance rescued Darwinism from the problem of gradual diminution of natural selection’s raw material through the process of sexual reproduction. Yet due to personal and professional rivalries many did not see in Mendelism the salvation of evolutionary theory. Pearson and the biometricians scoffed at Bateson and company’s innumeracy. They also argued that the qualitative distinctions in trait value generated by Mendel’s model could not account for the wide range of continuous traits which were the bread & butter of biometrics, and therefore natural selection itself. Some of the Mendelians also engaged in their own flights of fancy, seeing in large effect mutations which they were generating in the laboratory an opening for the possibility of saltation, and rendering Darwinian gradualism absolutely moot.

There were great passions on both sides. The details are impeccably recounted in Will Provine’s The Origins of Theoretical Population Genetics. Early on in the great debates the statistician G. U. Yule showed how Mendelism could be reconciled with biometrics. But his arguments seem to have fallen on deaf ears. Over time the controversy abated as biometricians gave way to the Mendelians through a process of attrition. Weldon’s death in 1906 was arguably the clearest turning point, but it took a young mathematician to finish the game and fuse Mendelism and biometrics together and lay the seeds for a hybrid theoretical evolutionary genetics.

R._A._FischerThat young mathematician was R. A. Fisher. Fisher’s magnum opus is The Genetical Theory of Natural Setlection, and his debates with the American physiologist and geneticist Sewall Wright laid the groundwork for much of evolutionary biology in the 20th century. Along with J. B. S. Haldane they formed the three-legged population genetic stool upon which the Modern Neo-Darwinian Synthesis would come to rest. Not only was R. A. Fisher a giant within the field of evolutionary biology, but he was also one of the founders of modern statistics. But those accomplishments were of the future, first he had to reconcile Mendelism with the evolutionary biology which came down from Charles Darwin. He did so with such finality that the last embers of the debate were finally doused, and the proponents of Mendelism no longer needed to be doubters of Darwin, and the devotees of Darwin no longer needed to see in the new genetics a threat to their own theory.

One of the major issues at work in the earlier controversies was one of methodological and cognitive incomprehension. William Bateson was a well known mathematical incompetent, and he could not follow the arguments of the biometricians because of their quantitative character. But no matter, he viewed it all as sophistry meant to obscure, not illuminate, and his knowledge of concrete variation in form and the patterns of inheritance suggested that Mendelism was correct. The coterie around Karl Pearson may have slowly been withering, but the powerful tools which the biometricians had pioneered were just waiting to be integrated into a Mendelian framework by the right person. By 1911 R. A. Fisher believed he had done so, though he did not write the paper until 1916, and it was published only in 1918. Titled The Correlation Between Relatives on the Supposition of Mendelian Inheritance, it was dense, and often cryptic in the details. But the title itself is a pointer as to its aim, correlation being a statistical concept pioneered by Francis Galton, and the supposition of Mendelian inheritance being the model he wished to reconcile with classical Darwinism in the biometric tradition. And in this project Fisher had a backer with an unimpeachable pedigree: a son of Charles Darwin himself, Leonard Darwin.

You can find this seminal paper online, at the R. A. Fisher digital archive. Here is the penultimate paragraph:

In general, the hypothesis of cumulative Mendelian factors seems to fit the facts very accurately. The only marked discrepancy from existing published work lies in the correlation for first cousins. Snow, owning apparently to an error, would make this as high as an avuncular correlation; in our opinion it should differ by little from that of the great-grandparent. The values found by Miss Elderton are certainly extremely high, but until we have a record of complete cousinships measured accurately and without selection, it will not be possible to obtain satisfactory numerical evidence on this question. As with cousins, so we may hope that more extensive measurements will gradually lead to values for the other relationship correlations with smaller standard errors. Especially would more accurate determinations of the fraternal correlation make our conclusions more exact.

I have to admit at the best of times that R. A. Fisher can be a difficult prose stylist to follow. One might wish to add from a contemporary vantage point that his language has a quaint and dated feel which compounds the confusion, but the historical record is clear that contemporaries had great difficulty in teasing apart distinct elements in his argument. Much of this was due to the mathematical aspect of his thinking, most biologists were simply not equipped to follow it (as late as the 1950s biologists at Oxford were dismissing Fisher’s work as that of a misguided mathematician according to W. D. Hamilton). In the the text of this paper there are the classic jumps and mysterious connections between equations along the chain of derivation which characterize much of mathematics. The problem was particularly acute with Fisher because his thoughts were rather deep and fundamental, and he could hold a great deal of complexity in his mind. Finally, there are extensive tables and computations of correlations of pedigrees from that period drawn from biometric research which seem extraneous to us today, especially if you have Mathematica handy.

But the logic behind The Correlation Between Relatives on the Supposition of Mendelian Inheritance is rather simple: in the patterns of correlations betweens relatives, and the nature of variance in trait value across those relatives, one could perceive the nature of Mendelian inheritance. It was Mendelian inheritance which could explain most easily the patterns of variation across continuous traits as they were passed down from parent to offspring, and as they manifested across a pedigree. Early on in the paper Fisher observes that a measured correlation between father and son in stature is 0.5. From this one can explain 1/4 of the variance in the height across the set of possible sons. This biological relationship is just a specific instance of the coefficient of determination, how much of the variance in a value, Y (sons’ heights), you can predict from the variance in X (fathers’ heights). Correcting for sex one can do the same for mothers and their sons (and inversely, fathers and their daughters).*** So combing the correlations of the parents to their offspring you can explain about half of the variance in the offspring height in this example (the correlation is higher in contemporary populations, probably because of much better nutrition in the lower orders). But you need not constraint yourself to parent-child correlations. Fisher shows that correlations across many sorts of relationships (e.g., grandparent-grandchild, sibling-sibling, uncle-niece/nephew) have predictive value, though the correlation will be a function of genetic distance.

What does correlation, a statistical value, have to do with Mendelism? Remember, Fisher argues that it is Mendelism which can explain in the details patterns of correlations on continuous traits. There were peculiarities in the data which biometricians explained with abstruse and ornate models which do not bear repeating, so implausible were the chain of conjectures. It turns out that Mendelism is not only the correct explanation for inheritance, but it is elegant and parsimonious when set next to the alternatives proposed which had equivalent explanatory power. A simple blending model could not explain the complexity of life’s variation, so more complex blending models emerged. But it turned out that a simple Mendelian model explained complexity just as well, and so the epicycles of the biometricians came crashing down. Mendelism was for evolutionary biology what the Copernican model was for planetary astronomy.

To a specific case where Mendelism is handy: in the data Fisher noted that the height of a sibling can explain 54% of the variance of height of other siblings, while the height of parents can explain only 40% of that of their offspring. Why the discrepancy? It is noted in the paper that the difference between identical twins is marginal, and other workers had suggested that the impact of environment could not explain the whole residual (what remains after the genetic component). Though later researchers observe that Fisher’s assumptions here were too strong (or at least the state of the data on human inheritance at the time misled him) the big picture is that siblings have a component of genetic correlation which they share with each other which they do not share with their parents, and that is the fraction accounted for by dominance. When dominance is included in the equation heritability is referred to as the “broad sense,” while when dominance is removed it is termed “narrow sense.”

A concept such as dominance can of course be easily explained by Mendelism, at least formally (the physiological basis of dominance was later a point of contention between Fisher and Sewall Wright). Most of you have seen a Punnet square, whereby heterozygous parents will produce offspring in ratios where 50% are heterozygous, and 25% one homozygote and 25% another. But consider a scenario where one parent is a heterozygote, and the other a homozygote for the dominant trait. Both parents will express the same trait value, as will their offsprings. But, there will be a decoupling of the correlation between trait-value and genotype here, as the offspring will be genotypically variant. Parent-offspring correlations along the regression line become distorted by a dominance parameter, and so reduce correlations. In contrast, full siblings share the same dominance effects because they share the same parents and can potentially receive the same identical by descent alleles twice. Consider a rare recessively expressed allele, one for cystic fibrosis. As it is rare in a population in almost all cases where the offspring are homozygotes for the disease causing allele, both parents will be heterozygotes. They will not express the disease because of its recessive character. But 25% of their offspring may because of the nature of Mendelian inheritance. So there’s a major possible disjunction between trait values from the parental to offspring cohorts. On the other hand, each sibling has a 25% chance of expressing the disease, and so the correlation is much higher than that with the parents (who do not express disease). In other words siblings can resemble each other much more than they may resemble either parent! This makes intuitive sense when you consider the inheritance constraints and features of Mendelism in diploid sexual species. But obviously a simple blending model can account for this. What it can not account for is the persistence of variation. It is through the segregation of independent Mendelian alleles, and their discrete and independent reassortment, that one can see how variation would not only persist from generation to generation, but manifest within families as alleles across loci shake out in different combinations. A simple model of inheritance can then explain two specific phenomena which are very different from each other.

There is much in Fisher’s paper which prefigures later work, and much which is rooted in somewhat shaky pedigrees and biometric research of his day. The take home is that Fisher starts from an a priori Mendelian model, and shows how it could cascade down the chain of inferences and produce the continuous quantitative characteristics we see all around us. From the Hardy-Weinberg principle he drills down through the inexorable layers of logic to generate the formalisms which we associate with heritability, thick with variance terms. The Correlation Between Relatives on the Supposition of Mendelian Inheritance was a marriage between what was biometrics and Mendelism which eventually gave rise to population genetics, and forced the truce between the seeds of that domain and what became quantitative genetics.

As I said, the paper itself is dense, often opaque, and characterized by a prose style that lends itself to exegesis. But I find that it is often useful to see the deep logics behind evolution and genetics laid bare. Some of the issues which we grapple with today in the “post-genomic era” have their intellectual roots in this period, and Fisher’s work which showed that quantitative continuous traits and discrete Mendelian characters were one in the same. The “missing heritability” hinges on the fact that classical statistical techniques tell us that Mendelian inheritance is responsible for the variation of many traits, but modern statistical biology which has recourse to the latest sequencing technology has still not be able to crack that particular nut with satisfaction. Perhaps decades from now biologists will look at the “missing heritability” debate and laugh at the blindness of current researchers, when the answer was right under their noses. Alas, I suspect that we live in the age of Big Science, and a lone genius is unlikely to solve the riddle on his lonesome.

Citation: Fisher, R. A. (1918). On the correlation between relatives on the supposition of Mendelian inheritance Transactions of the Royal Society of Edinburgh

Suggested Reading: The Origins of Theoretical Population Genetics, R.A. Fisher: The Life of a Scientist, and The Genetical Theory of Natural Selection.

* Though I will spare you the details, it may be that the Galtonians were by and large more Galtonian than Galton himself! It seems that Francis Galton was partial was William Bateson’s Mendelian model.

** To be fair, I believe the phrase was originally coined by Julian Huxely.

*** Just use standard deviation units.

Image Credit: Wikimedia

August 11, 2010

The dog’s world of large effect QTLs

Szusza_pekingeseA major issue in human genomics over the past few years has been the case of the “missing heritability“. Roughly, we know that for many traits, such as height, most of the variation in the trait within the population is controlled by variation in the genes of the population. The height of your parents is an extremely good predictor of your height in a developed nation. If you’re adopted, the height of your biological parents is an extremely good predictor of your height in a developed nation, not the height of your adoptive parents. Though a new paper claims to have resolved some of the difficulty, one of the major issues in human height genetics has been the lack of large effect quantitative trait locus. In plain English, a gene which can explain a lot of the variation in the trait. Rather, many have posited that continuous quantitative traits like height are controlled by variation in innumerable common genes of small effect size, or, by innumerable rare genes of large effect size. The same may be an issue with personality genetics, or so is claimed by a recent paper unable to find common variants (though an eminent geneticist pointed out in the comments some problems with the paper itself).

ResearchBlogging.orgOne would assume that the same problem would crop up across the tree of life. But a geneticist once told me that he considered biology the science where all rules have exceptions. Many exceptions. A new paper in PLoS Biology paints a fundamentally different picture of the genetic architecture of many morphological traits in the domestic dog, A Simple Genetic Architecture Underlies Morphological Variation in Dogs:

Dogs offer a unique system for the study of genes controlling morphology. DNA from 915 dogs from 80 domestic breeds, as well as a set of feral dogs, was tested at over 60,000 points of variation and the dataset analyzed using novel methods to find loci regulating body size, head shape, leg length, ear position, and a host of other traits. Because each dog breed has undergone strong selection by breeders to have a particular appearance, there is a strong footprint of selection in regions of the genome that are important for controlling traits that define each breed. These analyses identified new regions of the genome, or loci, that are important in controlling body size and shape. Our results, which feature the largest number of domestic dogs studied at such a high level of genetic detail, demonstrate the power of the dog as a model for finding genes that control the body plan of mammals. Further, we show that the remarkable diversity of form in the dog, in contrast to some other species studied to date, appears to have a simple genetic basis dominated by genes of major effect.

The paper uses powerful statistical and computational techniques, but the main results are relatively straightforward (assuming you don’t get stressed out by terms such as “random effect in the linear mixed model”). First, they delved a little into the evolutionary history and the general topography of the genomics of various dog breeds, wolves, as well as stray “village dogs” (I assume these are simply these are like the “pariah dogs” of India). Though village dogs had domestic ancestors they’ve gone feral, so they’re an interesting contrast with the new breeds created since the 19th century, as well as the wild ancestors of all dogs, wolves.

Three statistics were used to explore demographic history: linkage disequilibrium (LD), runs of homozygosity (ROH), and haplotype diversity. Inbred individuals have many ROH. They may have one individual show up relatively recently in their ancestry over and over, so it makes sense that they’d have many loci where both copies of the gene are identical by descent and state. Obviously purebred dogs have high ROH. They also have low haplotype diversity. Even the average person on the street is familiar with the freakish inbreeding which goes into the production of many purebred canine lineages, and their lower life expectancy vis a vis the maligned “mutt.” LD decayed much more quickly in wolves than in the dogs, village and purebred. Remember that LD indicates correlations of alleles across loci. It can be caused by selection at a SNP, which rises in frequency so quickly that huge swaths of the adjacent genome of that particular SNP “hitchhike” along before recombination can break up the association to too great an extent. Admixture between very distinctive populations can also produce LD, which again will decay with time due to recombination. Finally, another way LD can occur is through bottlenecks, which like positive selection can increase particular gene frequencies and their associated genomic regions rather rapidly through stochastic processes. It is the last dynamic which probably applies to all dogs: they went through a major population bottleneck during the domestication process, so the genomic pattern spans village and purebred lineages since it is an echo of their common history. Finally haplotype diversity is simply ascertaining the diversity of haplotypes across particular genomic windows. An interesting find in these results is that village dogs actually have lower ROH and higher haplotype diversity than wolves. That suggests that the wolves in this sample went through a major population bottleneck, while village dogs have maintained a larger effective population.

A general finding from the aforementioned examination is that different breeds tended to be genetically rather distinct. This follows naturally from the origin of modern purebreds as tight and distinct inbred lineages. This genome-wide distinctiveness though is a perfect background condition to test for similarities within the genome which correlate with specific morphological similarities across the breeds. And they did find quite a bit:

We searched for the strongest signals of allelic sharing by scanning for extreme values of Wright’s population differentiation statistic FST…cross the breeds. The 11 most extreme FST regions of the dog genome contained SNPs with FST≥0.57 and minor allele frequency (MAF [major allele frequency -Razib])≥0.15 (Table 1). Six of these regions are strongly linked to genetic variants known to affect canine morphology: the 167 bp insertion in RSPO2 associated with the fur growth and texture…an IGF1 haplotype associated with reduced body size…an inserted retrogene (fgf4) associated with short-leggedness…and three genes known to affect coat color in dogs (ASIP, MC1R, and MITF…Two other high FST regions correspond to CFA10.11465975 and CFA1.97045173, which were associated with body weight and snout proportions, respectively, in previous association studies….Two known coat phenotypes (fur length and fur curl…) also exhibited extreme FST values. Only a limited number of high FST regions were not associated with a known morphological trait (Figure 2, black labels). Here, we focus on illuminating the potential targets of selection for these regions as well as identifying genomic regions that associate with skeletal and skull morphology differences among breeds.

Many of these genes are familiar to you in all likelihood because they have the same functional significance in humans. The key difference is effect size. Since the paper is open access I’ll spare you the alphabet soup of genes and their association with canine morphological traits. There are many of them that pop up by examining differences between breeds in morphology (and similarities) and their allele frequencies. The top line is the prediction of trait which can occur via just a few genes. They constructed a regression model where a set of independent variables, genes, can predict the value of a given dependent variable, the trait:

Using forward stepwise regression, we combined potential signals into a multi-SNP predictive model for each trait. In the models of body weight, ear type, and the majority of measured traits, most of the variance across breeds could typically be accounted for with three or fewer loci…Correlated traits (e.g., femur length and humerus length) yielded similar SNP associations. For the 55 traits, the mean proportion of variance explained by the top 1-, 2-, and 3-SNP models was R2 = 0.52, 0.63, and 0.67, respectively….After controlling for body size, mean proportion of variance explained by these models was still appreciable—R2 = 0.21, 0.32, and 0.4, respectively.

R2 indicates the proportion of variance in the dependent variable explained by variance in the independent variables. The values for this model are very high. By contrast, a gene for height in humans is a find if it can explain 2% in the trait value variance.

The above found SNPs which could explain variation across breeds which are inbred and highly distinctive in genes and traits. Could the same SNPs explain variance within breeds? Yes:

Most of the variance in body size was explained by the IGF1 locus where we observe a single marker with R2 = 50% and R2 = 17% of variance in breed and village dogs, respectively. The top 3-SNPs explain R2 = 38% of the variance in body weight in village dogs, although the 6-SNP model explains less. The lower R2 in non-breed dogs than breed dogs may be a consequence of lower LD observed in village dogs reducing the strength of association between these markers and the causal body size variants. Alternatively, the lower R2 may also be a consequence of non-genetic factors such as diet or measurement error affecting the observed village dog weights, the smaller range of body sizes observed in the non-breed dog sample, or perhaps to overfitting of the model based on the particular breeds included in the dataset. Nevertheless, R2 = 38% is significantly better than association scans for morphometric traits in humans utilizing denser marker arrays….

Dogs and humans have a long history together. But some of these dogs have a very short history. As noted in the discussion many canine lineages which are purebred are products of Victorian era breeding crazes, and were selected for strange characteristics which were transmitted in a discrete fashion. The recency of the lineages combined with the peculiarities of the breeding programs of this era and dog fanciers generally may explain some of the genetic architecture of canines. The authors note that domestic animals subject to more gradual selection may not, and do not, exhibit the same tendency. Perhaps humans are more like goats or wheat, and less like dogs? The authors note the contrast in loci which exhibit population wide variation:

In humans, high-FST regions are associated with hair and pigmentation phenotypes, disease resistance, and metabolic adaptations…In contrast, the strongest signals of diversifying selection in dogs are all associated with either body size/shape or hair/pigmentation traits, and therefore are unlikely to have been under selection for disease resistance, metabolic adaptations, or behavior. In total, the 11 highest FST regions identified across purebred dogs are all associated with body size/shape or hair phenotypes, including three genomic regions that had not been detected in previous association studies.

The rationale for this study is the utility of dogs as model organisms for humans. They’re taxonomically rather close to us, so their genetics may give us insight into human conditions. The main worry though for me is that the best models here are inbred dogs, where the markers adduced are most valid, but it seems possible they’re the least promising set of models because they have all sorts of genetic peculiarities. But all practicality aside, a fascinating paper.

Image Credit: Jon Radoff and Angela Bull in 2002

Citation: Boyko AR, Quignon P, Li L, Schoenebeck JJ, Degenhardt JD, & et al. (2010). A Simple Genetic Architecture Underlies Morphological Variation in Dogs PLoS Biology : 10.1371/journal.pbio.1000451

August 1, 2010

Was Yao Ming bred?

Filed under: Genetics,Height,Quantitative Genetics,Yao Ming — Razib Khan @ 7:43 am

I knew that Yao Ming’s parents are very tall. Though his father, at 6′7, arguably contributed less than his mother, at 6′3, which is farther above the female mean in standard deviation units. But check this out from Superfusion: How China and America Became One Economy and Why the World’s Prosperity Depends on It:

Yao had essentially been bred. Both his parents played basketball. His 6′2 [different height from Wikipedia -Razib] mother, Fang Fengdi, perhaps the tallest woman in China, had been married to an even taller man. She had served as a Red Guard during the height of the Cultural Revolution and had been an ardent Maoist. She enthusiastically participated in the glorious plan of the local government to use her and her husband to produce a sports superstar. The Shanghai authorities who encouraged the match had gone back several generations to ensure that size was embedded in the bloodline. The result was Yao, a baby behemoth who just kept getting bigger.

What’s the chance of Yao? Let’s start with his mother being 6′3, his father being 6′7. Let’s assume that the genetic potentiality of Chinese women leaves a median height of 5′2, and men at 5′8. I suspect I’m low-balling this because there’s likely a fair amount of variability within China, with northerners being taller. Additionally, if Yao’s mother lived through the Cultural Revolution I’m wondering if she and her husband are even at their full height assuming normal nutrition. But let’s go with that. With 2 inches per standard deviation, ~85% heritability, you’d expect any of their children to be 6 standard deviations above the population norm in height (sex corrected). For a male that’s 6′8 (using the 5′8 figure as the median). Yao’s taller than that. In fact, at 7′6, he’s 5 standard deviations above the expected value. A freak if you will.

I think that that indicates that I’m being too conservative about the genetic potential of Yao’s parents, the full median height of the source population from which they derive assuming modern nutrition, and the heritability constraining to Yao’s family. In other words, I assume that the Chinese officials knew that neither of Yao’s parents were quite total freaks within their lineages, which indicates that there’ll be less regression back to the mean because their height is less likely attributable to non-replicable environmental variables. Though Yao is still freakishly tall in relation to both his parents, so I don’t think he was inevitable. Though of course the odds of someone of Yao’s height being born to his particular set of parents was orders of magnitude higher than for two random Chinese.

Note: To do the back-of-the-envelope I just used the breeder’s equation. Probably so far above the norm there are more non-linearities at work so that deviations from the expected values are probably higher. I guess only the Chinese officials who did the genealogical inquiries will know….

Powered by WordPress