Razib Khan One-stop-shopping for all of my content

November 11, 2012

Reflections on the evolution at ASHG 2012

As most readers know I was at ASHG 2012. I’m going to divide this post in half. First, the generalities of the meeting. And second, specific posters, etc.


- Life Technologies/Ion Torrent apparently hires d-bag bros to represent them at conferences. The poster people were fine, but the guys manning the Ion Torrent Bus were total jackasses if they thought it would be funny/amusing/etc. Human resources acumen is not always a reflection of technological chops, but I sure don’t expect organizational competence if they (HR) thought it was smart to hire guys who thought (the d-bags) it would be amusing to alienate a selection of conference goers at ASHG. Go Affy & Illumina!

- Speaking of sequencing, there were some young companies trying to pitch technologies which will solve the problem of lack of long reads. I’m hopeful, but after the Pacific Biosciences fiasco of the late 2000s, I don’t think there’s a point in putting hopes on any given firm.

- I walked the poster hall, read the titles, and at least skimmed all 3,000+ posters’ abstracts. No surprise that genomics was all over the place. But perhaps a moderate ...

October 21, 2012

Buddy can you spare a selective sweep

The Pith: Natural selection comes in different flavors in its genetic constituents. Some of those constituents are more elusive than others. That makes “reading the label” a non-trivial activity.

As you may know when you look at patterns of variation in the genome of a given organism you can make various inferences from the nature of these patterns. But the power of those inferences is conditional on the details of the real demographic and evolutionary histories, as well as the assumptions made about the models one which is testing. When delving into the domain of population genomics some of the concepts and models may seem abstruse, but the reality is that such details are the stuff of which evolution is built. A new paper in PLoS Genetics may seem excessively esoteric and theoretical, but it speaks to very important processes which shape the evolutionary trajectory of a given population. The paper is titled Distinguishing between Selective Sweeps from Standing Variation and from a De Novo Mutation. Here’s the author summary:

Considerable effort has been devoted to detecting genes that are under natural selection, and hundreds of such genes have been identified in previous studies. Here, we present a method for extending ...

October 25, 2011

The perils of human genomics

A friend pointed me to the heated comment section of this article in Nature, Rebuilding the genome of a hidden ethnicity. The issue is that Nature originally stated that the Taino, the native people of Puerto Rico, were extinct. That resulted in an avalanche of angry comments, which one of the researchers, Carlos Bustamante, felt he had to address. Eventually Nature updated their text:

CORRECTED: This article originally stated that the Taíno were extinct, which is incorrect. Nature apologizes for the offence caused, and has corrected the text to better explain the research project described.

Here’s Wikipedia on the Taino today:

Heritage groups, such as the Jatibonicu Taíno Tribal Nation of Boriken, Puerto Rico (1970), the Taíno Nation of the Antilles (1993), the United Confederation of Taíno People (1998) and El Pueblo Guatu Ma-Cu A Boriken Puerto Rico (2000), have been established to foster Taíno culture. However, it is controversial as to whether these Heritage Groups represent Taíno Culture accurately as some Taino groups are known to ‘adopt’ other native traditions (mainly North American Indian). Many aspects of Taino culture has been lost to time and or blended with Spaniard and African culture on the Caribbean Islands. Peoples who claim to be of native descent in the islands of Puerto Rico, Hispaniola and Eastern Cuba attempt to maintain some form of cultural connection with their historic identities. Antonio de Moya, a Dominican educator, wrote in 1993, “the [Indian] genocide is the big lie of our history… the Dominican Taínos continue to live, 500 years after European contact.”

One of the ways that Taino activists now use to strengthen interest and identity is by the creation of two unique scripts. The scripts are used to write Spanish, not a retained language from pre-Columbian ancestors. The organization Guaka-kú teaches and uses their script among their own members, but the LGTK (Liga Guakía Taína-ké) has promoted their script among elementary and middle school students to strengthen their interest in Taino identity.

It is undeniable that the Amerindian ancestry found in the Caribbean probably derives from that pre-Columbian population. And it may be that there are cultural forms which exhibit unbroken continuity. But it seems that the modern Taino are a re-precipitation out of a cultural milieu whose Amerindian self-identity had gone extinct. By analogy, Argentines have about the same proportion of Amerindian ancestry as Puerto Ricans on a population-wide basis. In fact, over 90% of the Amerindian distinctive ancestry in Argentina is not found in self-identified Amerindians (who do continue to exist as a minority, especially in the South). But to my knowledge for various cultural reasons there has not been a groundswell to shift the Argentine self-conception from being a European settler nation to a mestizo nation, let alone individuals declaring themselves Amerindian.

In comparison to the possibilities which are opened up in this case, the issue of Aboriginal genomics looks rather cut & dried. I suppose we would laugh if some people decided to “reclaim” their Neandertal heritage, but there’s a huge corpus of paleoanthropological scholarship which these individuals could draw upon to reconstruct their identities as Neandertals. It might sound ludicrous, but this is a world where a lot happens that you wouldn’t expect.

July 24, 2011

Why the human X chromosome is less diverse

The Pith: The human X chromosome is subject to more pressure from natural selection, resulting in less genetic diversity. But, the differences in diversity of X chromosomes across human populations seem to be more a function of population history than differences in the power of natural selection across those populations.

In the past few years there has been a finding that the human X chromosome exhibits less genetic diversity than the non-sex regions of the genome, the autosome. Why? On the face of it this might seem inexplicable, but a few basic structural factors derived from the architecture of the human genome present themselves.

First, in males the X chromosome is hemizygous, rendering it more exposed to selection. This is rather straightforward once you move beyond the jargon. Human males have only one copy of genes which express on the X chromosome, because they have only one X chromosome. In contrast, females have two X chromosomes. This is the reason why sex linked traits in humans are disproportionately male. For genes on the X chromosome women can be carriers of many diseases because they have two copies of a gene, and one copy may be functional. In contrast, a male ...

April 16, 2011

Evolution may explain why baby comes early

Image credit

The Pith: In this post I review a paper which covers the evolutionary dimension of human childbirth. Specifically, the traits and tendencies peculiar to our species, the genes which may underpin those traits and tendencies, and how that may relate to broader public health considerations.

Human babies are special. Unlike the offspring of organisms such as lizards or snakes human babies are exceedingly helpless, and exhibit an incredible amount of neoteny in relation to adults. This is true to some extent for all mammals, but obviously there’s still a difference between a newborn foal and a newborn human. One presumes that the closest analogs to human babies are those of our closest relatives, the “Great Apes.” And certainly the young of chimpanzees exhibit the same element of “cuteness” which is appealing to human adults. Still there is a difference of degree here. As a childophobic friend observed human infants resemble “larvae.” The ultimate and proximate reason for this relative underdevelopment of human newborns is usually attributed to our huge brains, which run up against the limiting factor of the pelvic opening of women. If a ...

February 23, 2011

Sweeping through a fly’s genome

Credit: Karl Magnacca

The Pith: In this post I review some findings of patterns of natural selection within the Drosophila fruit fly genome. I relate them to very similar findings, though in the opposite direction, in human genomics. Different forms of natural selection and their impact on the structure of the genome are also spotlighted on the course of the review. In particular how specific methods to detect adaptation on the genomic level may be biased by assumptions of classical evolutionary genetic models are explored. Finally, I try and place these details in the broader framework of how best to understand evolutionary process in the “big picture.”

A few days ago I titled a post “The evolution of man is no cartoon”. The reason I titled it such is that as the methods become more refined and our data sets more robust it seems that previously held models of how humans evolved, and evolution’s impact on our genomes, are being refined. Evolutionary genetics at its most elegantly spare can be reduced down to several general parameters. Drift, selection, migration, etc. Exogenous phenomena such as the flux in census size, or ...

January 13, 2011

More bad mutations = greater fitness

ResearchBlogging.orgDoes the chart above strike you as strange? What it shows is that the mean fitness of a population drops as you increase the rate of deleterious mutation (many more mutations are deleterious than favorable)…but at some point the fitness of the population bounces back, despite (or perhaps because of?) the deleterious mutations! This would seem, to me, an illustration of bizzaro-world evolution. Worse is better! More is less! Deleterious is favorable? By definition deleterious isn’t favorable, so one would have to back up and check one’s premises.

And yet this seems just what a new paper in PLoS ONE is reporting. Purging Deleterious Mutations under Self Fertilization: Paradoxical Recovery in Fitness with Increasing Mutation Rate in Caenorhabditis elegans:

Compensatory mutations can be more frequent under high mutation rates and may alleviate a portion of the fitness lost due to the accumulation of deleterious mutations through epistatic interactions with deleterious mutations. The prolonged maintenance of tightly linked compensatory and deleterious mutations facilitated by self-fertilization may be responsible for the fitness increase as linkage disequilibrium between the compensatory and deleterious mutations preserves their epistatic interaction.

Got that? OK, you probably need some background first….

The authors used C. elgans as a model organism. This “worm” is ubiquitous in biology. There’s an enormous community of developmental biologists, geneticists, and neuroscientists, who work with elegans as a model organism. For the purposes of evolutionary genetics you need to know a few things about elegans though. The vast majority of reproduction of elegans occurs through “selfing.” That is, most elegans are hermaphrodites who fertilize themselves. They’re obviously not asexual, but their habits are straight out of South Park. A small minority of reproductive events among elegans are sexual in a conventional manner, because a few of the worms in any given generation are males. For the purposes of this experiment you need to ignore this aspect; they’re focusing on the selfing. To do this they removed males out of the equation, either by introducing a male killing mutation, xol-1, or, manually removing them.

So now we have just the selfers. If you pick up a standard pop gen text, e.g. Principles of Population Genetics, you’ll find out that selfers tend to have some peculiar and interesting properties when it comes to the long term arc of evolutionary genetics. In particular, they “purge” “genetic load” like crazy. What this means is that deleterious alleles get removed from selfing populations very fast through negative selection. Why? How?

Let’s go back to genetics 101. Imagine a locus where an individual is a heterozygote, and carries an allele which is “wild type” and another which is deleterious, and recessively expressed. Cystic fibrosis is a recessive disease that is common among Europeans. 1 out of 25 Europeans is a heterozygote, and there is a 1 out of 25 chance that these individuals will mate with someone who is also a carrier. Out of these pairings, 50% of the offspring will also be carriers, 25% will be wild type homozygotes, and 25% will express the cystic fibrosis disease because they’re homozygotes for the deleterious allele. With the numbers given that means 1 out of 2,500 births will result in a child with cystic fibrosis.

Cystic fibrosis is a lethal disease which sharply reduces fitness (many individuals are just infertile). This is negative selection against the deleterious allele. But, the selection is relatively weak. Why? Take a look at the ratio between those who carry the allele, but have normal fitness, and those who carry two copies and have reduced fitness. It’s 100 to 1. Most copies of the deleterious allele are “masked” from any negative fitness consequences because they’re paired up with a normal wild type which complements and compensates the function of the mutant variant. This is one reason why we carry so many deleterious alleles; they’re often paired up with a “good” copy which prevents the fitness of the individual from cratering.

Now let’s bring this back to selfing. In a human population we pair up with others. So you have to multiply independent probabilities, 1/25 × 1/25, to produce a Punnett square where two heterozygotes are crossed. In a scenario of selfing the probabilities are different. There’s perfect assorting of genotype to genotype for selfers, because the genotypes are simply being crossed with themselves. If you’re a fertile hermaphrodite who carries the mutant cystic fibrosis allele there’s a 25% chance that you’re offspring will be homozygotes for cystic fibrosis, because you know that the cross will be with another heterozygote (yourself). Now imagine that the whole population consists of selfers. Instead of 1 out of 100 copies being exposed to selection, 1 out of 2 copies are exposed to selection! This is how selfers purge genetic load so well. When selection only operates on homozygotes, their tendency to produce homozygotes means that deleterious alleles are far more exposed to selection. Why do selfing populations in the aggregate produce so many homozygotes? Heterozygotes mating with heterozygotes produce both heterozygotes and homozygotes. Homozygotes mating with homozygotes produce only homozygotes. The “toy” chart I’ve put together shows what happens when you take a uniform population of heterozygote selfers in generation 1, and allow them to reproduce down the generations. Each generation the proportion of heterozygotes, those individuals where deleterious alleles are masked and so protected from the purging power of natural selection, decreases. Selection becomes more and more efficacious in purging genetic load from the population.

There are still two other concepts important to understanding the implications of this paper. Epistasis and genetic linkage. But let’s move on to some results first, and then digest them with a further helping of conceptual condiments. Here’s figures 3 & 4, which I’ve reedited a bit. On the left you see fitness (fecundity) as a function of the concentration of mutagen. In other words, as you move up the mutagen concentration on the x-axis the mutation rates are increasing. On the right you see a plot which shows the mean fitness after x # of generations, which each set of data points represent differing concentrations of the mutation. I’ve highlighted the lines with no mutagen, and maximal mutagen.

The bizarro aspect is the jump between 80 mM and 100 mM. As mutation rates increase there is a bounce back of fitness. Imagine that you were rolling a boulder up an incline which got progressively steeper in its grade. Common sense and basic physics would tell you that you’d have to use more and more force to move the boulder the same distance. Now imagine that beyond a certain grade of steepness you actually had to use less force! That would make no sense. In some ways that’s what’s going on here. But then, evolutionary processes may not be so linear and predictable as Newtonian mechanics.

Of course there could be some straightforward reasons for this strange behavior. For example, the xol-1 mutant which produced maleless populations may have had pleiotropic effects. To test for this they manually removed males from a population without the mutation, and obtained similar results. Additionally, they also took a divergent elegans line with the xol-1 mutant and performed the same experiments, and again the same pattern recapitulated itself. Finally, there’s always the possibility that resistance to the mutagen had developed above a certain concentration. If resistance to the mutagen had developed presumably taking the population which had exhibited the increased fitness ~100 mM, and placing it back into lower concentration environments, would produce a different response curve than we saw before. That is not what occurred, as you can see in figure 6.

Now that we have the core results under our belt, let’s move on to trying to make sense of how water can flow uphill like this. So back to the concepts, genetic linkage, and epistasis. The first is easy. Genes are arrayed along physical DNA strands. The closer the physical position of the genes, the more likely they are inherited together in a straightforward fashion. The kink in the expectation is recombination. In diploid organisms you have two copies of each gene on the two strands. Recombination can shuffle specific gene copies from one strand to the other (or, more accurately, break and recombine strands in a fashion so that both differ from the state before the event). The further the distance between any two gene copies on a physical strand, the greater the likelihood for recombination to separate the two. When two copies are very close there’s only a small physical distance across which recombination might operate to separate them. Therefore, the closer the copies the more “linked” the genes are.

Before explaining why this matters, let’s talk about epistasis. Epistasis can be thought of generally as gene-gene interaction. In the mechanistic molecular sense you’re referring to biophysical processes whereby one gene has some interaction with another gene. But there’s another way to think about: fitness or trait value. In this sense epistasis as gene-gene interaction introduces non-linearities into the mapping of genotype to phenotype, as well as genotype to fitness. This is what matters for the purposes of this paper. In particular, epistasis manifesting as compensatory deleterious mutations.

So how does this matters for selfers? Recall that above we were talking about how selfers purge deleterious genetic load by cranking up the proportion of homozygotes exposed to negative selection. Implicitly our model was single locus. We were looking at one gene, and one mutant. But how about if you had a large number of mutants? Can selfers produce all those homozygotes simultaneously, and so purge the load efficiently? Purging load through natural selection entails reduced fitness for many members of a population; purge too much and the population crashes and you’re liable to just go extinct through mutational meltdown. This where linkage and recombination come back to the fore. Recombination is often thought of as a way to create new genetic combinations. But in homozygous selfing lineages recombination doesn’t live up to that promise: there’s not enough heterozygosity within the genomes of these organisms so that the shuffling of the strands across each other produces anything new! Selfing lineages exhibit very strong linkage between sequences of genetic variants across loci because of the inability of recombination to break apart associations. So, if you have two genes, A and B, which are linked, and A is very fit and B is moderately unfit, if they are co-inherited B may sweep up to fixation with A. As you crank up mutation rates then the theory predicts that deleterious alleles will simply swamp out the ability of selfing lineages to purge the load fast enough to prevent ultimate extinction. Even if the genetic background wasn’t homozygous, too many mutations within the genome would be swapping out deleterious copies for other deleterious copies during recombination.

That theory was born out more or less at concentrations of the mutagen below 100 mM. But then expectation was confounded. Why? This is where epistasis steps into the picture. In the previous model we implicitly assumed an additive model. Imagine the fitness of allele 1 at gene A ~ 3 and the fitness of allele 2 and gene B ~ -2. Summing them together ~1. And so on. Epistasis confuses this simple picture because it implies non-linear computations. The fitness value of A and B may be conditional on the state of a third gene, C. In any case, a compensatory mutation is one where more deleterious is in fact less deleterious. Precisely, having two deleterious mutations may actually have less of a fitness hit than having one deleterious mutation! In some ways this becomes a matter of semantics and analytic philosophy. -10 + – 10 > – 10 is just incoherent.

Since this is not a philosophy blog, how does this relate to selfing lineages? It goes back to linkage. Recall that tight linkage may produce situations where recombination can not break apart unhealthy associations where favorable variants are linked with unfavorable ones, and the latter may hitchhike with the former in selective sweeps (in populations with more heterozygosity recombination would increase the range of combinations across which selection operated; see Muller’s ratchet). This is the bad. But in the case of compensatory mutations the inability of recombination to break apart associations may be a positive. These epistatic interactions are contingent on robust combinations persisting. Recombination would break apart those combinations, preventing the fitness gains from persisting across generations. But in these selfing linages the homogenized genetic backgrounds are relatively fixed palettes against which these mysterious genetic interactions which turn expectations upside down can perform their magic.

This paper had some moderately weird results. The response to mutagen concentration increases seemed robust within their set of experiments, but who knows how general this phenomenon is? A reliance on compensatory mutation also strikes me as only less weird because the results were so weird. In the last paragraph the authors seem to acknowledge the general strangeness at work:

Regardless of the mechanism driving the fitness increase exhibited by populations exposed to 100 mM EMS, the result is a testament to the resiliency of the genome. Consistent exposure to high mutation rates should wreak havoc on the genome, and repeated exposure to 80 mM EMS (Figure 5) appears to do just that. However, the genome is able to recover a large proportion of the fitness lost at 80 mM EMS when exposed to 100 mM EMS (Figure 3). This result is quite surprising and challenges the long-held beliefs concerning the relationship between mutation rates and fitness.

The long-held belief presumably being that high mutation rates are correlated with decreased mean fitness, and ultimately likely extinction. A great deal of post-apocalyptic fiction from the Cold War period was predicated on just this assumption. And clearly in most cases this seems to be a warranted axiom. On the other hand, sometimes in biology the minor exceptions are more important in explaining the patterns of diversity we see around us. If there was a veil of ignorance over us and we had to predict the nature of replicating organisms on this planet would we predict the incredible diversity we see all around us? Would we predict intelligent life? I suspect that there would be the preference for a simple and elegant model where life on earth was optimized toward extremely simple and highly robust rapid replicators. Prokaryotes. And to a first approximation that logical inference based on Darwinian assumptions would be correct. Prokaryotes are omnipresent. In fact, some estimate that there are 10 times as many bacterial cells within a human body as human cells. But obviously there are creatures on the Earth besides prokaryotes. And we care a great deal about this “residual” from the expected trend line….

Citation: Morran LT, Ohdera AH, & Phillips PC (2010). Purging Deleterious Mutations under Self Fertilization: Paradoxical Recovery in Fitness with Increasing Mutation Rate in Caenorhabditis elegans. PloS one, 5 (12) PMID: 21217820

November 17, 2010

Homozygosity runs in the family (or not)

800px-IMGP2147The number 1 gets a lot more press than -1, and the concept of heterozygosity gets more attention than homozygosity. Concretely the difference between the latter two is rather straightforward. In diploid organisms the genes come in duplicates. If the alleles are the same, then they’re homozygous. If they’re different, then they’re heterozygous. Sex chromosomes can be an exception to this because in the heterogametic sex you generally have only one copy of gene as one of the chromosomes is sharply truncated. This is why in human males are subject to X-linked recessive traits at such a great frequency in comparison to females; recessive expression is irrelevant when you don’t have a compensatory X chromosome to mask the malfunction of one allele.

Of course recessive traits are not simply a function of sex-linked traits. Consider microcephaly, an autosomal recessive disease. To manifest the trait you need two malfunctioning copies of the gene, one from each parent. In other words, you exhibit a homozygous genotype with two mutant copies. I suspect that this particularly common context of homozygosity, recessive autosomal diseases, is one reason why it is less commonly discussed outside of specialist circles: there are whole cluster of medical and social factors which lead to homozygosity which are already the focus of attention. The genetic architecture of the trait is of less note than the etiology of the disease and the possible reasons in the family’s background which might have increased the risk probability, especially inbreeding. In contrast heterozygosity is generally not so disastrous. Even if functionality is not 100%, it is close enough for “government work.” The deleterious consequences of a malfunctioning allele are masked by the “wild type” good copy. The exceptions are in areas such as breeding for hybrid vigor, when heterozygote advantage may be coming to the fore. The details of complementation of two alleles matter a great deal to the bottom line, and the concept of hybrid vigor has percolated out to the general public, with the more informed being cognizant of heterozygosity.

ResearchBlogging.orgBut homozygosity is of interest beyond the unfortunate instances when it is connected to a recessive disease. Like heterozygosity, homozygosity exists in spades across our genome. My 23andMe sample comes up as 67.6% homozygous on my SNPs (which are biased toward ~500,000 base pairs which tend to have population wide variation), while Dr. Daniel MacArthur’s results show him to be 68.1% homozygous across his SNPs. This is not atypical for outbred individuals. In contrast someone whose parents were first cousins can come up as ~72% homozygous. This is important: zygosity is not telling you simply about the state of two alleles, in this case base pairs, it may also be telling you about the descent of two alleles. Obviously this is not always clear on the base pair level; mutations happen frequently enough that even if you carry two minor alleles it is not necessarily evidence that they’re identical by descent (IBD), or autozygous (just a term which denotes ancestry of the alleles from the same original copy). What you need to look for are genome-wide patterns of homozygosity, in particular “runs of homozygosity” (ROH). These are long sequences biased toward homozygous genotypes.

220px-Morgan_crossover_1What ROH can tell you about an individual, and perhaps a population, becomes more clear when you conceptualize in your mind’s eye the basic dynamics which occur in the course of biological replication in diploid sexual organisms. Each individual receives half their autosomal genome from each parent. Though genes are abstractions, individual units at the root of a complex causal sequence which maps to a phenotype, a trait, they’re also physical entities embedded within the structure of DNA. This structure is a physical sequence, whereby you have adjacent base pairs, clusters of which define genes, intergenic regions, exons, introns, promoters, etc. In other words, the whole alphabet soup of molecular genetics. The spatial relationship of genes to each other along the chromosome allowed for linkage mapping decades before the biophysical substrate of DNA was known to be critical to the whole process. Particular sequences of alleles may therefore be inherited together, and form a haplotype. Over the generations the associations of these distinctive alleles in haplotypes dissolve through recombination, a physical process which erodes the structural integrity of chromosomal sequences.

210px-Juan_de_Miranda_Carreno_002With these basics in mind, let’s move to a specific repulsive example. Imagine a father who impregnates his daughter. Why is this repulsive to us? From a consequential “gene’s eye” perspective the father is suborning the beauty of sexual reproduction whereby genetic variation is mixed & matched across individuals. Colloquially, where the daughter would be 50% of the father genetically, the child of the daughter and her father would be 75% of the father genetically. From a gene-only perspective this may be favorable, as the father is coming closer to cloning himself, but we all know that the rate of breakdown of the “vehicle” in these individuals is high. Why? Inbreeding leads to a relatively massive increase in homozygosity as chromosomal segments identical by descent are paired off against each other. We know that the problem is that a host of nasty recessive diseases are highly likely in inbred individuals.

All humans carry a large load of deleterious alleles. Some of these may be potentially lethal. But like bombs without the trigger a functional copy of the alleles complements and masks the mutant variety and we carry on. Many of these mutants are particular to our family, and some of them are private even to ourselves, the outcome of de novo mutations which make each human distinctive genetic islands (at least until they reproduce and pass on their mutational distinctions). Therefore a man who mixes his own genes together in the act of incest is potentially lighting the fuse whereby these hidden malevolent mutants will explode from being cryptic genetic abormalities toward full-blown disease monstrosities.

One statistic which would register incest would be ROH; naturally when you have long regions of recently IBD chromosomal segments adjacent to each other you’ll have a lot of homozygosity, since the paired alleles are replica copies. Assuming that an individual with many long ROH can survive and reproduce over time these massive swaths of homogeneity will be wiped away by mutation and recombination as well as outbreeding. Incest is still arguably a health disaster, but one can imagine the motive genetic engines of evolutionary variation healing the damage over time.

And it doesn’t have to be so extreme. Father-daughter or sibling incest is only a boundary condition. First cousin marriages aren’t nearly as disastrous, the fecundity of British Pakistanis despite higher rates of genetic abnormalities being clear evidence of this. They are certainly more evolutionarily fit than non-Pakistani Brits, who do not reproduce at the clip of 4 children per family. These clans will exhibit more modest levels of ROH because the coefficient of relationship between cousins is only 1/8, as opposed to 1/2 between parents and children or full siblings.

roh1The figure to the left is from a 2008 paper on ROH in Europeans. Specifically these are Orcadians or part-Orcadians. A population you should be familiar with from the HGDP panel. Orcadians are natives of the Orkney islands just off the north coast of Scotland. Though of somewhat diverse origins, Viking, Scot and Pict, being islanders they’ve developed their own genetic peculiarities because of their isolation. A good rule of thumb is that any body of water is a fearsome barrier to casual gene flow. On the y-axis you see the total number of ROH in the genome of a given individual. I point you to the methods if you are curious as to the exact parameters they specified in their calculation. ROH is assessed over a window of the genome, and naturally one can vary its width, as well as the stringency in registering a particular region as a run or not a run. On the x-axis are the total lengths in terms of base pairs. What you see is a positive correlation between the number of ROH, and the total genomic length of the sequences. Those Orcadians who are genetically more diverse because of non-Orcadian parentage have the least homozygosity in their genomes. Those who are products of the recent cousin marriage have the most. But notice a peculiar pattern: there’s a curvilinear trend to the values. In those individuals who presumably have very high inbreeding coefficients the total length of ROH seems to exceed one’s expectation based on just the total number of ROHs. Why? Because they have very long runs of homozygosity indeed. This is just what we’d expect from the sort of process I described earlier, where it takes many generations for the long chromosomal sequences to be broken apart by recombination.

Before I get you too excited about the genetics of European homozygosity, let’s take a wider view. Some of the same researchers who published the paper above have come out with a set of results which survey the world. Genomic Runs of Homozygosity Record Population History and Consanguinity:

The human genome is characterised by many runs of homozygous genotypes, where identical haplotypes were inherited from each parent. The length of each run is determined partly by the number of generations since the common ancestor: offspring of cousin marriages have long runs of homozygosity (ROH), while the numerous shorter tracts relate to shared ancestry tens and hundreds of generations ago. Human populations have experienced a wide range of demographic histories and hold diverse cultural attitudes to consanguinity. In a global population dataset, genome-wide analysis of long and shorter ROH allows categorisation of the mainly indigenous populations sampled here into four major groups in which the majority of the population are inferred to have: (a) recent parental relatedness (south and west Asians); (b) shared parental ancestry arising hundreds to thousands of years ago through long term isolation and restricted effective population size (Ne), but little recent inbreeding (Oceanians); (c) both ancient and recent parental relatedness (Native Americans); and (d) only the background level of shared ancestry relating to continental Ne(predominantly urban Europeans and East Asians; lowest of all in sub-Saharan African agriculturalists), and the occasional cryptically inbred individual. Moreover, individuals can be positioned along axes representing this demographic historic space. Long runs of homozygosity are therefore a globally widespread and under-appreciated characteristic of our genomes, which record past consanguinity and population isolation and provide a distinctive record of the demographic history of an individual’s ancestors. Individual ROH measures will also allow quantification of the disease risk arising from polygenic recessive effects.

Their data set consists of the HGDP sample populations, so you naturally have the broad geographic clusters such as Africa, Europe, West Asia, Central/South Asia, East Asia, Oceania, and the New World. Two big dynamics are superimposed upon each other in the patterns of ROH: “deep history” demographic processes such as bottlenecks and population expansions, and cultural anthropological patterns which we see around us such as cousin marriage within inbred clans. To find the former you need to survey the genome finely. In contrast the latter leaves pretty obvious signs genomically in the form of very long ROH, as well as clusters of recessive diseases.

The first figure shows the distribution of different lengths of ROH by population:


Here’s the take away:

- Oceanians have many short ROH, but as you increase the length of ROH threshold they are not exceptional at all

- The New World samples persist in having a disproportionately number of ROH no matter the length, though the number does drop as you increase length threshold. This makes sense, the human genome is of finite length and you can only have so many very long ROHs

- The West Asian and Central/South Asian populations seem to have more long ROHs than the other Eurasian or African groups, though they’re not exceptional in the lowest category

- The Africans have the least ROH, especially in the category of very short runs

Before I comment on these patterns in detail, let’s quickly check out the next figure. It looks at Africans only, but divides the sample into those which are hunter-gatherers and those which are agriculturalists.


The hunter-gatherers have more, and longer, ROH than the agriculturists. Why? The answer in large part explains the geographical patterns as well: larger long term effective population. Effective population just refers to the proportion of the population which contributes genetically to the next generation. Small effective populations means a lot of genetic drift because of increased sample variance, and tends to converge upon consanguinity. If your tribe is small enough the only people you may find to marry are your cousins. As I noted above, this will produce long ROH as individuals will have descent through multiple lines from the same ancestor, increasing the probability of autozygosity greatly. The same process explains why West Asians and Central/South Asians are enriched for long LOH relative to other groups excepting Amerindians. Here’s a map from Consang.net:


Many Muslim societies practice cousin marriage, and many Muslims even argue that it is the Islamic practice (he married one of his cousins among his many wives. Strangely somehow these Muslims don’t argue that it is also the Muslim custom to marry old rich widows, though some do argue for the importance of marrying barely pubescent girls). Additionally, in India many Hindu groups in  the South practice consanguineous marriages, including uncle-niece marriage. This is all occurring now, and so produces signatures of long ROH in many families. The final figure breaks down the individuals from selected populations, with again the y-axis being the number of ROH and x-axis being total length of the ROH:


The population sets are representative of broader geographic clusters. The Karitiana are from the Amazon, the Mandenka from Senegal, and the Balochi from Pakistan. If you don’t know where the French and Japanese are from, I would ask you never leave a comment on this weblog. Notice a few French, Mandenka, and Japanese individuals deviated away from their main clusters. These are cryptically inbred, perhaps their parents were cousins, or some of their grandparents were cousins. In contrast the Baloch have a wide range in terms of length of ROH; this is typical of populations where a large proportion of individuals are the products of cousin marriage, but many are not. The fact that individuals would exhibit a large variance of expected relatedness between their parents means that their own inbreeding coefficients and the genomic correlates (in this case ROH) would also vary greatly. The same parameter is operative among the Karitiana, an endangered ethnic group which presumably has a small “mate market” available to each individual.

So what about the Papuans? Their cluster is tight, and they don’t have nearly the total length of ROH as the Amazonian tribe. But remember that in the first figure they had many short ROH. A plausible explanation for this is the the Papuans went through an ancient bottleneck, from which they have expanded. The bottleneck increased genetic drift and so generated highly common haplotype blocks which combined to produce runs of homozygosity. But over time these blocks would have disintegrated through mutation and recombination. ROH in the Papuans then is simply a shadow of demographic events past, while ROH in Baloch is evidence of demographic events present.

roh2These two balancing realities are starkly illustrated in the supplements when you drill down to the South and Central Asian groups. In the figure it is clear that the group with the consistently highest number of ROH are the Kalash. This makes sense. The Kalash are a genetic isolate because they’re traditionally a pagan non-Muslim group isolated in the remote Chitral region of Pakistan. Because Muslims can not join their tribe for over a thousand years the gene flow has been unidirectional, as the Kalash convert to Islam and so assimilate into the broader Pakistani society. In contrast the other Pakistani groups have a huge variance in the total amount of ROH. The individuals with the least ROH in both total length and number in the sample are Baloch, Brahui and Makrani, as are some of the individuals with the highest values on these statistics! While the Kalash have been slowly and consistently ground down by the pressure of small population size, the Baloch, Brahui, and Makrani, are subject to the hammer-blows of several generations of first cousin marriages in inbred clans. These repeated marriages across the generations rapidly increase the ROH as first cousins may be more closely related to each other genetically than they are anthropologically.

roIn the pre-genomic era it was simple to calculate inbreeding. Just look at pedigrees. From this you derived the inbreeding coefficient. The key is to remember that the relationship of one’s sum totality of ancestors were critical in this calculation. In the USA marriages between first cousins occur between individuals whose grandparents are not usually related. But in other societies the generation of the grandparents, and perhaps great-grandparents, may also have been cousins. But pedigrees have limits, and may miss deep ancestry.  The figure to the left, from the first paper I referenced, shows the relationship of the proportion of an individual’s ancestry which is identical by descent as calculated by genomic (ROH) methods on the y-axis and conventional ones on the x-axis (pedigree). There’s an obviously correlation, but observe the slight bias toward values above the line of best fit, and the fact that the y values are higher than the x. Genomic estimates capture common ancestry which lay outside the purview of conventional genealogy!

The implications of these patterns are two-fold: first, looking backward toward human history, and second, forward toward biomedical science. Patterns of ROH here are roughly in line with a serial bottleneck model Out of Africa; the further populations are from Africa the more short ROH they have. African populations have the least of these because of their larger long term effective population size, and relative insulation from the bottlenecking process. A shorter term phenomenon is that of consanguineous marriage patterns, whether conscious and culturally normative (as in the the Muslim world and parts of South Asia), or due to demographic constraint, as is the case among hunter-gatherers. These two processes together are relevant because of the prominence of recessive diseases within the domain of medical genetics. Clearly very long ROH is a sign of inbreeding, and so a likely higher susceptibility of an individual to a host of ailments. But the authors note that the sum effect of many short ROH may also be problematic, especially due to the fact that these together may form the preponderance of the ROH within the genomes of many populations.

So far I’ve basically alluded to demographic history, and how it shapes the genome through processes which are fundamentally neutral and stochastic. Inbreeding itself can be thought of as a form of super-charged drift, as the long term effective population of a breeding group collapses in on itself. But what about natural selection? I decided to take a closer look at Dr. Daniel MacArthur of Genomes Unzipped ROH. One of his longest regions is on Chromosome 2, is about ~2 Mb in length, and runs from position 134606441 to position 136593184. In 23andMe there’s a position which I think might explain this: 136325116. That’s the number for rs4988235 in the 23andMe data file. Variation on this SNP tracks lactase persistence in Europeans. Dr. Daniel MacArthur has the genotype for lactase persistence in the homozygote form. Are we seeing the long haplotype associated with lactase persistence here in this long ROH which rose rapidly in frequency in the last 10,000 years because of natural selection? In general the parameters outlined in the paper satisfy the broad sketch of human history, but there may be interesting detail on the margins left out of the picture.

Finally, let’s go back to heterozygosity vs. homozygosity. I recently watched the documentary “Is it Better to be Mixed Race?” Setting aside the obvious reality that this sort of program reflects the Zeitgeist of the era (it is rather obvious that a Victorian scientist could have produced a different documentary, even with the same evidence), near the end there is a comparison of ROH across populations and individuals. The comparison was actually done by the research group which published the paper I just reviewed. If you jump to 38 minutes into the film and just watch they’ll lay out the results, but I’ll tell you what they found. They compared two European men, a South Indian woman, and a man whose father was English and mother Nigerian. The European men had expected levels of homozygosity; on the higher end. The South Indian woman had lower levels of aggregate homozygosity. This should be expected, as India is relatively genetically diverse on a pan-Eurasian scale. Finally, the mixed race male had almost no homozygosity to speak of. The principle investigator admitted that out of 5,000 individuals who had he tested and analyzed this was the most extreme result, and he had to recheck it. Why? Three factors:

- The mother is Nigerian, which is a population which is relatively genetically diverse

- The genetic distance between the father and mother is rather high

- Finally, because the man is a first generation hybrid on all the loci where Africans and Europeans tend to differ he’ll be much more likely to be heterozygous

I’ll let the authors have the last word:

Long ROH are a neglected feature of our genome, which we have shown here to be universally common in human populations and to correlate well with demographic history. ROH are, however, only partially predictable from an individual’s background (due to the stochastic nature of inheritance). As well as conferring susceptibility to recessive Mendelian diseases, ROH are also potentially an underappreciated risk factor for common complex diseases, given the evidence for a recessive component in many complex disease traits…they will allow quantification of the risk arising from recessive genetic variants in different populations.

Citation: Mirna Kirin, Ruth McQuillan, Christopher S. Franklin, Harry Campbell, Paul M. McKeigue, & James F. Wilson (2010). Genomic Runs of Homozygosity Record Population History and Consanguinity PLoS ONE : 10.1371/journal.pone.0013996

Image Credit: Allison Stillwell

September 16, 2010

A fly’s life: adventures in experimental evolution

509px-Drosophila_residua_heNatural selection happens. It was hypothesized in copious detail by Charles Darwin, and has been confirmed in the laboratory, through observation, and also by inference via the methods of modern genomics. But science is more than broad brushes. We need to drill-down to a more fine-grained level to understand the dynamics with precision and detail, and so generate novel inferences which may then be tested. For example, there are various flavors of natural selection: stabilizing selection, negative selection, and positive directional selection. In the first case natural selection buffets the phenotype about an ideal mean, in the second case deleterious phenotypes and their associated alleles are purged from the genome, and finally, natural selection can also drive a novel trait toward greater prominence, and concomitantly the allelic variants which are associated with the fitter phenotype.

The last case is of particular interest to many because it is often with positive natural selection by which evolution as descent with modification occurs. Over time trait values and the nature of traits themselves shift such that a lineage changes its character beyond recognition. This phyletic gradualism and the scale independence of evolutionary process has been challenged, in particular from the domain of developmental biology (albeit, not all ,or even most, developmental biologists). But ultimately no one doubts that a classical understanding of evolution as change in allele frequency, often driven by natural selection, is part of the larger puzzle of how the tree of life came to be.

ResearchBlogging.orgOne of the phenomena associated with positive directional evolution is the selective sweep. How a selective sweep occurs, and its consequences, are rather straightforward. A genome consists of a sequence of base pairs (e.g., we have 3 billion base pairs). If a new mutation emerges at a particular base pair, a novel single nucelotide polymorphism (SNP), and, that allelic variant is ~10% fitter than the ancestral variant, natural selection could drive up its frequency (the conditionality is due to the fact that in all likelihood it would still go extinct because of the power of stochastic forces when a mutant is at low frequency). So the variant could in theory shift from ~0% (1 out of N, N being the number of individuals in a population, 2N if diploid, and so forth) to ~100%. This would be the fixation of the novel variant, driven by selective dynamics. So what’s the sweep aspect? The sweep in this case refers to the effect of the very rapid rise in frequency of the SNP in question on the adjacent genomic region. What is termed a genetic hitchiking dynamic results if the sweep occurs rapidly, so that nearby regions of the genome also move to fixation along with the favored SNP. But in a diploid organism with sexual reproduction genetic recombination persistently breaks apart associations across the physical genome. Therefore the span of the sequence of genetic markers nearby a favored SNP which form a haplotype is dependent on the rate of recombination as well as the rate of the rise in frequency of the allele, which is contingent on the strength of selection. A powerful selective sweep has the effect of homogenizing wide regions of the genome flanking the favored mutant; in other words the sweep “cleans” the gene pool of variation as one very long haplotype replaces many shorter haplotypes. As an example, in the genomes of Northern Europeans the locus LCT is characterized by a very long haplotype, which itself seems to correlate well with the trait of lactase persistence. The implication here is that the lactase persistence conferring variant arose relatively recently, and was swept up to near fixation by positive directional natural selection.

That’s the broad theory. But as you know, evolution and its subcomponents are more than “just a theory,” they’re a set of models which are amenable to testing, whether through observation, or via controlled laboratory experiments. A new letter to Nature elaborates how exactly selective sweeps play out in Drosophila melanogaster, a classic “model organism.” Interestingly, this is a case of experimental evolution, something we are more familiar with Richard Lenski’s E. coli. Genome-wide analysis of a long-term evolution experiment with Drosophila:

Experimental evolution systems allow the genomic study of adaptation, and so far this has been done primarily in asexual systems with small genomes, such as bacteria and yeast…Here we present whole-genome resequencing data from Drosophila melanogaster populations that have experienced over 600 generations of laboratory selection for accelerated development. Flies in these selected populations develop from egg to adult ~20% faster than flies of ancestral control populations, and have evolved a number of other correlated phenotypes. On the basis of 688,520 intermediate-frequency, high-quality single nucleotide polymorphisms, we identify several dozen genomic regions that show strong allele frequency differentiation between a pooled sample of five replicate populations selected for accelerated development and pooled controls. On the basis of resequencing data from a single replicate population with accelerated development, as well as single nucleotide polymorphism data from individual flies from each replicate population, we infer little allele frequency differentiation between replicate populations within a selection treatment. Signatures of selection are qualitatively different than what has been observed in asexual species; in our sexual populations, adaptation is not associated with ‘classic’ sweeps whereby newly arising, unconditionally advantageous mutations become fixed. More parsimonious explanations include ‘incomplete’ sweep models, in which mutations have not had enough time to fix, and ‘soft’ sweep models, in which selection acts on pre-existing, common genetic variants. We conclude that, at least for life history characters such as development time, unconditionally advantageous alleles rarely arise, are associated with small net fitness gains or cannot fix because selection coefficients change over time

Critical to understanding what’s going on here is the distinction they make between ‘classic’ ‘hard sweeps’ and ’soft sweeps.’ Hard sweeps follow the spare description I outlined above:

1) A new mutant arises in the genetic background

2) Selection favors the mutant

3) The mutant rises in frequency and sweeps to fixation, 0% → 100%, replacing the ancestral variants

In contrast, for a soft sweep:

1) Selection favors a set of minor polymorphisms already segregating in the gene pool

2) These polymorphisms rise in frequency

3) But they may not sweep to fixation

In the first case the signature of natural selection will be clear, distinct, and indubitable. A novel haplotype which has replaced the ancestral variants and produced a wide region of genetic homogeneity as all other allele states are expunged by the sweep will have resulted. That isn’t what they saw at the genomic level.

phendiffBut first, what did they do? The flies used in this experiment derive from a 30 year old lineage, and they selected them for 600 generations in the case of the treatments which were being driven to new phenotype values. 600 generations for humans would be about 15,000 years assuming 25 years per generation. If a trait is heritable, and you select offspring deviated away from the mean, over time you will see a shift in the trait value. This is classic quantitative genetics, and that’s what they saw. They had five lineages which exhibited accelerated development (ACO), and five which were controls which exhibited the ancestral phenotypes (CO). “Eclosion” refers to the fly’s emergence from the pupae. The lineages which were subject to natural had very different life histories from the control groups. The cluster of traits here shouldn’t be too surprising, we know from other taxa that short-lived fast-developing species tend to be smaller and metabolically more under-the-gun than the inverse.

But the real interesting aspects of this study are not the phenotypes. Who hasn’t seen weird things among the Drosophila? That’s one of the reasons they were chosen as model organisms in the first place! Rather, they explored the patterns of genomic variation within and across the lineages, and integrated the results into a broader theoretical framework of how evolutionary processes occur, and their implications for the genome-wide structure one should see. Below I’ve stitched together figure 2 & 3, which illustrate particular patterns of genomic variation.


The left figure shows differences in allele frequencies between the ACO and CO pooled lineages. The spikes indicate large differences, with the dotted line representing the threshold where there’s a 0.1% random chance of such a between population frequency difference. The vertical axis is log-scaled. The grey line at the bottom indicate the differences in one particular ACO lineage with the pooled ACO sample. In the right panel you see heterozygosities, with blue denoting the CO lineages, and red the selected ACO lineages which have shortened life histories. The grey again is a particular ACO lineage. Each vertical panel corresponds to a chromosomal arm of the the Drosophila melanogaster genome.

First, note the widespread distribution of allele frequency differences between ACO and CO. Additionally, there’s little difference between the specific ACO lineage, and the pooled sample. Despite their independent histories they seem to exhibit the same allelic configuration. Second, note that the heterozygosities in the case of the ACO pooled sample is lower than in the CO ancestral phenotype lineages. Why? Remember that selective sweeps should expunge genomic variation. But, the sweeps do not seem to have gone to fixation, otherwise we’d see many more inverted peaks converging to heterozygosity of ~0, as the selected variant replaces all others in the population.

What’s going on in the regions which exhibit differences between the controls and selected linages? They looked at the ~650 non-synonymous SNPs on ~500 genes which were most differentiated between ACO and CO (L10FET score > 4) and found the following categories of genes enriched: imaginal disc development, smoothened signalling pathway, larval development, wing disc development, larval development (sensu Amphibia), metamorphosis, organ morphogenesis, imaginal disc morphogenesis, organ development and regionalization. Life history is complex. Combine the wide class of genes with the dispersed genomic impact of selection as evident in figures 2 and 3, you get a good sense of the sort of consequences on the substrate level which quantitative genetic evolutionary dynamics have. Also of interest, they found that the X chromosome seemed enriched for signatures of selection and evolution. Why? They note that this chromosome would be more subject to selection for recessive or partially recessive expressing SNPs.

Clearly this study did not find the clean hard sweeps which theory may have predicted. Rather, the researchers found a lot of partially completed sweeps distributed all across the genome. Sound familiar? Before we move on to broader considerations, here are their explanations:

- The sweeps are hard, but haven’t reached fixation. So the selection coefficients have be rather small for them to still be in transient

- Selection is operating on “standing variation.” That is, the genetic variation extant naturally within a given population, and which may be operated upon by natural selection to change the population trait value mean through classical breeding techniques

- And finally, selection coefficients (the greater fitness of positively selected variants against the population mean) may not be static parameters, but change over time as a function of allele frequency. This shouldn’t be that surprising. Frequency dependence and epistasis can impact on linear assumptions within a statistical genetic model. The authors refer to deleterious alleles or antagonistic pleiotropy as possible genetic level forces which also prevent fixation

I personally lean against the first option, because it seems like we see a similar pattern in human evolutionary genomics, lots of partial sweeps and incomplete fixation. How much time does a brother need? In the long run we’re dead, and heat death swallows the universe. In the short run evolutionary pressures are always shifting. Fix now, or forget it say I! The wide distribution of allelic differences as well as moderate heterozygosities seems to be an indication that a quantitative trait, life history, is being modified through mass action on genetic variation. Interestingly, there’s also the parallel to humans insofar as the X chromosome seems to have more signatures of selection and variation in this evolutionary experiment. Next question: who’s working on experimental evolution of 600 generations in mice?

Citation: Burke, Molly K., Dunham, Joseph P., Shahrestani, Parvin, Thornton, Kevin R., Rose, Michael R., & Long, Anthony D. (2010). Genome-wide analysis of a long-term evolution experiment with Drosophila Nature : 10.1038/nature09352

Image Credit: Karl Magnacca

September 14, 2010

The silver age of altitude adaptation

tib1With all the justified concern about “missing heritability”, the age of human genomics hasn’t been a total bust. As I have observed before in 2005’s excellent book Mutants the evolutionary geneticist Armand M. Leroi asserted that we really didn’t have a good understanding of normal variation of human pigmentation. At the time I think it was a defensible claim, but within three years I’d say that most of the mystery had been cleared up. Though there are still some holes to be plugged, and details to be elucidated, the genetic architecture of pigmentation is now understood more or less. By the fall of 2006 Richard Sturm penned a review titled A golden age of human pigmentation genetics, an age I think which in some ways probably was closed with his 2009 review Molecular genetics of human pigmentation diversity. It’s not surprising that many of the traits that 23andMe tells you about have to do with your pigmentation. Of course there’s some limited utility in this, one assumes that most individuals don’t gain much benefit from the knowledge that they have an “85% change of having brown eyes,” though it may be useful in terms of offspring prediction (I would say I have an 85% chance of having brown eyes, but since I’m not European the genetic background isn’t right to make that probability assertion).

ResearchBlogging.orgBut as the golden age of pigmentation genetics comes to a close and the low hanging fruit is stripped bare, where next? I wonder if it may be altitude adaptations. Like pigmentation altitude genetics has been around for a while, but it seems there’s a recent cresting of papers in the area, focusing in particular on the three canonical high altitude peoples, the Tibetans, Andeans, and the Ethiopians. Last spring two major groups came out with papers on the genetics of Tibetan altitude adaptation, and its evolutionary history, using somewhat different techniques. A new paper in PLoS Genetics builds upon that work (verifying two of the loci as targets of selection in Tibetans implicated in the previous papers), and, adds Andean populations to the mix to assess the possibilities of convergent adaptations. Identifying Signatures of Natural Selection in Tibetan and Andean Populations Using Dense Genome Scan Data:

High-altitude hypoxia is caused by decreased barometric pressure at high altitude, and results in severe physiological stress to the human body. Three human populations have resided at high altitude for millennia including Andeans on the Andean Altiplano, Tibetans on the Himalayan plateau, and Ethiopian highlanders on the Semian Plateau. Each of these populations exhibits a unique suite of physiological changes to the decreased oxygen available at altitude. However, we are just beginning to understand the genetic changes responsible for the observed physiology. The aim of the current study was to identify gene regions that may be involved in adaptation to high altitude in both Andeans and Tibetans. Genomic regions showing evidence of recent positive selection were identified in these two high-altitude human groups separately. We found compelling evidence of positive selection in HIF pathway genes, in the globin cluster located on chromosome 11, and in several chromosomal regions for Andeans and Tibetans. Our results suggest that key HIF regulatory and targeted genes are responsible for adaptation to altitude and implicate several distinct chromosomal regions. The candidate genes and gene regions identified in Andeans and Tibetans are largely distinct from one another. However, one HIF pathway gene, EGLN1, shows evidence of directional selection in both high-altitude populations.

In this paper the authors looked at around 50 Andeans (Quechua and Aymara speakers) and 50 Tibetans, and compared them to various outgroups. In addition to the European and Asian HapMap populations they also looked at some Amerindian populations. The map below shows the geographical scope of their sampling (the right inset are the Amerindian lowland groups):


The ancestral relationships of the two highland groups sampled in relation to the lowlanders was relatively straightforward. Panel A and B show PCA plots for the Andeans and Tibetans, while C and D show frappe bar plots. The only thing notable for me is that the Quechua speakers seem to show residual European ancestry which the Aymara do not, and the Colombian indigenous groups seems to have more affinity with Mesoamerican populations than with the other South American samples. I can give no insight as to the latter, but if it is not just a quirk of non-representativeness one may be seeing the higher number of Spanish men who married into the nobility of the Quechua speaking highlands than further south in lands of the Aymara (though Potosi was in Bolivia, so this may not be plausible).

We already have some evolutionary expectations of how these groups came to have these adaptations to their high altitude environments. It seems that the physiological processes for the three groups are somewhat different, and this has been a source of curiosity for geneticists for a long time. It stands to reason if the physiology is somewhat varied, the genetics should be too, and that seems to be a broadly correct assumption. In this paper they took two general approaches, looking at the total genome, and focusing on specific candidate regions. From what I can tell they did not find much novel using the first technique, but they did clarify the relationship between Tibetans and Andeans in terms of their genetic adaptations a bit by looking at specific genes. As noted in the author summary it looks as if the two populations do have somewhat different genetic architectures. Many of the genes which seem to have been targets of selection do not overlap, and of those that do there seem different localized selection events so that the haplotypes being driven by positive selection differ.

They used a compound of techniques to detect possible regions of natural selection:

- locus specific branch length (LSBL)

- the log of ratio of heterozygosities (lnRH)

- a modified Tajima’s D statistic

- whole genome long range haplotype (WGRLH)

LSBL is an elaboration on Fst, so it is finding between population differences in allele frequency. Recall that at any given locus you don’t expect much between population difference, so if there is a great deal of ecological adaptation you may see a lot of variance as a function of geography. Heterozygosity is simply a measure of the fraction of loci where the two gene copies are in different states. It’s just a way to measure genetic variation (though there are others). The Tajima’s D statistic is a test for whether the locus seems deviated from neutral expectations. This means that there may have been a bottleneck, selective sweep, or, balancing selection. Finally, the last test looks for sets of correlated markers within the genome. If there is a haplotype, a sequence of markers, at high frequency then it may be that you’re witnessing a genomic region which is in, or just after, the occurrence of a selective sweep.

Why four different tests? Because one given test is not dispositive of natural selection. As noted with Tajima’s D, there are demographic processes of a stochastic nature which can produce false positives, so it is best not to live or die by one technique alone.

Here is figure 4, which shows the differences in allele frequencies on the EGLN1 gene:


We’ve seen EGLN1 before. In the figure above the left panels show the Andean derived SNPs, and the right panels the Tibetan ones. Note the differences in frequency in A and B. The red denotes statistically significant values for a statistic in panels C & D. Both Andeans and Tibetans show indications of selection, but the details in the patterns vary when you zoom in on the gene. The very last panel has an arrow which points to the SNPs in each population where the between population variance is maximized. Interestingly the ancestral allele seems to have risen in frequency here in the high altitude populations, as black denotes ancestral and red derived in the first and last panels.

Let me jump to their conclusion:

In summary, we performed a genome scan on high- and low-altitude human populations to identify selection-nominated candidate genes and gene regions in two long-resident high-altitude populations, Andeans and Tibetans. Several chromosomal regions show evidence of positive directional selection. These regions are unique to either Andeans or Tibetans, suggesting a lack of evolutionary convergence between these two highland populations. However, evidence of convergent evolution between Andeans and Tibetans is suggested based on the signal detected for the HIF regulatory gene EGLN1. In addition to EGLN1, a second HIF regulatory gene, EPAS1, as well as two HIF targeted genes, PRKAA1 and NOS2A, have been indentified as selection-nominated candidate genes in Tibetans (EPAS1) or Andeans (PRKAA1NOS2A). PRKAA1 and NOS2A play major roles in physiological processes essential to human reproductive success…Thus, in addition to demonstrating the likely targets of natural selection and the operation of evolutionary processes, genome studies also have the clear potential for elucidating key pathways responsible for major causes of human morbidity and mortality. Based on the findings of this study, it will be important to confirm the results with genotype-phenotype association studies that link genotype to a specific high-altitude phenotype.

I wanted to show the alphabet soup of genes in case you’re a geneticist with an interest in any of these loci. I’ve seen these before in previous papers, I assume the key that got this published in PLoS Genetics is the deep comparative dimension, as the researchers explored the lack or existence of evolutionary convergence between these two populations. Should the finding be surprising? I don’t think so. High altitudes are extreme environments, and the literature is filled with references to problems which emerge even in these populations because of the nature of their adaptations. There are likely deleterious side effects, especially if one of last spring’s papers on Tibetans is correct and that they’re relatively recent settlers of the highlands. But you never know until you play the game, so it is good to confirm.

A further exploration of the genetic architecture and nature of adaptations, especially when the research is extended to Ethiopians, may give us a further window into contingency in evolutionary history. These three occurrences are basically three independent experiments. In this paper they indicate that some of the variants being subject to natural selection may have been in the ancestral population, so standing variation. Others are new mutations, unique and novel. Though there are different pathways to the final expression of the phenotype, which in the details of implementation (physiology) still differ across the groups, there are also genes which in this comparison seem to be implicated in both Tibetans and Andeans as having been subject to selection. How constrained is the sample space subject to possible selection and the implied G-matrix? How contingent are the evolutionary pathways that different populations take to attain the state of adaptive fitness in similar ecologies? These are the sort of long term questions which I think will be possibly answered as the tentative silver age of altitude adaptation gives way to the golden age.

Citation: Bigham A, Bauchet M, Pinto D, Mao X, & Akey JM (2010). Identifying Signatures of Natural Selection in Tibetan and Andean Populations Using Dense Genome Scan Data PLoS Genetics

Image Credit: Micah MacAllen

Note: I am aware that classically the silver age follows the golden age, instead of precedes it. But we live in Whiggish times indeed!

August 27, 2010

Not the origin of genome complexity

ResearchBlogging.orgOver the past decade evolutionary geneticist Mike Lynch has been articulating a model of genome complexity which relies on stochastic factors as the primary motive force by which genome size increases. The argument is articulated in a 2003 paper, and further elaborated in his book The Origins of Genome Architecture. There are several moving parts in the thesis, some of which require a rather fine-grained understanding of the biophysical structural complexity of the genome, the nature of Mendelian inheritance as a process, and finally, population genetics. But the core of the model is simple: there is an inverse relationship between long term effective population size and genome complexity. Low individual numbers ~ large values in terms of base pairs and counts of genetic elements such as introns.

A quick reminder: effective population size denotes the proportion of the population which contributes genes to the next generation. So, in the case of insects with extremely high mortality in the larval stage the effective population size may be orders of magnitude smaller than the census size at any given generation evaluating over all stages of life history. In contrast, with humans a much larger proportion of children end up contributing to the genetic makeup of the subsequent generation. With large organisms I’ve heard you can sometimes use a rule of thumb that effective population size is ~1/3 of census size, though this probably overestimates the effective population size. One reason that reproductive variation reduces the effective population, because many individuals contribute far less to the next generation than other individuals. The greater the variance, the more evolutionary genetic variation is impacted by a few individuals within the population at a given generation, reducing effective population which contributes to the next (the reproductive variance is often assumed to be poisson, but that is likely an underestimate). Additionally, there is the issue of variation over time. Long term effective population is much more sensitive to low bound values than high bound values, so it is liable to be much smaller than the census size at any given period for a species which goes through cycles. Humans for example have a relatively small long term effective population size evaluated over the past 100,000 years because we seem to have expanded from a small initial population. Mathematically since long term effective population size is given by the harmonic mean it stands to reason that low bound values would be critical. If that doesn’t make sense to you, remember the outsized impact which population bottlenecks may have on the long term trajectory of a species, in particular by removing genetic variation.

How does this influence genome complexity? Basically Lynch’s thesis is that when you reduce effective population you dampen the power of natural selection, specifically purifying selection, from preventing the addition of non-adaptive complexity through random processes. It isn’t that selection is rendered moot, rather, its signal is overwhelmed by the noise. Here’s the abstract of his 2003 paper:

Complete genomic sequences from diverse phylogenetic lineages reveal notable increases in genome complexity from prokaryotes to multicellular eukaryotes. The changes include gradual increases in gene number, resulting from the retention of duplicate genes, and more abrupt increases in the abundance of spliceosomal introns and mobile genetic elements. We argue that many of these modifications emerged passively in response to the long-term population-size reductions that accompanied increases in organism size. According to this model, much of the restructuring of eukaryotic genomes was initiated by nonadaptive processes, and this in turn provided novel substrates for the secondary evolution of phenotypic complexity by natural selection. The enormous long-term effective population sizes of prokaryotes may impose a substantial barrier to the evolution of complex genomes and morphologies.

The implication here is that prokaryotes with massive population sizes are biased toward smaller genomes by the more efficacious application natural selection. In contrast, more complex organisms which have smaller population sizes, and so are more impacted by the random fluctuations generation to generation due to sample variance, are less streamlined genomically because selection can do only so much against the swelling sea of noise. One intriguing argument of Lynch is that the genomic complexity is then later useful downstream as the building block of phenotypic complexity, but let’s set that aside for now.

A new paper in PLoS Genetics challenges the statistical analysis of the original data which Lynch et al. used to make their case. Technically the argue was that there was an inverse relationship between Neu and genome size. Ne is effective population size, and u is nucleotide mutation rate. Though argument is technical, and the basic objection should be easy to understand: there are other variables which may actually be responsible for the correlation which Lynch et al. discerned. To the paper, Did Genetic Drift Drive Increases in Genome Complexity?:

Genome size (the amount of nuclear DNA) varies tremendously across organisms but is not necessarily correlated with organismal complexity. For example, genome sizes just within the grasses vary nearly 20-fold, but large-genomed grass species are not obviously more complex in terms of morphology or physiology than are the small-genomed species. Recent explanations for genome size variation have instead been dominated by the idea that population size determines genome size: mutations that increase genome size are expected to drift to fixation in species with small populations, but such mutations would be eliminated in species with large populations where natural selection operates at higher efficiency. However, inferences from previous analyses are limited because they fail to recognize that species share evolutionary histories and thus are not necessarily statistically independent. Our analysis takes a phylogenetic perspective and, contrary to previous studies, finds no evidence that genome size or any of its components (e.g., transposon number, intron number) are related to population size. We suggest that genome size evolution is unlikely to be neatly explained by a single factor such as population size.

lynchfig2In the original analysis by Lynch et al. ~66% of the variation in genome size was explained by Neu! That’s a pretty large effect. Figure 1 illustrates how phylogeny could be a confound in adducing a relationship. Here’s some of the text which explains the figure:

In this hypothetical example, eight species have been measured for two traits, x and y, as indicated by pairs of values at the tips of the phylogenetic tree (A). Ordinary least-squares linear regression (OLS) indicates a statistically significant positive relationship (B; r-squared = 0.62, P = 0.02), potentially leading to an inference of a positive evolutionary association between x and y. However, inspection of the scatterplot (B) in relation to the phylogenetic relationships of the species (A) indicates that the association between x and y is negative for the four species within each of the two major lineages. Regression through the origin with phylogenetically independent contrasts…which is equivalent to phylogenetic generalized least squares (PGLS) analysis, accounts for the nonindependence of species and indicates no overall evolutionary relationship between the traits…The apparent pattern across species was driven by positively correlated trait change only at the basal split of the phylogeny; throughout the rest of the phylogeny, the traits mostly changed in opposite directions (A; basal contrast in red)….

The argument then seems to be that the relationship in the original work by Lynch was an artifact due to the evolutionary history of the species which he surveyed to infer the relationship. Instead of a general principle or law then what you have is an outcome of contingent historical processes. Not very neat and clean. You can see the taxa-clustered nature of the relationship in figure 1 from the 2003 paper in Science:


OK, now let’s look at the visualization of the same data set from this paper, as a tree to illustrate the correlations:


lynchfig5The last figure shows the difference between a scatterplot using conventional OLS regression, and the phylogenetic least squares model (PGLS). You go from an obvious linear relationship, which translated into the high r-squared noted above, to basically nothing (r-squared near zero, no statistical significance).

The paper itself isn’t that long, the objection is pretty straightforward. They’re simply claiming that Lynch didn’t correct for an obvious alternative explanation/confound, and that we don’t know what we thought we knew. Additionally, there is the assertion that the idea that effective population size predicts genome size robustly is becoming conventional wisdom within the scientific community. I don’t know about that, this seems like such a young field in flux that I think they oversold how widespread this assumption is to make the force of their rebuttal more critical. Certainly the patterns in genome size can be quite perplexing, but my intuition is that an r-squared on the order of 2/3 of the variation in genome size being explained by one predictor variable is rather astounding. Obviously genome size is pretty easy to get in the “post-genomic era,” but Ne and u are harder to come by for many taxa, or even within a given taxon for a set of species of interest. It looks to me an opportunity for experimental evolutionalists, who can control the confounds, and observe changes within a lineage. And yet even if Neu is predictive as an independent variable all things controlled, what if all things are not usually controlled, and random acts of phylogenetic history are more important? Mike Lynch is credited in the acknowledgements, so I assume we’ll be seeing a response from him in the near future.

Citation: Whitney KD, & Garland T Jr (2010). Did Genetic Drift Drive Increases in Genome Complexity? PLoS Genetics : 10.1371/journal.pgen.1001080

April 12, 2010

When a trait isn’t a trait isn’t a trait

ResearchBlogging.orgOne of the great things about evolutionary theory is that it is a formal abstraction of specific concrete aspects of reality and dynamics. It allows us to squeeze inferential juice from incomplete prior knowledge of the state of nature. In other words, you can make predictions and models instead of having to observe every last detail of the natural world. But abstractions, models and formalisms often leave out extraneous details. Sometimes those details turn out not to be so extraneous. Charles Darwin’s original theory of evolution had no coherent or plausible mechanism of inheritance. R. A. Fisher and others imported the empirical reality of Mendelism into the logic of evolutionary theory, to produce the framework of 20th century population genetics. Though accepting the genetic inheritance process of Mendelism this is original synthesis was not informed by molecular biology, because it pre-dated molecular biology. After James Watson and Francis Crick uncovered the biophysical basis for Mendelism molecular evolution came to the fore, and neutral theory emerged as a response to the particular patterns of genetic variation which new molecular techniques were uncovering. And yet through this much of R. A. Fisher’s image of an abstract genetic variant floating against a statistical soup of background noise variation persisted, sometimes dismissed as “bean bag genetics”.

We’ve come a long way from the first initial wave of discussions which were prompted by the molecular genetic revolution. We have epigenetics, evo-devo and variation in gene regulation. None of these processes “overthrow” evolutionary biology, though in some ways they may revolutionize aspects of it. Science is over the long haul after all an eternal revolution, as the boundaries of comprehension keep getting pushed outward. A few days ago I pointed to Sean Carroll’s recent work, which emphasizes that one must think beyond the sequence level, and focus on particular features such as cis-regulartory elements. Here we’ve been tunneling down to the level of the gene, but what about the traits, the phenotypes, which are affected by genetic variation?

It is well known that the sparest abstraction of genotypic-phenotypic relationship can be illustrated like so:

genetic variation → phenetic variation

But each element of this relation has to be examined greater detail. What type of genetic variation? Sequence level variation? Epigenetic variation? The second component is perhaps the most fraught, with the arrow waving away the myriad details and interactions which no doubt lurk between genotype and phenotype. And finally you have the phenotype itself. Are they all created alike in quality so that we can ascribe to them dichotomous values and quantities?

A new paper in PNAS examines the particulars of morphological phenotypes and physiological phenotypes, and their genetic control, as well as rates of evolution. Contrasting genetic paths to morphological and physiological evolution:

The relative importance of protein function change and gene expression change in phenotypic evolution is a contentious, yet central topic in evolutionary biology. Analyzing 5,199 mouse genes with recorded mutant phenotypes, we find that genes exclusively affecting morphological traits when mutated (dubbed “morphogenes”) are grossly enriched with transcriptional regulators, whereas those exclusively affecting physiological traits (dubbed “physiogenes”) are enriched with channels, transporters, receptors, and enzymes. Compared to physiogenes, morphogenes are more likely to be essential and pleiotropic and less likely to be tissue specific. Morphogenes evolve faster in expression profile, but slower in protein sequence and gene gain/loss than physiogenes. Thus, morphological and physiological changes have a differential molecular basis; separating them helps discern the genetic mechanisms of phenotypic evolution.

Morphology here refers to gross anatomical features. The sort of traits and characteristics which a paleontologist or anatomist might take interest in. Physiology is more about function, and the physical structures which enable that function. It is naturally closer to the scale of molecular biology as physiology melts into biochemistry. Of course at the other end physiology also merges with anatomy as physiology occurs within features of interest to the anatomist. By way of generalization perhaps physiology may be considered more granular, while morphology more gross, in the context of this paper.

They used the mouse because it’s a species which has long served as a model organism, and there are a host of well known and characterized mutations for both physiology and morphology. Utilization of mice in these fields in the context of evolutionary research dates back to the early 20th century. So systems biologists have a lot of research that’s already been done to work with. They found 5199 mouse genes with known phenotypes in the Mouse Genome Informatics database. 821 affected only morphological traits and 912 affected only physiological traits.

Figure 1 shows the breakdown by Gene Ontology:


Going by what little I know about these topics the second to the fourth panels aren’t surprising. Morphological traits are built from molecular structures, while the transporter activity classes are a more cellular scale, and so would seem to be below the threshold of salience for morphological traits. The first panel is not something I’d expected, but it makes sense after the fact. Figure 2 clarifies. The right panels have proportions, the left counts.


The primary point is this: morphogenes seem to affect more traits than physiogenes, and, their affect is less tissue specific when it comes to a particular trait. When this pattern is highlighted the enrichment toward transcriptional regulation makes more sense to me it is transcriptional regulation might allows for more trait by trait level control of variation. If there is a relationship of many traits to one gene that would probably impose a constraint on the sequence level to a greater extent than if the gene was implicated in variation on one trait. The gap in pleiotropy is closed somewhat when you constrain to essential genes, those whose mutation results in decrease of fitness to zero (through death or lack of ability to reproduce). Pleiotropy presumably is constraining the genetic landscape toward particular fitness peaks. Tissue specificity seems understandable when you consider the localization of many physiological processes, and their biochemical complexities (I’m thinking of the vagaries of gene expression in the liver here).

But they looked at more than how the traits and genes distribute now, they tried to sniff out if there were differences in the rate of evolution of morphogenes and physiogenes contingent upon the class of genetic variants. Remember that you have sequent level changes on exons which can alter proteins. You have cis-acting elements as critical cogs in gene regulation. And you have more gross genomic features such as gene duplication or deletion.

morphfig3Figure 3 shows the differences between mice and humans on particular genes in relation to sequence level substitutions as well as gene expression profiles. Specifically in the case of the former you want to know the rate of nonsynonymous substitution, those substitutions at base pairs which change the amino acid translated, standardized by the overall mutation rate. So panel C is the one to focus on. Note that physiogenes seem to have evolved more since the last divergence between human and mice lineages than morphogenes. Why might this be? An immediate thought that comes to mind is that tissue-specific expressing physiological processes are liable to be modulated more often than gross morphology, which might be controlled by genes with a lot of pleiotropic effects and so constrained. Even when you control to tissue-specificity the pattern remains, as evident in panel D. The pattern seems somewhat inverted in relation to rate of evolution when it comes to gene expression profiles, as you can see in the last three panels. Evolution happens, but by somewhat different genetic means in these cases. The authors finger pleiotropy in particular as the problem for sequence level evolution in morphogenes, as changes in proteins are much more likely to be problematic if those proteins are upstream from many more traits.

In a way these results show that evolution has to be a versatile designer. When it comes to physiogenes the illustrator is in charge, creating new traits from the most basic genetic raw material, changes in a base pair here and a base pair there. But for morphogenes evolution has to use the tools and tricks of photoshopping, making recourse to extant elements and rearranging or tweaking things here and there so as not to upset the complex applecart while modulating on the margins.

What about cis-acting regulatory elements? In the paper they allude to the argument of Sean Carroll that cis-acting regulatory elements are critical for the evolution of morphological traits. That would imply that morphogenes should be enriched vis-a-vis physiogenes for changes on these elements. They didn’t find that in figure 4. On the contrary.


But I don’t think they perceive their result as a rock-solid refutation of Carroll because it was somewhat indirect. I’ll quote from the paper:

…Because experimentally confirmed mammalian cis elements are few, are likely to have been confirmed in only one species, and are potentially biased toward certain classes of genes,we tested the above hypothesis by using cis-elements that were predicted exclusively by motif sequence conservation among a set of vertebrate genome sequences and recorded in the cisRED database (20). In cisRED, 8,440 predicted mouse cis-elements and 7,688 predicted human cis-elements were found to be in the proximity of 586 mouse morphogenes and their human orthologs, respectively. Similarly, 7,082 mouse cis-elements and 7,215 human cis-elements were predicted for 621 physiogenes….

I’m inclined to accept this result and its generalizability, but there’s a layer of analysis and modeling in this case which doesn’t exist in the others. Additionally, Carroll’s thesis is about the whole animal kingdom and a mouse-human comparison may be atypical.

Finally they wanted to look at gene duplication. They found:

Together with the Dfam result, our analyses show that, whereas physiogene families expand/contract faster than morphogene families, the rate of expansion/contraction is relatively constant across lineages for a given family.

I wonder if the duplication here might have something to do with modulating dosages of various substrates in biochemical processes. This may have more direct relevance to physiological processes.

It is important to note as they did that the category “morphogene” and “physiogene” is somewhat artificial, as is the distinction between morphology and physiology. Nature is fundamentally one, and we break it apart as particular joints for ease of our own abstractions and categorizations. Additionally all genes presumably have some effect on morphology and physiology, and though this exploration looks under the hood a bit more than some of the older abstractions it too is a simplification. The key is that the argument here seems to be that these breaking apart of categories and processes gives us useful marginal return in comprehension of evolutionary dynamics. A trait is not always just a trait. Different classes of phenotypes may have different evolutionary genetic implications by their very nature. Some of this is common sense, those traits which are less functionally significant will exhibit more genic variation. But distinctions in terms of form and function themselves are at a further level of detail. And, I presume that generalizations that we make from mouse-human comparisons as here have some limitations across the tree of life.

Citation: Liao BY, Weng MP, & Zhang J (2010). Contrasting genetic paths to morphological and physiological evolution. Proceedings of the National Academy of Sciences of the United States of America PMID: 20368429

Powered by WordPress