Homozygosity runs in the family (or not)

800px-IMGP2147 The number 1 gets a lot more press than -1, and the concept of heterozygosity gets more attention than homozygosity. Concretely the difference between the latter two is rather straightforward. In diploid organisms the genes come in duplicates. If the alleles are the same, then they’re homozygous. If they’re different, then they’re heterozygous. Sex chromosomes can be an exception to this because in the heterogametic sex you generally have only one copy of gene as one of the chromosomes is sharply truncated. This is why in human males are subject to X-linked recessive traits at such a great frequency in comparison to females; recessive expression is irrelevant when you don’t have a compensatory X chromosome to mask the malfunction of one allele.

Of course recessive traits are not simply a function of sex-linked traits. Consider microcephaly, an autosomal recessive disease. To manifest the trait you need two malfunctioning copies of the gene, one from each parent. In other words, you exhibit a homozygous genotype with two mutant copies. I suspect that this particularly common context of homozygosity, recessive autosomal diseases, is one reason why it is less commonly discussed outside of specialist circles: there are whole cluster of medical and social factors which lead to homozygosity which are already the focus of attention. The genetic architecture of the trait is of less note than the etiology of the disease and the possible reasons in the family’s background which might have increased the risk probability, especially inbreeding. In contrast heterozygosity is generally not so disastrous. Even if functionality is not 100%, it is close enough for “government work.” The deleterious consequences of a malfunctioning allele are masked by the “wild type” good copy. The exceptions are in areas such as breeding for hybrid vigor, when heterozygote advantage may be coming to the fore. The details of complementation of two alleles matter a great deal to the bottom line, and the concept of hybrid vigor has percolated out to the general public, with the more informed being cognizant of heterozygosity.

But homozygosity is of interest beyond the unfortunate instances when it is connected to a recessive disease. Like heterozygosity, homozygosity exists in spades across our genome. My 23andMe sample comes up as 67.6% homozygous on my SNPs (which are biased toward ~500,000 base pairs which tend to have population wide variation), while Dr. Daniel MacArthur’s results show him to be 68.1% homozygous across his SNPs. This is not atypical for outbred individuals. In contrast someone whose parents were first cousins can come up as ~72% homozygous. This is important: zygosity is not telling you simply about the state of two alleles, in this case base pairs, it may also be telling you about the descent of two alleles. Obviously this is not always clear on the base pair level; mutations happen frequently enough that even if you carry two minor alleles it is not necessarily evidence that they’re identical by descent (IBD), or autozygous (just a term which denotes ancestry of the alleles from the same original copy). What you need to look for are genome-wide patterns of homozygosity, in particular “runs of homozygosity” (ROH). These are long sequences biased toward homozygous genotypes.

220px-Morgan_crossover_1 What ROH can tell you about an individual, and perhaps a population, becomes more clear when you conceptualize in your mind’s eye the basic dynamics which occur in the course of biological replication in diploid sexual organisms. Each individual receives half their autosomal genome from each parent. Though genes are abstractions, individual units at the root of a complex causal sequence which maps to a phenotype, a trait, they’re also physical entities embedded within the structure of DNA. This structure is a physical sequence, whereby you have adjacent base pairs, clusters of which define genes, intergenic regions, exons, introns, promoters, etc. In other words, the whole alphabet soup of molecular genetics. The spatial relationship of genes to each other along the chromosome allowed for linkage mapping decades before the biophysical substrate of DNA was known to be critical to the whole process. Particular sequences of alleles may therefore be inherited together, and form a haplotype. Over the generations the associations of these distinctive alleles in haplotypes dissolve through recombination, a physical process which erodes the structural integrity of chromosomal sequences.

210px-Juan_de_Miranda_Carreno_002 With these basics in mind, let’s move to a specific repulsive example. Imagine a father who impregnates his daughter. Why is this repulsive to us? From a consequential “gene’s eye” perspective the father is suborning the beauty of sexual reproduction whereby genetic variation is mixed & matched across individuals. Colloquially, where the daughter would be 50% of the father genetically, the child of the daughter and her father would be 75% of the father genetically. From a gene-only perspective this may be favorable, as the father is coming closer to cloning himself, but we all know that the rate of breakdown of the “vehicle” in these individuals is high. Why? Inbreeding leads to a relatively massive increase in homozygosity as chromosomal segments identical by descent are paired off against each other. We know that the problem is that a host of nasty recessive diseases are highly likely in inbred individuals.

All humans carry a large load of deleterious alleles. Some of these may be potentially lethal. But like bombs without the trigger a functional copy of the alleles complements and masks the mutant variety and we carry on. Many of these mutants are particular to our family, and some of them are private even to ourselves, the outcome of de novo mutations which make each human distinctive genetic islands (at least until they reproduce and pass on their mutational distinctions). Therefore a man who mixes his own genes together in the act of incest is potentially lighting the fuse whereby these hidden malevolent mutants will explode from being cryptic genetic abormalities toward full-blown disease monstrosities.

One statistic which would register incest would be ROH; naturally when you have long regions of recently IBD chromosomal segments adjacent to each other you’ll have a lot of homozygosity, since the paired alleles are replica copies. Assuming that an individual with many long ROH can survive and reproduce over time these massive swaths of homogeneity will be wiped away by mutation and recombination as well as outbreeding. Incest is still arguably a health disaster, but one can imagine the motive genetic engines of evolutionary variation healing the damage over time.

And it doesn’t have to be so extreme. Father-daughter or sibling incest is only a boundary condition. First cousin marriages aren’t nearly as disastrous, the fecundity of British Pakistanis despite higher rates of genetic abnormalities being clear evidence of this. They are certainly more evolutionarily fit than non-Pakistani Brits, who do not reproduce at the clip of 4 children per family. These clans will exhibit more modest levels of ROH because the coefficient of relationship between cousins is only 1/8, as opposed to 1/2 between parents and children or full siblings.

roh1 The figure to the left is from a 2008 paper on ROH in Europeans. Specifically these are Orcadians or part-Orcadians. A population you should be familiar with from the HGDP panel. Orcadians are natives of the Orkney islands just off the north coast of Scotland. Though of somewhat diverse origins, Viking, Scot and Pict, being islanders they’ve developed their own genetic peculiarities because of their isolation. A good rule of thumb is that any body of water is a fearsome barrier to casual gene flow. On the y-axis you see the total number of ROH in the genome of a given individual. I point you to the methods if you are curious as to the exact parameters they specified in their calculation. ROH is assessed over a window of the genome, and naturally one can vary its width, as well as the stringency in registering a particular region as a run or not a run. On the x-axis are the total lengths in terms of base pairs. What you see is a positive correlation between the number of ROH, and the total genomic length of the sequences. Those Orcadians who are genetically more diverse because of non-Orcadian parentage have the least homozygosity in their genomes. Those who are products of the recent cousin marriage have the most. But notice a peculiar pattern: there’s a curvilinear trend to the values. In those individuals who presumably have very high inbreeding coefficients the total length of ROH seems to exceed one’s expectation based on just the total number of ROHs. Why? Because they have very long runs of homozygosity indeed. This is just what we’d expect from the sort of process I described earlier, where it takes many generations for the long chromosomal sequences to be broken apart by recombination.

Before I get you too excited about the genetics of European homozygosity, let’s take a wider view. Some of the same researchers who published the paper above have come out with a set of results which survey the world. Genomic Runs of Homozygosity Record Population History and Consanguinity:

The human genome is characterised by many runs of homozygous genotypes, where identical haplotypes were inherited from each parent. The length of each run is determined partly by the number of generations since the common ancestor: offspring of cousin marriages have long runs of homozygosity (ROH), while the numerous shorter tracts relate to shared ancestry tens and hundreds of generations ago. Human populations have experienced a wide range of demographic histories and hold diverse cultural attitudes to consanguinity. In a global population dataset, genome-wide analysis of long and shorter ROH allows categorisation of the mainly indigenous populations sampled here into four major groups in which the majority of the population are inferred to have: (a) recent parental relatedness (south and west Asians); (b) shared parental ancestry arising hundreds to thousands of years ago through long term isolation and restricted effective population size (N_e), but little recent inbreeding (Oceanians); (c) both ancient and recent parental relatedness (Native Americans); and (d) only the background level of shared ancestry relating to continental N_e(predominantly urban Europeans and East Asians; lowest of all in sub-Saharan African agriculturalists), and the occasional cryptically inbred individual. Moreover, individuals can be positioned along axes representing this demographic historic space. Long runs of homozygosity are therefore a globally widespread and under-appreciated characteristic of our genomes, which record past consanguinity and population isolation and provide a distinctive record of the demographic history of an individual’s ancestors. Individual ROH measures will also allow quantification of the disease risk arising from polygenic recessive effects.

Their data set consists of the HGDP sample populations, so you naturally have the broad geographic clusters such as Africa, Europe, West Asia, Central/South Asia, East Asia, Oceania, and the New World. Two big dynamics are superimposed upon each other in the patterns of ROH: “deep history” demographic processes such as bottlenecks and population expansions, and cultural anthropological patterns which we see around us such as cousin marriage within inbred clans. To find the former you need to survey the genome finely. In contrast the latter leaves pretty obvious signs genomically in the form of very long ROH, as well as clusters of recessive diseases.

The first figure shows the distribution of different lengths of ROH by population:

journal.pone.0013996.g001

Here’s the take away:

– Oceanians have many short ROH, but as you increase the length of ROH threshold they are not exceptional at all

– The New World samples persist in having a disproportionately number of ROH no matter the length, though the number does drop as you increase length threshold. This makes sense, the human genome is of finite length and you can only have so many very long ROHs

– The West Asian and Central/South Asian populations seem to have more long ROHs than the other Eurasian or African groups, though they’re not exceptional in the lowest category

– The Africans have the least ROH, especially in the category of very short runs

Before I comment on these patterns in detail, let’s quickly check out the next figure. It looks at Africans only, but divides the sample into those which are hunter-gatherers and those which are agriculturalists.

journal.pone.0013996.g002

The hunter-gatherers have more, and longer, ROH than the agriculturists. Why? The answer in large part explains the geographical patterns as well: larger long term effective population. Effective population just refers to the proportion of the population which contributes genetically to the next generation. Small effective populations means a lot of genetic drift because of increased sample variance, and tends to converge upon consanguinity. If your tribe is small enough the only people you may find to marry are your cousins. As I noted above, this will produce long ROH as individuals will have descent through multiple lines from the same ancestor, increasing the probability of autozygosity greatly. The same process explains why West Asians and Central/South Asians are enriched for long LOH relative to other groups excepting Amerindians. Here’s a map from Consang.net:

Globalcolorsmall

Many Muslim societies practice cousin marriage, and many Muslims even argue that it is the Islamic practice (he married one of his cousins among his many wives. Strangely somehow these Muslims don’t argue that it is also the Muslim custom to marry old rich widows, though some do argue for the importance of marrying barely pubescent girls). Additionally, in India many Hindu groups in the South practice consanguineous marriages, including uncle-niece marriage. This is all occurring now, and so produces signatures of long ROH in many families. The final figure breaks down the individuals from selected populations, with again the y-axis being the number of ROH and x-axis being total length of the ROH:

journal.pone.0013996.g003

The population sets are representative of broader geographic clusters. The Karitiana are from the Amazon, the Mandenka from Senegal, and the Balochi from Pakistan. If you don’t know where the French and Japanese are from, I would ask you never leave a comment on this weblog. Notice a few French, Mandenka, and Japanese individuals deviated away from their main clusters. These are cryptically inbred, perhaps their parents were cousins, or some of their grandparents were cousins. In contrast the Baloch have a wide range in terms of length of ROH; this is typical of populations where a large proportion of individuals are the products of cousin marriage, but many are not. The fact that individuals would exhibit a large variance of expected relatedness between their parents means that their own inbreeding coefficients and the genomic correlates (in this case ROH) would also vary greatly. The same parameter is operative among the Karitiana, an endangered ethnic group which presumably has a small “mate market” available to each individual.

So what about the Papuans? Their cluster is tight, and they don’t have nearly the total length of ROH as the Amazonian tribe. But remember that in the first figure they had many short ROH. A plausible explanation for this is the the Papuans went through an ancient bottleneck, from which they have expanded. The bottleneck increased genetic drift and so generated highly common haplotype blocks which combined to produce runs of homozygosity. But over time these blocks would have disintegrated through mutation and recombination. ROH in the Papuans then is simply a shadow of demographic events past, while ROH in Baloch is evidence of demographic events present.

roh2 These two balancing realities are starkly illustrated in the supplements when you drill down to the South and Central Asian groups. In the figure it is clear that the group with the consistently highest number of ROH are the Kalash. This makes sense. The Kalash are a genetic isolate because they’re traditionally a pagan non-Muslim group isolated in the remote Chitral region of Pakistan. Because Muslims can not join their tribe for over a thousand years the gene flow has been unidirectional, as the Kalash convert to Islam and so assimilate into the broader Pakistani society. In contrast the other Pakistani groups have a huge variance in the total amount of ROH. The individuals with the least ROH in both total length and number in the sample are Baloch, Brahui and Makrani, as are some of the individuals with the highest values on these statistics! While the Kalash have been slowly and consistently ground down by the pressure of small population size, the Baloch, Brahui, and Makrani, are subject to the hammer-blows of several generations of first cousin marriages in inbred clans. These repeated marriages across the generations rapidly increase the ROH as first cousins may be more closely related to each other genetically than they are anthropologically.

In the pre-genomic era it was simple to calculate inbreeding. Just look at pedigrees. From this you derived the inbreeding coefficient. The key is to remember that the relationship of one’s sum totality of ancestors were critical in this calculation. In the USA marriages between first cousins occur between individuals whose grandparents are not usually related. But in other societies the generation of the grandparents, and perhaps great-grandparents, may also have been cousins. But pedigrees have limits, and may miss deep ancestry. The figure to the left, from the first paper I referenced, shows the relationship of the proportion of an individual’s ancestry which is identical by descent as calculated by genomic (ROH) methods on the y-axis and conventional ones on the x-axis (pedigree). There’s an obviously correlation, but observe the slight bias toward values above the line of best fit, and the fact that the y values are higher than the x. Genomic estimates capture common ancestry which lay outside the purview of conventional genealogy!

The implications of these patterns are two-fold: first, looking backward toward human history, and second, forward toward biomedical science. Patterns of ROH here are roughly in line with a serial bottleneck model Out of Africa; the further populations are from Africa the more short ROH they have. African populations have the least of these because of their larger long term effective population size, and relative insulation from the bottlenecking process. A shorter term phenomenon is that of consanguineous marriage patterns, whether conscious and culturally normative (as in the the Muslim world and parts of South Asia), or due to demographic constraint, as is the case among hunter-gatherers. These two processes together are relevant because of the prominence of recessive diseases within the domain of medical genetics. Clearly very long ROH is a sign of inbreeding, and so a likely higher susceptibility of an individual to a host of ailments. But the authors note that the sum effect of many short ROH may also be problematic, especially due to the fact that these together may form the preponderance of the ROH within the genomes of many populations.

So far I’ve basically alluded to demographic history, and how it shapes the genome through processes which are fundamentally neutral and stochastic. Inbreeding itself can be thought of as a form of super-charged drift, as the long term effective population of a breeding group collapses in on itself. But what about natural selection? I decided to take a closer look at Dr. Daniel MacArthur of Genomes Unzipped ROH. One of his longest regions is on Chromosome 2, is about ~2 Mb in length, and runs from position 134606441 to position 136593184. In 23andMe there’s a position which I think might explain this: 136325116. That’s the number for rs4988235 in the 23andMe data file. Variation on this SNP tracks lactase persistence in Europeans. Dr. Daniel MacArthur has the genotype for lactase persistence in the homozygote form. Are we seeing the long haplotype associated with lactase persistence here in this long ROH which rose rapidly in frequency in the last 10,000 years because of natural selection? In general the parameters outlined in the paper satisfy the broad sketch of human history, but there may be interesting detail on the margins left out of the picture.

Finally, let’s go back to heterozygosity vs. homozygosity. I recently watched the documentary “Is it Better to be Mixed Race?” Setting aside the obvious reality that this sort of program reflects the Zeitgeist of the era (it is rather obvious that a Victorian scientist could have produced a different documentary, even with the same evidence), near the end there is a comparison of ROH across populations and individuals. The comparison was actually done by the research group which published the paper I just reviewed. If you jump to 38 minutes into the film and just watch they’ll lay out the results, but I’ll tell you what they found. They compared two European men, a South Indian woman, and a man whose father was English and mother Nigerian. The European men had expected levels of homozygosity; on the higher end. The South Indian woman had lower levels of aggregate homozygosity. This should be expected, as India is relatively genetically diverse on a pan-Eurasian scale. Finally, the mixed race male had almost no homozygosity to speak of. The principle investigator admitted that out of 5,000 individuals who had he tested and analyzed this was the most extreme result, and he had to recheck it. Why? Three factors:

– The mother is Nigerian, which is a population which is relatively genetically diverse

– The genetic distance between the father and mother is rather high

– Finally, because the man is a first generation hybrid on all the loci where Africans and Europeans tend to differ he’ll be much more likely to be heterozygous

I’ll let the authors have the last word:

Long ROH are a neglected feature of our genome, which we have shown here to be universally common in human populations and to correlate well with demographic history. ROH are, however, only partially predictable from an individual’s background (due to the stochastic nature of inheritance). As well as conferring susceptibility to recessive Mendelian diseases, ROH are also potentially an underappreciated risk factor for common complex diseases, given the evidence for a recessive component in many complex disease traits…they will allow quantification of the risk arising from recessive genetic variants in different populations.

Citation: Mirna Kirin, Ruth McQuillan, Christopher S. Franklin, Harry Campbell, Paul M. McKeigue, & James F. Wilson (2010). Genomic Runs of Homozygosity Record Population History and Consanguinity PLoS ONE : 10.1371/journal.pone.0013996

Image Credit: Allison Stillwell