Razib Khan One-stop-shopping for all of my content

May 27, 2019

Population genetics + “deep learning”

Population genetics is many things, but a popular field that gets written up in Wired or the tech-press is usually not one of those things. It emerged out of Mendelian genetics in the early decades of the 20th-century, transforming elegant pedigrees into abstruse algebraic formulae. It was a peculiar hybrid of mathematics and evolutionary biology, both obsessions of late 19th-century Victorian academics. Population genetics was as much a product of a particular history as the topics that it studied.

In the population genetic lens, evolution became simply the “change in allele frequencies over time.” Alleles being the early term for different genetic variants, which were correlated with patterns of inheritance.

Whereas some fields of quantitative science are focused on the analysis of collected data, early population genetics was rather more fixated on logical deduction from theoretical models. These models involved the algebraic inferences that were consequences of assumptions about values of parameters such as mutation or natural selection, in the context of random mating populations. On occasion, these models were supplemented with geometric analogies and illustrations, but by and large, this domain of science was inhabited by thinkers who were comfortable in abstract symbols, rather than the mess and fuss of bench biology.

This was a matter of necessity as much as preference.

There simply was not much data in early genetics on a population-wide scale.

The structure of DNA was not elucidated until 1952. Molecular evolution did not emerge until the next decade, and what we term genomics is the product of the very end of the 20th-century.

But the growth in data since the year 2000 has been exponential. For its first 80 years, population genetics was a field with too little data, fixated on theory. In the last 20 years, as population genomics has bloomed researchers have had to confront the fact that the theoretical edifice built when there was access to genetic variation on dozens of loci within a species is not adequate in a world where one has access to whole genomes from hundreds of individuals.

Population genetics is now as much data science as theoretical science.

Words such as machine learning and deep learning have the characteristic of being both banal and esoteric. Who doesn’t know what a machine is? Or what deep means? And everyone learns! But of course, these terms refer to fields within computer science which have emerged to deal with the mass of data that modern society generates. Machines learning deeply seems to be quite a mysterious feat!

When population genetics was developed in the 1920s and 1930s to model evolutionary processes it was viewed as something of a mystery to most biologists. These theorists focused on the implications of models of the change in frequencies of alleles. They dealt in stylized conceptions of single mutations rising up rapidly in frequency due to strong positive selection, or perhaps a new mutation bouncing up and down in a “random walk” process of genetic drift. Relatively simple mathematical processes described simple evolutionary dynamics, which one could test with the limited data on hand.

Adaptation to malaria in Africa and the emergence of sickle-cell disease is a case in point. This is a situation where the selection pressure for individuals with a single copy of the mutant allele is balanced against the fitness cost to those who carry two copies of the mutant allele, and so exhibit sickle-cell disease. A simple algebraic relationship between the cost of sickle-cell disease and the protection conferred to carriers of the mutation against malaria can allow one to compute the allele frequencies at a single locus within populations.

But it turns out that much of natural selection is not so amenable to classical population genetic models.

A great deal of natural selection in populations is not easily localized to a specific locus. The human genome itself has 19,000 genes, and tens of millions of polymorphisms. Though there are some selected events which fit the model of a classical sweep up from a single mutation, most adaptation may occur through shifting the frequencies of many alleles across the genome in a subtle manner. Population genetic modeling from the early 20th-century was not designed to detect these subtle processes, because they would not have had the data to be able to detect them empirically for decades.

This is where buzzwords step in. Deep learning is a method of extracting features, patterns, out of a mass of raw data which is not digestible by humans. This is why it is applied to online marketing, to learn from the patterns of tens of millions of individuals, as well as their individual preferences, to generate a customized set of choices. This is in contrast to earlier methods of marketing which relies on segmentation by specific demographics defined by analysts. Classical marketing is not useless, but in the context of e-commerce, the newer methods of targeting individuals based on a mass of data are even more effective.

Machine and deep learning do not mean population genetic theory is irrelevant. On the contrary, classical population genetic theory is invaluable as a guide to the broad sweep of evolutionary change. It generates questions that one can finally test. Data science inference without a firm theoretical basis is directionless. But to test the details of population genetic processes one needs to lean on futuristic computer science.

Modern sequencing machines generate more data in a week than all of 20th-century genetics did over decades.

Only the interpretative tools developed in this century can absorb the scale of 21st-century genomics.

Population genetics + “deep learning” was originally published in Insitome on Medium, where people are continuing the conversation by highlighting and responding to this story.

Powered by WordPress

Do NOT follow this link or you will be banned from the site!