Razib Khan One-stop-shopping for all of my content

May 2, 2018

The genetics of forensic identification

Filed under: criminology,dna-sequencing,Forensics,Genetics — Razib Khan @ 4:03 pm

The arrest of a suspect in the infamous “Golden State Killer” DNA evidence was notable for how he was identified. The media attention the case has garnered means that forensics genetics have come to public attention again in 2018. Not that the public has not been aware of the power of genetics in legal and criminal contexts: The Innocence Project famously leveraged DNA results to show that some individuals had been falsely convicted by eliminating them due to lack of a DNA match from the crime scene.

But the recent illustration of the power of 21st century genomics, with researchers digging through public databases to search for relatives of the potential suspect, was revelatory for much of the world, which had not kept up with the breakneck pace of change in genetics.

The first human genome cost $3,000,000,000 and took more than a decade to complete. Today a good quality human genome sequence can be had for less than $1,000, and generated in around a day in a pinch. The field has been subject to massive changes in the last 10 years, crashing through Moore’s Law and transforming what geneticists are capable of in the present.

The arrest of a suspect in killings that date back four decades has awakened the public to the reality that geneticists have already been living in in the 21st century. It’s like Clark Kent transforming into Superman.

Obviously using genetics to resolve legal disputes is not new at all. Blood group inheritance patterns were understood early in the 20th century, and brought to bear in cases such as paternity disputes. But blood groups are only a small number of traits, with a limited about of variation. In a huge number of cases inheritance patterns wouldn’t resolve anything. If ~25% of the population had blood group A, then finding that wouldn’t allow for narrowing across a broad cross-section of the population even if it would be useful in specific cases.

But even techniques as primitive as blood group inheritance illustrate the power of genetic techniques in the 20th century: they could eliminate a large number of possibilities. ~75% of the population does not have blood group A, so if you are looking at a large number of suspects then removing three out of four possibilities might be worth it.

By the latter decades of the 20th century, forensic genetics took this to the next step. With the molecular revolution in biology, geneticists didn’t have to focus on blood groups — rather, they could look at variation at specific genes that they obtained from various types of biological samples. With the development of new techniques of amplifying DNA from infinitesimally small samples in the 1990s, the amount of genetic material needed declined greatly, making it feasible to revisit cases where DNA analysis was previously deemed impossible.

The combination of molecular biology and genetics in the late 1990s was a
forensic “killer app”, but there was still the problem that geneticists needed to target loci that had enough variation that they could differentiate individuals. If, for example, scientists tackled a genetic position where 99% of the population population has one variant, and 1% the other, in most cases there wouldn’t be much novel information that one could use.

Because forensic labs could only focus on a specific number of genes, they quickly realized that the biggest “bang-for-the-buck” was in highly variable regions. In particular they looked at “short tandem repeats” (STRs). These are regions of the genome subject to expansion or contraction in the number of repeat units during DNA replication, thus generating usable repetitive variation. Where “single nucleotide polymorphisms” (SNPs) are limited to four different bases (A, C, G, and T — and typically only two of the four possible bases), STR loci can differ over many different copy number variants. Because STRs loci are mutate rapidly, they are more polymorphic and vary a great deal even across families.

All this is why they are at the heart of CODIS, Combined DNA Index System, a governmental database used by law enforcement, and centralized at the federal level since 1998. Originally starting with 13 markers, today CODIS uses 20. Because of the high level of variation in these markers, random matches are rare. Though some geneticists dispute the statistics, the FBI estimates that a random CODIS profile should appear about 1 in 10 million cases. That means that there should be more than 30 matches to a profile just based on chance in the United States. Obviously not all of these individuals would be a suspect. All but one would be false positives through DNA testing.

With limited markers, false positives — or more precisely the inability to distinguish between individuals — are always going to be an issue. Just by chance some people will match others within a subset of the genome, even at these highly variable positions. In contrast, the lack of the match eliminates someone from the pool of suspects.

This is why CODIS was useful for exonerating people: if one did not match the DNA sample, one knew that this was not a statistical fluke. A negative match gives a certain conclusion: the individuals are different.* A positive match gives a probability: the individuals are likely the same.

But CODIS is 1990s genetics. The apprehension of the suspect in the rapes and killings from the 1970s and 1980s in California was done with state of the art genetics. While CODIS focuses on 20 markers at most, by 2010 tens of thousands, and today tens of millions, of people were getting large swaths of their genome genotyped, usually at 500,000 to 1,000,000 SNP positions. CODIS relied on STRs because of the expense of genotyping genetic positions in the 1990s.

But today “SNP-chips” cost less than $50 and return nearly a million markers. Data constraints are no longer an issue, and aligning patterns of SNPs across each chromosome allows for highly accurate assessment of relationships between people. Instead of returning the result that two individuals are probably siblings or parent-offspring, one can now conclude that two individuals are siblings, and share 46.5% of their genome in common! (including what segments of each chromosome they share)

With individual DNA data no longer being in short supply, what was needed was a database. CODIS may have about a million profiles, but those are not genotyped on modern DNA technology. Consumer genomics firms such as Ancestry, 23andMe, and Family Tree DNA do have SNP databases of more than a million (Ancestry has more than 10 million), but these are not accessible to law enforcement without a subpoena. However, there are public databases available with SNP genotype profiles. GEDMatch is one of those, with ~1 million entries.

The combination of hundreds of thousands of genetic markers across millions of individuals is powerful. Bringing these together unleashes the ability to look into the pedigrees of thousands of individuals who weren’t tested with just a single sample. There are ~300 million Americans. If GEDMatch has ~1 million samples in its database it is likely that the vast majority of Americans will have matches. Obviously the vast majority of people will not have a perfect match, but because modern methods use hundreds of thousands of variable positions a perfect match is not just a probability anymore, but a surety (barring identical twins there will be only one perfect match in the database per person at most). Matches with 2nd cousins and closer are also ones that can be made with very high confidence. This means people who descend from common great-grandparents — but even without that many people can make matches with people more distantly related; the suspect in the case above shared common great-great-grandparents with people in the GEDMatch database.

Genetic genealogists have become adept at looking at patterns of probabilistic matches that are quite distant, and triangulating them with other pieces of data to establish high confidence genealogical connections. Once those connections are made, obtaining DNA from suspects would yield a result that law enforcement could have near-perfect confidence in.

Law enforcement, the media, and the public are living in the genetic 1990s. The future is actually happening in the present, led by consumer genomics databases and “citizen scientists.” The lesson we can distill from the headlines is that genetic privacy may, in many ways, now be a 20th century novelty in the eyes of the law.

Explore your Regional Ancestry story today.


* There are exceptions to this when it comes to genetic mosaicism.

Regional Ancestry

The genetics of forensic identification was originally published in Insitome on Medium, where people are continuing the conversation by highlighting and responding to this story.

Powered by WordPress