July 4, 2018

American folkways & American pedigrees

St. Augustine Historic district

Over ten thousand years after the first Americans settled the New World, from the Arctic to Patagonia, a new people arrived on these shores. From “deep history” to colonial history. Before Plymouth, before Jamestown, even before Santa Fe, there was St. Augustine, facing the Atlantic on the Florida coast. Occupied continuously since 1565, St. Augustine is the oldest current city in the continental United States.

This small and obscure town has witnessed massive culture, political, and demographic changes wrought over the continent of North America across 450 years from the fringes. Its existence reflects a tentative and tenuous phase in the exploration of North America by Europeans. Though inauspicious, the fact that St. Augustine sustained itself on the edge of North America for centuries is a testament to something different that was looming on the horizon.

Sir Walter Raleigh

Of course the dominant culture and people in the continental United States by 1800 was not that of the Spaniards of Florida or New Mexico. Rather, it was from the British Isles. The two primary early zones of British settlement were in the Chesapeake and Massachusetts Bay areas. These were the core regions around which the Commonwealth of Massachusetts and Commonwealth of Virginia developed.

By the last quarter of the 18th century the entire eastern seaboard of North America was dominated by settler colonies of diverse European provenance.

In New England, Massachusetts had been joined by colonies which were offshoots in various ways from the founding stock that coalesced around Boston. The 50,000 or so English settlers of the middle of the 17th century were now 750,000 New Englanders, due to natural increase driven by the fecundity of its people. Similarly, to the north in Canada the original small number of French who settled on the margins of the vast coniferous forests in search of furs and other riches had flourished and grown into a fully-fledged society, which even maintained itself in the face of British conquest and domination, strong their Roman Catholicism and fiercely devoted to their language.

In Virginia the younger sons of English nobility and the descendants of indentured servants had recreated an aristocratic society characterized by hierarchy and inequality typical of southern England on American shores. While the settlers of New England brought their Reformed Calvinist faith and its sectarian offshoots to the New World, seeking freedom to worship as they wished, the grandees of the Tidewater adhered nominally to the Anglican Church, and focused their energies on increasing their wealth and prestige. The kingdom which they wished to inherit was in this world, not the next!

Slaves cultivating tobacco in Virginia

Much of the wealth of the kingdom which the planters were building unfortunately consisted of slaves. The ancestors of black Americans arrived mostly in the 17th and 18th centuries. Though importation of slaves was legal almost into the 19th century, the reality is that the vast majority of the ancestors of black Americans date to a far earlier period. Because slave fertility was above replacement, the American trade in humans quickly become independent of international sources. The first person killed in the Boston Massacre was Crispus Attucks, a man of African and Native American ancestry, who was as American as American can be.

Along with Puritans, and the Southern planters and the slaves, on the eve of the Founding of the United States of America numerous other societies had developed organically from very different seeds. Dutch New Amsterdam had become New York by the Revolution, but the entrepreneurial spirit of the Dutch republic persisted, as the great port remained an ethnic melange driven by commerce, not the whims of aristocrats or the will of God. Further south, the great city of Philadelphia was founded by industrious English Quakers, but these were joined by German Protestants and people from Ulster and the border region between Scotland and England, the “Scots-Irish.”

Wyckoff House in Brooklyn dates to the Dutch period

Though Philadelphia was the destination for the hundreds of thousands of Scots-Irish fleeing deprivation in their homeland, they did not tarry long, pushing deep into the back-country of Pennsylvania, and then migrating south along the ridges of the Appalachians. Where planter class established an aristocratic society of elites whose privileges were built on the backs of poor whites and black slaves, the rugged uplands of the vast southern stretches of the English colonies were populated by an individualistic and egalitarian people whose wealth was measured in their pride.

President Andrew Jackson, the son of Ulster migrants

These were the people caught up in the American Revolution, on both sides, rebel and loyalist. Despite their Northern European origins, with the exception of the black slaves, they were still a diverse motley. Thirteen separate and distinct colonies with many local subcultures. Not a single nation.

There were numerous waves of people who arrived after the British colonies became the United States of America. Irish and Germans in the decades before the Civil War transformed the culture of the United States of America, with their Catholicism and their beer, not to mention triggering an anti-immigrant populist movement which has resonances down to the present day.

After the traumatic conflict between the North and South, immigrants arrived in large numbers as industry began to demand labor in the cities and the vast open expanses of West needed to be settled. Scandinavians, Southern and Eastern Europeans in the decades around 1900 congregated in the cities of the East and Midwest and worked in factories and planted farms deep on the Great Plains. From the Far East there even arrived Chinese to build the railroads and operate the mines across the Pacific coast. The racial resentment of these Chinese led to explicit bans on immigration from Asian countries for decades.

Despite all this immigration, even as late as 1990 50% of the ancestry of the population of the United States of America was derived from those who were present and counted by the Census of 1790.

Though there are numerous interpretations and debates about American history, one thing that is clear is that it is a history of which we have copious records and documentation. The Founding Fathers are not dead names, but people who come alive in their correspondence. By the middle of the 19th century the immigrants who came to these shores are also visible in all the realism of their features through the magic of photography. Documents at Ellis Island mean that genealogical detective work can yield insights which illuminate the understanding of many a family’s past.

Irish immigrants

And yet genetics can shed light on historical patterns. Unlike written text genetics is neutral. It does not present a particular narrative or agenda. Though the tale genetics tells is that of the winners, there is no hiding this truth. In genetics the future belongs to those who procreate, and that is the foundation on which its logic is built.

Modern genetic technology surveying hundreds of thousands of markers in the human genome allow researchers to reconstruct pedigrees, family trees, and mark the history of peoples through their descent. While ancestry tests usually focus on deep history and ancient evolutionary and population events, modern genomic techniques allow for the exploration of events even within the last few hundred years.

Several years ago the genetics arm of Ancestry looked in their database, and selected 770,000 individuals of American heritage to analyze.

By surveying the patterns and clusters of relatedness, the researchers constructed a map which shows that most genetic variation in the United States is between the north and the south. That is, people from New England tend to be more different from people from the Deep South, than they are from those from neighboring states. In fact, the largest component of variation tracks geography very well, from northern New England, down to the Mid-Atlantic, then to the South.

The second component of variation tracks east to west, the direction of the migration of settlers. Very few people left Massachusetts for Mississippi. Many did leave Massachusetts for Michigan.

Those who pulled up stakes were not always the same sort of people who stayed home. So over time Westerners became somewhat distinct from Easterners. Those who left married others who left. Those who stayed continued to marry others who stayed.

And yet the reality is that different regions of the West were settled from different regions of the East.

The genetic clusters present some interesting results which are comprehensible only through the lens of history. Hawaii and Utah are two states which are a bit skewed to the north. But as it happens these are states which were heavily settled by New England Yankees. In contrast, though states such as California might be dominated by cosmopolitan cities such as San Francisco and Los Angeles, in the hinterland are many people whose roots are in the uplands of the South. Scots-Irish who traveled west in search of greener pastures, who brought their music, twang, and kinfolk.

Geographic patterns of genetic clusters

While Indiana was settled mostly from the South, there were far more Yankees who founded towns in Michigan and Wisconsin. Meanwhile, Ohio and Illinois were both divided between a northern portion settled from New England, and a southern expanse dominated by Scots-Irish “Butternuts.”

All this seems clear in the genetic results. Now we can quantify the differences. Illinois is tilted a bit to the northern migrants. Ohio somewhat to the southern ones. Historical debates can be resolved through genetic analyses!

Over the next few years tens of millions of more Americans will obtain direct to consumer genetic tests. The database will grow larger and larger. Many demographic questions related to the history of this country will not need to be explored through reconstruction of texts and laborious perusal of letters and court documents. Rather, scientists will simply scan through the pedigrees they construct from human genomes, and synthesize their results with the rich assortment of resources already available from the fields of genealogy and history.

A nation of immigrants and settlers will become an open book to all who wish to read their incredible stories.

August 28, 2017

When journalists get out of their depth on genetic genealogy

For some reason The New York Times tasked Gina Kolata to cover genetic genealogy and its societal ramifications, With a Simple DNA Test, Family Histories Are Rewritten. The problem here is that to my knowledge Kolata doesn’t cover this as part of her beat, and so isn’t well equipped to write an accurate and in depth piece on the topic in relation to the science.

This is a general problem in journalism. I notice it most often when it comes to genetics (a topic I know a lot about for professional reasons) and the Middle East and Islam (topics I know a lot about because I’m interested in them). It’s unfortunate, but it has also made me a lot more skeptical of journalists whose track record I’m unfamiliar with.* To give a contrasting example, Christine Kenneally is a journalist without a background in genetics who nevertheless is immersed in genetic genealogy, so that she could have written this sort of piece without objection from the likes of me (she did write a book on the topic, The Invisible History of the Human Race: How DNA and History Shape Our Identities and Our Futures, which I had a small role in fact-checking).

What are the problems with the Kolata piece? I think the biggest issue is that she didn’t go in to test any particular proposition, and leaned on the wrong person for the science. She quotes Joe Pickrell, who knows this stuff like the back of his hand. But a more space is given to Jonathan Marks, an anthropologist who is quite opinionated and voluble, and so probably a “good source” for any journalist.

Marks seems well respected in anthropology from what I can tell, but he’s also the person who put up a picture of L. L. Cavalli-Sforza juxtaposed with a photo of Josef Mengele in the late 1990s during a presentation at Stanford. Perhaps this is why anthropologists respect him, I don’t know, but I do not like him because of his nasty tactics (I wouldn’t be surprised if Marks had power he would make sure people like me were put in political prison camps, his rhetoric is often so unhinged).

Marks’ quotes wouldn’t be much of an issue if Kolata could figure out when he’s making sense, and when he’s just bullshitting. But she can’t. For example:

…“tells me I’m 95 percent Ashkenazi Jewish and 5 percent Korean, is that really different from 100 percent Ashkenazi Jewish and zero percent Korean?”

The precise numbers offered by some testing services raise eyebrows among genetics researchers. “It’s all privatized science, and the algorithms are not generally available for peer review,” Dr. Marks said.

The part about precise numbers is an issue, though a lot less of an issue with high density SNP-chips (the real issue is sensitivity to reference population and other such parameters). But if a modern test says you are 95 percent Ashkenazi Jewish and 5 percent Korean it really is different from 100% Ashkenazi. Someone who comes up as 5% Korean against an Ashkenazi Jewish background is most definitely of some East Asian heritage. In the early 2000s with ancestrally informative markers and microsatellite based tests you’d get somewhat weird results like this, but with the methods used by the major DTC companies (and in academia) today these sorts of proportions are just not reported as false positives. Marks may not know because this isn’t his area, but Pickrell would have. Kolata probably did not think to double-check with him, but that’s because she isn’t able to smell out tendentious assertions. She has no feel for the science, and is flying blind.

Second, Marks notes that the science is privatized, and it isn’t totally open. But it’s just false that the algorithms are not generally available for peer review. All the details of the pipeline are not downloadable on GitHub, but the core ancestry estimation methods are well known. Eric Durand, who wrote the originally 23andMe ancestry composition methodology presented on it at ASHG 2013. I know because I was there during his session.

You can find a white paper for 23andMe’s method and Ancestry‘s. Not everything is as transparent as open science would dictate (though there are scientific papers and publications which also mask or hide elements which make reproducibility difficult), but most geneticists with domain experience can figure out what’s going on and it if it is legitimate. It is. The people who work at the major DTC companies often come out of academia, and are known to academic scientists. This isn’t blackbox voodoo science like “soccer genomics.”

Then Marks says this really weird thing:

“That’s why their ads always specify that this is for recreational purposes only: lawyer-speak for, ‘These results have no scientific standing.’”

Actually, it’s lawyer-speak for “do not sue us, as we aren’t providing you actionable information.” Perhaps I’m ignorant, but lawyers don’t get to define “scientific standing”.

The problem, which is real, is that the public is sometimes not entirely clear on what the science is saying. This is a problem of communication from the companies to the public. I’ve even been in scientific sessions where geneticists who don’t work in population genomics have weak intuition on what the results mean!

Earlier Kolata states:

Scientists simply do not have good data on the genetic characteristics of particular countries in, say, East Africa or East Asia. Even in more developed regions, distinguishing between Polish and, for instance, Russian heritage is inexact at best.

This is not totally true. We have good data now on China and Japan. Korea also has some data. Using haplotype-based methods you can do a lot of interesting things, including distinguish someone who is Polish from Russian. But these methods are computationally expensive and require lots of information on the reference samples (Living DNA does this for British people). The point is that the science is there. Reading this sort of article is just going to confuse people.

On the other hand a lot of Kolata’s piece is more human interest. The standard stuff about finding long lost relatives, or discovering your father isn’t your father. These are fine and not objectionable factually, though they’ve been done extensively before and elsewhere. I actually enjoyed the material in the second half of the piece, which had only a tenuous connection to scientific detail. I just wish these sorts of articles represented the science correctly.

Addendum: Just so you know, three journalists who regularly cover topics I can make strong judgments on, and are always pretty accurate: Carl Zimmer, Antonio Regalado, and Ewen Callaway.

* I don’t follow Kolata very closely, but to be frank I’ve heard from scientist friends long ago that she parachutes into topics, and gets a lot of things wrong. Though I can only speak on this particular piece.

January 23, 2012

Personal genomics and adoption

With DNA Testing, Suddenly They Are Family:

Several companies provide tests that can confirm whether adoptees are related to individuals they already know. Others cast a wider net by plugging DNA results into databases that contain tens of thousands of genetic samples, provided mostly by people searching for their ancestral roots. The tests detect genetic markers that reveal whether people share a common ancestor or relative.

Some experts on adoption and genetics have criticized ancestry and genealogy testing companies, saying they are, at times, connecting people whose genetic links are tenuous — in effect stretching the definition of a relative. Nevertheless, the growing popularity of the tests, combined with social media sites that connect people day to day, has given some adoptees a sense of family that feels tangible, intimate and immediate.


I think that these tenuous connections and slivers of information are better than nothing. This isn’t rocket science. And naturally many adopted people also could care less. This is a deeply personal issue, and the valence is going to be private. I suspect that those of us who aren’t adopted, and take for granted knowledge of our own family background have a hard time imagining the value which even a 3rd or 4th cousin could give someone.

Additionally, though finding very close relatives is not that common (first cousins, let alone first order relatives), knowledge of more distant relations can still help you triangulate aspects of family history if you begin with nothing. To give a personal example I know someone whose paternal grandparents were immigrants from Germany. The maternal side is much more mixed, and some of the genealogical records hit dead-ends in the mid 19th century in the USA. It turns out that one of the individuals that this person is closest to on 23andMe is an African American (both maternal and paternal lineages are clearly African). What does this mean? The lead hasn’t been followed up, but combining family histories might be very informative in this case.

January 24, 2011

The genomic heritage of French Canadians

Image Credit: Anirudh Koul

One of the great things about the mass personal genomic revolution is that it allows people to have direct access to their own information. This is important for the more than 90% of the human population which has sketchy genealogical records. But even with genealogical records there are often omissions and biases in transmission of information. This is one reason that HAP, Dodecad, and Eurogenes BGA are so interesting: they combine what people already know with scientific genealogy. This intersection can often be very inferentially fruitful.

But what about if you had a whole population with rich robust conventional genealogical records? Combined with the power of the new genomics you could really crank up the level of insight. Where to find these records? A reason that Jewish genetics is so useful and interesting is that there is often a relative dearth of records when it comes to the lineages of American Ashkenazi Jews. Many American Jews even today are often sketchy about the region of the “Old Country” from which their forebears arrived. Jews have been interesting from a genetic perspective ...

September 12, 2010

The confusions of genetic relatedness

Last spring I posted ‘Beyond visualization of data in genetics’ in the hopes that people wouldn’t take PCA too far in assuming that the method was a reflection of reality in a definite fashion. Remember, PCA visualizations are showing you two, and at most three, dimensions in genetic variation within the data set at any given time. The fine print is important; e.g., “PC 1 15%”, “PC 2 4.5%”, etc., which points to the magnitude of the dimensions within the data. You see the largest, and likely historically most significant on a population wide scale, genetic variances, but there’s still a large remainder left over. But when I look at referrals from message boards people obviously aren’t careful with what PCA is telling them.

As an illustration, in the 23andMe user interface you can “compare genes” genes across people who you “share genes” with. This comparison operates over ~550,000 single nucelotide polymorphisms out of 3 billion base pairs (you can constrain it to traits, but I’m going to talk about the comparison to the whole data set below). For example, a man of European descent shares 83.2% with his daughter, who is Eurasian (the mother is Burmese, with some recent Indian admixture). Another man of European descent shares 84% with his daughter, whose mother is also European (in fact, both parents are western European). The “gene sharing” with other people of European descent of these two men is in the 75-74% range (for reference, a Chinese person is 71%, and Nigerian 68.5%). On the PCA plot the European and his Eurasian daughter are very far apart, while the European man and his European daughter cluster together. What you’re seeing on the PCA chart is population level information, not the genetic uniqueness within families and across parents and offspring.

To further explore this issue, I thought it would be interesting to revisit my own genetic data. If you read my previous post, you will know it is not boring. As an ethnic Bengali my ancestry comes from the northeast of the Indian subcontinent, so in addition to the “Asian” fraction which most South Asians have in the 23andMe “ancestry painting” (around 25% on average, with a range from 10-35% probably the extremes within two standard deviations from what I can tell), I likely have some southeast Asian ancestry from Burma. 23andMe has three “reference” populations it uses from the HapMap:

Asian = Chinese/Japanese
European = Northwest European
African = Yoruba

All of us get an ancestry painting which is a combination of these three. Unfortunately unless you’re a relatively straightforward combination of these three groups it isn’t always too informative. So if you’re African American you should be in luck since the two ancestral populations which you derive from are included as reference populations. On the other hand, unadmixed Native Americans tend to be about 25% European and 75% Asian, while unadmixed South Asians are 75% European and 25% Asian. That’s because the allele frequencies in these two populations have some relationship to both the reference groups, even if there hasn’t been any recent admixture (additionally, the painting presumably misses a lot that is distinctive to these groups, though 23andMe has a feature which allows people to explore possible Native American ancestry specifically).

As I told you before my ancestry is 57% European and 43% Asian. This is a very large Asian fraction for a South Asian, and after comparing notes with other South Asian 23andMe customers I’m pretty sure that my large fraction is due to having admixture from Burmese and/or Tibeto-Burman or Austro-Asiatic “Hill Tribes” to the north, south and east of Bengal. Since my family is from the east of Bengal that is not too surprising.

You know from my previous post that on the PCA plot I am near, but outside, of the main South Asian cluster. But there’s some interesting data from the gene comparison feature too. For reasons of privacy I’m not going to give you names obviously, but, I will label people by geographical origin if I know that aspect of the individual’s information. Additionally, below the comparison is mostly to Indians, and so I’m going to substitute names of Indian states for those where I have that level of specificity. I also restandardized the gene sharing value, so that the nearest individual with whom I’m sharing is 0 , and the furthest on the plot is 1 (74.5% to 73.04% if you’re curious). To add a wrinkle, I’ve added the % Asian calculated from 23andMe’s ancestry painting on the Y axis. The two images below show the results, the first includes some East Asians and a European, while the second includes only South Asians.

The first image is of more interest. Two points:

1 – Unlike most South Asians I have greater gene sharing identity with East Asians than with Europeans. The South Asian to whom I am closest to does not exhibit my own pattern, as they are closer to some Europeans than they are to some Chinese. In contrast, I not only unequivocally share more genes with East Asians than Europeans, but, I share more genes with some East Asians than I do with the individual from Iran, and, one South Asian from the northwest of the subcontinent and another from southern India. This last pattern is very peculiar from what I’ve been told (the other Bangaldeshi has the same tendency, though not to the same extent).

2 – There is a woman with whom I am sharing genes with from Burma. Her father, who died when she was young, had Indian ancestry, and reputedly spoke Tamil. She is ~20% European, which would make her father ~40% European. I have not seen a South Indian who is less than 65% European, so I believe that he had native Burmese admixture. If his mother was Burmese that would make his father ~80% European, which I have seen in a few South Indians, though their usual range seems to be 75-65%. Note that I am closer to her than I am to most South Asians. In contrast, the Bangaldeshi with whom I am sharing genes, and has the second highest percentage of Asian in their ancestry is about as far from this woman and he is from the Punjabis in terms of distance (in contrast, the Punjabis are about 2.5 times further than she is from my own genetic state).

7419_133883902983_699392983If I did the same plot of % Asian with gene sharing for the European man and his Eurasian daughter I would see a pattern whereby for most of the data there would be a noticeable linear pattern, the more Asian, the less gene sharing. The exception would be his daughter, who would be greatly Asian, but would be the closest by this genetic distance measure. Similarly, the Burmese woman with some Indian admixture is an outlier on my plot. The South Asians follow a southeast-to-northwest range of distance from me, with a rough, but not perfect, correspondence with Asian ancestry. Among the South Asians the individual from Bihar is an exception, just as the Burmese woman is. Why? From previous comments I’ve made I have indicated that there is a high probability of recent Burmese ancestry on my paternal lineage (specifically, my paternal grandfather, whose physical appearance is always described as atypical for a Bengali. My paternal grandmother was from a Hindu family which converted, and she looked stereotypically Bengali). Additionally, I know my mother’s maternal grandfather is from the Indian state of Uttar Pradesh, specifically, the region of Delhi. But I also know that before they were Muslim my maternal grandfather’s family were of the Hindu Kayastha caste. The individual from Bihar is a Kayastha, and for those of you who do not know, Bihar is the state just to the west of Bengal. I do not know if the Kayasthas share any deep genetic affinity or not, but I recall that Reich et al. observed a high degree of genetic evidence of endogamy in South Asia. So, just as I believe that I share Burmese-specific genetic variants with the woman of predominant Burmese origin which are not showing up in the simple ancestry estimates based on the global reference populations, I may also share Kayastha-specific variants which results in my genetic closeness to the Bihari individual. But my confidence in the latter conjecture is far weaker than in the former case.

In reviewing all I’ve said so far I suppose the moral of the story is not to trust too deeply in one set of data visualizations or summary statistics. Granted, some people have axes to grind and can find what they want in the science, my posts on Jewish genetics indicates that very strongly. But if you’re genuinely interested in patterns of variation, and your own place within the broader framework, you need to open different windows on the same data to get a truly fully-fleshed out understanding of the nature of things. If you are of an understudied population, and of somewhat mixed background, as I am, tread lightly and carefully. If you are of a well studied and characterized population, then learning you are 100% European is basically worthless (though some of the more detailed PCA’s can tell you some things).

