Razib Khan One-stop-shopping for all of my content

September 19, 2018

David Burbridge’s 10 questions for A. W. F. Edwards In 2006

Filed under: A. W. F. Edwards,Population genetics — Razib Khan @ 9:30 pm

A few years ago I watched a documentary about the rise of American-influenced rock music in Britain in the 1960s. At some point, one of the Beatles, probably Paul McCartney, or otherwise Eric Clapton, was quoted as saying that they wanted to introduce Americans to “their famous people.” Though patronizing and probably wrong, what they were talking about is that there were particular blues musicians who were very influential in some British circles were lingering in obscurity in the United States of America due to racial prejudice. The bigger picture is that there are brilliant people who for whatever reason are not particularly well known to the general public.

This is why I am now periodically “re-upping” interviews with scientists that we’ve done on this weblog over the past 15 years. These are people who should be more famous. But aren’t necessarily.

In 2006 David Burbridge, a contributor this weblog and a historian of things Galtonian, interviewed the statistical geneticist A. W. F. Edwards. Edwards was one of R. A. Fisher’s last students, so he has a connection to a period if history that is passing us by.

I do want to say that his book, Foundations of Mathematical Genetics, really gave me a lot of insights when I first read it in 2005 and began to be deeply interested in pop gen. It’s dense. But short. Additionally, I have also noticed that there is now a book out which is a collection of Edwards’ papers, with commentaries, Phylogenetic Inference, Selection Theory, and a History of Science. Presumably, it is like W. D. Hamilton’s Narrow Roads of Gene Land series. I wish more eminent researchers would publish these sorts of compilations near the end of their careers.

There have been no edits below (notice the British spelling). But I did add some links!

David’s interview begins after this point:

A. W. F. (Anthony) Edwards is one of Britain’s most distinguished geneticists. He studied genetics at Cambridge as one of the last students of R. A. Fisher, and like Fisher he has contributed actively to both genetics and statistics. In genetics his work includes several influential papers on the reconstruction of phylogenies, and a widely-read recent article on ‘Lewontin’s Fallacy’. In statistics he is known especially for his development and advocacy of the concept of Likelihood as a criterion for scientific inference. He has also made a notable contribution to combinatorial mathematics by finding a method of constructing Venn diagrams for any number of sets. In addition to many scientific papers, he has written four books: Likelihood (1972; expanded edition 1992); Foundations of Mathematical Genetics (1977; 2nd edition 2000); Pascal’s Arithmetical Triangle: the Story of a Mathematical Idea (1987; expanded edition 2002); and Cogwheels of the Mind: the Story of Venn Diagrams (2004). He has written extensively on the history of genetics, mathematics, and statistics, and has co-edited (with H. A. David) Annotated Readings in the History of Statistics (2001), and (with Milo Keynes and Robert Peel) A Century of Mendelism in Human Genetics (2004). He is also a champion glider pilot.

1. You were among the last students of R. A. Fisher. Can you share with us some reminiscences of him?

I first met Fisher in the summer of 1956 and had much contact with him until his death six years later. I saw him last during the Second Human Genetics Conference in Rome in 1961 and subsequently corresponded with him. I have published quite a few of my reminiscences of those times in a number of different places. I have been fortunate in having had a lot of contact with the older generation who knew him better than I did – people like Barnard, Bartlett, Finney, Yates, Race, Ruth Sanger and Bennett – and with members of his family, especially Rose, Harry and Joan (his biographer) amongst his children. Being a fellow of the same Cambridge college (Caius) as Fisher, though not at the same time, has meant daily contact with people who knew him well. But the most important thing is his science, and there everyone can get to know him through his writings, which reveal a mind of extraordinary power and vigour. That is the Fisher whom succeeding generations should learn about and admire.

2. Like Fisher you have worked in both statistics and genetics. How do you see the relationship between them, both in your own work and more generally?

In a sense I have benefitted from being an amateur in both fields so that I see no boundary between them. Though I qualified in genetics the subject almost immediately changed so radically through advances in molecular biology that most geneticists would not now regard me as one of themselves anyway. My generation thought genetics was the study of inheritance; theirs thinks it is the study of genes. As to statistics, I attended eight lectures by Henry Daniels in Cambridge but am otherwise self-taught, being hugely influenced by Fisher’s book Statistical Methods for Research Workers which he told me to buy (and then signed for me). Genetical statistics has changed fundamentally too: our problem was the paucity of data, especially for man, leading to an emphasis on elucidating correct principles of statistical inference. Modern practitioners have too much data and are engaged in a theory-free reduction of it under the neologism ‘bioinformatics’. We had to navigate by the stars; they have GPSs.

3. Much of your early work (some of it in collaboration with L. L. Cavalli-Sforza) was on methods of inferring phylogenies. How do you assess the progress in this field since the 1960s, and how have your own methods stood up to empirical tests?

All my work was in collaboration with Luca Cavalli-Sforza. It was his idea. He hired me to join his group in Pavia in Italy, not specifically to work on phylogenies but to apply the new-fangled computers to human genetics generally. The late delivery of the Olivetti computer was a blessing in disguise because it left us time to talk about what we would do with it when it came. I was initially sceptical because I knew that linkage was statistically difficult and here was Luca proposing what looked like linkage on a tree whose very shape also required estimating!

I think progress on the theoretical side has been incredibly slow, despite the best efforts of Joe Felsenstein, the leading practitioner. In a few months in 1962 and 1963 Luca and I thought up three ways of tackling the problem: least-squares on an additive tree (his), minimum evolution or parsimony (mine) and maximum-likelihood on a stochastic model (very much a joint effort). Forty-odd years on people are still arguing about the relative merits of the descendants of our methods when all along they should have been concentrating on refining the statistical approach through maximum-likelihood, which was our real contribution. Of course, from a practical point of view the computer packages have taken over in a development parallel to that in human genetics, from shakey inferences based on too little data and doubtful logic to computer algorithms trying to digest too much.

4. Your recent article on ‘Lewontin’s Fallacy’ criticises the claim that human geographical races have no biological meaning. As the article itself points out, it could have been written at any time in the last 30 years. So why did it take so long – and have you had any reactions from Lewontin or his supporters?

I can only speak for myself as to why it took me so long. Others closer to the field will have to explain why the penny did not drop earlier, but the principal cause must be the huge gap in communication that exists between anthropology, especially social anthropology, on the one hand, and the humdrum world of population and statistical genetics on the other. When someone like Lewontin bridges the gap, bearing from genetics a message which the other side wants to hear, it spreads fast – on that side. But there was no feedback. Others might have noticed Lewontin’s 1972 paper but I had stopped working in human and population genetics in 1968 on moving to Cambridge because I could not get any support (so I settled down to writing books instead). In the 1990s I began to pick up the message about only 15% of human genetic variation being between, as opposed to within, populations with its non-sequitur that classification was nigh impossible, and started asking my population-genetics colleagues where it came from. Most had not heard of it, and those that had did not know its source. I regret now that in my paper I did not acknowledge the influence of my brother John, Professor of Genetics in Oxford, because he was independently worrying over the question, inventing the phrase ‘the death of phylogeny’ which spurred me on.

Eventually the argument turned up unchallenged in Nature and the New Scientist and I was able to locate its origin. I only started writing about it after lunch one day in Caius during which I had tried to explain the fallacy across the table to a chemist, a physicist, a physiologist and an experimental psychologist – all Fellows of the Royal Society – and found myself faltering. I like to write to clear my mind. Then I met Adam Wilkins, the editor of BioEssays, and he urged me to work my notes up into a paper.

I have had no adverse reaction to it at all, but plenty of plaudits from geneticists, many of whom told me that they too had been perplexed. Perhaps the communication gap is
still too large, or just possibly the point has been taken. After all, Fisher made it in 1925 in Statistical Methods which was written for biologists so it is hardly new.

5. You have written several articles about Fisher’s Fundamental Theorem of Natural Selection. Following a groundbreaking reinterpretation by George Price in the early 1970s, it is now generally accepted that the theorem as intended by Fisher is valid, but some biologists would still question its practical use or importance. Can you explain in non-technical terms the meaning of the theorem, how the correct interpretation differs from earlier misunderstandings of it, and your own view on its biological importance?

Oh, it’s very simple. You must first recall the precise name of Fisher’s book in which it is the centrepiece: The Genetical Theory of Natural Selection. He is studying the mechanisms of natural selection from the point of view of populations regarded as aggregates of genes. Of course he knows, and stresses, that this is not the whole story. But to him selection’s defining effect is to change gene frequencies. He sees that this will only happen if there is variability in the survival rates of different genes.

Animal breeders promote artificial selection by imposing different ‘fitnesses’ on their stock according to desirability, breeding from some and not from others. They thus raise the mean value in the population of the character desired. Fisher saw that this process implicitly relies on a correlation between the character and fitness, so that progress will depend both on the magnitude of this correlation and the extent to which the character is genetically determined. What happens, he then asked, if we designate fitness itself as the character, making the correlation perfect? The answer is that the mean fitness of the genes will increase by an amount that depends on the extent to which fitness is determined by them. This is the fundamental theorem (in a modern paraphrase): ‘The rate of increase in the mean fitness ascribable to natural selection acting through changes in gene frequencies is equal to the additive genetic variance in fitness’.

The theorem does not involve the mean genotypic fitness – that is, the weighted mean of the fitnesses of the genotypes – which is where most interpreters of it went wrong. Fisher’s repeated denials that his theorem referred to the mean genotypic fitness, itself immortalised in Sewall Wright’s ‘adaptive landscapes’, went unheeded. In 1941 Fisher even published an example in which gene frequencies were changed under natural selection but the mean genotypic fitness stayed constant. Nobody noticed.

The brilliance of the fundamental theorem is not merely that it expresses the central dogma of natural selection – the connection between genetic variability and selective change – but that it does so exactly. Fisher discovered what the rate of change was proportional to: not to the total variance in fitness of the genotypes but only to that part of it found by fitting a weighted linear regression to the genotypic fitnesses. This is the part accounted for by the regression itself, the so-called additive genetic variance. Animal breeders know it as the variance of the breeding values of the genotypes. The fundamental theorem disregards the way the genes are distributed through the population, which will depend on the amount of heterosis in fitness, the extent of assortative mating, and similar possibly transient effects. What matters to it are the changes to the mean fitness brought about by changing gene frequencies.

This, then, is the theorem whose ‘practical use or importance’ ‘some biologists would still question’. Let them ask the animal breeders if it is any ‘use’, and let them ask themselves whether they think Darwin’s theory of evolution by natural selection is of any ‘importance’. If they do, then the fundamental theorem should help them to a deeper, Mendelian, understanding of it. If, however, they hanker after a theory that can make evolutionary predictions, like Wright’s adaptive landscapes were thought to do at one time, they are crying for the moon. Possession of the fundamental theorem will no more enable you to predict the flow of evolution than possession of Newton’s law of gravitation will enable you to predict the time of high tide at London Bridge.

It should not be forgotten, however, that shorn of its genetical complexities the theorem does have predictive power, just as the law of gravitation does when applied to the celestial movements that underlie the tides. ‘In a subdivided population the rate of change of the overall growth-rate is proportional to the variance in growth rates’. The ‘populations’ could be economic sectors, for example, or even one’s own savings accounts.

6. Your career since the 1950s spans the period in which computers, and off-the-shelf programs, have become widely available. Has this been an unmixed blessing, and do you think the development of statistics or genetics would have been very different if computers had been available in, say, 1900?

A mixed blessing of course, because the existence of programs hinders the development of the underlying theory. This is particularly true in statistics where, despite assertions to the contrary by Bayesians, the underlying theory is still a matter for discussion. The phenomenon can be seen in the field of phylogenetic trees, where programs based on different methods proliferate.

1900 is a peculiarly well-chosen date on which to hang the question. Not only was it the year in which Mendel’s results became widely known but it was also the year of the publication of the second edition of Karl Pearson’s The Grammar of Science, which included chapters on biological science for the first time.The Grammar of Science was hugely influential in its day, proclaiming that the function of science was ‘not to explain, but to describe by conceptual shorthand our perceptual experience’. ‘The man who classifies facts of any kind whatever, who sees their mutual relation and describes their sequences, is applying the scientific method’. The computer implementation of this sterile philosophy would have had a devastating effect, particularly on the development of statistical theory and the acceptance of Mendelism. All Pearson’s formidable energy would have been devoted to amassing vast quantities of information to be sifted for correlations. William Bateson’s 1894 six-hundred-page Materials for the Study of Variation treated with especial regard to Discontinuity in the Origin of Species would have been digitally scanned and computer programmers urged to uncover its secrets. It doesn’t bear thinking about!

7. In statistics you are especially known for developing and advocating the concept of Likelihood and its use in scientific inference. Can you explain how Likelihood differs from probability, and why Likelihood methods are useful in evaluating hypotheses?

Likelihood compares statistical hypotheses; it has nothing to say about a hypothesis on its own, like a test of significance does. Imagine two statistical hypotheses, each of which predicts the probabilities of all the possible outcomes of an experiment – which need be no more complex than tossing a biassed coin a number of times and counting the heads. The experiment is performed, the heads counted. Given this count, was the probability of heads p1 (the first hypothesis) or p2 (the second hypothesis)?

Now imagine doing the experiment lots of times assuming the first, and then the second, hypothesis. Would you not prefer the hypothesis that had the shorter expected waiting time until the exact number of heads observed turned up? If so, you have just chosen the one with the greater likelihood. The likelihood of a hypothesis is proportional to the probability of the data given the hypothesis. Meaningless for a hypothesis by itself because of the undefined constant of proportionality, with two hypotheses to be compared on the same data this constant is irrelevant, and the ratio of their likelihoods (or the difference in their log-likelihoods) becomes a measure of the support for one hypothesis versus the other.

Likelihoods therefore derive from probabilities, but unlike the latter are not additive. Whereas you can sum the probabilities of two possible outcomes of an experiment to form the probability of ‘either one or the other’, you cannot do the same for the likelihood of two hypotheses; ‘either one hypothesis or the other’ is not in itself a hypothesis enabling the probabilities of outcomes to be computed, so no likelihood for it is defined. But you can graph the likelihood as a function of p and pay special attention to its maximum, the maximum-likelihood estimate of the probability of heads.

The concept of the likelihood function is fundamental to all approaches to statistical inference, whether Bayesian, Neyman-Pearson, or Fisherian. Not everyone agrees that it is meaningful standing alone by itself, but I (and others before me) believe it is. Doubters can always fall back on the above ‘how long to wait’ argument, which I think was due to David Sprott.

8. You have written extensively on the history of genetics, statistics, and mathematics. Apart from the intrinsic interest of historical studies, how important do you think a knowledge of the history of science is for practising scientists?

I find it essential, and cannot imagine doing science without it. Much of what counts as science nowadays is rather theory-free. We don’t really have a word for it. Sequencing the human genome, for example, is a marvellous achievement relying on technical advances of great ingenuity but it did not require historical understanding. It differs intellectually from, say, the associated activity of trying to estimate linkage values between gene loci. The history of the latter, on which I have written recently, is an essential part of the study of the problem, and much modern work suffers from its neglect.

Celebrating the centenary of the publication of The Origin of Species in 1959, Fisher said: More attention to the History of Science is needed, as much by scientists as by historians, and especially by biologists, and this should mean a deliberate attempt to understand the thoughts of the great masters of the past, to see in what circumstances or intellectual milieu their ideas were formed, where they took the wrong turning or stopped short on the right track.
I agree.

9. R. A. Fisher was a keen eugenist. What are your own views on the role (if any) of eugenics in the modern world?

Fisher’s world was so different from ours, in three ways in particular. Then (say the period between the wars) nation-states were much more independent of each other so that it was possible to discuss population matters for Britain in relative isolation; secondly, it was a time of concern about the possibility of a declining home population; and thirdly many scientists were in the first flush of enthusiasm for the application of Mendelian principles – so recently elucidated – to man. None of this is true today.

For myself, though I was once a grateful holder of a Darwin Research Fellowship of the Eugenics Society (now the Galton Institute), since boyhood I have been more concerned about the quantity of people on earth rather than their quality. In the early 1960s I was a founder-member of a body called, I think, the Conservation Society, which does not seem to exist today. Its main platform was that too large a population would be unsustainable. At the time there was much discussion about over-population which was seen as one of the greatest dangers facing mankind. Interestingly, the worse the problem gets, the less it is discussed. Yet the mounting dangers we face, such as the possibility of global warming, are all exacerbated by too high a world population, given its enthusiasm for motor-cars, aeroplanes, and environmentally-damaging activity generally. It seems that people fear the charge of racism if they comment on population growth – they intuitively understand Fisher’s fundamental theorem.

10. Like yourself, your brother, J. H. Edwards, is also a distinguished geneticist. Nature, nurture, or sibling rivalry?

Well, certainly not sibling rivalry. It is true that we have been sufficiently alike at some stages of our lives to have been mistaken for each other. At the Rome Conference of Human Genetics in 1961 we were in a lift with the Swedish geneticist Jan Lindsten when he engagingly introduced us to another participant as ‘the two most confused brothers in genetics’.

In fact I am 7 1/2 years younger than John, and due to mother’s illness, father’s war service, and wartime privations generally, I hardly encountered him until the end of the war when I was ten. We developed boyhood enthusiasms for science quite independently, he for biology, me for astronomy. But there was a common factor in our education from thirteen to eighteen. We both attended Uppingham School, though of course not at the same time, and were exceptionally well-taught in science and mathematics, in some cases by the same teachers. I cannot stress this influence too strongly. Since I only went to Uppingham because my elder brother did, is that nature or nurture?

Subsequently John’s main influence was when I was learning about likelihood (see the preface to my book Likelihood). He, being medically qualified, keeps me straight on medical matters and I try to keep him straight on things statistical. I deliberately stayed off linkage theory so as not to get too close to his interests. John was more influenced by Lancelot Hogben and J. B. S. Haldane than I was. There is a wonderful letter from Fisher to R. R. Race in 1960 in which he refers to me as ‘my Edwards from Cambridge’ and to John as ‘only one of Hogben’s [pupils]’, so at least Fisher got us straight.

And John introduced me to gliding. Though not exactly a ‘champion’, to use your word, I have enjoyed fifty years gliding and hope for a few more yet.

September 5, 2018

My interview of James F. Crow in 2006

Filed under: Jim Crow,Population genetics — Razib Khan @ 3:13 pm

Since the death of L. L. Cavalli-Sforza I’ve been thinking about the great scientists who have passed on. Last fall, I mentioned that Mel Green had died. There was a marginal personal connection there. I had the privilege to talk to Green at length about sundry issues, often nonscientific. He was someone who been doing science so long he had talked to Charles Davenport in the flesh (he was not complimentary of Davenport’s understanding of Mendelian principles). It was like engaging with a history book!

A few months before I emailed Cavalli-Sforza, I had sent a message on a lark to James F. Crow. It was really a rather random thing, I never thought that Crow would respond. But in fact he emailed me right back! And he answered 10 questions from me, as you can see below the fold. The truth is I probably wouldn’t have thought to try and get in touch with Cavalli-Sforza if it hadn’t been so easy with Crow.

If you are involved in population genetics you know who Crow is. No introduction needed. Some of the people he supervised, such as Joe Felsenstein, have gone on to transform evolutionary biology in their own turn.

Born in 1916, Crow’s scientific career spanned the emergence of population genetics as a mature field, to the discovery of the importance of DNA, to molecular evolution & genomics. He had a long collaboration with Motoo Kimura, the Japanese geneticist instrumental in pushing forward the development of “neutral theory.”

He died in 2012.

Below are the questions I asked 12 years ago. My interests have changed somewhat, so it’s interesting to see what I was curious about back then. And of course fascinating to read Crow’s responses.

Originally posted June 26, 2006

James F. Crow is Professor Emeritus of Genetics at the University of Wisconsin. A collaborator with Motoo Kimura on Neutral Theory, he remains an active member of the evolutionary genetics community.

1) In 2002 in “Perspective: Here’s to Fisher, additive genetic variance, and the fundamental theorem of natural selection,” you conclude, “is there any other quantity that captures so much evolutionary meaning in such a simple way?” in reference to additive genetic variance. And yet, what about other factors like statistical epistasis? Do gene-gene interactions pack enough of an evolutionary punch to be anything more than a footnote in God’s Book? Have you seen Loren Rieseberg’s work at Indiana which points to the importance of loci of large effect?

The remarkable thing about additive genetic variance is that it predicts the effect of selection, even in the presence of dominance and epistasis. Nature seems to follow least-squares principles. The result is that the additive component of variance pulls out of dominance and epistatic variance those components associated with allele frequency change under selection. Of course the theory is not exact, but it is a very good first approximation. Fisher did not ignore epistasis, as some have said; rather he showed how selection can utilize epistatic (and dominance) components of variance.

On a more technical level, Kimura showed that under selection with loose linkage the population rather soon attains a state in which the linkage-disequilibrium variance approximately cancels the epistatic variance. Thus, under this circumstance the effects of selection are better predicted by ignoring additive by additive epistatic variance than by including it. See my book with Kimura (1970, p. 217 ff).

I am aware of Rieseberg’s work on sunflowers. QTL mapping and various other molecular methods are indeed finding alleles with large effect in many species. It is inevitable that the first genes discovered will be those with largest effect, so I expect alleles with smaller effects to follow. How large a part genes with large effect have played in evolution is still up in the air, as far as I know. But they are getting more emphasis now than in the recent past.

2) R.A. Fisher is reputed to have aimed for an “ideal gas law” of evolutionary genetics (The Fundamental Theorem of Natural Selection?). In the paper above you state that you expect “mathematical theory” to become more “general and rigorous.” How near are we to an “ideal gas law” for evolutionary genetics which takes the step beyond a qualitative heuristic, if such a thing is possible?

It is not surprising that Fisher, who was trained in classical physics, would use physical analogies. Various mathematical geneticists, such as Tom Nagylaki of the University of Chicago, have found more general and accurate expressions, and I expect this to continue. I don’t expect evolution to imitate classical physics in such things as an ideal gas law. For example, Fisher’s analogizing fitness with entropy is better regarded as a metaphor than as rigorous science.

3) Computational methods have come to the fore within the past generation as an alternative to analytic modes for attacking theoretical problems. Do you believe this has been wholly a good thing, and if not, can you elaborate?

Yes, I think it is a good thing. Many problems in population genetics cannot be solved by a mathematician, no matter how gifted. Although I expect improvements in the mathematical theory, it is already clear that computer methods are very powerful. This is good. It also permits people with limited mathematical knowledge to work on important problems; but I don’t expect it to entirely replace mathematical theory.

4) The 1966 the Lewontin and Hubby allozyme papers reported a great deal more polymorphism than either the followers of Wright or Fisher expected (i.e., Balance School and Classical School). The work with Neutral Theory and its successors stepped into the theoretical breach. In hindsight, does it seem that Neutral Theory was plausible a priori, or did the evolutionary geneticists of the pre-DNA era simply miss the possibility (and ubiquity) of neutral substitutions because they did not have a good mental model of variation on the molecular level?

The amount of variability disclosed by Lewontin and Hubby was more than some expected, although it did not seem particularly surprising to me. It is important to say, as Lewontin was the first to articulate, that the difference between the classical and balance schools does not lie in the amount of variability (variability is an observable and not a theoretical quantity). Rather the difference in the two schools was the way in which variability was thought to be maintained: mainly by mutation-selection balance or mainly be heterosis.

I think neutral variability came as a surprise to almost everybody. Of course, it was an outgrowth of molecular methodology, which made possible the study of DNA itself rather than phenotypic traits. I don’t think it was the absence of a mental model as much as not knowing in advance the enormous number of nucleotides in the genome, and how little of the DNA, especially in mammals, is protein-coding.

5) Do you believe that group selection (i.e., inter-demic selection) might have played a significant role in the evolution of H. sapiens sapiens?

I’m sure it did, for our ancestors for many years had a tribal existence with competition, even wars, between groups. I suspect that group structure may be responsible for much altruistic behavior. In a small group everyone is related, so behaving cooperatively or altruistically toward members of a group is the genetic equivalent of kin-selection. Muller and others emphasized this idea. There is a level of relatedness in a group at which the welfare of the group prevails over the welfare of individuals. Egbert Leigh quantified this as did Aoki and I. 

6) When your commentary on Arthur Jensen’s infamous Harvard Educational Review article on the inheritance of IQ and racial differences was published in 1969, did you have any inkling that the issues raised by Jensen would remain largely unresolved over thirty-five years later? What kind of evidence do you think would decide these issues one way or the other?

I did not expect the issues to be resolved soon, for there were no new methods that promised be more informative. Of course, the structure of DNA had been discovered, but the powerful methods now available had not yet been developed. I think further identification of individual genes, usually by molecular methods, and a combination of statistical and molecular methods are pointing the way toward a solution. I don’t expect racial differences to be either entirely genetic or entirely environmental, but of course I don’t know the relative amount; it is likely to be different for different traits and different human groups. 

7) In you recent review of “Genes in Conflict” you state in reference to Robert Trivers’ papers published in the 1970s that, “They were ignored by most social scientists, who were reluctant to consider natural selection as a cause of human behavioral traits, and they were bitterly attacked by Marxists for reasons of doctrine.” Recently the University of Chicago evolutionary genomicist Bruce Lahn has come under fire (as profiled in The Wall Street Journal, June 16th edition) for his study of ASPM, a locus implicated in brain development, from both geneticists and non-geneticists because of the sensitivity of the possibility of intergroup variation due to differential evolutionary forces within the past 40,000 years. Last year the paper put forward by Gregory Cochran, Henry Harpending and Jason Hardy that argued high Ashkenazi IQ was due to recent natural selection also ignited a firestorm. It seems that we are entering a new era of human genetics as a great deal of data will soon be available for theorists to analyze (e.g. the HapMap and its successors). Are “controversial” questions still going to be off limits, or will the science compel the political and cultural taboos to step aside?

I hope that such questions can be approached with the same objectivity as that when we study inheritance of bristle number in Drosophila, but I don’t expect it soon. There are too many strongly held opinions. I thought Lahn had a clever idea in thinking that the normal alleles of head-reducing mutants might be responsible for evolution of larger heads in human ancestry. Likewise, I think that Cochran et al. are fully entitled to consider the reasons for Jewish intelligence and I found their arguments interesting. In my view it is wrong to say that research in this area — assuming it is well done — is out of order. I feel strongly that we should not discourage a line of research because someone might not like a possible outcome. 

8) If a budding evolutionary thinker had to read one book or paper that excluded Charles Darwin’s body of work, what would you recommend?

I would recommend Fisher’s “Genetical Theory of Natural Selection“. But the reader should be prepared to find it tough going. Fisher’s elegant obscurity has left many of us baffled, but entranced. Your “budding thinker” might want to stop before the last four chapters, which are more dated than the rest of the book. And by all means, read the 1999 variorum edition. It’s appendices explain many of the book’s obscurities.

9) You’ve defended “bean bag genetics” (Nature, 2001). Lynn Margulis has complimented you personally, but seems to dismiss the whole endeavor of theoretical evolutionary biology as trivial and irrelevant when set next to the concrete realities of molecular and cell biology. Over the past generation molecular biology has dethroned physics as the “Queen of Sciences” in regards to prestige, and many young biologists seemed to take the work of Fisher, Wright, Haldane, Kimura and yourself for granted and do not concern themselves with the abstract “big picture” when mechanistic details on the DNA scale needed to be elucidated. Do you believe that over the next generation more young people will begin to look once more at evolutionary biology in its grandest abstract reaches as the “low hanging fruit” in molecular biology is exhausted?

Lynn Margulis is a long-time personal friend and has done important work on the origin of cellular organelles, but I disagree with her on this issue. It is true that the elegant theory of Fisher, Wright, Haldane, Kimura, and Malécot was less useful than might have been expected, because of lack of good data to whieh the theory was applicable. But that is no longer true. Molecular evolution has provided an abundance of data and the theory now has plenty of important applications. In particular, the neutral theory of molecular evolution has had great heuristic and predictive value, and it owes a great deal to Kimura’s earlier theoretical work, which built on the foundations of the pioneers. Lynn might change her mind if she looked at some of the striking results gotten by combining molecular measurements with population genetics theory. Maybe I should ask her! 

10) If you had to have one last glass of beer, and your drinking partner was going to be either Fisher, Wright or Haldane, who would you choose, and why?

I would choose Haldane, for his uninhibited willingness to speculate, his enormous erudition, his interest in almost everything, his irreverence, his wit, and his enjoyment of conversation. I am told that much of the good biology in Huxley’s “Brave New World” is the result of his drinking partnership with Haldane.

 

July 26, 2018

Local ancestry deconvolution made simpler (?)

Filed under: Local ancestry,Population genetics — Razib Khan @ 11:37 pm

I’ve been waiting for a local ancestry deconvolution method to come out of Simon Myers’ group for a few years. Well, I think we’re there, Fine-scale Inference of Ancestry Segments without Prior Knowledge of Admixing Groups. Here’s the abstract:

We present an algorithm for inferring ancestry segments and characterizing admixture events, which involve an arbitrary number of genetically differentiated groups coming together. This allows inference of the demographic history of the species, properties of admixing groups, identification of signatures of natural selection, and may aid disease gene mapping. The algorithm employs nested hidden Markov models to obtain local ancestry estimation along the genome for each admixed individual. In a range of simulations, the accuracy of these estimates equals or exceeds leading existing methods that return local ancestry. Moreover, and unlike these approaches, we do not require any prior knowledge of the relationship between sub-groups of donor reference haplotypes and the unseen mixing ancestral populations. Instead, our approach infers these in terms of conditional “copying probabilities”. In application to the Human Genome Diversity Panel we corroborate many previously inferred admixture events (e.g. an ancient admixture event in the Kalash). We further identify novel events such as complex 4-way admixture in San-Khomani individuals, and show that Eastern European populations possess 1-5% ancestry from a group resembling modern-day central Asians. We also identify evidence of recent natural selection favouring sub-Saharan ancestry at the HLA region, across North African individuals. We make available an R and C ++ software library, which we term MOSAIC (which stands for MOSAIC Organises Segments of Ancestry In Chromosomes).

The truth is I’ve only done a quick skim of the preprint and not run the method myself to see how it works. But to be honest I can’t see where the part about Eastern Europeans is in the manuscript (I checked the supporting text)? That being said, if you run a PCA many Northern and most Eastern Europeans are clearly shifted toward East Asians compared to Southern Europeans. So I accept it.

In any case, always remember, all models are wrong. But some of them have insight.

July 13, 2018

Tutorial to run supervised admixture analyses

Filed under: Admixture,Data Analysis,Population genetics — Razib Khan @ 11:03 pm
ID Dai Gujrati Lithuanians Sardinian Tamil
razib_23andMe 0.14 0.26 0.02 0.00 0.58
razib_ancestry 0.14 0.26 0.02 0.00 0.58
razib_ftdna 0.14 0.26 0.02 0.00 0.57
razib_daughter 0.05 0.14 0.29 0.18 0.34
razib_son 0.07 0.17 0.28 0.19 0.30
razib_son_2 0.06 0.19 0.29 0.19 0.27
razib_wife 0.00 0.07 0.55 0.38 0.00

This is a follow-up to my earlier post, Tutorial To Run PCA, Admixture, Treemix And Pairwise Fst In One Command. Hopefully you’ll be able to run supervised admixture analysis with less hassle after reading this.

The above results are from a supervised admixture analysis of my family and myself. The fact that there are three replicates of me is due to the fact that I converted my 23andMe, Ancestry, and Family Tree DNA raw data into plink files. Notice that the results are broadly consistent. This emphasizes that discrepancies between DTC companies in their results are due to their analytic pipeline, not because of data quality.

The results for my family are not surprising. I’m about ~14% “Dai”, reflecting East Asian admixture into Bengalis. My wife is ~0% “Dai”. My children are somewhere in between. At the low fraction you expect some variance in the F1.

Now below are results for three Swedes with the sample reference panel:

Group ID Dai Gujrati Lithuanians Sardinian Tamil
Sweden Sweden17 0.00 0.09 0.63 0.28 0.00
Sweden Sweden18 0.00 0.08 0.62 0.31 0.00
Sweden Sweden20 0.00 0.05 0.72 0.23 0.00

All these were run on supervised admixture frameworks where I used Dai, Gujrati, Lithuanians, Sardinians, and Tamils, as the reference “ancestral” populations. Another way to think about it is: taking the genetic variation of these input groups, what fractions does a given test focal individual shake out at?

The commands are rather simple. For my family:
bash rawFile_To_Supervised_Results.sh TestScript

For the Swedes:
bash supervisedTest.sh Sweden TestScript

The commands need to be run in a folder: ancestry_supervised/.

You can download the zip file.

Here is what the scripts do in two different situations. Imagine you have raw genotype files downloaded from 23andMe, Ancestry, and Family Tree DNA.

Download the files as usual. Rename them in an intelligible way, because the file names are going to be used for generating IDs. So above, I renamed them “razib_23andMe.txt” and such. Leave the extensions as they are. You need to make sure they are not compressed obviously. Then place them all in ancestry_supervised/RAWINPUT.

The script looks for the files in there. You don’t need to specify names, it will find them. In plink the family ID and individual ID will be taken from the text before the extension in the file name. Output files will also have the file name.

Aside from the raw genotype files, you need to determine a reference file. In REFERENCESFILES/ you see the binary pedigree/plink file Est1000HGDP. The same file from the earlier post. It would be crazy to run supervised admixture on the dozens of populations in this file. You need to create a subset.

For the above I did this:
grep "Dai|Guj|Lithua|Sardi|Tamil" Est1000HGDP.fam > ../keep.keep

Then:
./plink --bfile REFERENCEFILES/Est1000HGDP --keep keep.keep --make-bed --out REFERENCEFILES/TestScript

When the script runs, it converts the raw genotype files into plink files, puts them in INDIVPLINKFILES/. Then it takes each plink file and uses it as a test against the reference population file. That file has a preprend on group/family IDs of the form AA_Ref_. This is essential for the script to understand that this is a reference population. The .pop files are automatically generated, and the script inputs in the correct K by looking for unique population numbers.

The admixture is going to be slow. I recommend you modify runadmixture.pl by adding the number of cores parameters so it can go multi-threaded.

When the script is done it will put the results in RESULTFILES/. They will be .csv files with strange names (they will have the original filename you provided, but there are timestamps in there so that if you run the files with a different reference and such it won’t overwrite everything). Each individual is run separately and has a separate output file (a .csv).

But this is not always convenient. Sometimes you want to test a larger batch of individuals. Perhaps you want to use the reference file I provided? For the Swedes I did this:
grep "Swede" REFERENCEFILES/Est1000HGDP.fam > ../keep.keep

Then:
./plink --bfile REFERENCEFILES/Est1000HGDP --keep keep.keep --make-bed --out INDIVPLINKFILES/Sweden

Please note the folder. There are modifications you can make, but the script assumes that the test files inINDIVPLINKFILES/. The next part is important. The Swedish individuals will have AA_Ref_ preprended on each row since you got them out of Est1000HGDP. You need to remove this. If you don’t remove it, it won’t work. In my case, I modified using the vim editor:
vim Sweden.fam

You can do it with a text editor too. It doesn’t matter. Though it has to be the .fam file.

After the script is done, it will put the .csv file in RESULTFILES/. It will be a single .csv with multiple rows. Each individual is tested separately though, so what the script does is append each result to the file. If you have 100 individuals, it will take a long time. You may want to look in the .csv file as the individuals are being added to make sure it looks right.

The convenience of these scripts is that it does some merging/flipping/cleaning for you. And, it formats the output so you don’t have to.

I originally developed these scripts on a Mac, but to get it to work on Ubuntu I made a few small modifications. I don’t know if it still works on Mac, but you should be able to make the modifications if not. Remember for a Mac you will need the make versions of plink and admixture.

For supervised analysis, the reference populations need to make sense and be coherent. Please check the earlier tutorial and use the PCA functions to remove outliers.

July 12, 2018

Running your own analyses

Filed under: Genetics,Population genetics,Scripts — Razib Khan @ 8:25 am

For the technically inclined people here: Tutorial To Run PCA, Admixture, Treemix And Pairwise Fst In One Command.

July 11, 2018

Tutorial to run PCA, Admixture, Treemix and pairwise Fst in one command

Filed under: Admixture,data,Fst,PCA,PLINK,Population genetics,TreeMix — Razib Khan @ 11:50 pm


Today on Twitter I stated that “if the average person knew how to run PCA with plink and visualize with R they wouldn’t need to ask me anything.” What I meant by this is that the average person often asks me “Razib, is population X closer to population Y than Z?” To answer this sort of question I dig through my datasets and run a few exploratory analyses, and get back to them.

I’ve been meaning to write up and distribute a “quickstart” for a while to help people do their own analyses. So here I go.

The audience of this post is probably two-fold:

  1. “Trainees” who are starting graduate school and want to dig in quickly into empirical data sets while they’re really getting a handle on things. This tutorial will probably suffice for a week. You should quickly move on to three population and four population tests, and Eigensoft and AdmixTools. As well fineStructure
  2. The larger audience is technically oriented readers who are not, and never will be, geneticists professionally. 

What do you need? First, you need to be able to work in a Linux or environment. I work both in Ubuntu and on a Mac, but this tutorial and these scripts were tested on Ubuntu. They should work OK on a Mac, but there may need to be some modifications on the bash scripts and such.

Assuming you have a Linux environment, you need to download this zip or tar.xz file. Once you open this file it should decompress a folderancestry/.

There are a bunch of files in there. Some of them are scripts I wrote. Some of them are output files that aren’t cleaned up. Some of them are packages that you’ve heard of. Of the latter:

  • admixture
  • plink
  • treemix

You can find these online too, though these versions should work out of the box on Ubuntu. If you have a Mac, you need the Mac versions. Just replace the Mac versions into the folderancestry/. You may need some libraries installed into Ubuntu too if you recompile yourselfs. Check the errors and make search engines your friends.

You will need to install R (or R Studio). If you are running Mac or Ubuntu on the command line you know how to get R. If not, Google it.

I also put some data in the file. In particular, a plink set of files Est1000HGDP. These are merged from the Estonian Biocentre, HGDP, and 1000 Genomes. There are 4,899 individuals in the data, with 135,000 high quality SNPs (very low missingness).

If you look in the “family” file you will see an important part of the structure. So do:

less Est1000HGDP.fam

You’ll see something like this:
Abhkasians abh154 0 0 1 -9
Abhkasians abh165 0 0 1 -9
Abkhazian abkhazian1_1m 0 0 2 -9
Abkhazian abkhazian5_1m 0 0 1 -9
Abkhazian abkhazian6_1m 0 0 1 -9
AfricanBarbados HG01879 0 0 0 -9
AfricanBarbados HG01880 0 0 0 -9

There are 4,899 rows corresponding to each individual. I have used the first column to label the ethnic/group identity. The second column is the individual ID. You can ignore the last 4 columns.

There is no way you want to analyze all the different ethnic groups. Usually, you want to look at a few. For that, you can use lots of commands, but what you need is a subset of the rows above. The grep command matches and returns rows with particular patterns. It’s handy. Let’s say I want just Yoruba, British (who are in the group GreatBritain), Gujurati, Han Chinese, and Druze. The command below will work (note that Han matches HanBeijing, Han_S, Han_N, etc.).

grep "Yoruba|Great|Guj|Han|Druze" Est1000HGDP.fam > keep.txt

The file keep.txt has the individuals you want. Now you put it through plink to generate a new file:

./plink --bfile Est1000HGDP --keep keep.txt --make-bed --out EstSubset

This new file has only 634 individuals. That’s more manageable. But more important is that there are far fewer groups for visualization and analysis.

As for that analysis, I have a Perl script with a bash script within it (and some system commands). Here is what they do:

1) they perform PCA to 10 dimensions
2) then they run admixture on the number of K clusters you want (unsupervised), and generate a .csv file you can look at
3) then I wrote a script to do pairwise Fst between populations, and output the data into a text file
4) finally, I create the input file necessary for the treemix package and then run treemix with the number of migrations you want

There are lots of parameters and specifications for these packages. You don’t get those unless you to edit the scripts or make them more extensible (I have versions that are more flexible but I think newbies will just get confused so I’m keeping it simple).

Assuming I create the plink file above, running the following commands mean that admixture does K = 2 and treemix does 1 migration edge (that is, -m 1). The PCA and pairwise Fst automatically runs.

perl pairwise.perl EstSubset 2 1

Just walk away from your box for a while. The admixture will take the longest. If you want to speed it up, figure out how many cores you have, and edit the file makecluster.sh, go to line 16 where you see admixture. If you have 4 cores, then type -j4 as a parameter. It will speed admixture up and hog all your cores.

There is as .csv that has the admixture output. EstSubset.admix.csv. If you open it you see something like this:
Druze HGDP00603 0.550210 0.449790
Druze HGDP00604 0.569070 0.430930
Druze HGDP00605 0.562854 0.437146
Druze HGDP00606 0.555205 0.444795
GreatBritain HG00096 0.598871 0.401129
GreatBritain HG00097 0.590040 0.409960
GreatBritain HG00099 0.592654 0.407346
GreatBritain HG00100 0.590847 0.409153

Column 1 will always be the group, column 2 the individual, and all subsequent columns will be the K’s. Since K = 2, there are two columns. Space separated. You should be able to open the .csv or process it however you want to process it.

You’ll also see two other files: plink.eigenval plink.eigenvec. These are generic output files for the PCA. The .eigenvec file has the individuals along with the values for each PC. The .eigenval file shows the magnitude of the dimension. It looks like this:
68.7974
38.4125
7.16859
3.3837
2.05858
1.85725
1.73196
1.63946
1.56449
1.53666

Basically, this means that PC 1 explains twice as much of the variance as PC 2. Beyond PC 4 it looks like they’re really bunched together. You can open up this file as a .csv and visualize it however you like. But I gave you an R script. It’s RPCA.R.

You need to install some packages. First, open R or R studio. If you want to go command line at the terminal, type R. Then type:
install.packages("ggplot2")
install.packages("reshape2")
install.packages("plyr")
install.packages("ape")
install.packages("igraph")
install.packages("ggplot2")

Once those packages are loaded you can use the script:
source("RPCA.R")

Then, to generate the plot at the top of this post:
plinkPCA()

There are some useful parameters in this function. The plot to the left adds some shape labels to highlight two populations. A third population I label by individual ID. This second is important if you want to do outlier pruning, since there are mislabels, or just plain outlier individuals, in a lot of data (including in this). I also zoomed in.

Here’s how I did that:
plinkPCA(subVec = c("Druze","GreatBritain"),labelPlot = c("Lithuanians"),xLim=c(-0.01,0.0125),yLim=c(0.05,0.062))

To look at stuff besides PC 1 and PC 2 you can do plinkPCA(PC=c("PC3","PC6")).

I put the PCA function in the script, but to remove individuals you will want to run the PCA manually:

./plink --bfile EstSubset --pca 10

You can remove individuals manually by creating a remove file. What I like to do though is something like this:
grep "randomID27 " EstSubset.fam >> remove.txt

The double-carat appends to the remove.txt file, so you can add individuals in the terminal in one window while running PCA and visualizing with R in the other (Eigensoft has an automatic outlier removal feature). Once you have the individuals you want to remove, then:

./plink --bfile EstSubset --remove remove.txt --make-bed --out EstSubset
./plink --bfile EstSubset --pca 10

Then visualize!

To make use of the pairwise Fst you need the fst.R script. If everything is set up right, all you need to do is type:
source("fst.R")

It will load the file and generate the tree. You can modify the script so you have an unrooted tree too.

The R script is what generates the FstMatrix.csv file, which has the matrix you know and love.

So now you have the PCA, Fst and admixture. What else? Well, there’s treemix.

I set the number of SNPs for the blocks to be 1000. So -k 1000. As well as global rearrangement. You can change the details in the perl script itself. Look to the bottom. I think the main utility of my script is that it generates the input files. The treemix package isn’t hard to run once you have those input files.

Also, as you know treemix comes with R plotting functions. So run treemix with however many migration edges (you can have 0), and then when the script is done, load R.

Then:
>source("src/plotting_funcs.R")
>plot_tree("TreeMix")

But actually, you don’t need to do the above. I added a script to generate a .png file with the treemix plot in pairwise.perl. It’s called TreeMix.TreeMix.Tree.png.

OK, so that’s it.

To review:

Download zip or tar.xz file. Decompress. All the packages and scripts should be in there, along with a pretty big dataset of modern populations. If you are on a non-Mac Linux you are good to go. If you are on a Mac, you need the Mac versions of admixture, plink, and treemix. I’m going to warn you compiling treemix can be kind of a pain. I’ve done it on Linux and Mac machines, and gotten it to work, but sometimes it took time.

You need R and/or R Studio (or something like R Studio). Make sure to install the packages or the scripts for visualizing results from PCA and pairwiseFst won’t work.*

There is already a .csv output from admixture. The PCA also generates expected output files. You may want to sort, so open it in a spreadsheet.

This is potentially just the start. But if you are a laypersonwith a nagging question and can’t wait for me, this should be you where you need to go!

* I wrote a lot of these things piecemeal and often a long time ago. It may be that not all the packages are even used. Don’t bother to tell me.

June 21, 2018

The fault in our parameters

Filed under: Population genetics — Razib Khan @ 10:49 pm

Of the books, I own Elements of Evolutionary Genetics is one I consult frequently because of its range and comprehensiveness. The authors, Brian Charlesworth and Deborah Charleswencyclopedican encyclopediac knowledge of the literature. To truly understand the evolutionary process in all its texture and nuance it is important to absorb a fair amount of theory, and Elements of Evolutionary Genetics does do that (though it’s not as abstruse as something like An Introduction to Population Genetics Theory).

When I see a paper by one of the Charlesworth’s, I try and read them. Not because I have a love of Drosophila or Daphnia, but because to develop strong population genetics intuitions it always helps to stand on the shoulders of giants. So with that, I pass on this preprint, Mutational load, inbreeding depression and heterosis in subdivided populations:

This paper examines the extent to which empirical estimates of inbreeding depression and inter-population heterosis in subdivided populations, as well as the effects of local population size on mean fitness, can be explained in terms of estimates of mutation rates, and the distribution of selection coefficients against deleterious mutations provided by population genomics data. Using results from population genetics models, numerical predictions of the genetic load, inbreeding depression and heterosis were obtained for a broad range of selection coefficients and mutation rates. The models allowed for the possibility of very high mutation rates per nucleotide site, as is sometimes observed for epiallelic mutations. There was fairly good quantitative agreement between the theoretical predictions and empirical estimates of heterosis and the effects of population size on genetic load, on the assumption that the deleterious mutation rate per individual per generation is approximately one, but there was less good agreement for inbreeding depression. Weak selection, of the order of magnitude suggested by population genomic analyses, is required to explain the observed patterns. Possible caveats concerning the applicability of the models are discussed.

June 19, 2018

Burmese are a bit Bengali

Filed under: Burma,Population genetics — Razib Khan @ 12:48 am

About ten years ago I read the book The River of Lost Footsteps: Histories of Burma. Though I have read books where Burma figures prominently (e.g., Strange Parallels), this is the only history of Burma I have read. The author is Burmese, and provide something much more than a travelogue, as might have been the case if he was of Western background. By chance over the past month or so I’ve been in contact with the author, who made a few inquiries as to the genetics of his own family (he came with genotypes in hand). But this brought us to the issue of the genetics of the Burmese people, and their position in the historical-genetic landscape.

The author of The River of Lost Footsteps reminded me of something that’s curious about Southeast Asia: its Indic influences tend to be from the south of the subcontinent. In particular, the native scripts derive from a South Indian parent. Could genetics confirm this connection as well? Also, could genetics give some insights as to the timing of admixture/gene-flow?

In theory, yes.

I had a lot of Southeast Asian datasets to play with, and did a lot of pruning to remove outliers (e.g., people with obvious recent Chinese ancestry). First, comparing them to Bangladeshis it seems that even without local ancestry tract analysis that Burmese and Malays have more varied, and so likely recent, exogenous ancestry than Bangladeshis. At least this is evidence on the PCA plot, where these two groups exhibit strong admixture clines toward South Asians.

But what about the question of Southeast Asian affinities? This needs deeper analysis. Three-population tests, which measure admixture with outgroups when compared to a dyad of populations which are modeled as a clade, can be informative.

Outgroup Pop1 Pop2 f3 z
Bangladeshi Telugu Cambodians -0.00183999 -46.3322
Bangladeshi Telugu Han -0.00220121 -46.046
Burma Telugu Han -0.00406071 -51.0018
Burma Han Bangladeshi -0.00348186 -49.1398
Burma Han Punjabi_ANI_2 -0.00418193 -47.2351
Cambodians Telugu Viet -0.00126923 -16.91
Cambodians Punjabi_ANI_2 Viet -0.00129881 -15.6039
Cambodians Bangladeshi Viet -0.000970022 -14.5642
Malay Igorot Telugu -0.00249795 -18.758
Malay Igorot Bangladeshi -0.00223454 -18.5212
Malay Igorot Punjabi_ANI_2 -0.00250732 -18.3027
Malay Igorot Cambodians -0.00107817 -16.6214
Viet Han Cambodians -0.000569337 -13.1139

Bangladeshis show strong signatures with both Cambodians and Han. This is in accordance with earlier analysis which suggests Austro-Asiatic and Tibeto-Burman contributions to the “East Asian” element of Bengali ancestry. The Burmese always have Han ancestry, with a South Asian donor as well. This aligns with other PCA analysis which shows the Burmese samples skewed toward Han Chinese. Burma is a compound of different ethnic groups. Some are Austro-Asiatic. The Bamar, the core “Burman” group, have some affinities to Tibetans. And the Shan are a Thai people who are relatively late arrivals.

Cambodians have a weaker admixture signature and are paired with a South Asian group and their geographic neighbors the Vietnamese. The Malays are similar to Cambodians but have the Igorot  people from the Philippines as one of their donors. And finally, not surprisingly the Vietnamese show some mixture between Han-like and Cambodian-like ancestors.

Further PCA analysis shows that while Cambodians and Malays tend to skew somewhat neutrally to South Asians (the recent Indian migration to Malaysia is mostly Tamil), the Burmese are shifted  toward Bangladeshis:

Click to enlarge

Finally, I ran some admixture analyses.

First, I partitioned the samples with an unsupervised set of runs (K = 4 and K = 5). In this way I obtained reified reference groups as follows:

“Austronesians” (Igorot tribesmen from the Philippines)
“Austro-Asiatic” (a subset of Cambodians with the least exogeneous admixture)
“North Indians” (Punjabis)
“South Indians” (A subset of middle-caste Telugus highest on the modal element in South Indians)
“Han” (a proxy for “northern” East Asian)

The results are mostly as you’d expect. In line with three-population tests, the Vietnamese are Han and Austro-Asiatic. More of the former than latter. There is a minor Austronesian component. Notice there is no South Asian ancestry in this group.

In contrast, Cambodians have low levels of both North and South Indian. These out sample Cambodians are still highly modal for Austro-Asiatic though.

Malays are more Austro-Asiatic than Austronesian, which might surprise. But the Igorot samples are highly drifted and distinct. I think these runs are underestimating Austronesian in the Malays. Notice that some of the Malays have South Asian ancestry, but a substantial number do not. This large range in admixture is what you see in PCA as well. I think this strongly points to the fact that Malays have been receiving gene-flow from India recently, as it is not a well mixed into the population.

The Bangladeshi outgroup is mostly a mix of North and South Indian, with a slight bias toward the latter. No surprise. As I suggested earlier you can see that the Bangladeshi samples are hard to model as just a mix of Burmese with South Asians. The Austro-Asiatic component is higher in them than the Burmese. This could be because Burma had recent waves of northern migration (true), and, eastern India prior to the Indo-Aryan expansion was mostly inhabited by Austro-Asiatic Munda (probably true). That being said, the earlier analysis suggested that the Munda cannot be the sole source of East Asian ancestry in Bengalis.

Finally, every single Burmese sample has South Asian ancestry. Much higher than Cambodians. And, there is variance.  I think that leads us to the likely conclusion that Burma has been subject to continuous gene-flow as well as recent pulses of admixture from South Asia. The variation in South Asian ancestry in the Burmese is greater than East Asian ancestry in Bengalis. I believe this is due to more recent admixture in Burmese due to British colonial Indian settlement in that country.

The cultural and historical context of this discussion is the nature of South Asian, Indic, influence, on Southeast Asia. One can not deny that there has been some gene-flow between Southeast Asia and South Asia. In prehistoric times it seems that Austro-Asiatic languages moved from mainland Southeast Asia to India. More recently there is historically attested, and genetically confirmed, instances of colonial Indian migration. But, the evidence from Cambodia suggests that this is likely also ancient, as unlike Malaysia or Burma, Cambodia did not have any major flow of Indian migrants during the colonial period. One could posit that perhaps the Cambodian Indian affinity is a function of “Ancestral South Indian.” But the Cambodians are not skewed toward ASI-enriched groups in particular. And, I know for a fact that appreciable frequencies of R1a1a exist within the male Khmer population (this lineage is common in South Asia, especially the north and upper castes).

As far as Burma goes, I think an older period of South Indian cultural influence, and some gene-flow seems likely. But, with the expansion of Bengali settlement to the east over the past 2,000 years, more recent South Asian ancestry is probably enriched for that ethnolinguistic group.

I’m going to try and follow-up with some ancestry tract analysis….

May 30, 2018

Soft & hard selection vs. soft & hard sweeps

Filed under: Population genetics,Quantitative Genetics — Razib Khan @ 12:09 am


When I was talking to Matt Hahn I made a pretty stupid semantic flub, confusing “soft selection” with “soft sweeps.” Matt pointed out that soft/hard selection were terms more appropriate to quantitative genetics rather than population genomics. His viewpoint is defensible, though going back into the literature on soft/selection, e.g., Soft and hard selection revisited, the main thinkers pushing the idea were population geneticists who were also considering ecological questions.*

The strange thing is that I had already known the definitions of hard and soft selection on some level because I had read about them as I was getting confused with hard and soft sweeps! But this was more than ten years ago now, and since then I haven’t given the matter enough thought obviously, as I defaulted back to confusing the two classes of terms, just as I used to.

Matt pointed out that truncation selection is a form of hard selection. All individuals below (or above) a certain phenotype value have a fitness of zero, as they don’t reproduce. In a single locus context, hard selection would involve deleterious lethal alleles, whose impact on the genotype was the same irrespective of ecological context. So in a hard selection, it operates by reducing the fitness of individuals/genotypes to zero.

For soft selection, context matters much more, and you would focus more on relative fitness differences across individuals/genotypes. Some definitions of soft vs. hard selection emphasize that in the former case fitness is defined relative to the local ecological patch, while the latter is a universal estimate. Soft selection does not necessarily operate through the zero fitness value for a genotype, but rather differential fitness. Hard selection can crash your population size. Soft selection does not necessarily do that.

Though I won’t outline the details, one of the originators of the soft/hard selection concept analogized them to density-dependent/independent dynamics in ecology. If you know the ecological models, the correspondence probably is obvious to you.

As for hard and soft sweeps, these are particular terms of relevance to genomics, because genome-wide data has allowed for their detection through the impact they have on the variation in the genome. A “sweep” is a strong selective event that tends to sweep away variation around the focus of selection. A hard sweep begins with a single mutant, and positive selection tends to drive it toward fixation.

A classical example is lactase persistence in Northern Europeans and Northwest South Asians (e.g., Punjabis). The mutation in the LCT gene is the same across a huge swath of Eurasia. And, the region around the genome is also the same, because regions of the genome adjacent to that single mutation increased in frequency as well (they “hitchhiked”). This produces a genetic block of highly reduced diversity since the hard selective sweep increases the frequency of so many variants which are associated with the advantageous one, and may drive to extinction most other competitive variants.

Someone is free to correct me in the comments, but it strikes me that many hard selective sweeps are driven by soft selection. Fitness differentials between those with the advantageous alleles and those without it are not so extreme, and obviously context dependent, even in cases of hard sweeps on a single locus.

The key to understanding soft sweeps is that there isn’t a focus on a singular mutation. Rather, selection can target multiple mutations, which may have the same genetic position, but be embedded within different original gene copies. In fact, soft selection often operates on standing variation, preexistent alleles which were segregating in the population at low frequencies or were totally neutral. Genetic signatures of these events are less striking than those for hard sweeps because there is far less diminishment of diversity, since it’s not the increase in the frequency of a singular mutation and the hitchhiking of its associated flanking genomic region.

Soft sweeps can clearly occur with soft selection. But truncation selection can occur on polygenic traits, so depending on the architecture of the trait (i.e., effect size distribution across the loci) one can imagine them associated with hard selection as well.

Going back to the conversation I had with Matt the reason semantics is important is that terms in population genetics are informationally rich, and lead you down a rabbit-hole of inferences. If population genetics is a toolkit for decomposing reality, then you need to have your tools well categorized and organized. On occasion it is important to rectify the names.

* There are two somewhat related definitions of soft/hard selection. I’ll follow Wallace’s original line here, though I’m not sure they differ that much.

May 21, 2018

The mutation accumulation controversy continues….

Filed under: Mutation,Population genetics — Razib Khan @ 10:56 pm

Every few years I check to see if the great mutation accumulation controversy has resolved itself. I don’t know if anyone calls it that, but that’s what I think of it as. There are two major issues that matter here: mutation rates are a critical parameter in evolutionary models, and, mutation accumulation over time matters for parental age effects when it comes to disease (speaking as an older father!).

In the latter case, I’m talking about the reasons that people freeze their eggs or sperm. In the former case, I’m talking about whether we can easily extrapolate mutation rates over evolutionary time as semi-fixed, so we can infer dates of last common ancestry and such. To give a concrete example of what I’m talking about, if mutation rates varied a lot over the evolutionary history of our hominin lineage, then we might need to rethink some of the inferred timings.

Today two preprints came out on mutation accumulation. First, Overlooked roles of DNA damage and maternal age in generating human germline mutations. Second, Reproductive longevity predicts mutation rates in primates. What a coincidence in synchronicity!

Additionally, the last author on the second preprint, Matt Hahn, is someone I’ll be doing a podcast with this week. So aside from talking about neutral theory, and his book Molecular Population Genetics, I’m going to have to bring up this mutation business.

The figure above from the first preprint shows that the proportion of mutations derived from the father don’t increase over time, as textbooks generally state. Why would we expect this? Sperm keeps replicating after puberty so you should be gaining more mutations. In contrast, the eggs are arrested in meiosis. There are various mechanistic reasons that the authors of the first preprint give for why the ratio does not change between paternal and maternal mutations (e.g., non-replicative mutations seem to be the primary one). The authors are using a very “pedigree” strategy, rather than an “evolutionary” one. They’re looking at sequenced trios, and noticing patterns. I think in the near future they’ll be far more sure of what’s going on because they’ll have bigger sample sizes. They admit the effects are subtle (also, some of the p-values are getting close to 0.05).

Instead of focusing on a human pedigree, the second preprint does some sequencing on owl monkeys (I had no idea there were “owl monkeys” before this paper). They find that the mutation rate is ~32% lower in owl monkeys than in humans. Why is this?

The plot to the left shows that mutations increase across age with species (though the number of data points is pretty small). The authors contend that:

The association between mutation rates and reproductive longevity implies that changes in life history traits rather than changes to the mutational machinery are responsible for the evolution of these rates. Species that have evolved greater reproductive longevity will have a higher mutation rate per generation without any underlying change to the replication, repair, or proofreading proteins.

If I read this right: owl monkeys reproduce fast and don’t have as much reproductive longevity. Ergo, lower mutation rates (less mutational build-up from paternal side).

After all these years I’m still not convinced about anything. I assume that eventually bigger data sets will come online and we’ll resolve this. Someone has to be right!

(not too many people on Twitter get what’s going on either)

May 8, 2018

The peoples of the Maghreb have some Pleistocene roots

Filed under: Human Population Genetics,North Africa,Population genetics — Razib Khan @ 11:58 pm
Moroccan Berber man

The Maghreb is an important and interesting place. In the history of Western civilization, the tension between Carthage, the ancient port city based out of modern-day Tunisia, and Rome, is one of the more dramatic and tragic rivalries that has resonances down through the ages. Read Adrian Goldsworthy’s chapter on the Battle of Cannae in The Punic Wars for what I’m alluding to (and of course there was Cato the Younger’s dramatic remonstrations).

Later Roman Africa, which really encompassed northern Morocco, coastal Algeria, and Tunisia and Tripolitania, became a major social and economic pillar of the Imperium. Not only did men such as the emperor Septimius Severus and St. Augustine have roots in the region, but these provinces were a major economic bulwark for the Western Empire in its last century. The wealthy Senators of the 4th and 5th century were often absentee landlords of vast estates in North Africa. The fall of these provinces to the Vandals and Alans in the 430s began the transformation of the Western Empire based in Rome into a more regional player, rather than a true hegemon (perhaps an analogy here can be made to the loss of Anatolia by the Byzantines in the 11th century).

Another important aspect of North Africa is that it is the westernmost extension of the region possibly settled by Near Eastern farmers in Africa. The native Afro-Asiatic Berber languages seem to have been dominant in the region despite the influence and prestige of Punic and Latin in the cities when Muslim Arabs conquered the region in the late 7th century. The genetic-demographic characteristics of the region are relevant to attempts to understand the origins of the Afro-Asiatic languages more generally since Berber is part of the clade with the Semitic languages.

A preprint and a paper utilizing ancient DNA have shed a great deal of light on these questions recently. The paper is in Science, Pleistocene North African genomes link Near Eastern and sub-Saharan African human populations. The preprint is Ancient genomes from North Africa evidence prehistoric migrations to the Maghreb from both the Levant and Europe. They are in broad agreement, though they cover somewhat different periods.

The figure below is the big finding of the Science paper:

They retrieved some genotypes from a site in northern Morocco, Taforalt, which dates to ~15,000 years before the present. This is a Pleistocene site, before the rise of agriculture. The Taforalt individuals are about 65% Eurasian in affinity, and 35% Sub-Saharan African. This confirms that the Eurasian back-migration to northern Africa predates the Holocene, just as many archaeologists and geneticists have reported earlier.

The samples from the preprint date to a later time. IAM in the samples dates to 7,200 years before the present, and KEB to ~5,000 years before the present. It seems pretty clear that the IAM samples in the preprint exhibit continuity with the Taforalt samples. Though it is not too emphasized in the preprint the lower K’s seem to strongly suggest that the IAM samples have Sub-Saharan African ancestry, just like the Taforalt samples which are nearly 8,000 years older. In the KEB samples, the fraction drops, probably diluted in part by ancestry related to what we elsewhere term “Early European Farmer” (EEF), related to the Anatolian farming expansion.

Both the Taforalt and IAM samples, in particular, seem to exhibit strong affinities to Natufian/Levantine peoples. Additionally, many of these samples carry Y chromosome haplogroup E1b, just like some of the Natufians. These results indicate that the Natufian-North African populations were exchanging genes or one cline rather deep in the Pleistocene.

Though various methods have suggested that there is a lot of recent Sub-Saharan African admixture, dating to the Arab period, in North Africa, these results suggest that much of it is far older. The Mozabites, as an isolated Berber group, reflect this tendency. Though some individuals have inflated African ancestry due to recent admixture, much of it is older and evener. And yet the Mozabites seem to have less Sub-Saharan African ancestry on average than the IAM sample.

There aren’t enough data points to make a strong inference about the temporal transect, but these few results imply a decline in Sub-Saharan ancestral component after the Pleistocene with further farming migration, and then a rise again with the trans-Saharan slave trade during the Muslim period. Another issue, highlighted in the preprint, is likely heterogeneity within the Maghreb in ancestry (lowland populations in modern North Africa tend to have more Sub-Saharan ancestry due to where slaves were settled).

In the Science paper the authors make an attempt to adduce the origin of the Sub-Saharan contribution to the Taforalt individuals. The result is that there is no modern or ancient proxy that totally fits the bill. These individuals have affinities to many Sub-Saharan African populations.  The Sub-Saharan component is likely heterogeneous, but attempts to model European genetic variation during the Ice Age ran into trouble that divergence from modern populations was quite great. Until we get more ancient DNA there probably won’t be too much more clarity.

On the issue of the Eurasian ancestry, it’s clearly quite like the Natufians. But curiously the authors find that the Neanderthal ancestry in these samples is greater than that found in early Holocene Iran samples. From this, the authors conclude that they may have had a lower fraction of “Basal Eurasian” (BEu) than those populations further to the east. But already 15,000 years ago BEu populations were mixed with more generic West Eurasians to generate the back-migration to Africa. If BEu diverged from other Eurasians >50,000 years ago, then it may have merged back into the “Out-of-Africa” populations around or before the Last Glacial Maximum, ~20,000 years ago.

Finally, the authors looked at some pigmentation genes. Curiously the Taforalt and IAM individuals did not carry the derived variants for pigmentation found in many West and South Eurasians, but the KEB did. This confirms results from Europe, and population genomic inference in modern samples, that selection for derived pigmentation variants is relatively recent in the Holocene.

I do want to add that one possibility about the Sub-Saharan ancestry in the Taforalt, and probably all modern North Africans to a lesser extent, is that it is ancient and local. We now know proto-modern humans were present in the region >300,000 years ago. Northwest Africa may have been part of the multi-regional metapopulation of H. sapiens, as opposed to the Eurasian biogeographic zone that it is often placed, before a post-LGM back migration of Eurasians.

February 24, 2018

Are Turks Armenians under the hood?

Filed under: Historical Population Genetics,Population genetics — Razib Khan @ 8:31 pm

Benedict Anderson’s Imagined Communities: Reflections on the Origin and Spread of Nationalism is one of those books I haven’t read, but should. In contrast, I have read Azar Gat’s Nations, which is a book-length counterpoint to Imagined Communities. To take a stylized and extreme caricature, Imagined Communities posits nations to be recent social and historical constructions, while Nations sees them as primordial, and at least originally founded on on ties of kinships and blood.

The above doesn’t capture the subtlety of  Gat’s book, and I’m pretty sure it doesn’t capture that of Anderson’s either. But, those are the caricatures that people take away and project in public, especially Anderson’s (since Gat’s is not as famous).

When it comes to “imagined communities” I recently have been thinking how much that of modern Turks fits into the framework well. Though forms of pan-Turkic nationalism can be found as earlier as 9th-century Baghdad, the ideology truly emerges in force in the late 19th century, concomitantly with the development of a Turkish identity in Anatolia which is distinct from the Ottoman one.

The curious thing is that though Turkic and Turkish identity is fundamentally one of language and secondarily of religion (the vast majority of Turkic peoples are Muslim, and there are periods, such as the 17th century when the vast majority of Muslims lived in polities ruled by people of Turkic origin*), there are some attempts to engage in biologism. This despite the fact that the physical dissimilarity of Turks from Turkey and groups like the Kirghiz and Yakut is manifestly clear.

Several years ago this was made manifestly clear in the paper The Genetic Legacy of the Expansion of Turkic-Speaking Nomads across Eurasia. This paper clearly shows that Turkic peoples across Eurasia have been impacted by the local genetic substrate. In plainer language, the people of modern-day Turkey mostly resemble the people who lived in Turkey before the battle of Manzikert and the migration of Turkic nomads into the interior of the peninsula in the 11th century A.D. Of course, there is some genetic element which shows that there was a migration of an East Asian people into modern day Anatolia, but this component in the minority one.**

Sometimes the Turkish fascination with the biological comes out in strange ways, Turkish genealogy database fascinates, frightens Turks. Much of the discussion has to do with prejudice against Armenians and Jews. But the reality is that most Turks at some level do understand that they are descended from Greeks, Armenians, Georgians, etc.

To interrogate this further I decided to look at a data set of Greeks, Turks, Armenians, Georgians, and a few other groups, including Yakuts, who are the most northeastern of Turkic peoples. The SNP panel was >200,000, and I did some outlier pruning. Additionally, I didn’t have provenance on a lot of the Greeks, except some labeled as from Thessaly. I therefore just split those up with “1” being closest to the Thessaly sample and “3” the farthest.

First, let’s look at the PCA.

The Turks are shifted toward the Yakuts, but not too much. In contrast, there is much more of Yakut shift in Tajiks, and especially Turkmens. These are two groups from further east, closer to the heart of the zone Turkic expansion. Curiously, the Tajiks, who are the dominant non-Turkic Iranian speaking people of Central Asia, actually have more East Asian ancestry than the Turks of Turkey. This goes to show that ethnicity is somewhat fluid, and Turkic people have assimilated into the Tajik identity. That being said, please note that the Turkmen are notably more east-shifted than the Tajik.

Let’s see how this looks on pairwise Fst.

Fst is kind of difficult for fine distinctions when you have outgroups like Yakuts and Dai. So let’s look at Treemix with five migrations:

On this, you can see that the relationship of the Greece clusters on Treemix to Lithuanians matches PCA. Greece1 is the closest, Greece 3 the farthest.

The Turks are close to the Georgians and Armenians, but not the Kurds, or Tajiks. And, they receive gene flow from the Turkmen-Yakut region of the graph. So do the Tajiks…but the Tajiks also remove gene flow from the Lithuanians. The admixture plot makes it more clear what’s happening I think.

Yellow ~ modal in Southern Europe, green ~ modal Northern Europe, red ~ Central Asian, while blue and purple are northern and southern East Asian. In comparison to Turks of Anatolia Tajiks have a lot more Northern European affinity, probably because of the common steppe heritage. Not surprisingly, Turks have more Southern European like ancestry.

Curiously the East Asian ancestry in the Turkic people seems to be both Yakut and Dai like, so perhaps it was more cosmopolitan than we might think? The Yakuts after all are from the northern edge of the range, and may have absorbed a lot of indigenous Siberian ancestry.

Georgians have none of the Northern European sort of ancestry, but Armenians do, and Turks even more. One could posit that this is due to Slavic ancestry arriving with the Rumelian Turks who arrived in the 20th century, but just as likely is the possibility that Turks have a lot of ancestry from western Anatolia which was Greek, and Greeks have more of this than Armenians.

It’s hard to tell from these results whether Turks have more of an affinity with Greek or Armenians as their non-Turkic ancestors. So I ran a three population test.

Outgroup X1 X2 f3 error z
Turkey Armenians Yakut -0.00253688 6.70852e-05 -37.8158
Turkey Greece3 Yakut -0.00246931 6.72384e-05 -36.7247
Turkey Georgian Yakut -0.00256555 7.60158e-05 -33.7502
Turkey Armenians Dai -0.00246779 7.40038e-05 -33.3468
Turkey Greece3 Dai -0.0024101 7.34629e-05 -32.8071
Turkey Georgian Dai -0.00249174 8.11957e-05 -30.688
Turkey Greece2 Yakut -0.00222382 7.62368e-05 -29.1699
Turkey Greece2 Dai -0.00231001 8.39207e-05 -27.5261
Turkmen Turkey Dai -0.00288213 0.000108049 -26.6742
Turkmen Turkey Yakut -0.00254805 0.000102816 -24.7826
Turkey Greece1 Yakut -0.00225638 9.94722e-05 -22.6836
Turkey GreekCentral Dai -0.00235681 0.000104014 -22.6587
Turkey Greece3 Tajik -0.000622671 2.76666e-05 -22.5063
Turkey GreekCentral Yakut -0.00221985 0.000101654 -21.8373
Turkey Greece1 Dai -0.00243254 0.000112011 -21.717
Turkey Greece3 Turkmen -0.000640439 3.33529e-05 -19.2019
Turkey GreekThessaly Yakut -0.00208436 0.00011042 -18.8767
Turkey Dai GreekThessaly -0.00225435 0.00012241 -18.4163
Turkey Greece2 Turkmen -0.000584983 3.29819e-05 -17.7365
Turkey Armenians Turkmen -0.000520887 3.07253e-05 -16.953
Turkey Armenians Tajik -0.000421139 2.55274e-05 -16.4975
Tajik Turkey Dai -0.00140423 8.51697e-05 -16.4875
Tajik Turkey Yakut -0.00124601 7.60725e-05 -16.3793
Turkey Georgian Turkmen -0.000532496 3.80694e-05 -13.9875
Turkey Greece2 Tajik -0.000412419 3.04172e-05 -13.5587
Turkey Armenians Lithuanians -0.000459831 3.75838e-05 -12.2348
Turkey Greece1 Turkmen -0.000570715 4.7753e-05 -11.9514
Turkey Kurds Yakut -0.00146087 0.000124799 -11.7058
Turkey GreekThessaly Turkmen -0.000516877 4.46683e-05 -11.5714
Turkey Georgian Tajik -0.000328859 3.02443e-05 -10.8734
Turkey GreekCentral Turkmen -0.000504962 4.92555e-05 -10.2519

Armenians beat out Greece3 a bit better, but really it’s hard to say from this that this is definitive. It’s likely that my Turkish sample has both, and/or the original Turkic nomads had Iranian-like ancestry which was more like Armenian than Greek? Hard to say. Additionally, the face that Greece3 is better than the other options suggests to me that the source are Anatolian Greeks who were less impacted by migrations from the north than Greeks in Greece proper.

 

* The Mughals were Central Asian Turks, while the Safavids were mostly Azeri Turks.

** Since the Turks who arrived in Anatolia had long sojourned in Turn and Iran it is important not to assume that their contribution is limited only to the East Asian component of ancestry.

January 27, 2018

Genetic distances across Eurasia

Filed under: Human Population Genetics,Population genetics — Razib Khan @ 10:00 pm

I feel that for whatever reason that over the past few years that many people have started to exhibit weak intuitions about the magnitude of between population differences on this weblog. Two suggestions for why this might occur.

* First, the proliferation of PCA plots with individuals can make it hard to discern averages

* Second, model-based admixture plots don’t explicitly quantify the differences between the different clusters

To get a better sense of between-group differences I decided to take a step back and look at Fst. Fst basically looks all the genetic variance between groups and quantifies the proportion that can be attributed to differences between groups.

The plot at the top of this post is from an Fst matrix I generated with Plink (I wrote a script to do the pairwise comparison). I did some PCA pruning of the populations to be clear (e.g., with both Cambodians and Filipinos I made them more distinct than they would otherwise be). The goal was to give people a sense of genetic distances within regions and between them.

I also generated a PCA plot and a Treemix plot, for the sake of comparison.

It’s also useful to look at a few group comparisons and judge them in a global context.

Fst
Tamil Telugu 0.0011
Tamil Tamil Scheduled Caste 0.0016
Tamil Bangladeshi 0.0024
Tamil South Indian Brahmin 0.0031
Tamil Uttar Pradesh Brahmin 0.0041
Tamil Sindhi 0.0087
Tamil Vietnamese 0.0668
Southern Chinese Northern Chinese 0.0033
Southern Chinese Vietnamese 0.0034
Southern Chinese Korea 0.0045
Southern Chinese Japanese 0.0087
Southern Chinese Tamil 0.0711
Southern Chinese Polish 0.1141
Gujurati_Patel Telugu 0.0062
Gujurati_Patel Uttar Pradesh Brahmin 0.0065
Gujurati_Patel Bangladeshi 0.0069
Gujurati_Patel Velama 0.0094
Gujurati_Patel Sindhi 0.0104
Gujurati_Patel Polish 0.0405
Gujurati_Patel Japanese 0.0781
GreatBritain Ireland 0.0015
GreatBritain Polish 0.0043
GreatBritain Sicily 0.0077
GreatBritain Uttar Pradesh Brahmin 0.0264
GreatBritain Tamil 0.0430
GreatBritain Korea 0.1130

The non-Brahmin and non-Dalit samples in the 1000 Genomes are not much partitioned much by geography. The Tamil vs. Telugu difference is smaller than that between the British and Irish. Within Tamil Nadu Brahmins though are nearly as different from typical Tamils as Poles are from the English (most of the British sample is English). The biggest differences in Europe are between Sicilians and Northern European groups, which similar in a degree to that between South Indians and Pakistanis. The South Chinese sample is nearly as close to Vietnamese as it is to a North Chinese group, while the difference between Koreans and Chinese is relatively small when compared to the variance you see in South Asia and Europe.

Note: Drift tends to inflate Fst.

January 25, 2018

How South Asian populations relate to each other

Filed under: Population genetics,South Asian Genetics — Razib Khan @ 12:45 am

Since people asking me about this, and I’m running the South Asian Genotype Project, I thought I would post two non-PCA visualizations of how various South Asian groups relate to each other (along with a few outgroups).

The radial plot above is a neighbor-joining tree visualized from pairwise Fst statistics (basically a proxy for genetic distance).

I also used Treemix to generate a plot. You see the similar patterns as the one above, though the two methods are different. Treemix tests a bunch of models and sees how the data fit those models. The visualization of Fst is just a way of representing the summary statistic.

I added 5 migration edges to the plot to the right. Not sure if they add anything, but you can see that some of the nodes move around because they are so mixed.

January 9, 2018

Island demes in an empty world

Filed under: Beringia,Population genetics — Razib Khan @ 9:58 pm

 

As you probably know a new ancient genome paper was published last week in Nature, Terminal Pleistocene Alaskan genome reveals first founding population of Native Americans. There is at least one other involving Willerslev in the works for what it’s worth. Carl Zimmer has a good write-up in The New York Times, while Greg Cochran picked up the fact that the latest results show no evidence of “Australo-Melanesian” affinities that have been found in Amazonians.

The key issue here is that they found 11,500 year old remains from Alaska, one of which they sequenced at 17x coverage, which is rather good (not medical grade good, but really sufficient for a lot of population genomic work). It’s clear that the lineage represented by these remains is “basal” to that of other Native American peoples, whom David Reich’s group labeled “First Americans.” Later, the First Americans diverged into different populations, with the two in modern focus being a northern cluster, including the Aboriginal peoples of Canada and parts of the United States, and a southern one including everyone else. This does not mean that the Beringians were isolated outliers. There may have been many other peoples related to the Beringians who diversified, who went extinct as well. The settlement of Alaska by other peoples suggests to me that extreme conditions in the Arctic made it likely that there would be population turnover there. Also, the fact that these samples were located close to the source of settlement in the New World by modern humans makes their distant relation to all other New World populations unsurprising.

The big thing that the press is highlighting is the confirmation of the Beringian Standstill model, where modern humans percolated into the area between Siberia and Alaska, Beringia, and did not move east for thousands of years. Basically, the conditions were inclement toward human habitation on both sides of Beringia, while a relict modern human group likely occupied a pocket of more moderate climes for thousands of years, with minimal gene flow from the west, and blocked from migration to the east. Genetically the Beringia Standstill made sense for a long time…the divergence between Amerindian lineages and those of eastern Eurasia seemed too old to be accounted for by recent migration a bit more than 10,000 years ago (the old “Clovis first” hypothesis).

How old? This paper suggests that the portion of Native American ancestry which indicates an affinity to East Asians stopped exhibiting gene flow from that source around ~25,000 years ago, after diverging around ~36,000 years ago. This points to the fact that after modern humans came to dominate eastern Eurasia they began to diversify rapidly after 40,000 years ago, but gene flow between different populations did not always allow them to drift apart…at least initially. The ancestors of Native Americans and East Asians may have been in extremely separate locations by ~25,000 years ago, whether it be on the fringes of eastern Siberia, or somewhere in southern China (there is no reason that the modern Chinese have to have had ancestors resident on the North China plain before the Last Glacial Maximum).*

One aspect here I want to emphasize is that our image of a world thickly populated with humans may mislead us in our intuition about how patchy occupation was ~25,000 years ago. Yes, humans may have left artifacts all over the world, but that doesn’t mean that there weren’t centuries or millennia of no occupation, or, that meta-population dynamics were such that extinctions were common. For decades in population genetics there has been talk of “clines vs. clusters,” but if human population densities were far lower, or occupation patchier, then clines may have become much more important recently with high density than in the past.

Finally, back to the Australo-Melanesian issue. Either there is a lot of population structure in ancient Beringia to be explored, with diverse quasi-Asiatic groups, or there was an Australo-Melanesian group already in South America.

* Ancient North Eurasian ancestry came into Beringians ~20,000 years ago. Two groups which merged during the middle of the Last Glacial Maximum.

December 20, 2017

Natural selection in humans (OK, 375,000 British people)

Filed under: Natural Selection,Population genetics,Population genomics,Selection — Razib Khan @ 10:41 pm

 


The above figure is from Evidence of directional and stabilizing selection in contemporary humans. I’ll be entirely honest with you: I don’t read every UK Biobank paper, but I do read those where Peter Visscher is a co-author. It’s in PNAS, and a draft which is not open access. But it’s a pretty interesting read. Nothing too revolutionary, but confirms some intuitions one might have.

The abstract:

Modern molecular genetic datasets, primarily collected to study the biology of human health and disease, can be used to directly measure the action of natural selection and reveal important features of contemporary human evolution. Here we leverage the UK Biobank data to test for the presence of linear and nonlinear natural selection in a contemporary population of the United Kingdom. We obtain phenotypic and genetic evidence consistent with the action of linear/directional selection. Phenotypic evidence suggests that stabilizing selection, which acts to reduce variance in the population without necessarily modifying the population mean, is widespread and relatively weak in comparison with estimates from other species.

The stabilizing selection part is probably the most interesting part for me. But let’s hold up for a moment, and review some of the major findings. The authors focused on ~375,000 which matched their sample criteria (white British individuals old enough that they are well past their reproductive peak), and the genotyping platforms had 500,000 markers. The dependent variable they’re focusing on is reproductive fitness. In this case specifically, “rRLS”, or relative reproductive lifetime success.

With these huge data sets and the large number of measured phenotypes they first used the classical Lande and Arnold method, which leveraged regression to measure directional and stabilizing selection. Basically, how does change in the phenotype impact reproductive fitness? So, it is notable that shorter women have higher reproductive fitness than taller women (shorter than the median). This seems like a robust result.

The results using phenotypic correlations for direction (β) and stabilizing (γ) selection are shown below. The abbreviations are the same as above.

 

There are many cases where directional selection seems to operate in females, but not in males. But they note that that is often due to near zero non-significant results in males, not because there were opposing directions in selection. Height was the exception, with regression coefficients in opposite directions. For stabilizing selection there was no antagonistic trait.

A major finding was that compared to other organisms stabilizing selection was very weak in humans. There’s just not that that much pressure against extreme phenotypes. This isn’t entirely surprising. First, you have the issue of the weirdness of a lot of studies in animal models, with inbred lines, or wild populations selected for their salience. Second, prior theory suggests that a trait with lots of heritable quantitative variation, like height, shouldn’t be subject to that much selection. If it had, the genetic variation which was the raw material of the trait’s distribution wouldn’t be there.

Using more complex regression methods that take into account confounds, they pruned the list of significant hits. But, it is important to note that even at ~375,000, this sample size might be underpowered to detect really subtle dynamics. Additionally, the beauty of this study is that it added modern genomic analysis to the mix. Detecting selection through phenotypic analysis goes back decades, but interrogating the genetic basis of complex traits and their evolutionary dynamics is new.

To a first approximation, the results were broadly consonant across the two methods. But, there are interesting details where they differ. There is selection on height in females, but not in males. This implies that though empirically you see taller males with higher rLSR, the genetic variance that is affecting height isn’t correlated with rLSR, so selection isn’t occurring.

~375,000 may seem like a lot, but from talking to people who work in polygenic selection there is still statistical power to be gained by going into the millions (perhaps tens of millions?). These sorts of results are very preliminary but show the power of synthesizing classical quantitative genetic models and ways of thinking with modern genomics. And, it does have me wondering about how these methods will align with the sort of stuff I wrote about last year which detects recent selection on time depths of a few thousand years. The SDS method for example seems to be detecting selection for increasing height the world over…which I wonder is some artifact, because there’s a robust pattern of shorter women having higher fertilty in studies going back decades.

December 14, 2017

A genetic map of the world

Filed under: Population genetics — Razib Khan @ 4:46 pm


The above map is from a new preprint on the patterns of genetic variation as a function of geography for humans, Genetic landscapes reveal how human genetic diversity aligns with geography. The authors assemble an incredibly large dataset to generate these figures. The orange zones are “troughs” of gene flow. Basically barriers to gene flow.  It is no great surprise that so many of the barriers correlate with rivers, mountains, and deserts. But the aim of this sort of work seems to be to make precise and quantitative intuitions which are normally expressed verbally.

To me, it is curious how the borders of the Peoples’ Republic of China is evident on this map (an artifact of sampling?). Additionally, one can see Weber’s line in Indonesia. There are the usual important caveats of sampling, and caution about interpreting present variation and dynamics back to the past. But I believe that these sorts of models and visualizations are important nulls against which we can judge perturbations.

As I said, these methods can confirm rigorously what is already clear intuitively. For example:

Several large-scale corridors are inferred that represent long-range genetic similarity, for example: India is connected by two corridors to Europe (a southern one through Anatolia and Persia ‘SC’, and
a northern one through the Eurasian Steppe ‘NC’)

We still don’t have enough ancient DNA to be totally sure, but it’s hard to ignore the likelihood that “Ancestral North Indians” (AN) actually represent two different migrations.

India also illustrates contingency of these barriers. Before the ANI migration, driven by the rise in agricultural lifestyles, there would likely have been a major trough of gene flow on India’s western border. In fact a deeper one than the one on the eastern border. And if the high genetic structure statistics from ancient DNA are further confirmed then the rate of gene flow was possibly much lower between demes in the past. Perhaps that would simply re-standardize equally so that the map itself would not be changed, but I suspect that we’d see many more “troughs” during the Pleistocene and early Holocene.

Because there are so many geographically distributed samples for humans, and frankly some of the best methods developers work with human data (thank you NIH), it is no surprise that our species would be mapped first. But I think some of the biggest insights may be with understanding the dynamics of gene flow of non-human species, and perhaps the nature and origin of speciation as it relates to isolation (or lack thereof).

A genetic map of the world

Filed under: Population genetics — Razib Khan @ 4:46 pm


The above map is from a new preprint on the patterns of genetic variation as a function of geography for humans, Genetic landscapes reveal how human genetic diversity aligns with geography. The authors assemble an incredibly large dataset to generate these figures. The orange zones are “troughs” of gene flow. Basically barriers to gene flow.  It is no great surprise that so many of the barriers correlate with rivers, mountains, and deserts. But the aim of this sort of work seems to be to make precise and quantitative intuitions which are normally expressed verbally.

To me, it is curious how the borders of the Peoples’ Republic of China is evident on this map (an artifact of sampling?). Additionally, one can see Weber’s line in Indonesia. There are the usual important caveats of sampling, and caution about interpreting present variation and dynamics back to the past. But I believe that these sorts of models and visualizations are important nulls against which we can judge perturbations.

As I said, these methods can confirm rigorously what is already clear intuitively. For example:

Several large-scale corridors are inferred that represent long-range genetic similarity, for example: India is connected by two corridors to Europe (a southern one through Anatolia and Persia ‘SC’, and
a northern one through the Eurasian Steppe ‘NC’)

We still don’t have enough ancient DNA to be totally sure, but it’s hard to ignore the likelihood that “Ancestral North Indians” (AN) actually represent two different migrations.

India also illustrates contingency of these barriers. Before the ANI migration, driven by the rise in agricultural lifestyles, there would likely have been a major trough of gene flow on India’s western border. In fact a deeper one than the one on the eastern border. And if the high genetic structure statistics from ancient DNA are further confirmed then the rate of gene flow was possibly much lower between demes in the past. Perhaps that would simply re-standardize equally so that the map itself would not be changed, but I suspect that we’d see many more “troughs” during the Pleistocene and early Holocene.

Because there are so many geographically distributed samples for humans, and frankly some of the best methods developers work with human data (thank you NIH), it is no surprise that our species would be mapped first. But I think some of the biggest insights may be with understanding the dynamics of gene flow of non-human species, and perhaps the nature and origin of speciation as it relates to isolation (or lack thereof).

December 10, 2017

Visualizing intra-European phylogenetic distances

Filed under: Europe,European genetics,Population genetics,Population genomics — Razib Khan @ 4:53 pm
Neighbor-joining tree of genetic distances between populations

 

In L. L. Cavalli-Sforza’s The History and Geography of Human Genes he used between population group genetic distances, as measured in FST values, to generate a series of visualizations, which then allowed him to infer historical processes. Basically the way it works is that you look at genetic variation, and see how much of it can be allocated to between groups. If none of it can be allocated to between groups, then in a population genetic sense it doesn’t make much sense to speak of distinctive groups, they’re basically one breeding population. The higher the FST statistic is, the more of the variation is partitioned between the groups.

Roughly this is used to correlate with genetic distance as well as evolutionary divergence. The longer two populations have been separated, the more and more genetic differences they’ll accumulate, inflating the FST value. There are a lot of subtleties that I’m eliding here (see Estimating and interpreting FST: the impact of rare variants for a survey of the recent literature on the topic and pathways forward), but for a long time, FST was the go-to statistic for making phylogenetic inferences on a within-species scale.

Today we have other techniques, Structure, Treemix, fineStructure, and various local ancestry packages.

But FST is still useful to give one a Gestalt sense of population genetic differences. Cavalli-Sforza admits in The History and Geography of Human Genes that European populations had very low pairwise FST, but because of the importance of Europe for sociocultural reasons a detailed analysis of the region was still provided in the text. Additionally, they had lots of European samples (non-European Caucasoids were thrown into one category for macro-group comparisons because there wasn’t that many samples).

Using results from the 2015 paper Massive migration from the steppe was a source for Indo-European languages in Europe, I visualized pairwise genetic distances for European populations, ancient and modern (Han Chinese as an outgroup), on a tree. What the results illustrate is that

  1. Ancient populations were very distinct in Europe from modern ones.
  2. Many modern groups are clustered close together.

The bulk of the population genetic structure in modern Europe seems to have been established in the period between 3000 BCE and 2000 BCE. This is not that much time for a lot of distinctiveness to develop, especially on the geographically open North European plain. I suspect with more and more Mesolithic and early to middle Neolithic DNA we’ll see that some of the modern population structure is a ghost of ancient substrate absorption.

Many of the ethno-national categories that are very significant in recent history, and impact the cultural memories of modern people and their genealogies, have very shallow roots. This does not mean they are not “real” (I don’t know what that’s supposed to mean at all), just that many of the identities which seem so salient to us today may be relatively recent in terms of their significance to large groups of humans….

Visualizing intra-European phylogenetic distances

Filed under: Europe,European genetics,Population genetics,Population genomics — Razib Khan @ 4:53 pm
Neighbor-joining tree of genetic distances between populations

 

In L. L. Cavalli-Sforza’s The History and Geography of Human Genes he used between population group genetic distances, as measured in FST values, to generate a series of visualizations, which then allowed him to infer historical processes. Basically the way it works is that you look at genetic variation, and see how much of it can be allocated to between groups. If none of it can be allocated to between groups, then in a population genetic sense it doesn’t make much sense to speak of distinctive groups, they’re basically one breeding population. The higher the FST statistic is, the more of the variation is partitioned between the groups.

Roughly this is used to correlate with genetic distance as well as evolutionary divergence. The longer two populations have been separated, the more and more genetic differences they’ll accumulate, inflating the FST value. There are a lot of subtleties that I’m eliding here (see Estimating and interpreting FST: the impact of rare variants for a survey of the recent literature on the topic and pathways forward), but for a long time, FST was the go-to statistic for making phylogenetic inferences on a within-species scale.

Today we have other techniques, Structure, Treemix, fineStructure, and various local ancestry packages.

But FST is still useful to give one a Gestalt sense of population genetic differences. Cavalli-Sforza admits in The History and Geography of Human Genes that European populations had very low pairwise FST, but because of the importance of Europe for sociocultural reasons a detailed analysis of the region was still provided in the text. Additionally, they had lots of European samples (non-European Caucasoids were thrown into one category for macro-group comparisons because there wasn’t that many samples).

Using results from the 2015 paper Massive migration from the steppe was a source for Indo-European languages in Europe, I visualized pairwise genetic distances for European populations, ancient and modern (Han Chinese as an outgroup), on a tree. What the results illustrate is that

  1. Ancient populations were very distinct in Europe from modern ones.
  2. Many modern groups are clustered close together.

The bulk of the population genetic structure in modern Europe seems to have been established in the period between 3000 BCE and 2000 BCE. This is not that much time for a lot of distinctiveness to develop, especially on the geographically open North European plain. I suspect with more and more Mesolithic and early to middle Neolithic DNA we’ll see that some of the modern population structure is a ghost of ancient substrate absorption.

Many of the ethno-national categories that are very significant in recent history, and impact the cultural memories of modern people and their genealogies, have very shallow roots. This does not mean they are not “real” (I don’t know what that’s supposed to mean at all), just that many of the identities which seem so salient to us today may be relatively recent in terms of their significance to large groups of humans….

Older Posts »

Powered by WordPress