July 13, 2018

Tutorial to run supervised admixture analyses

ID Dai Gujrati Lithuanians Sardinian Tamil
razib_23andMe 0.14 0.26 0.02 0.00 0.58
razib_ancestry 0.14 0.26 0.02 0.00 0.58
razib_ftdna 0.14 0.26 0.02 0.00 0.57
razib_daughter 0.05 0.14 0.29 0.18 0.34
razib_son 0.07 0.17 0.28 0.19 0.30
razib_son_2 0.06 0.19 0.29 0.19 0.27
razib_wife 0.00 0.07 0.55 0.38 0.00

This is a follow-up to my earlier post, Tutorial To Run PCA, Admixture, Treemix And Pairwise Fst In One Command. Hopefully you’ll be able to run supervised admixture analysis with less hassle after reading this.

The above results are from a supervised admixture analysis of my family and myself. The fact that there are three replicates of me is due to the fact that I converted my 23andMe, Ancestry, and Family Tree DNA raw data into plink files. Notice that the results are broadly consistent. This emphasizes that discrepancies between DTC companies in their results are due to their analytic pipeline, not because of data quality.

The results for my family are not surprising. I’m about ~14% “Dai”, reflecting East Asian admixture into Bengalis. My wife is ~0% “Dai”. My children are somewhere in between. At the low fraction you expect some variance in the F1.

Now below are results for three Swedes with the sample reference panel:

Group ID Dai Gujrati Lithuanians Sardinian Tamil
Sweden Sweden17 0.00 0.09 0.63 0.28 0.00
Sweden Sweden18 0.00 0.08 0.62 0.31 0.00
Sweden Sweden20 0.00 0.05 0.72 0.23 0.00

All these were run on supervised admixture frameworks where I used Dai, Gujrati, Lithuanians, Sardinians, and Tamils, as the reference “ancestral” populations. Another way to think about it is: taking the genetic variation of these input groups, what fractions does a given test focal individual shake out at?

The commands are rather simple. For my family:
bash rawFile_To_Supervised_Results.sh TestScript

For the Swedes:
bash supervisedTest.sh Sweden TestScript

The commands need to be run in a folder: ancestry_supervised/.

You can download the zip file.

Here is what the scripts do in two different situations. Imagine you have raw genotype files downloaded from 23andMe, Ancestry, and Family Tree DNA.

Download the files as usual. Rename them in an intelligible way, because the file names are going to be used for generating IDs. So above, I renamed them “razib_23andMe.txt” and such. Leave the extensions as they are. You need to make sure they are not compressed obviously. Then place them all in ancestry_supervised/RAWINPUT.

The script looks for the files in there. You don’t need to specify names, it will find them. In plink the family ID and individual ID will be taken from the text before the extension in the file name. Output files will also have the file name.

Aside from the raw genotype files, you need to determine a reference file. In REFERENCESFILES/ you see the binary pedigree/plink file Est1000HGDP. The same file from the earlier post. It would be crazy to run supervised admixture on the dozens of populations in this file. You need to create a subset.

For the above I did this:
grep "Dai|Guj|Lithua|Sardi|Tamil" Est1000HGDP.fam > ../keep.keep

./plink --bfile REFERENCEFILES/Est1000HGDP --keep keep.keep --make-bed --out REFERENCEFILES/TestScript

When the script runs, it converts the raw genotype files into plink files, puts them in INDIVPLINKFILES/. Then it takes each plink file and uses it as a test against the reference population file. That file has a preprend on group/family IDs of the form AA_Ref_. This is essential for the script to understand that this is a reference population. The .pop files are automatically generated, and the script inputs in the correct K by looking for unique population numbers.

The admixture is going to be slow. I recommend you modify runadmixture.pl by adding the number of cores parameters so it can go multi-threaded.

When the script is done it will put the results in RESULTFILES/. They will be .csv files with strange names (they will have the original filename you provided, but there are timestamps in there so that if you run the files with a different reference and such it won’t overwrite everything). Each individual is run separately and has a separate output file (a .csv).

But this is not always convenient. Sometimes you want to test a larger batch of individuals. Perhaps you want to use the reference file I provided? For the Swedes I did this:
grep "Swede" REFERENCEFILES/Est1000HGDP.fam > ../keep.keep

./plink --bfile REFERENCEFILES/Est1000HGDP --keep keep.keep --make-bed --out INDIVPLINKFILES/Sweden

Please note the folder. There are modifications you can make, but the script assumes that the test files inINDIVPLINKFILES/. The next part is important. The Swedish individuals will have AA_Ref_ preprended on each row since you got them out of Est1000HGDP. You need to remove this. If you don’t remove it, it won’t work. In my case, I modified using the vim editor:
vim Sweden.fam

You can do it with a text editor too. It doesn’t matter. Though it has to be the .fam file.

After the script is done, it will put the .csv file in RESULTFILES/. It will be a single .csv with multiple rows. Each individual is tested separately though, so what the script does is append each result to the file. If you have 100 individuals, it will take a long time. You may want to look in the .csv file as the individuals are being added to make sure it looks right.

The convenience of these scripts is that it does some merging/flipping/cleaning for you. And, it formats the output so you don’t have to.

I originally developed these scripts on a Mac, but to get it to work on Ubuntu I made a few small modifications. I don’t know if it still works on Mac, but you should be able to make the modifications if not. Remember for a Mac you will need the make versions of plink and admixture.

For supervised analysis, the reference populations need to make sense and be coherent. Please check the earlier tutorial and use the PCA functions to remove outliers.

November 11, 2012

Religion determines politics for Asian Americans

I was at ASHG this week, so I’ve followed reactions to the election passively. But one thing I’ve seen is repeated commentary on the fact that Asian Americans have swung toward the Democrats over the past generation. The thing that pisses me off is that there is a very obvious low-hanging fruit sort of explanation out there, and I’m frankly sick and tired of reading people ramble on without any awareness of this reality. We spent the past few months talking about the power of polls, and quant data vs. qual (bullshit) analysis, with some of my readers going into full on let’s-see-if-Razib-is-moron-enough-to-swallow-this-crap mode.

In short, it’s religion. Barry Kosmin has documented that between 1990 and 2010 Asian Americans have become far less Christian, on average. Meanwhile, the Republican party has become far more Christian in terms of its identity. Do you really require more than two sentences to infer from this what the outcome will be in terms of how Asian Americans will vote?

Below I took the data from Pew’s Religious Identification Survey in terms of how all Americans lean politically based on religion, and compared it to how Asian Americans lean based on religion.


All ...

October 18, 2012

The general social survey: information is not free

Last week the GSS was down. I was very sad. The SDA team explains the situation:

Part of the popularity of our demonstration archive is that it is free for end users. We are happy to provide this service. It is a valuable resource for the academic community and it also publicizes the value of our SDA software. However, the flip side of providing this free service is that it does not generate any income to offset the cost of providing the infrastructure required. We receive no funding from GSS for hosting their datasets — which is often a surprise to our users. Almost all of our income comes from the fees provided by licensing the SDA software to other data archives (like ICPSR and IPUMS), and virtually all of that income goes to support the programming and technical support that we provide them. We obviously need some additional sources of revenue.

October 7, 2012

The relative decline of New York

Despite the real estate bubble bursting, it looks as if Florida will surpass New York in population by the next Census. I once made some quick money by betting an older gentleman that Texas had a larger population than New York. I suspect there’s even more money to be made by betting people that Florida has a larger population than New York in a few years. The reality is that most people don’t check statistics in their free time, so some “facts” get frozen in their minds. A great number of adults alive today were told in elementary or secondary school that New York was the second largest state in population. They are unlikely to update their views as they age. Unfortunately, I suspect these confusions are going to lead to public policy problems as well. I am not confident that our elected officials are any more aware of statistics than their constituents.

October 2, 2012

Who cares about overpopulation? Smart people and atheists

If you had the sense that Paul Ehrlich and Garrett Hardin are very much figures of the 1970s nexus of environmentalism and population control, it seems you are right. According to Google Ngrams mention of these topics has been declining since peaking during the oil crisis, in the afterglow of the influence of the late 1960s counter-culture. The general social survey has a variable, POPGRWTH, which asks:

And please circle one number for each of these statements to show how much you agree or disagree with it. The earth cannot continue to support population growth at its present rate.

The question was asked in the year 2000 and 2010. Demographic breakdowns below for the pooled responses….

Earth can not sustain population growth Strongly agree Agree Neither Disagree Strongly disagree Non-Hispanic white 15 41 22 20 3 Non-Hispanic black 6 32 28 29 5 Hispanic 14 39 22 23 3 Male 14 42 18 22 4 Female 13 38 26 21 3 Highest educational attainment < HS 14 46 18 19 3 HS 13 38 24 22 3 Junior College 11 37 28 23 2 Bachelor 14 41 20 21 4 Graduate 16 40 17 23 5 Intelligence, measured by vocab score (WORDSUM 0-4 = stupid, 5-8 = average, 9-10 = smart) Stupid 14 38 23 21 4 Average 12 41 23 22 3 Smart 21 37 15 22 5 Liberal 20 43 19 17 1 Moderate 12 41 27 18 1 Conservative 11 36 19 28 6 Protestant 12 39 22 23 3 Catholic 13 39 24 21 4 Jewish 15 33 22 30 0 No Religion 18 47 20 13 2 Bible is…. Word of God 11 35 23 27 5 Inspired Word of God 11 41 25 21 3 Book of Fables 23 43 14 18 2 Opinion as to the existence of God Don’t believe 35 34 13 15 3 No way to find out 24 39 22 14 1 Some higher power 21 51 11 17 1 Believe sometimes 18 43 23 16 0 Believe but with doubts 13 44 22 19 2 Know God exists 11 37 24 24 4

September 15, 2012

Intelligence challenged people and free speech

In the post below I took the time out to link to the GSS, as well as posting my exact queries. As payment for this consideration the first comment was absolute drivel. I understand people have political opinions, but I’m not too interested in your opinions. You may be interested in your opinions, but I’d rather have more data. Most people don’t know enough for me to have interest in their opinions (most != all, many readers do have opinions in their specialties which I seek out).

I was trying to make a point that anger and even violence in reaction to actions which offend are actually comprehensible as the modal human response. The community reacts to punish those who violate taboos. The taboos may differ, but the response to the action of violation is normal and natural. A primary issue that needs to be considered is that taboos differ from society to society, so one is often not conscious of the act of violation (e.g., if you show the bottom of your shoes to people when you sit down, that’s an offensive act in some societies).

An implication here is that American norms of free speech near absolutism, ...

September 1, 2012

The educated and conservative think fatness is a choice

After the post on fatness and homophobia I decided to query the GSS on the extent to which people think that fatness has a strong biological element, similar to homosexuality. There’s a variable, GENENVO1. It asks:

Character, personality, and many types of behavior are influenced both by the genes people inherit from their parents and by what they learn and experience as they grow up. For each of the following descriptions, we would like you to indicate what percent of the person’s behavior you believe is influenced by the genes they inherit, and what percent is influenced by their learning and experience and other aspects of their environment. The boxes on handcard D1 are arranged so that the first box on the LEFT (which is numbered 1) represents 100% genetic influence (and 0% environment). The next box (numbered 2) represents 95% genes (and 5% environment), and so on. The RIGHTMOST box (numbered 21) represents 100% environmental influence (and no genetic influence). After each description, please type the number of the box that comes closest to your answer. Please use the numbered scale on handcard D1 to indicate, FOR EACH OF THE BEHAVIORS DESCRIBED, what percent of the person’s behavior ...

August 29, 2012

The future of the three “Pakistans”

Over at Econlog Bryan Caplan bets that India’s fertility will be sup-replacement within 20 years. My first inclination was to think that this was a totally easy call for Caplan to make. After all, much of southern India, and the northwest, is already sup-replacement. And then I realized that heterogeneity is a major issue. This is a big problem I see with political and social analysis. Large nations are social aggregations that are not always comparable to smaller nations (e.g., “Sweden has such incredible social metrics compared to the United States”; the appropriate analogy is the European Union as a whole).

August 26, 2012

Non-whites consistent on “life” issues

Over at Darwin Catholic a commenter asked whether a pro-choice commenter on this weblog also supported the death penalty. I presume that they were here pointing to the consistent life ethic issue. Many liberals who oppose capital punishment support abortion rights, and many conservatives who support capital punishment oppose abortion rights. These camps both have their viewpoints, which I’m not interested in re-litigating in the comments. But I was curious as to the overall societal support for the combinations of positions.

So I looked at the GSS, using the CAPPUN and ABANY variables (capital punishment, and abortion for any reason). In this post I will show you screenshots of the GSS output. It’s ugly, but it shows you deviation away from the expected proportions. Basically, if two variables are independent you can predict what you’d expect to be the crossed percentages over the four cells. If the results deviate from that you can ascertain particular associations. In the GSS output red means that the cell has a higher value than it should, and blue a lower value. Additionally, the intensity signals the magnitude of the deviation. I limited all results to the year 2000 and later.

First, the general aggregate ...

August 21, 2012

Who rejects right to abortion in cases of rape?

It’s basically impossible to avoid hearing about Todd Akin right now. My Twitter and Facebook feeds are kind of swamped. But it did make me wonder: what percentage of Americans reject abortion in cases of rape and incest? The GSS has a handy variable, ABRAPE, which asks respondents about the possibility of abortion if a woman gets pregnant as a result of rape (let’s stipulate that it’s possible to get pregnant as a result of rape!). I also limited the sample to the year 2000 and later, and non-Hispanic whites (to clear out confounds). Demographic breakdowns below….

Before people start complaining, the scale below goes from 0% to 50%, NOT 0% to 100%!


August 20, 2012

More GSS, less speculation!

Long time readers know that one of my pet hobby-horses is to try and convince more pundits that they should use the GSS. Opinions based on opinions may be fun, but opinions based on facts may be useful. In general my appeals have fallen on deaf ears. But today I notice that Will Saletan is using GSS data to discussion the Todd Akin case. You may not agree with Saletan’s take on the results, but at least he bothered to generate some results.

August 19, 2012

More racist: white liberals or white conservatives?

Reihan Salam has a post up on the alignment of racism and political orientation. He begins:

Recently, Chris Hayes, host of MSNBC’s UP with Chris Hayes, made the following observation:

It is undeniably the case that racist Americans are almost entirely in one political coalition and not the other.

Chris is a good friend of mine, and we grew up in the same milieu. I can attest to the fact that the view he expressed is very widely held in the circles in which we both travel….

Salam then links to Alex Tabarrok, who uses party identification data to indicate that actually racism is split between the two groups, while John Sides suggests that there is a definite lean toward Republicans being more racist, using a few indicator variables. Overall I think Sides is about right, all things equal conservatives are more racist than liberals. At least in the modern context of the two ideologies.* I say conservative/liberal rather than Republican/Democrat, because my experience with the GSS data set is that ideology is a more powerful predictor of social views among whites. This holds true with the variables which Tabarrok and Sides query from what I can see; the gap between ...

August 15, 2012

Who shall inherit the earth?

There was a question below in regards to the high fertility of some extreme (“ultra”) religious groups, in particular Haredi Jews. The commenter correctly points out that these Jews utilize the Western welfare system to support large families. This is not limited to just Haredi Jews. The reason Somalis and Arabs have fertility ~3.5 in Helsinki, as opposed to ~1.5 as is the norm, is in part to due to the combination of pro-natalist subcultural norms, and a generous benefits state. Of course we mustn’t overemphasize economics. Israel’s decline in Arab Muslim fertility but rise in Jewish fertility in the 2000s has been hypothesized to be due to different responses to reductions in child subsidies by Muslims and the Haredi Jews. In short, the former reacted much more strongly to economic disincentives in relation to the latter.

A bigger question is whether exponential growth driven by ideology can continue indefinitely. I doubt it. Demographics is inevitable, but subject to a lot of qualifications. Haredi political power in Israel grants some benefits, but at the end of the day basic economics will serve as a check on the growth of the population of this sector. Similarly, barring ...

August 9, 2012

What is the distribution of offspring per individual?

A commenter below notes:

Also, in modern society, doesn’t just about everyone reproduce, such that not only is any particular advantage competing against other countervailing pressures as you note, but also that the “less fit” genomes are not removed from the overall population, but rather are added back to the mix? In other words, the less-preferred short males don’t die and have zero kids, they also get married and their genes get thrown back into the pot.

First, let’s not get caught in the assumption that for genes to be disfavored one has to have zero fitness in individuals carrying those genes. If, for example, in a situation of demographic expansion you had individuals who had eight children vs. those who had one child, there would be selection for the traits which were passed by those with eight children in relation to those who had one child. But, it did make me realize I wasn’t intuitively aware of the distribution of number of offspring in the population. I assumed that the median was around two, but that’s about it.

So, I looked at the GSS CHILDS variable for individuals born in 1950 or earlier from the year 2000 on (COHORT and ...

July 29, 2012

College makes you believe in marriage!

There’s a cliche, which isn’t totally false, that more education tends to lead one toward heterodox viewpoints which challenge conventional norms. But one issue that has been coming to the fore over the last 10 years or so is that college educated Americans tend toward social liberalism, and yet often continue to live very bourgeois lives. In other words, the freedoms which they favor are those freedoms which are ever operative in their own lives. In contrast those Americans without college educations tend to have a less libertarian attitude toward personal mores, but have lives characterized by greater disturbance and disastrous choices.

And yet this does not hold in the case of what articles such as this report, How Divorce Lost Its Groove:

Though she wasn’t entirely surprised. Ever since her divorce three years ago, Ms. Thomas said, she has been antisocial, “nervous about what people would say.”

After all, she had gone from Park Slope matron, complete with involved husband (“We had cracked the code of Gen X peer parenthood”) and gut-renovated brownstone, to “a Red Hook divorcée,” she said, remarried with a new baby and two children-of-divorce barely out of preschool. “All of a sudden, this community I’d lived in ...

July 16, 2012

People wanted more children in 2000s, but had fewer

The readers of this weblog are relatively non-fecund, at least going by reader surveys. But I was curious nonetheless about the attitudes toward number of children, and realized goals of number of children, in the General Social Survey. I decided to look at two variables:



The former asks the respondent how many children they had, the latter how many they’d like to have. I restricted the sample to whites ages 45-65 for every survey year. I then combined all the years of a particular decade, so you have 1970s, 1980s, 1990s, and 2000s. For demographics I looked at highest educational attainment, and household income indexed to 1986 real value dollars (so they are comparable across decades).

Two major takeaways:

1) Education matters more than income in terms of number of children. Having lots of education tends to reduce family size. No great surprise.

2) Ideal number of children increased in the 2000s, but the decline in average number of children continued.

There is often talk in the literature on the disjunction between ideal family size in Third World nations and the realized family size, with a larger number of children than women may want. What is less discussed is the inverse discussion. It seems that ...

July 13, 2012

Who objects to painful tests on animals?

In the year 2008 1,400 people were asked this question in the GSS:

Scientists should be allowed to do research that causes pain and injury to animals like dogs and chimpanzees if it produces new information about human health problems. (Do you strongly agree, agree, disagree, or strongly disagree?)

(variable ANSCITST)

I was curious because I ran into some stuff on pro-life sites today about how animal rights activist don’t oppose abortion, and how hypocritical that is. So naturally I was curious about how attitudes varied on that issue.

Allow animal testing which might case pain Strongly agree Agree Disagree Strongly Disagree Yes to abortion on demand 8 31 39 22 No to abortion on demand 7 32 39 23

What the results above show is that there is almost no difference in attitudes toward animal research when you vary attitudes toward abortion on demand. In other words, 22 percent of pro-choice people oppose such research strongly, while 23 percent of pro-life people do. How does this vary by demographic?

Strongly agree Agree Disagree Strongly Disagree Male 8 44 34 14 Female 4 23 42 31 Liberal 6 34 38 22 Moderate 5 32 38 25 Conservative 7 35 38 20 White 6 31 39 24 Black 7 41 36 16 No College Degree 6 32 38 24 College Degree 7 36 38 19 Stupid (WORDSUM 0-4) 8 33 38 22 Average (WORDSUM 5-8) 6 34 36 25 Smart (WORDSUM 9-10) 13 29 41 16 Age 18-35 4 34 40 22 Age 36-65 7 32 37 24 Age 66 and up 8 33 39 20 Protestant 6 35 39 20 Catholic 7 32 38 24 Jewish 12 29 36 23 No Religion 7 31 36 26 Atheist & agnostic 9 36 33 21 Theist 6 33 38 23 Bible Word of God 5 36 39 20 Bible Inspired Word 5 32 39 24 Bible Book of Fables 10 32 35 22

June 24, 2012

Higher vocabulary ~ higher income

Prompted by a comment below I was curious as to the correlation between intelligence and income. To indicate intelligence I used the GSS’s WORDSUM variable, which has a ~0.70 correlation with IQ. For income, I used REALINC, which is indexed to 1986 values (so it is inflation adjusted) and aggregates the household income. Finally, I limited my sample to non-Hispanic whites over the age of 30 (for what it’s worth, this choice also limited the data set to respondents from the year 2000 and later).

The results don’t get at the commenter’s assertions, because 10 out of 10 on WORDSUM does not imply that you’re that smart really. But the trendline is suggestive. Note that aggregated 0-4 because the sample size at the lower values is small indeed.

June 23, 2012

Attitudes toward genetically modified crops & science

In the further interests of putting quantitative data out their instead of vague impressions, I noticed two GSS variables which might be of interest. One queries the impression of effect on the environment of genetically modified crops. The second asks about whether science does more harm than good. The latter question exhibited almost no year to year variation of note, so I just threw them in a pot together. But for the environment and genetically modified crop question I show responses for the year 2000 and 2010. As you can see there is a modest difference in regards to the first where liberals are more skeptical.

