search this blog

Monday, March 28, 2016

PCA/nMonte open thread


Below are a few nMonte models of ancient individuals based on 25 principal components (PCs). The relevant datasheet and nMonte R script can be downloaded here and here, respectively.


Many of the outcomes are basically perfect. Others could certainly be better. But they all make sense.

The more complex the ancestry, the more difficult it is to model. Also, deamination, low coverage and missing markers are probably skewing things to some degree for most of these samples. So although time consuming, it might be a good idea to use population averages minus the most obvious outliers.

Are there any other ways to improve the analysis? Is 25 dimensions too much or too little? Let's run plenty of tests and see where this takes us.

I can update the datasheet with many more populations and dimensions later this week. Feel free to post your requests in the comments and I'll run them if I have them. Also, if anyone's wondering, I don't know yet which commercial genotype files I can run in this test, if any. I'll check.

Update 04/04/2016: A modified datasheet with 50 dimensions and many more samples is available here. It should be more useful in modeling South Central Asians, especially the Kalash. However, as far as I can tell, using just 9 dimensions, like in the version here, is faster and produces more accurate results.

75 comments:

Davidski said...

The eigen scores are here.

https://drive.google.com/file/d/0B9o3EYTdM8lQXzhlOHdtTmVLZlU/view?usp=sharing

Nirjhar007 said...

Afanasievo,Andronovo,Sintashta,Poltavka,Yamnaya etc all have Kotias as the highest ancestry .

There is something wrong?.

Davidski said...

They all have more European hunter-gatherer ancestry than Caucasus hunter-gatherer ancestry. So there's no problem.

Nirjhar007 said...

Have you done the K4 ( ANE,WHG,CHG,EEF) on the Steppe samples ? , I am very interested to see that also.

If you can link, it will be a great help :) ..

Krefter said...

Can you post the @ D results? How good were the fits?

Davidski said...

Didn't save them. I'm sure you can check them yourself though.

Matt said...

Haven't tried any fits on this, couple early comments:

a) Comparing cumulative % variance between this PCA and the PCA on D-stats, respectively:

By PC1: 36% vs 70%
PC2: 60% vs 92%
PC3: 67% vs 96%
etc.

Comparing what each dimensions shows, it looks like this is because the D-stats, possibly because they include the Mbuti term, really give a lot of % of the variance to the African vs non-African contrast (and dimensions) and emphasize the degree that Eurasians form a clade relative to Africans. While the PCA itself is not influenced to be concerned with that directly by the form of the stats, and so the dimension weighting African-Eurasian difference is lower.

No clue if that makes results more accurate with the "direct" PCA. "Direct" PCA based nMonte (opposed to D-stats or D-stat based PCA) seems more easily interpretable at the moment though. So possible best to go with results from that for now?

b) Taking the population averages on the PCA and transforming them themselves through PCA:
http://i.imgur.com/MNRKjmt.png

(This should sort of summarise the scores from the PCA run itself in a lower dimension way. I think.)

Surprisingly (I guess because of the predominantly West Eurasian populations in the panel), the "PCA on PCA" recapitulates a kind of West Eurasian PCA, itself.

The interesting part of that is the loadings on this, which seem like they should give us a clue about what dimensions are contributing to different nMonte fits for West Eurasian populations. These are mainly on dimension 9, 7, 19, 24, 25 and 2. 9 points at the Middle East and 7 and 19 point at Kalash. 2 is important in splitting out South Asians, not surprisingly as it is the ENA dimension. Quite low dimensions don't seem to be so important, because I guess they distinguish very much between continental populations, not so much with West Eurasia? So it suggests, for the question on number of dimensions. that anything that cut out to lower parts of dimensional range than this might not work so well? (1-8 / 1-20)?

The only really surprising thing that jumps out is that the Kalash is really distinguished here. Not sure why that is. Very extreme case of recent drift or divergent ancient ancestry? That population wasn't IRC as distinctive under the D-stats.

The other thing that strikes is that the EHG and Caucasus HG aren't as divergent on these PCA, relative to the other West Eurasians and each other, as they seem to be on the D-stat based PCA. Motala also seem less divergent than in PCA on D-stats. That probably explains much of why Anatolia_Neolithic seems like it isn't present in the steppe in nMonte on these PCA data, while it is in nMonte on the D-stats (or nMonte on PCA of the D-stats). In PCA based on the D-stats, Karelia and Caucasus_HG are distant from the other West Eurasians in a way that makes population which here would fit as Karelia+CHG need Anatolia_Neolithic. Seems like these go the opposite way, with Anatolia_Neolithic relatively distant, where it's relatively close in the D-stat.

Davidski said...

The distinctiveness of the Kalash here is probably due to recent drift. I reckon I can get rid of that when I update the datasheet with more pops.

Alberto said...

Thank you, David, great to have this stuff now that things are slow with ancient DNA.

I just did the first test, so can't comment about it in general. But since the result was surprising and then I tested with an alternative version of nMonte that calculates the distance as the sum of the absolute values of the residuals (instead of the sum of the squared values of the residuals) and the result was significantly different, I thought I'd post it here for others who might want to try that too.

I include the 2 scripts: euc_nMonte is the exact same last version by huijbregts (using Euclidean distance in the algorithm), but it prints the absolute distance in addition to the Euclidean distance. abs_nMonte uses the absolute distance in the algorithm, and also prints both at the end. The usage is exactly the same as huijbregts version:

https://drive.google.com/file/d/0B2ZfdVZaNXDxLTJEdTlDVEM0alk/view?usp=sharing

@huijbregts, I hope you don't mind that I share those changes. They're really trivial changes to your script and just for testing purposes.

The test I did was with Lithuanian as the target pop, and a standard set of 7 pops that I've been using for Europe. With the standard version that minimizes the Euclidean distance I got:

Lithuanian:lithuania3
"Motala_HG:I0012" 28.7
"Kotias:KK1" 26.7
"Loschbour:Loschbour" 20.75
"Anatolia_Neolithic:I1579" 19.2
"Samara_HG:I0124" 4.65
"Nganasan:ADR00504" 0
"Paniya:PNYD9" 0
"Yoruba:HGDP00928" 0
distance_abs=0.025956 / distance_euc=0.006227

A bit surprising, will have to check further combinations. But running the modified version for minimizing absolute distance I got something a bit different:

Lithuanian:lithuania3
"Loschbour:Loschbour" 26.15
"Kotias:KK1" 24.4
"Anatolia_Neolithic:I1579" 19.6
"Motala_HG:I0012" 15.05
"Samara_HG:I0124" 14.8
"Nganasan:ADR00504" 0
"Paniya:PNYD9" 0
"Yoruba:HGDP00928" 0
distance_abs=0.024363 / distance_euc=0.006446

Seems to make some more sense, that's why I post the modified version for others to test it.

huijbregts said...

I do strongly disagree with Alberto's plagiarism.
In the previous thread I have explained why this a bad idea, which is not founded on a solid understanding of statistic principles.
Take advice of some statistician before you confuse other users.

Alberto said...

Ah, a warning. The version that calculates the lowest absolute distance sometimes has a hard time really getting to the lowest possible distance. I really have no idea why. But I'm testing both versions side by side, and while the one using Euclidean distances is consistent in finding the lowest Euclidean distance, it sometimes happen that the absolute distance is also lower than then one found by the alternative version (which should happen, since the alternative version is looking for the lowest absolute distance). Sometimes running it a few times it gets there, but other times it gets stuck with a worse value.

@huijbregts, any idea of what might be causing this? I tried to check for any reason myself, but I'm not acquainted with R scripting to understand what's going on well enough, so i didn't find anything obvious.

Alberto said...

@huijbregts

Oops. It seems you do mind that I share those changes. Sorry about that. I'm deleting the files now, but not sure if it's working.

Really, I didn't want to plagiarize your work, just test an alternative method (that might be garbage, I'm not a mathematician). Apologies.

Alberto said...

It seems I managed to stop sharing those files. If someone got them already, please don't distribute them unless huijbregts gives his permission. My apologies.

Matt said...

@ Davidski, yeah that could make sense with the Kalash somehow dominating a dimension through drift specific to them I guess?

@ All: Had a test with the PCA data. I didn't really want to pick at random from the individual rows, so used a set of population averages from the datasheet. When I ran these through, running a test with a target file for English_Cornwall (population average) and calc file with a range of population averages*:

English_Cornwall - Anatolia_Neolithic 31.6, Hungary_HG 25.65, Kotias2 16.6, Karelia_HG 9.9, Armenia_BA 8.6, Motala_HG 7.2, Bichon 0.45 - distance% = 0.4582 %

Seems quite enriched in HG at 43.2%, not much EHG of that. Quite different results from D-stat (which fit at 34.3% HG, mostly Karelia, at 21.25%).

Further tests for some other populations with same calcs:

Basque_French - Anatolia_Neolithic 49.85, Bichon 27.3, Loschbour 8.4, Kotias2 8.2, Armenia_BA 6.05, Ulchi 0.2 - distance% = 0.6924 %

Rich in WHG, as expected (35.7 vs 25.65 for English Cornwall?), seems like zero EHG or SHG (although maybe 0.6-0.8% EHG would come through Armenia_BA).

Belarusian - Kotias2 26.85, Motala_HG 23.65, Anatolia_Neolithic 21.65, Hungary_HG 20.9, Loschbour 4.5, Karelia_HG 1.8, Han 0.65 - distance% = 0.5899 %

HG richness 46.3%, strong switchover to Motala and Kotias from what English Cornwall gets.

Sardinian - Anatolia_Neolithic 80.55, Bichon 17.25, Kotias2 1.45, Ulchi 0.6, Yoruba 0.15 - distance% = 0.5747 %

Seem pretty much Middle Neolithic, where the D-stats seem to position them more as about like 50:50 Tuscan:Early_Neolithic.

Russian_Kargopol - Hungary_HG 24.45, Kotias2 20.15, Motala_HG 20.15, Anatolia_Neolithic 11.75, Armenia_BA 9.45, Karelia_HG 6.6, Ulchi 4.1, Nganasan 2, England_Roman_outlier 1, Itelmen - 0.35 - distance% = 0.484 %

Ukrainian_East - Hungary_HG 29.9, Kotias2 25.25, Anatolia_Neolithic 17, Armenia_BA 9.5, Loschbour 7.8, Karelia_HG 7.35, Motala_HG 3.05, Japanese 0.15 - distance% = 0.5516 %

It seems like there's more of a cline in the type of HG ancestry with these PCA, with Motala and to a lesser extent Hungary_HG picking up what would have been found as composites of Karelia and WHG in the D-stats (and so were more consistent with a model of a single Yamnaya population mixing with a more or less single Middle Neolithic population?).

What are you folks getting when you do results based on individuals in the calc and target files? Esp. in case I've done something when converting to population averages.

I might try the calc with Sweden_NHG and some other populations to see what happens.

* Calc file: Ami, Anatolia_Neolithic, Armenia_BA, Atayal, Australian, BedouinB, Bichon, Chukchi, Dai, Druze, England_Roman_outlier, Eskimo_Naukan, Han, Hungary_HG, Iberia_Mesolithic, Itelmen, Japanese, Karelia_HG, Kostenki14_UP, Kotias, Kotias1, Kotias2, Loschbour, MA1, Masai_Kinyawa, Motala_HG, Nganasan, Samara_HG, Satsurblia, Ulchi, Ust_Ishim, Yoruba.

Population averages were generated in an automated way, so samples like Kotias, Kotias1 and Kotias2 with different pop labels weren't averaged together.

huijbregts said...

@ Alberto
OK. Sorry for the sharpe tone. I lost access to my computer.

Davidski said...

Matt,

Bichon is an old genome and low coverage. It's likely to skew results because in PCA it looks like it has some Basal Eurasian.

I think Loschbour is probably the best and most proximate reference we have for unadmixed Western hunter-gatherer ancestry in modern Europeans.

Matt said...

OK, thanks, to test, rerunning without Bichon for the two populations that scored with some Bichon:

Sardinian - Anatolia_Neolithic 81.15, Loschbour 15.2, Iberia_Mesolithic 1.3, Kotias2 0.9 Ulchi 0.8, Yoruba 0.65 - distance% = 0.6542 %

Basque_French - Anatolia_Neolithic 51.2, Loschbour 32.55, Kotias2 8, Armenia_BA 4.9, Hungary_HG 2, Ulchi 0.65, Yoruba 0.65, Satsurblia 0.05 - distance% = 0.7672 %

WHG loses about 1% for both, while Anatolia_Neolithic gains about 1%.

Or just Loschbour:

Sardinian - Anatolia_Neolithic 81.25, Loschbour 16.25, Kotias2 0.95, Ulchi 0.85, Yoruba 0.7 - distance% = 0.6544 %

Basque_French - Anatolia_Neolithic 50.95, Loschbour 34.3, Kotias2 7.45, Armenia_BA 5.8, Ulchi 0.65 - distance% = 0.7673 %

Without Hungary_HG, English Cornwall - Anatolia_Neolithic 31.1, Loschbour 22.5, Kotias2 14.8, Armenia_BA 12.45, Karelia_HG 12.4,
Motala_HG 6.5, England_Roman_outlier 0.25 - distance% = 0.7673 %

(total HG 41.5% slight increase to the Karelia there, as Hungary_HG is slightly intermediate the Loschbour type WHG and Karelia/Samara EHG? If all Karelia in English Cornwall comes via Yamnaya then suggests 1/5 Yamnaya ancestry, as approx 60% EHG in Yamnaya, 12.5% in English Cornwall?).

Davidski said...

I wouldn't run Armenia_BA in this type of model either. It's a more recent population than Yamnaya, and very likely with Yamnaya admixture.

Davidski said...

Here's a datasheet with 50 dimensions.

https://drive.google.com/file/d/0B9o3EYTdM8lQczB3ei0zNHZTdEk/view?usp=sharing

Seinundzeit said...

David,

A very interesting method.

I've been trying to model the Paniya, and their best fit has them at 100% Ust-Ishim. It's a very poor model (distance = 6.0187). Regardless, it's of interest that they turn out 100% Ust-Ishim despite the inclusion of Atayal, Ami, Han, and Australian samples. This is very similar to what we've seen with the d-stats (Dravidian_India had only minor ENA, even with the Onge included. Rather, they turned out as mostly Ust-Ishim + West Eurasian + ENA).

When I exclude Ust-Ishim, but include East Asians + Australians, the Paniya turn out to be 90% Kostenki14, 5% Australian, and 5% Siberian (distance = 8.7397). And when I exclude the proto-West Eurasian sample, this is what they get:

38.40% MA1
21.50% Ulchi
15.90% Australian
15.80% BedouinB
8.35% Ami
0.05% Yoruba

(distance = 9.4335)

A confusing mix of ANE, Siberian, Australasian, Near Eastern, and East Asian.

I'm pretty sure that this, coupled with the d-stat evidence, shows that ASI wasn't ENA. We won't know what ASI was until we have some South Asian aDNA. But to repeat my own speculative suggestion, I think Mesolithic South Asian hunter gatherers belonged to a stream of populations distinct from both West Eurasia (K14/ANE/EHG/WHG) and ENA (East Asians/Onge/Australasians), but probably closer to West Eurasia. Perhaps, we could be dealing with a lineage that was more closely related to the branch that eventually led to West Eurasians, when compared to ENA.

But that's just speculation on my part. I've really lost confidence in any claims regarding South Asian genetic history.

Davidski said...

Interestingly, using 25 dimensions might be too much.

I just ran some tests using 15 of the 50 dimensions in that last sheet I just uploaded and the results probably look better.

huijbregts said...

@ Davidsky
If the dimensions after 15 just add noise, you better cut them.
I wonder whether this effect continues when you have added a lot more rows.
So goodbye to Germanic vs. Celtic in dimension 49.

huijbregts said...

@ Davidski
It could also be an artefact of nMonte.
Maybe the algoithm forces nMonte to find some structure, even if you feed it pure noise.
Can somebody else test this, for I am not admitted by my computer.

Davidski said...

I think the best solution might be to use the most informative dimensions from as many as 50.

But the problem is that different dimensions will be informative for different sets of test and reference samples.

huijbregts said...

@ Davidski
If I write a program that detects three points on a straight line and I feed this prpgram a large dataset of random numbers, it will surely detect some triples. If it does not, this proves that the data were not random.
nMonte starts with a random combination that adds to 100%. It can only drop it, if it finds a random combination wich better fits rhe target.
If it is granted enough iterations, it will surely find one. You know the saying "if you have a hammer, everything looks like a nail".
So yes, I am convinced that nMonte will find false positives. The only check you can do is repeat the run. If you find the same structure, it is a signal. If you find a different structure, it is noise. To much iterations will increase the chance of false positives.
My conclusion: it is certainly possible that nMonte works better with 15 dimensions than with 50. But any surplus fits you find with 50 dimension, should not be reproducible when you repeat the run.

Matt said...

Davidski: I wouldn't run Armenia_BA in this type of model either. It's a more recent population than Yamnaya, and very likely with Yamnaya admixture.

OK, included it as it seemed pretty minimally admixed with EHG in D-stats and these runs, and in case it included some ancient Near East ancestry not quite captured by CHG+AN but which might be informative in these dimensions.

For the populations that scored in Armenia_BA, without Armenia_BA, or any WHG except Loschbour:

English_Cornwall - Anatolia_Neolithic 33.75, Loschbour 22.65, Kotias2 21.1, Karelia_HG 12.9, Motala_HG 7.35, England_Roman_outlier 1.8, Satsurblia 0.45 - distance% = 0.5255 % (vs distance% = 0.4582 % originally)

Basque_French - Anatolia_Neolithic 52.55, Loschbour 34.8, Kotias2 9.75, Satsurblia 1.5, Ulchi 0.75, Yoruba 0.65 - distance% = 0.7687 %

Ukrainian_East - Loschbour 34.65, Kotias2 30.8, Anatolia_Neolithic 20.25, Karelia_HG 12, England_Roman_outlier 0.8, Motala_HG 0.8, Japanese 0.7 - distance% = 0.6207 %

Russian_Kargpol - Motala_HG 33.1, Kotias2 23.85, Loschbour 13.6, Anatolia_Neolithic 12.75, Ulchi 6.4, Karelia_HG 5.45, England_Roman_outlier 4.35, Nganasan 0.5 - distance% = 0.5291 %

Some pick up in mainly Kotias, Anatolia_Neolithic and also at the margins England_Roman_outlier and Satsurblia to take up additional %.

huijbregts an artefact of nMonte

Could cross testing against 4mix for populations which look well fitted for four sources be worth for a comparison? (Errors won't be systematic between the two different algorithms?).

Davidski said...

Yeah, essentially what I'm saying is that we need to test whether using 50 of the overall most significant dimensions, like in the sheet below, are better as input than, say, 10 of the overall most significant dimensions, or alternatively, a handful of the most relevant dimensions based on the selected test and reference samples.

https://drive.google.com/file/d/0B9o3EYTdM8lQczB3ei0zNHZTdEk/view?usp=sharing

The eigen value scores for the 50 dimensions are here...

https://drive.google.com/file/d/0B9o3EYTdM8lQamxkbW81ZENkeGc/view?usp=sharing

Maybe there's a way to compute eigen values for the selected test and reference samples in each test, and then use the top 10 only to model the test sample?

I just got a new Windows laptop because my old one crashed, and this one is slower, so it's a bit of a pain in the ass getting used to it, but I'll try and run plenty of tests this weekend.

huijbregts said...

@ Matt
There are more similarities between 4mix and nMonte than you might expect. Both depend on minimizing the Euclidean distance and they will both find false positives. So I expect that they do share false positives. A better idea is to compare two runs of nMonte. I expect that two successive runs of nMonte show different false positives.

Alberto said...

@Seinundzeit

That's interesting. I had never been able to test the relationship of Kostenki14 with this theoretical ASI component. But your findings (that I reproduced) show that indeed Paniya are better modeled using Kostenki14 than MA1, so we're not talking anything specifically related to ANE here, but maybe something that branched off earlier (?).

But yes, without ancient DNA from anywhere near, we're all quite a bit lost with this.

Still something to point out is that probably all the ENA affinity in Paniya can't be explained just by kostenki14 (by Ust-Ishim yes, but that's a bit different case). So this model:

Paniya:PNYD9
"Kostenki14_UP:Kostenki14" 100
"Loschbour:Loschbour" 0
"MA1:MA1" 0
"Ami:NA13608" 0

Could be an artifact of the method used to calculate the lowest distance. When using the absolute values instead of Euclidean distance:

Paniya:PNYD9
"Kostenki14_UP:Kostenki14" 70.15
"Ami:NA13608" 29.85
"Loschbour:Loschbour" 0
"MA1:MA1" 0

Which probably makes more sense (though it's hard to say in any definitive way).

huijbregts said...

@ Davidski

I can use my computer again, more or less.
First I tried to replicate your observation that 15 dimensions may be good enough.
The data were from your PCA50. I have averaged over the pops to simplify things.
The target population was Yamnaya Samara with the selection you used in the figure of the thread.
The results were:
15 columns: distance% = 0.4515
30 columns: distance% = 0.5654
50 columns: distance% = 1.1418
So the results for 15 and 30 columns were comparable, but the distance for 50 columns was dramatically worse.
This was caused by several large differences in the dimensions above 30.
The largest difference was a whopping -0.005448 in dim 38.
It seems that the model is deficient in the higher dimensions; it would be useful to zoom in.

huijbregts said...

@ Davidski

After inspection of dimension 38 I have added the Itelmen to the model, but that did not really help:
Yamnaya_Samara
"Kotias" 40.6
"Karelia_HG" 26.5
"Motala_HG" 18.55
"Samara_HG" 13.7
"Itelmen" 0.65
"Anatolia_Neolithic" 0
"Loschbour" 0
"Nganasan" 0
distance%=1.1357 / distance=0.011357
It only slightly decreased the difference on dim 38 to -0.005174

huijbregts said...

@ Davidski

When you calculated the PCA, did you normalize? I think you shouldn't.

Matt said...

@ Davidski, I can understand the thinking behind extracting particular dimensions. It doesn't quite feel right though, and I'd imagine there is always the question that of how much relatedness is hiding in the sum of small fractions that don't seem by themselves to contribute much to differentiation between the populations of interest.

With the 50 dimensions, repeating the exercise of transforming the population averages themselves through Past3's PCA function:

"PCA on PCA" over all populations: http://i.imgur.com/MK8a52r.png

"PCA on PCA" over a subset of samples which are fairly West Eurasian in dimension 1 and 2:
http://i.imgur.com/nLPbXAW.png

Dimension 38 here is important in the loadings, and so does seem like the most significant individual dimension after 25 in the loadings here, in agreement with huijbregts experiment, particularly for the intra West Eurasian contrasts.

Looking at what populations Dimension 38 distinguishes, the lowest scorers are in order: Eskimo_Naukan, Iberia_EN, Stuttgart, Sardinian, Anatolia_Neolithic, while the highest are Itelmen, Mozabite, Karelia_HG, Samara_Eneolithic, Samara_HG, BedouinB, Motala_HG,MA-1.

So Dimension 38 is distinguishing EEF (and Eskimo, and also Nganasan is close) from Arabic/North African and North Asian HG populations.

By the way, with the "PCA on PCA" it seems like the variance on the eigenvectors between populations can themselves be about 88% summarised within 10 dimensions, then 99% summarised within 20 dimensions.

I tested using the first 20 dimensions of the "PCA on PCA" scores to feed in to nMonte themselves - didn't seem to work too well, as it produced English_Cornwall as 42% Motala_HG, 37.1% Anatolia_Neolithic, 18.8% Kotias2 and 1.35% Loschbour. That's similar to what I get when I use the full 50 dimensions in nMonte - English_Cornwall: Motala_HG 40.55, Anatolia_Neolithic 37.05, Kotias2 11.1, Satsurblia 8.2, Loschbour 2.45, Japanese 0.45, Itelmen 0.2. So PCA on PCA compresses the data in the PCA run, but doesn't find much more / less accuracy (looks clearly wrong).

Alberto said...

The distance is a sum, not an average, so it should be expected to increase with more dimensions. And to increase more with the dimensions that have more variance than with the ones that are less relevant. I do agree though that removing the dimensions with less variance might have a negative effect when they add up. But no idea which would be the best strategy regarding the number or how to choose them.

Some of the things I see seems to make sense in that they agree with things we've seen with admixture, like stronger CHG presence than with D-stats (especially in S-C Asia, where it overwhelmingly dominates over Anatolia_Neolithic, as it does in Yamnaya), the high EHG in Yamnaya, but decrease in modern Europeans,... From Motala things are a bit new, since we only really tested with latest D-stats and there it seemed to be minor, while here it looks important (we have no reference from Admixture). So overall, things that we do know that have to be in certain place seem to be right, but others is difficult to say.

I'm running now 50 dimensions without making averages (BTW, if someone who's made the averages can share it, that would be appreciated; if Davidski doesn't mind, of course), so things might be a bit different than with averages. But, for example, taking English_Cornwall, it doesn't look as bad as what Matt got:

English_Cornwall:HG00231
"Anatolia_Neolithic:I1579" 33.95
"Motala_HG:I0012" 28.4
"Kotias:KK1" 26
"Loschbour:Loschbour" 10.05
"Karelia_HG:I0061" 1.6
"Nganasan:ADR00504" 0
"Samara_HG:I0124" 0
"Yoruba:HGDP00928" 0
distance_abs=0.065336 / distance_euc=0.011468

Still, that looks like too much Motala and too little EHG, not sure what D-stats would say about that. With absolute distance instead of Euclidean (seems to deal better with imperfections in the data, at least in the bad cases. Others are about the same):

English_Cornwall:HG00231
"Anatolia_Neolithic:I1579" 34.6
"Kotias:KK1" 24.8
"Loschbour:Loschbour" 18.9
"Motala_HG:I0012" 11.5
"Karelia_HG:I0061" 7.95
"Samara_HG:I0124" 2.25
"Nganasan:ADR00504" 0
"Yoruba:HGDP00928" 0
distance_abs=0.063772 / distance_euc=0.011756

I admit that seeing different results from different methods makes it harder to judge what makes sense and what doesn't. But we'll learn things from all these tests.

huijbregts said...

@ Alberto
I placed the file PCA50_avg.csv in my dropbox
https://www.dropbox.com/sh/1iaggxyc2alafow/AACIjLtnkuaNNsJ5oKME_3XHa?dl=0

As to your methodological experiment, could you please consult a mathematician or science teacher?
Basically you are ignoring the theorem of Pythagoras.

Alberto said...

An example, for Pathan:

Pathan:HGDP00224
"Kotias:KK1" 44.7
"Paniya:PNYD3" 30
"Sintashta:RISE394" 16.45
"MA1:MA1" 7.35
"BedouinB:HGDP00610" 1.5
"Anatolia_Neolithic:I1579" 0
"Dai:HGDP01309" 0
"Yoruba:HGDP00928" 0
distance_abs=0.095015 / distance_euc=0.018296

Which kind of reminds what we've seen in Admixture for a long time, though it's difficult to say if MA1 would be part of the North European, the "Teal" or even the South Asian. (And BTW, very similar result with absolute distance in this one, which is probably a good sign).

Alberto said...

I put an example in the previous thread of how that works and why in our case it might be preferable to use absolute distance. That was a theoretical example. Then I've given you empirical ones.

Now, I'm not saying using absolute distance is definitely better. But it might be a better alternative at least in some cases. Worth exploring, in any case. That's exactly what I'm doing.

If you don't want to waste your time giving a theoretical explanation of why you think that even in my example that's not preferable, or to give some empirical example to make your point, that's ok. Maybe you can read for alternatives instead (it's probably more productive and you may find something interesting), for example here:

http://numerics.mathdotnet.com/Distance.html

Sum of Absolute Difference (SAD)

The sum of absolute difference is equivalent to the L1-norm of the difference, also known as Manhattan- or Taxicab-norm. The abs function makes this metric a bit complicated to deal with analytically, but it is more robust than SSD.

Sum of Squared Difference (SSD)

The sum of squared difference is equivalent to the squared L2-norm, also known as Euclidean norm. It is therefore also known as Squared Euclidean distance. This is the fundamental metric in least squares problems and linear algebra. The absence of the abs function makes this metric convenient to deal with analytically, but the squares cause it to be very sensitive to large outliers.

Davidski said...

I think that PCA was normalized. I'll look into doing one that isn't.

Alberto said...

Ok, now with population averages that should be more robust:

English_Cornwall
"Motala_HG" 42.4
"Anatolia_Neolithic" 37.65
"Kotias" 19.05
"Loschbour" 0.9
"Dai" 0
"Karelia_HG" 0
"Samara_HG" 0
"Yoruba" 0
distance_abs=0.052277 / distance_euc=0.009375

That definitely looks like way too much Motala. Though using absolute distance things are different:

English_Cornwall
"Anatolia_Neolithic" 36
"Kotias" 20.6
"Motala_HG" 17.9
"Loschbour" 16.4
"Karelia_HG" 7.8
"Samara_HG" 1.3
"Dai" 0
"Yoruba" 0
distance_abs=0.051712 / distance_euc=0.010086

Still that's a lot of Motala and little EHG, but harder to say if that's totally unreasonable (though it's obviously something we had not seen before).

huijbregts said...

@ Alberto

In a multidimensional space a distance is calculated with the Euclidean formula (=Pythogoras). In a one-dimensional space this formula simplifies to the simple one-dimensional distance.
In the taxicab-branche distance is measured in driven kilometers, which is a one-dimensional measure, so it is logical to use the L-1 norm as you call it.
In a multidimensional space the distance is always measured according the Euclidean formula or the L2-norm as you call it. This is especially true in multivariate statistics, as you can see in any text about the subject.

If you stubbornly insist on using the sum of absolute distances instead of the Euclidean distance,
the effect will be that small distances get more weight than in the Euclidean formula.
This results in a longer list of possible admixtures with a lower reliability. In other words, you have lowered the signal-to-noise ratio. Is that what you want?

Matt said...

@ Davidski, just having a look through the individual dimensions and plotting them in Past3 to see what happens....

Not sure about the inclusion of England Roman Outlier, it seems very distinguished as an outlier in dimensions 10, 14, 15 and 16 -

http://i.imgur.com/WqoaW7q.png

Dimension 31 seems to distinguish Karsdorf_LN out.

Might be worth looking to see if there are any unexpected populations who are huge outliers in any dimension and then removing them? It seems like you get some dimensions that might not be informative overall and load heavily on peculiarities of single samples, and then other dimensions have to compensate. This is possibly due to those issues of "deamination, low coverage and missing markers"? I don't know if it affects the results overall, or just pushes other dimensions lower down the chain or anything like that.

Also, although I didn't take a note of them specifically there seem like quite a few dimensions, quite early on in the sequence of dimensions, distinguishing Siberian / North Asian / East populations.

Just looking at the populations in Past3's neighbour joining clustering on the PC scores, we seem to have these really huge branch lengths splitting these populations:

http://i.imgur.com/xqWudye.png

I can't quite understand why that is, as they don't normally seem to be that divergent (unless I'm remembering wrongly). The positions make sense, it just seems like they are *really* divergent as individual populations.

Seems like an even stronger effect in classical clustering:
http://i.imgur.com/2y6YgfU.png. Long branches.

Chad said...

Just remove Motala. Theyre giving you dubious results.

Seinundzeit said...

Alberto,

For whatever it's worth, the Paniya results look more interesting with only 10 dimensions:

71.80% Ust Ishim
25.85% Satsurblia
2.35% MA1

(distance = 4.7259)

Seems more reasonable (again, East Asians and Australians included, along with many distinct West Eurasian reference populations).

I've been trying models with "basic" reference populations (ones which aren't, at least in a shallow sense, mixes between each other, as per the literature). These models make great sense (same reference populations for all of them), and work well. I only used individual samples, not averages:

Chechen
49.30% CHG
29.85% Anatolia Neolithic
14.20% EHG
4.60% BedouinB
1.85% Ami

(distance = 0.2251)

Lezgin
55.90% CHG
29.20% Anatolia Neolithic
12.50% EHG
2.10% Ami
0.30% Australian

(distance = 0.1385)

Georgian
58.35% CHG
32.50% Anatolia Neolithic
8.35% BedouinB
0.80% Atayal

(distance = 0.3919)

Iranian
61.70% CHG
16.90% Anatolia Neolithic
14.15% BedouinB
4.30% ANE
2.65% Atayal
0.30% Australian

(distance = 0.2346)

Jordanian
59.35% BedouinB
20.60% CHG
9.90% Anatolia Neolithic
4.60% Yoruba
2.60% EHG
1.50% ANE
1.40% Australian

(distance = 0.0876)

Sardinian
76.85% Anatolia Neolithic
15.65% WHG
3.05% EHG
2.90% CHG
1.45% Yoruba
0.10% BedouinB

(distance = 0.2582)

Basque (French)
49.60% Anatolia Neolithic
32.70% WHG
13.9% CHG
3.50% EHG
0.30% Yoruba

(distance = 0.3971)

Lithuanian
41.35% WHG
22.55% CHG
18.75% Anatolia Neolithic
17.35% EHG

(distance = 0.3878)

So far, makes good sense. Although, some of the results differ from what we've seen with d-stat/f-4 based methods. But even then, nothing looks implausible, although one wonders which set of results we should take more seriously (either results based on formal stats, or something like this, which differs from those results, yet still makes sense).

For South Central Asians, same reference populations:

Kalash
73.55% CHG
26.45% ANE

(distance = 2.1085)

Pashtun
72.90% CHG
19.95% ANE
5.05% Ust-Ishim
2.10% Ami

(distance = 1.1435)

Pashtun (different sample)
77.7% CHG
17.7% ANE
4.3% Ust-Ishim
0.3% Ami

(distance = 0.9379)

Tajik (Ishkashim)
65.60% CHG
18.85% ANE
7.00% Ami
6.65% EHG
1.30% Australian
0.60% Ust-Ishim

(distance = 0.3649)

Tajik (Shugnan)
70.70% CHG
20.80% EHG
5.50% Ami
1.45% Australian
1.35% ANE
0.20% BedouinB

(distance = 0.1824)

South Asia:

Punjabi (Lahore)
60.85% CHG
21.40% Ust-Ishim
17.75% ANE

(distance = 3.5394)

Chamar (Uttar Pradesh)
58.15% CHG
34.45% Ust-Ishim
7.40% ANE

(distance = 5.1284)

Pulliyar (Tamil Nadu, if I'm not mistaken)
53.85% Ust-Ishim
41.30% CHG
4.85% ANE

(distance = 4.7012)

Interesting results, these also make sense. But here, we're in way more speculative territory, compared to Europe (or even West Asia).

Krefter said...

@Davidiski,

What type of PCA are you using? A Global PCA? Can you give a picture of it?

Davidski said...

huijbregts,

Without normalization...

https://drive.google.com/file/d/0B9o3EYTdM8lQWElGb3FwTTRPZUU/view?usp=sharing

I'm seeing very sensible results using just 9 dimensions from this sheet. But it's often not possible to run models with very closely related populations as references, like Karelia_HG and Motala_HG. I don't think this is a major problem.

Matt,

The long branches you're seeing are probably caused by these Siberian populations dominating some of the lesser dimensions due to recent drift. I don't think this has any impact on the models we're testing though, and if it affects the Siberians, then limiting the number of dimensions to less than 10 should help.

Btw, don't worry about the ancient samples in the datasheet that are behaving strangely in some dimensions. They don't affect the outcomes for any of the other samples, because they don't define the dimensions.

Krefter,

I can't plot more than 3 dimensions. If I plot the first 2 and 3 dimensions the result will just look like an average global plot.

Krefter said...

Awesome job David.

@Matt, Alberto.

We need to make assumptions about the ancestry of modern populations to get realistic results. When you use all ancient West Eurasians and lots of moderns, you'll get false results.

Here's an example of using realistic ancestors.

Scottish_Argyll: 44.25 Yamnaya, 36.1 Anatolia_Neolithic, 19.65 Loschbour.

I'm getting very realistic results for Europeans right now, because I use realistic ancestors.

Alberto said...

@Sein

Yes, agreed. With Ust-Ishim you can catch all the kind of ancestry in Paniya without any East Asian. What was new to me was seeing that the most West Eurasian part of Paniya was better represented by Kostenki14 than by MA1.

I agree about the rest too. Some results are different from using other methods, but hard to that they're necessarily wrong. They're not unreasonable.

---

Re:Motala, yes, removing it seems to give us something close to what we've seen before:

English_Cornwall
"Anatolia_Neolithic" 37.6
"Loschbour" 22.4
"Kotias" 20.2
"Karelia_HG" 18.65
"Samara_HG" 1.15
"Dai" 0
"Yoruba" 0
distance_abs=0.054942 / distance_euc=0.01032

(And results with absolute distance are now in full agreement).

But one thing that was interesting to explore was precisely how much Motala-like there could be in Europe. I guess it's just difficult with these methods to distinguish SHG from just a mix of EHG and WHG, so we can't trust those results too much (though just discarding them doesn't answer our question either). Maybe in the future we can find the right way to measure that with more confidence.

Anonymous said...

Regarding this comment of Davidski:

"The long branches you're seeing are probably caused by these Siberian populations dominating some of the lesser dimensions due to recent drift. I don't think this has any impact on the models we're testing though, and if it affects the Siberians, then limiting the number of dimensions to less than 10 should help."

What's the practical number of clusters Siberians can be divided into, according to this approach? Also, is there a reliable way to attach any of these clusters to populations such as Finns, Northern Russians, Erzyas and Mokshas, for instance?

Davidski said...

No idea, but you can check with Past3 how the different Siberian groups behave in each dimension, and also try a few models with nMonte of Finns, Russians and Mordovians to see which Siberians they prefer.

Matt said...

Davidski: The long branches you're seeing are probably caused by these Siberian populations dominating some of the lesser dimensions due to recent drift. I don't think this has any impact on the models we're testing though, and if it affects the Siberians, then limiting the number of dimensions to less than 10 should help.

Hmmm. OK, clustering does look different (less extreme branches) with less than 10 dimensions and less than 20, although this does hit the distinction of the HGs as well (which would be a problem for using low dimension in the West Eurasian models).

Alberto Re:Motala, yes, removing it seems to give us something close to what we've seen before

But one thing that was interesting to explore was precisely how much Motala-like there could be in Europe. I guess it's just difficult with these methods to distinguish SHG from just a mix of EHG and WHG, so we can't trust those results too much (though just discarding them doesn't answer our question either).

Kind of looks like you can get more intuitive fits without Motala.

Re: Motala in the fits, I think for me the question is not just about them, though, but also whether this indicates generally about whether the various ancient populations (EHG, WHG, SHG, CHG, Anatolia_Neolithic) are more compressed closer to their closest present day by this method (in these 50 dimensions) in a way that makes it more fit for the nMonte algorithm to find high %s of them in populations. While this is just most apparent for Motala because they're very central. Also hence why you don't get Yamnaya admixture in Sardinia, etc.

When I put these scores through PCA, the Sardinians and EN seemed more displaced from others than they usually are, relative to the HG and other Europeans - http://i.imgur.com/x0Fbuxo.png

But you could just as easily argue that the D-stats are displacing the ancients further from the modern populations really are, I suppose (and not equally between populations, hence why increase in Anatolia_Neolithic in the D-stats models, because it is less displaced than it is here).

Doing a run taking out Motala and including a Middle Neolithic population
English_Cornwall - Sweden_MN 37.15, Yamnaya_Samara 28.1, Anatolia_Neolithic 11.8, Karelia_HG 10.4, Satsurblia 6.1, Loschbour 5.8, Kotias2 0.65 - distance% = 0.7412 %

then themselves

Sweden_MN - Anatolia_Neolithic 72.65, Loschbour 27.35 - distance% = 1.6859 %

Yamnaya_Samara - Karelia_HG 33.95, Kotias2 26.35, Samara_HG 15.1 Kotias1 14.6, Loschbour 9.45, Itelmen 0.55 - distance% = 1.1454 %

which when inserted to English_Cornwall - Anatolia_Neolithic - 38.8, EHG 24.18 (Karelia_HG 19.9, Samara 4.24), Loschbour 18.6, CHG 18.25 (Satsurblia 10.3, Kotias2 4.9, Kotias1 4.1).

which is fairly consistent.

@ Krefter, dimensions 1 and 2, and 2 and 3 - http://i.imgur.com/yGiqKxM.png

Davidski said...

By the way, call me crazy, but I think limiting the datasheets to 25 or less dimensions does help.

Also, I think the normalized datasheet works better than the non-normalized one.

I other words, the sheet I posted in the blog entry should be fine as long as the models are sensible enough.

Anonymous said...

@Davidski: "..also try a few models with nMonte of Finns, Russians and Mordovians to see which Siberians they prefer". Any sophisticated guesses what would happen if there are more than one Siberian source populations?

Davidski said...

You can stick them all in and see what happens. But I'd say the best strategy would be to keep things simple and follow a mainstream archaeological theory. Maybe the Karasuk outliers might be nice to start with?

Alberto said...

@Matt

Yes, something like that seems to be happening. But as you also note, when we're seeing differences when using different methods it becomes difficult to know reliably what should be more correct and what shouldn't.

For example, here we see very little ENA and SSA in Europe compared to what we were seeing with D-stats. The pervasive Dai/Nganasan signal mostly disappears and with Spanish_Castilla_la_Mancha (which should be about average Spanish) I just get 0.1% Yoruba.

Another interesting thing that can be seen in your PCA is that Anatolia Neolithic is more "eastern" than EN/MN (or quite specially than Iberian EN/MN). This makes that when using Anatolia Neolithic the "Yamnaya" (be it EHG or CHG) admixture is significantly decreased as compared to using Iberia_MN. For example, Basque:

Basque_Spanish
"Anatolia_Neolithic" 63.1
"Loschbour" 30.8
"Kotias" 6.1
"Dai" 0
"Karelia_HG" 0
"Samara_HG" 0
"Yoruba" 0
distance=0.0325

(though with absolute distance I do get close to 6% EHG + 4% CHG)

Basque_Spanish
"Iberia_MN" 73.3
"Kotias" 10.05
"Loschbour" 9.4
"Samara_HG" 7.25
"Anatolia_Neolithic" 0
"Dai" 0
"Karelia_HG" 0
"Yoruba" 0
distance=0.018392

And the model greatly improves.

For Asia the biggest difference is the CHG vs. Anatolia_Neolithic figures. With D-stats based model, the Kalash could get some 15% Anatolia_Neolithic (apart from whatever Sintashta had itself), while here they get 0% and instead much more CHG (some 50% vs. 27%). I don't know if there would be any way to check which model is closer to the truth (similar thing with Yamnaya).

Simon_W said...

Regarding Motala ancestry in the table in the blog post (the modeling of the ancients), it makes complete sense to me, and the Motala admixture in Bell_Beaker_Germany:I0112 is truly amazing! Because it's in line with the cranial affinity of German Bell Beakers with the Danish TRB. A possible link is of course Nordic_LN:RISE98, who has even more Motala admixture. The fact that, according to the table, Sintashta:RISE395 also has Motala admixture is in line with the connection between Sintashta and the Nordic_LNBA noted in the cluster analysis in the related D-stats/nMonte open thread. Apparently, this Motala admixture played no role in later cultures, like Unetice and the probably kind of Germanic Halberstadt_LBA.

Another interesting point is the substantial Anatolia_Neolithic ancestry in Armenia_BA. This is in line with previous analyses, just a nice confirmation. Especially since both Yamnaya individuals completely lacked this.

Matt said...

Alberto:
The pervasive Dai/Nganasan signal mostly disappears

Seems like that could go hand in hand with decreasing Anatolia_Neolithic %s as, Anatolia_Neolithic is the least sharing of drift with Dai/Nganasan (compared to WHG, EHG and even CHG)? On the other hand, with the D-stats those Dai/Nganasan %s were also operant in the Sardinians, who if anything have increased in Anatolia_Neolithic in these models (and the Sardinians in the D-stat model sort of got Nganasan to a higher degree than in most other populations). So probably not that.

Another interesting thing that can be seen in your PCA is that Anatolia Neolithic is more "eastern" than EN/MN (or quite specially than Iberian EN/MN). This makes that when using Anatolia Neolithic the "Yamnaya" (be it EHG or CHG) admixture is significantly decreased as compared to using Iberia_MN. For example, Basque

Interesting spot. If that "PCA on PCA" is accurately summarizing the PCA dimensions, yes, it does look like Iberia_MN has some position away from a simple mix of Anatolia_Neolithic and Loschbour / Bichon / Iberia_Mesolithic, that would need the Basque to pick up more EHG related populations to compensate. Maybe there is some substructure on the Cardial vs Danubian populations or ongoing drift in the West Mediterranean in these PCA dimensions than is apparent in D-stats (even with Iberia_Chalcolithic as a D-stat).

That would also suggest that with Iberia_EN as their Neolithic ancestors, the Sardinians would pick up a little Karelia_HG, and they do, although not much: Sardinian - Iberia_EN 89.4, Kotias1 7, Karelia_HG 3.25, Yoruba 0.3, Ulchi 0.05 - 1.2976 %.

(OTOH Sweden_MN which is positioned more towards Yamnaya than Iberia_MN on that plot does not fit well as Iberia_MN plus EHG / CHG, and instead got a weird blend of England_Roman_Outlier and Loschbour when I tried it.)

One other thing that may be of interest with that "PCA on PCA" was that the Hungary_BA population average was really displaced towards Northwest Europe, while on the D-stats PCA they were pretty much Basques+a little WHG.

Krefter said...

@Simon_W,

I highly doubt there's signifcant SHG ancestry in any moderns, except maybe East Baltic. The problem with Matt and Alberto's modelling, is they're not modelling Europeans as, Yamnaya+Middle Neolithic+other. There's no point of using Loschbour and Anatolia Neolithic and Kotais and Karelia_HG. Middle Neolithic and Yamnaya carry all modern Europeans need from each one of them.

Look what happens when I model Northern Europeans as Yamnaya+Middle Neolithic+Motala_HG+Loschbour.

Only some score a few percent Motala_HG. Even for Lithuanians, Loschbour works better than Motala_HG.

Population Loschbour Sweden_MN Motala_HG Yamnaya_Samara D statistic
Czech 4 45 0 51 0.0097
Belarusian 9 33 5 53 0.0172
Lithuanian 19 24 2 55 0.0177
Norwegian 5 44 3 48 0.0081
English_Cornwall 0 52 1 47 0.0084
English_Kent 3 52 0 45 0.0116
Scottish_Argyll 5 47 0 48 0.0097


Krefter said...

The results I'm getting are 100% consistent with D-stats. I'd say, 30-50% Anatolia_Neolithic and 20-50% Yamnaya, are reasonable ancestry ranges for 90% of Europeans.

Ryukendo K said...
This comment has been removed by the author.
Ryukendo K said...
This comment has been removed by the author.
Davidski said...

I don't understand what you're saying. There aren't any outgroups in the PCA datasheet.

Ryukendo K said...
This comment has been removed by the author.
Davidski said...

I don't think that would work, because unless the ancient samples are covered by at least several hundred modern samples, they do weird things on PCA plots. In fact, the same thing happens with highly drifted and/or inbred modern samples.

There's no way such PCA data, with really exaggerated outcomes, would produce accurate ancestry estimates.

Really, the only way to do this is to pack as much variation as possible into a run, and then pick out the most effective dimensions, which are usually the first nine dimensions.

The thing to understand is that it's not practical to model modern populations with anything but the most relevant and proximate ancient samples. Modern Northern Europeans are not a mixture of Karelia_HG, Motala_HG, and certainly not Kostenki14, and this is essentially why these models don't work.

On the other hand, models involving Yamnaya and Middle Neolithic Europeans work very well, and are very similar to what we've seen with other methods. The reason they work is because they reflect reality.

Ryukendo K said...
This comment has been removed by the author.
Shaikorth said...

Paniya is equidistant from Han and GujaratiA according to Kurd's IBS test. With a D-test they're closer to Japanese than to Kets.

Davidski said...

Of course South Asians don't share any special relationship with Kostenki14 or Ust_Ishim. And certainly this type of PCA analysis for such ancient samples is not particularly useful.

But the distances for the models with these genomes are always ridiculously large. So there's no problem.

Ryukendo K said...
This comment has been removed by the author.
huijbregts said...

@ Ryukendo

Given my present state of knowledge, I have no reason to consider Dstats intrinsically inferior to PCA data of raw aDNA.
My statement is that nMonte works by minimizing the Euclidean distance of a set of vectors.
In doing so, it presupposes that the columns of the vectors are orthogonal.
So if you use nMonte, you get a better result if you first orthogonalize your data.
From this point of view, PCA data are perfect for nMonte and the usefulness of Davids datasets have underscored this.

The question whether it is useful to orthogonalize Dstats is beyond my knowledge horizon.

Seinundzeit said...

RK,

Interestingly, with d-stats, the same results are observed:

GujaratiC
24.10% Caucasus_HG
21.70% Ust_Ishim
16.60% MA1
15.35% Atayal
12.55% Anatolia_Neolithic
7.20% Onge
1.50% Karelia_HG
1% BedouinB

Dravidian_India
35.45% Ust_Ishim
21.35% Atayal
18.35% Caucasus_HG
11.05% Onge
8.75% MA1
5.05% Anatolia_Neolithic

Kharia
39.40% Atayal
39.35% Ust_Ishim
12.15% Onge
5.10% MA1
4% Caucasus_HG

Even with the d-stat sheet, despite the use of Onge/East Asians, Ust-Ishim is the largest component for Dravidian_India (and would be the predominant element for Paniya).

I don't think this means that South Asians have a special relationship to Ust-Ishim or K14. Rather, it just means that tribal South Indians are "their own thing", neither ENA (East Asians/Onge/Australasians) nor West Eurasian (WHG/EHG/ANE), but something distinct from both (but probably with some admixture from both West and East Eurasia, or perhaps just phylogenetically closer to West Eurasia).

But I do agree that the d-stat sheet was more reliable. With this method, South Central Asians like the Kalash basically turn out 70%-75% CHG + 30%-35% ANE, while Pashtuns turn out the same, but with 2%-5% Ust-Ishim.

Which seems somewhat unlikely, as I'd expect some Anatolia Neolithic, BedouinB, and less ANE/EHG.

Matt said...

@ Ryu, full agreement with a lot of the issues you talk about in theory (although I can't speak to whether they are as practically relevant as I'd think they could be, and any counterveiling issues with the outgroup D-stats) and thanks for posting.

I think although these do seem to be giving reasonable results with the MN and Yamnaya as proximate mixing populations (results consistent with other methods like D-stats), even if you do not have the modern populations modelled as the most ancient ancients rather than the most recent ancients though, it still seems like any potential problem of compression from dimensions being drawn from variation in modern could still be extent (more so?) if you're using the dimensions to model more recent ancient dna (Yamnaya, MN, etc.) with more ancient samples (EHG, CHG, WHG, AN).

huijbregts said...

unscaled PCA scores

The sheets of Davidsky present scaled PCA scores, which is the default PCA mode. This means that the raw scores are divided by the root of their eigenvalues.
However, if you want use the scores for successive calculations like nMonte or inspecting the residuals, the unscaled scores should be used.
Therefore, I have compared the nMonte calculations with scaled and unscaled data.

Yamnaya_Samara scaled:

"Kotias" 40.6
"Karelia_HG" 26.55
"Motala_HG" 18.55
"Samara_HG" 13.65
"Itelmen" 0.65
"Anatolia_Neolithic" 0
"Loschbour" 0

distance_scaled = 1.1348


Yamnaya_Samara unscaled:

"Kotias" 42.75
"Motala_HG" 22.8
"Karelia_HG" 21.4
"Samara_HG" 13.05
"Itelmen" 0
"Anatolia_Neolithic" 0
"Loschbour" 0

distance_unscaled = 2.4169


Note that the two distance measures are not comparable. Maybe a statistician/mathematician can help us here.
The most striking difference is the presence a half percent of Itelmen in the scaled case.
This might be interpreted in the direction of palaeoSiberians.
However, inspection of the unscaled datasheet reveals that Itelmen has its biggest loadings on the dimensions
1 (non-African), 2 (Asian vs. European), 6, 11 and 35.
Especially the loading on dimension 35 suggests that this might be just noise.

The two sets of results are very similar. Yet I think that the unscaled scores are the ones to use.

Davidski said...

The new datasheet is ready with lots of extra populations and unscaled PC coordinates. I only used 9 dimensions to speed things up, and I'd say the results are more accurate than with 50 anyway. Here's a decent model for Kotias...

Kotias:KK1
Satsurblia:HL 94
Anatolia_Neolithic:I0709 5.1
Loschbour:Loschbour 0.65
Karelia_HG:I0061 0.25
BedouinB:HGDP00607 0
Dai:HGDP01307 0
Motala_HG:I0017 0
Nganasan:ADR00504 0
Paniya:PNYD1 0
Samara_HG:I0124 0

Also, this sheet should be more useful for modeling South Asians. Here are three Kalash...

Kalash:HGDP00267
Kotias:KK1 63.55
Paniya:PNYD1 23.3
Karelia_HG:I0061 13.15
Anatolia_Neolithic:I0709 0
BedouinB:HGDP00607 0
Dai:HGDP01307 0
Loschbour:Loschbour 0
Motala_HG:I0017 0
Nganasan:ADR00504 0
Samara_HG:I0124 0

Kalash:HGDP00277
Kotias:KK1 63.25
Paniya:PNYD1 23.8
Karelia_HG:I0061 12.95
Anatolia_Neolithic:I0709 0
BedouinB:HGDP00607 0
Dai:HGDP01307 0
Loschbour:Loschbour 0
Motala_HG:I0017 0
Nganasan:ADR00504 0
Samara_HG:I0124 0

Kalash:HGDP00281
Kotias:KK1 62.2
Paniya:PNYD1 25.25
Karelia_HG:I0061 12.55
Anatolia_Neolithic:I0709 0
BedouinB:HGDP00607 0
Dai:HGDP01307 0
Loschbour:Loschbour 0
Motala_HG:I0017 0
Nganasan:ADR00504 0
Samara_HG:I0124 0

Paniya PNYD1 is probably the most ASI sample we have in any dataset. This individual is only ~70% ASI, at best, but that's still way higher than any other samples.

Anyway, these outcomes are definitely not unrealistic, and also very consistent. But I can't guarantee that this will be the case for all individuals in the datasheet, so it's probably better to use population averages.

The new nMonte apparently offers the aggr_pops function that averages the results by population, but I can't get it to work yet.

The datasheet with 9 dimensions...

https://drive.google.com/file/d/0B9o3EYTdM8lQbEhhcW9ZVnJudDA/view?usp=sharing

The datasheet with all 50 dimensions...

https://drive.google.com/file/d/0B9o3EYTdM8lQWmYyWUhtekNJdVE/view?usp=sharing

huijbregts said...

The function aggr_pops() expects that the datasheet has a comma as field separator.

Davidski said...

The files I linked to above are comma separated. The issue I was having was with my arguments, but these work...

source('nMonte.R')

averages <- aggr_pops('final2.txt')

write.csv(averages,"averages.txt")

huijbregts said...

@ All who have used the unscale function 'scaled2raw' in nMonte.

This function is meant to undo the effect of scaling (dividing the variables by the root of their variance).
Alas, it was too good to be true. I have to confess that I made an error with this function; I have removed it from nMonte.
This has no consequences for nMontes main function 'getMonte' which can be used to estimate the composition of the DNA as a mixture of related DNA's..

If I correctly understand the mathematics, the situation is even worse: every attempt to unscale the PCA-scores compromises the orthogonality of the principal components.
As a consequence it seems impossible to calculate a real absolute Euclidean distance between two DNA-samples.
What remains is the possibility to calculate the Euclidean distance of scaled data (the correlation matrix instead of the covariance matrix).
In doing so the weight of higher (=smaller) dimensions becomes inflated.
This is one more reason to be sparing with dimensions. (another one is that PCA's are sensitive to outliers.)