search this blog

Thursday, February 15, 2018

Modeling genetic ancestry with Davidski: step by step


There are many different ways to model your genetic ancestry. I prefer the Global25/nMonte method (see here). This is a step by step guide to modeling ancient ancestry proportions with this simple but powerful method using my own genome.


As far as I know, the vast majority of my recent ancestors came from the northern half of Europe. This may or may not be correct, but it gives me somewhere to start, so that I can come up with a coherent model. If you don't have this sort of information, because, perhaps, you were adopted, then just look in the mirror, and work from there. Like I say, it's not imperative that you know anything whatsoever about your ancestry, because your genetic data will do the talking, but you do need a model when modeling.

In scientific literature nowadays, Northern Europeans are often described as a three-way mixture between Yamnaya-related pastoralists, Anatolian-derived early farmers, and Western European Hunter-Gatherers (WHG). So let's see if this model works for me. Obviously, if it does, then it'll confirm the information that I have about my origins, but it might also reveal finer details that I'm not aware of. The datasheet that I'm using for this model is available here.

[1] distance%=6.9025 / distance=0.069025

Davidski

Yamnaya_Samara 53.9
Barcin_N 30.75
Rochedane 15.35
Tepecik_Ciftlik_N 0

Yep, the model does work, with a fairly reasonable distance of almost 7%. The ancestry proportions more or less match those from scientific literature and the plethora of analyses that I've featured at this blog on the topic. Please note that I've kept things very simple, using only four reference populations and individuals as proxies for four distinct streams of ancestry. But I've put my own twist on this Neolithic/Bronze Age model by including two populations from Neolithic Anatolia (Barcin_N and Tepecik_Ciftlik_N), just to see what would happen. The WHG proxy is Rochedane.

Admittedly, though, my Yamnaya cut of ancestry appears somewhat bloated at over 53%, and the model's distance is a little higher than what I normally see for really strong models. So let's check if I can get a better fitting and more sensible result by adding a slightly more easterly forager proxy than Rochedane: Narva_Lithuania.

[1] distance%=5.9331 / distance=0.059331

Davidski

Yamnaya_Samara 45.75
Barcin_N 31.45
Narva_Lithuania 22.8
Rochedane 0
Tepecik_Ciftlik_N 0

The statistical fit does improve, and when given a choice between Rochedane and Narva_Lithuania, the algorithm picks the latter as the only source of extra forager input in my genome.

What could this mean? It might mean that a large part of my ancestry derives from the Baltic region. Actually, I know for a fact that this is true. But even if I had no idea about my genealogy, this result would be a very strong hint about my genetic origins. Indeed, let's follow this trail and try to further improve the fit of the model by adding a more relevant Yamnaya-related proxy, such as early Baltic Corded Ware (CWC_Baltic_early).

[1] distance%=5.444 / distance=0.05444

Davidski

CWC_Baltic_early 54.95
Barcin_N 26.7
Narva_Lithuania 18.35
Rochedane 0
Tepecik_Ciftlik_N 0
Yamnaya_Samara 0

Holy shit! To be honest, I wasn't expecting this sort of resolution and accuracy, and I can't promise that everyone using the Global25/nMonte method will see such incredibly nuanced outcomes, but this isn't a fluke. It can't be, because it gels so well with everything that I know about my ancestry. Please note also that I belong to Y-chromosome haplogroup R1a-M417, which is a lineage intimately associated with the Corded Ware expansion across Northern Europe (for instance, see here).

But of course, the Baltic and nearby regions haven't been isolated from migrations and invasions since the Corded Ware times. For instance, at some point, probably during the Bronze Age, Uralic-speaking peoples moved west across the forest zone of Northeastern Europe and into the East Baltic and northern Scandinavia. It's generally accepted that they brought Siberian admixture with them (see here). Moreover, from the Iron Age to the Middle Ages, East Central Europe was under intense pressure from a wide range of nomadic steppe groups with complex ancestry, such as the Sarmatians, Avars, Huns, and Mongolians. Did any of these peoples leave their mark on my genome? At the risk of overfitting the model, let's explore this possibility by adding a few more reference populations.

[1] distance%=5.444 / distance=0.05444

Davidski

CWC_Baltic_early 54.95
Barcin_N 26.7
Narva_Lithuania 18.35
Han 0
Mongolian 0
Nganassan 0
Rochedane 0
Sarmatian_Pokrovka 0
Tepecik_Ciftlik_N 0
Yamnaya_Samara 0

Nothing changes when I add the Han Chinese, Mongolians, Nganassans (an Uralic people from Siberia), and Sarmatians to the model. But what about if I throw in the only ancient Slav in my datasheet?

[1] distance%=2.9904 / distance=0.029904

Davidski

Slav_Bohemia 85.9
CWC_Baltic_early 7.7
Narva_Lithuania 6.4
Barcin_N 0
Rochedane 0
Tepecik_Ciftlik_N 0
Yamnaya_Samara 0

Considering that the vast majority of my recent ancestors were Poles, thus a Slavic-speaking people from near the Baltic, this outcome makes perfect sense. And check out the new distance! But the problem now is that I'm overfitting the model by using two very similar and probably very closely related references, CWC_Baltic_early and Slav_Bohemia. And overfitting should be avoided at all costs. So it might be useful to break up this effort into two models: one focusing on the Neolithic and Bronze Age, and the other on the Iron Age and Middle Ages. I'll do that soon, but not just yet, because there are still too few Iron Age and Medieval samples available from the Baltic region and surrounds for meaningful analyses of this type.

For a more technical guide to running Global25-type data with nMonte, please refer to this post at my other blog by regular commentator Onur: An nMonte and 4mix guide for the participants of the Basal-rich K7 and/or Global 10 tests.

See also...

The powerful Global 25 now available via the Eurogenes genetic ancestry online store

61 comments:

khana said...

Excellent and simple breakdown of the process. Hopefully, it will assist many people in trying a hand at it themselves.

Karl_K said...

Good stuff.

Samuel Andrews said...

Wow, looks really good. I think I'll probably order this.

Davidski said...

It's free if you had the Global 10 done.

MomOfZoha said...

Alright, discovered the
subset_data
function of nMonte3 which is incredibly useful.

After playing around first with some 6 pops, reducing to 3 -- all hastily but hopefully good START: Found
Assyrian, Tuvinian, and Tajik_Shugnan
to be good ref pops. Originally tried other Caucasus and Balkan stuff too, but that did cause too small distances. E.g. when adding Adygei, there seems to be an effect on the Tajik_Shugnan percent for exactly the same reason as your example above as they seem to share important ancestry (despite the geographic distance). Therefore, unfortunately, it becomes difficult to extract out my mom's confirmed ancestries from both Northwest Caucasus and Central Asia. But, as I said, this is just the START:

unscaled run for my dad:
"distance%=2.4469"

MoZ_Father

Assyrian,88
Tuvinian,9.8
Tajik_Shugnan,2.2

scaled run for dad:
"distance%=4.5011"

MoZ_Father

Assyrian,84.6
Tajik_Shugnan,12
Tuvinian,3.4

Unscaled run for my mom:

distance%=1.5222"

MoZ_Mother

Assyrian,79.2
Tuvinian,12.2
Tajik_Shugnan,8.6

scaled run for mom:

"distance%=3.7714"

MoZ_Mother

Assyrian,72.4
Tajik_Shugnan,20.2
Tuvinian,7.4

The scaled runs give larger distances of course. I am tired of running in both modes. Is there a consensus on whether scaled versus unscaled is preferred? I thought Simon_W was saying that the scaled runs didn't make sense for his family ancestry in the previous post comments.

And, I hope you'll add the Armenians soon...

Davidski said...

@MomOfZoha

Most people are using the so called scaled coordinates.

I didn't see Simon's comments, but sometimes people mix up the scaled and original coords in their data and target files, and they get strange results.

I'll include the Armenians later today, or early tomorrow, depending on your time zone. But yeah, those Assyrians are very similar to Armenians. Basically identical.

Simon_W said...

@ Davidski
Ah I see, the coords of the target files have to be scaled as well, right?

Lion Heart said...

So how do I order it? I didn't do global 10

Simon_W said...

Thinking about it, it's pretty logical, I was comparing apples with pears. I'm gonna delete that post. ;-)

Davidski said...

@Lion Heart

E-mail me...

eurogenesblog [at] gmail [dot] com

Eren said...

Nice tutorial David.

With regards to samples: would it possible to include Altai_IA:Rise504 (from the imputed Martiniano dataset) ?

Davidski said...

@Eren

Probably, I'll have a look tomorrow.

Simon_W said...

As for my approach, I'm mostly interested in my LBA/Iron Age/early Medieval ancestry, I don't care much about the amount of Yamnaya-related or WHG ancestry I've got. For that reason I rather use modern populations as substitutes for missing ancient populations than wait for better times.

MomOfZoha said...

@David:
Thanks for your feedback.

For fun, I modeled my parents with three very distinct populations, as it occurred to me that Armenian-like Assyrians would also share ancestry with Tajik via Iran. Therefore, I tried:

Samaritan, Kalah, and Tuvinian

I got much smaller distances with un-scaled runs again. Here are the scaled runs only:

mom:
"distance%=5.1811"

MoZ_Mother

Samaritan,64.2
Kalash,26.6
Tuvinian,9.2

dad:
distance%=6.2547"

MoZ_Father

Samaritan,72.8
Kalash,21.8
Tuvinian,5.4

***

Then I added 'Mycenaean' to test my dad:
distance%=4.2584"

MoZ_Father

Mycenaean,46.4
Samaritan,29.8
Kalash,18.2
Tuvinian,5.6

then had to test my mom too:
distance%=3.7707"

MoZ_Mother

Mycenaean,35.4
Samaritan,32.8
Kalash,21
Tuvinian,10.8

***

I know there is some "interference" effects going on in adding the Mycenaeans, aka need to clean up some other groups... Will play more later. Damn this is addicting! Can't believe I got in this game so late hah. Cheers.

Davidski said...

@MomOfZoha

Armenians are now in the datahsheets.

@Eren

Altai_IA:RISE504 is now in the datasheets.

MomOfZoha said...

@Davidski:
Thank you!

In the meantime, the addiction continues...

Trying a couple ancients plus a Siberian (Tuvinians are my fave):

dad:
"distance%=5.9453"

MoZ_Father

Minoan_Lasithi,65.2
Yamnaya_Samara,26.2
Tuvinian,8.6

mom:
distance%=5.7239"

MoZ_Mother

Minoan_Lasithi,58
Yamnaya_Samara,28.2
Tuvinian,13.8

Whatcha think, DavidSKI?

After a break, looking forward to try new Armenian+ datasheet!

jv said...

David,
Thanks for your recent help in sorting out my FTDNA results. My British Isles ancestry sure showed up. And yes, out of 16 Great Great Grandparents, 10 lineages originated in the British Isles and 6 in Germany. Guess, all those British Isles ancestors gave me a nearly 60% Western Hunter Gatherer makeup. Thanks again!

Tesmos said...

David, did you use Nmonte2 or Nmonte3 for your plan?

Eren said...

@Davidski: Cool, thanks!

@All:
I wanted to check my understanding of the population structure seen in the first two global PCA dimensions. I've annotated Davidski's Global PCA and uploaded it here: https://abload.de/img/worldplot_speculationlvpz7.png
So, what do you guys think? Totally wrong, or what?

MomOfZoha said...

When modeling my family with Armenians now added to the datasheet, at first it looked like my dad and father-in-law might be of ~80-96% Armenian descent, and my mom might be 65-89% so. Especially in my dad's case, this much surprised me. Then, I thought to check out the "average Turk" and the "average Azeri" to see how things compare...

Average Turk results:
“distance%=2.4603" Armenian,74.2 Turkmen,25.8
"distance%=2.6006" Armenian,83 Altai_IA,17
"distance%=2.7315" Armenian,89.8 Tuvinian,10.2
"distance%=3.0339" Armenian,91 Yakut,9
"distance%=3.5744" Armenian,92.2 Han,7.8

Average Azeri results:
"distance%=1.7003" Armenian,73.4 Turkmen,26.6
"distance%=1.8684" Armenian,82.6 Altai_IA,17.4
"distance%=2.6653" Armenian,90 Tuvinian,10
"distance%=3.0927" Armenian,91.2 Yakut,8.8
"distance%=3.1447" Armenian,92 Han,8

So, I am still getting that my dad and my father-in-law are more "Armenian" than both of their respective populations, especially my father-in-law (not surprising). My mom, on the other hand, as Armenian-shifted as she is, is still apparently less "Armenian" and more "Turkic" than the average Anatolian Turk, especially when taking possible Turkmen admixture into account.

MomOfZoha said...

With Ancients + Altaian, best fit so far excluding Iran in samples (even when CWC_Baltic, Andronovo,and Battle_Axe_Sweden were included):

Mom: "distance%=3.5642"
Armenia_MLBA,76.6 Natufian,13 Altaian,10.4

Dad: "distance%=4.9371"
Armenia_MLBA,79.8 Natufian,15.8 Altaian,4.4

Father-in-law: "distance%=3.5118"
Armenia_MLBA,88 Natufian,9.6 Altaian,2.4

(Somewhat similar results when Amerindian_North replaces Altaian except worse fits and less % assigned to Amerindian_North.)

O.w. when considering all Iran ancients and the Altai Iron Age: Easy fits with simply Iran_IA + Altai_IA:

Mom:"distance%=5.7881" Iran_IA,80.4 Altai_IA,19.6
Dad: "distance%=5.9718" Iran_IA,91 Altai_IA,9
FiL: "distance%=3.8018" Iran_IA,93.8 Altai_IA,6.2

EastPole said...

Western mtDNA on Bronze age and Eneolithic Mongol Steppe:

https://ceas.yale.edu/events/major-mtdna-haplogroups-bronze-age-mongol-steppe-populations-and-their-eurasian-population

Davidski said...

@Tesmos

I used nMonte2 with scaled data. But nMonte3 should produce the same results with pen=0.

@EastPole

We've discussed the ancient data from the Mongolian steppe.

http://eurogenes.blogspot.com/2017/02/european-specific-mtdna-lineages-on-neo.html

MomOfZoha said...

Modeling Azeris and Anatolian Turks with
Tepecik_Ciftlik_N, Iran_N, Sarmatian_Pokrovka, Han, and Ngananassan:

Average Turkish:
[1] "distance%=3.1932"

Turkish

Tepecik_Ciftlik_N,51.8
Sarmatian_Pokrovka,23.4
Iran_N,19.6
Han,2.8
Nganassan,2.4

Average Azeri:
[1] "distance%=3.6157"

Azeri

Tepecik_Ciftlik_N,43.6
Iran_N,29.8
Sarmatian_Pokrovka,22.6
Han,2.2
Nganassan,1.8

Now, my father-in-law:
[1] "distance%=4.0855"

MoZ_FatherInLaw

Tepecik_Ciftlik_N,47.2
Iran_N,29.6
Sarmatian_Pokrovka,22
Nganassan,1
Han,0.2

Almost line-by-line identical to Average Azeri with exception of slightly more Anatolian Neolithic farmer, and slightly less Han...

My mom:
"distance%=3.9951"

MoZ_Mother

Tepecik_Ciftlik_N,47
Sarmatian_Pokrovka,28.6
Iran_N,17
Han,4.8
Nganassan,2.6

My dad:
"distance%=3.4795"

MoZ_Father

Tepecik_Ciftlik_N,56.2
Sarmatian_Pokrovka,21.2
Iran_N,18.8
Han,2.4
Nganassan,1.4

My dad, is very similar line-by-line to the Turkish average, except with quite some more Anatolian Neolithic farmer, which also slightly reduces his Sarmatian and Nganassan contributions. This makes sense given his isolated village region which: His Konya Cumra county includes Catalhoyuk and is driving distance to Cappadocia (Tepecik Ciftlik).

My mom, on the other hand, has reduced Anatolian Neolithic farmer and inflated Sarmatian plus Han, which may be due to her maternal ancestry from Hotamis village, known for its Turkmen kilims... I am continually surprised by this given that my maternal grandfather was not likely Turkic at all, which would further inflate the potential Turkic ancestry of my maternal grandmother.

Davidski said...

@All

Just to reiterate, those of you who already have Global 10 coordinates can e-mail me and request your Global 25 coordinates for free.

And you must send me your data again. If you think that I should still have it from the last analysis, then you're mistaken, because I don't.

Rob said...

Boring times
Hopefully the Cheddar man paper comes with other Paleo and Meso samples- help shape up lingering questions

Btw Dave what happened to Sams blog ? (And the money he charged people )

EastPole said...

@Davidski

“We've discussed the ancient data from the Mongolian steppe.

http://eurogenes.blogspot.com/2017/02/european-specific-mtdna-lineages-on-neo.html’

Thank you David, I forgot about it.

It is OT, but I find it interesting that in search of common roots of Indian and Greek culture some mention Mongolia:

https://s14.postimg.org/xyi6gg45d/screenshot_338.png

https://books.google.pl/books?id=PWuVAwAAQBAJ&pg=PA184&lpg=PA184&dq=%22A+Story+Waiting+to+Pierce+You:+Mongolia,+Tibet,+and+the+Destiny+of+the+Western+World%22&source=bl&ots=n4TMyoL86H&sig=NOqkmtNY2Od_htTgZ815bTejes4&hl=pl&sa=X&ved=0ahUKEwisysTR7orZAhXFXiwKHRY8B-w4ChDoAQhNMAU#v=onepage&q=%22A%20Story%20Waiting%20to%20Pierce%20You%3A%20Mongolia%2C%20Tibet%2C%20and%20the%20Destiny%20of%20the%20Western%20World%22&f=false

But Hyperboreans were not Mongolians. They were farmers living north of Scythian steppe. The fact that Vedic Sanskrit as well as some Greek numerals so important in Pythagorean beliefs are very similar to Slavic numerals is not accidental.
It is because the ancestors of Vedic Aryans, Hellenes and Slavs shared common culture, common poetic language, common religion at one time.
The presence of western mtDNA in Mongolia possibly explains some customs in Mongolia:

https://s10.postimg.org/oixn4lfh5/screenshot_340.png

https://books.google.pl/books?id=wksyCgAAQBAJ&pg=PT145&lpg=PT145&dq=%22A+Story+Waiting+to+Pierce+You:+Mongolia,+Tibet,+and+the+Destiny+of+the+Western+World%22&source=bl&ots=oymf8WodTq&sig=NNx0D5u0B7JnbWLuRpIYPHwohZ0&hl=pl&sa=X&ved=0ahUKEwisysTR7orZAhXFXiwKHRY8B-w4ChDoAQhSMAY#v=snippet&q=New%20vodka%20(arza&f=false
Mongolian shamans do not eat hallucinogenic mushrooms, do not use ephedra, do not smoke dope or other staff to get in touch with the divine, they drink alcohol like IE, like Kalash:

http://eurogenes.blogspot.com/2018/01/the-kho-people-archaic-indo-aryans.html?showComment=1516795236915#c8505964353792957184

MomOfZoha, do you know the etymology of Mongolian ‘arza’ “vodka”, can it be a borrowing from IE rta : arza: aša : ori : ari itp. It looks very similar to Slavic Jari : Jarzi, the god of fertility and fire.

David Rabaez said...

Me (nMonte2, scaled):

[1] "distance%=4.8212 / distance=0.048212"

DavidR:N119885

Iberia_MN 60.95
Yamnaya_Samara 28.15
Barcin_N 9.65
Yoruba 1.25
Rochedane 0.00
Han 0.00
Pima 0.00

Dad

[1] "distance%=4.3597 / distance=0.043597"

ManuelR

Iberia_MN 58.2
Yamnaya_Samara 30.6
Barcin_N 11.2
Rochedane 0.0
Yoruba 0.0
Han 0.0
Pima 0.0

Simon_W said...

@ Rob

"Btw Dave what happened to Sams blog ? (And the money he charged people )"

That money is now his, I guess. :D I had also ordered a report from him, it was worth the few bucks.

Richard Rocca said...

@Rob

As always, your comments would be appreciated... https://anthrogenica.com/showthread.php?13487-Was-I2a-Din-originally-Finno-Ugric-(not-Slavic)&p=349992&viewfull=1#post349992

Rob said...

@ Simon
I hadn’t , as I can do my own analysis, but had simply heard others wondering where the site has disappeared to ?

Rob said...

@ RRocca
An interesting idea, and I’m subjectively drawn to it
However, I2a1 was too far spread and too old to be connected to any singular language
Moreover, I2a -Din is too “Western” to be linked to FU. It was probably some non-IE group which then became IE’ized, perhaps Celtic or Germanic which then mixed into the Baltic continuum and help galvanise Slavic
Not sure, too early to tell

Alberto said...

@MomOfZoha

I haven't used nMonte3, but isn't that subset() function the same as selecting a few populations in the source file? This is what most of us do, but you were concerned about the choice being biased or something. Well, I guess you can always choose an objectively panel of possible source populations and then let the program choose (more or less what most of us do). Now you're actually restricting to very few number of sources, so forcing the models into those ones.

Anyway, I did give a try to write a script that would choose from all the sources but restricting the amount of them that can be included in the results. But it turned out to be much more problematic than expected. In any case, I'll answer about this and share a couple of scripts in the previous thread (we have to prevent a withdrawal syndrome).

Regarding scaled vs. original, I run both models side by side for a long time before setting for the eigenvalue scaled version. In practical terms I can say that with good models using proximate sources the differences are small and it's hard to say which one is better. My impression is that eigenvalue scaled is usually preferable, but there's no objective way of measuring that. However, with more difficult models with more distant sources, the differences become very apparent, and much more clearly favour the scaled version (this, together with he fact that the scaled version is the technically correct one for calculations -not for graphs, where the scaled data would make for rather meaningless plots - maybe the reason why the data comes out that way in the first place- could be objective prove that it should be preferable).

As an example, look at these models with the original data (from Global 25):

Mongolian
Andronovo 37.6%
She 28.2%
Barcin_N 14.1%
Levant_N 12.4%
Iran_N 5.4%
CHG 2.3%
Afanasievo 0%
AfontovaGora3 0%
Armenia_EBA 0%
EHG 0%
Germany_MN 0%
Iran_ChL 0%
MA1 0%
Natufian 0%

Distance 5.6826%


Mixe
AfontovaGora3 60.7%
Barcin_N 16.3%
MA1 8.4%
Levant_N 7.7%
She 6.9%
Afanasievo 0%
Andronovo 0%
Armenia_EBA 0%
CHG 0%
EHG 0%
Germany_MN 0%
Iran_ChL 0%
Iran_N 0%
Natufian 0%

Distance 17.5669%

Trying to pair a "pure" East Asian source with some West Eurasian ancients to model those two populations turns out into completely strange models, with a very small amount of East Eurasian ancestry and other problems. With the eigenvalue scaled data it looks like this:

Mongolian
She 69.1%
AfontovaGora3 21.6%
Levant_N 7.8%
Barcin_N 1.5%
Afanasievo 0%
Andronovo 0%
Armenia_EBA 0%
CHG 0%
EHG 0%
Germany_MN 0%
Iran_ChL 0%
Iran_N 0%
MA1 0%
Natufian 0%

Distance 15.5159%


Mixe
AfontovaGora3 56.7%
She 43.2%
MA1 0.1%
Afanasievo 0%
Andronovo 0%
Armenia_EBA 0%
Barcin_N 0%
CHG 0%
EHG 0%
Germany_MN 0%
Iran_ChL 0%
Iran_N 0%
Levant_N 0%
Natufian 0%

Distance 42.7229%

Not that those models are great, either, but clearly more balanced.

Still, keep testing and see what works for you.

Simon_W said...

@ Rob

"I hadn’t , as I can do my own analysis, but had simply heard others wondering where the site has disappeared to ?"

I had done my own analysis too, at least of my own mtDNA which is relatively rare. But my father's mtDNA is of a more common haplogroup and by chance I had the copious 23andme raw data, so it was convenient to pass this on to someone who's already familiar with the minutiae of the phylogeny and who - unlike me - has access to a large mtDNA collection. He found a sample with private mutations that are no longer private now because my father has them too.

But yeah, curious the blog is down.

Simon_W said...

Regarding the scaled vs. non-scaled issue, I've tried the scaled coords too, now. (This time properly, with the target file also scaled, lol.) It's really hard to decide which models make more sense, as Alberto said. Just to describe the most striking differences:

- In the scaled version my East Prussian grandmother gets a few percent less Turlojiske3 which are compensated with a few percent Hungary_BA I1502. Historically that's plausible. It would make sense that the Prussian Balts had assimilated some Gothic remainders. But did it happen? We don't know.

- My Alemannic grandparents get much less Hungary_IA and much more Tuscan in the scaled version. My grandfather also gets much less French and French_South. To me these large Tuscan scores look dubious. Of course there were Romans in Switzerland and southern Germany. But 29.2%? I'm not buying this. A mix of French-like Gauls and Hungary_IA-like Thraco-Cimmerians would historically make more sense, I guess. Moreover my grandfather gets lots of Nordic_IA and Tuscan in the scaled version, but hardly anything French-related. This doesn't make sense, because the Romans merged with the Celts into Gallo-Romans.

- My maternal side gets a few percent less Samaritan in the scaled version which prima facie may seem more plaubsible. Then again this is partly compensated by a reappearance of 2.35% Mozabite. But it's possible I'd say.

Richard Rocca said...

@Rob

Yeah, I think I2a-Din as a primary Finno-Ugric vector is highly unlikely.

Simon_W said...

Maybe I just didn't use the right populations to model my Alemannic grandparents. Indeed these inflated Tuscan scores might mean that in reality smaller numbers of more exotic migrants arrived, e.g. from Anatolia and the Levant. And the lack of French in my grandfather could be due to his ancestry being from a part of the Roman empire where the provincials were Germanic rather than Gaulish.

Arza said...

@ Simon_W
In the scaled version my East Prussian grandmother gets a few percent less Turlojiske3 which are compensated with a few percent Hungary_BA I1502. Historically that's plausible. It would make sense that the Prussian Balts had assimilated some Gothic remainders.

How Prussian Balts could have assimilate Gothic remainders from Hungary?

And how Hungary_BA sample have anything in common with Goths?

Double test drive - Global 25 scaled, Xmix

Hungary_BA:I1502
Tiszapolgar_ECA:I2793 29.3%
Baltic_BA:Turlojiske3 16.85%
Germany_MN:I0560 13.55%
Baltic_BA:Kivutkalns222 12.3%
Baltic_BA:Kivutkalns215 9%
Narva_Lithuania:Kretuonas1 6.1%
Narva_Lithuania:Donkalnis6 4.5%
Baltic_BA:Kivutkalns209 3.95%
Narva_Lithuania:Donkalnis7 3.1%
Bell_Beaker_Germany:I0111 0.9%
Baltic_BA:Kivutkalns25 0.3%
Narva_Lithuania:Kretuonas4 0.1%
Starcevo_EN:I1880 0.05%

Distance 1.2578%

Hungary_BA:I1504
Starcevo_EN:I1880 20.95%
Baltic_BA:Kivutkalns222 17.65%
Germany_MN:I0172 16.35%
Bell_Beaker_Germany:I1549 14.1%
Baltic_BA:Turlojiske3 12.1%
CWC_Germany:I1532 8.7%
Narva_Lithuania:Spiginas1 5.25%
Baltic_BA:Kivutkalns215 4.6%
Bell_Beaker_Germany:I0111 0.25%
Starcevo_EN:I1878 0.05%

Distance 1.7209%

Both are more Narva-shifted Baltic_BA (Welzin!?!) with two streams of neolithic ancestry - one from Hungary and one from Germany. On top of that I1504 does have an admixture from a standard Yamnaya/Farmer population (BB/CWC).

Arza said...

Interesting thing happens when Polish average is dropped into the same model as above:

Polish
CWC_Germany:I0049 20.55%
Baltic_BA:Kivutkalns19 14.05%
Baltic_BA:Kivutkalns207 13.5%
Baltic_BA:Turlojiske3 9.85%
Tiszapolgar_ECA:I2793 8.35%
Starcevo_EN:I1880 8.3%
Baltic_BA:Kivutkalns209 7.45%
Tisza_LN:I0449 6.1%
Baltic_BA:Kivutkalns215 5.55%
CWC_Germany:I0104 3.55%
CWC_Baltic_early:Gyvakarai1 1.6%
Bell_Beaker_Germany:I1549 0.55%
Bell_Beaker_Germany:I0113 0.35%
CWC_Baltic_early:Plinkaigalis242 0.25%

Distance 0.8594%

Reduced to a 3-way model:

Polish
Baltic_BA:Kivutkalns19 48.8%
CWC_Germany:I0049 30%
Starcevo_EN:I1880 21.2%

Distance 1.3645%

Simon_W said...

@ Arza

Well, the Gothic remainders weren't from Hungary, but from the mouth of the Vistula. And as the preliminary rumours suggested they were similar to Hungary_BA I1502, at least in the first two dimensions, and like the latter with a striking WHG/SHG shift. So what I was presupposing wasn't a connection to Bronze Age Hungary; rather I was using Hungary_BA I1502 as a substitute for South Baltic samples for the moment that we don't have them in the data sheet. Apparently it seems to work, according to nMonte; though the Tollense valley samples would be preferrable.

Simon_W said...

My final take for the moment, arrived at with nMonte 1.0 and scaled data:

East Prussian grandmother:

"Nordic_IA" 39.4
"Dutch" 32.55
"Baltic_BA:Turlojiske3" 22.85
"Hungary_BA:I1502" 5.2
"England_Anglo-Saxon" 0
"Slav_Bohemia" 0
"Halberstadt_LBA" 0
"Polish" 0

distance%=2.3484

grandfather from southern Baden and northwestern Switzerland:

"Nordic_IA" 76.8
"Levant_BA" 11.45
"French" 7.45
"Anatolia_ChL" 4.3
"Hungary_IA" 0
"French_East" 0
"Italian_Tuscan" 0
"Halberstadt_LBA" 0
"Scythian_Samara" 0
"Hungary_BA:I1504" 0
"Mycenaean" 0
"Anatolia_BA" 0
"England_Roman_outlier" 0

distance%=2.813

grandmother from Swabia and northwestern Switzerland:

"French" 49.55
"Nordic_IA" 19.6
"French_East" 10.55
"Anatolia_ChL" 10.3
"Hungary_BA:I1504" 3.8
"Hungary_IA" 3.35
"Scythian_Samara" 1.95
"Levant_BA" 0.9
"French_South" 0
"Italian_Tuscan" 0
"Halberstadt_LBA" 0
"Mycenaean" 0
"Anatolia_BA" 0
"England_Roman_outlier" 0

distance%=0.9679

maternal half of mine, 50% from Romagna, Italy:

"French" 57.3
"Sicilian_West" 26.05
"Italian_Bergamo" 16.65
"Nordic_IA" 0
"Italian_South" 0
"Italian_Tuscan" 0
"Sicilian_East" 0

distance%=2.8781

or in other terms:

"French" 46.2
"French_East" 16.05
"French_South" 16.05
"Samaritan" 10.05
"Minoan_Lasithi" 5.5
"Mozabite" 2.4
"Remedello_BA" 1.45
"Nordic_IA" 1.35
"Hungary_BA:I1504" 0.95
"Hungary_IA" 0
"Cypriot" 0
"Mycenaean" 0
"Anatolia_BA" 0
"Anatolia_ChL" 0
"Levant_BA" 0
"England_Roman_outlier" 0
"Druze" 0

distance%=2.7191

MomOfZoha said...

@Alberto:
Thank you for coding up right away! Also thank you for your detailed examples. I just DLed and started looking at your XMix now. While doing so, I decided to go ahead and also look at the nMonte code and 4Mix code too. I wonder why a function that simply does the following is not defined anywhere:

Given a particular set of source population vectors s_1, s_2, ... s_k and a given target population average vector t: Find assignment of weights w_i to each s_i such that matrix s times vector w yields minimum distance to t.

This is constrained linear least squares approximation: Given matrix s and vector t, find unit vector w such that ||ws - t|| is minimum.

I have never coded in R previously, (the closest scripting lang I used was perl some years ago), but given that everyone seems to use R for statistical processing: Isn't there just a linear algebra or stats R library that must solve that exactly and efficiently?

O.w. from a quick glance, it seems that everyone is using a kind of heuristic to calculate those weights given populations (e.g. using normalized inverse distances, greedy updated projections, etc.). But, I'd like to know just a simple function that does something like

weight_vector = findLSWeights(source_matrix,target_vector)

which should literally be in some R library.

It will increase the run time of the algorithm, but may yield more accurate results.

Regarding the too-many combinations needed in XMix, Alberto: While I lost my train of thought due to excitement over my friend's phone call in the previous blog post comment, after sleeping on it my thought-train returned:: I really believe that using an existing clustering to help guide the whole process should both speed up the combinatorial search greatly (especially in your XMix), AND reduce the likelihood of what Davidski terms "overfitting". In his example, overfitting tends to occur due to use of multiple similar population sources. One would expect "similar populations" to be in the same cluster. Therefore, if we restrict source combinations to include only populations from different clusters, then this may be an improvement though it has the extra overhead of some kind of clustering files as input. One could also attempt to accomplish something similar by checking combinations of sources such that all pairwise distances between the source populations considered obey some lower bound: In fact, you can add this variation directly to your XMix by a single if-statement in the loop considering substitutions. However, it would be faster to not even consider those combinations at all.

(continued in next post due to character limit...)

MomOfZoha said...


E.g. if a person is of mixed descent including Northwest Caucasian, Central Asian, and Anatolian descent (which person might that be, hmmm), then it doesn't even make sense to consider both Abkhasian and Kabardin as simultaneous different source populations as the results may yield both overfitting and inefficiency. I don't think that any of these vector methods can immediately separate out something like Kabardin versus Abkhasian ancestry in an individual who is admixed with other distant things too. On the other hand, if a person is purely Northwest Caucasian, then that is a different story, and in that case, maybe a more limited input clustering could be considered (in clustering Northwest Caucasians only, Abkhasian might anyway be in a different cluster than Kabardin).

Combinatorially speaking, if we have, say, 5 clusters with 3 populations per cluster, and we want to pick at most one pop from each cluster, that would yield 3^5 = 243 considerations. Otherwise, without the cluster constraint, choose(18,5) = 8568.

Another advantage of using meta-pop clusters as guides is that one might also use the "cluster average" as a heuristic to help guide the search: Essentially solving the weighted least squares on those before proceeding to finer granularities. And, if various hierarchies of clustering are given, then this may even be more useful...

I'm not the fastest coder and have way too many life-interruptions especially to dive into R now. So, I'm happy if you or anyone will like to experiment with any further ideas.

I can help with other visualizations that do not necessarily yield admixture information. Since last time, I haven't generated another graph yet on different parameters and important subsets (ancients only, modern eurasia only, etc.), but can do that again and also share graph files if there is interest. Just keep in mind that I do have a lot of "real life interruptions" so please be patient with me.

Cheers.

MomOfZoha said...

@EastPole:
"MomOfZoha, do you know the etymology of Mongolian ‘arza’ “vodka”, can it be a borrowing from IE rta : arza: aša : ori : ari itp. It looks very similar to Slavic Jari : Jarzi, the god of fertility and fire."

Interesting observation. My guess is that vodka was likely introduced to Mongolians by others in the first place, considering the well-known alcohol intolerance associated with Mongolians. Not everyone can drink like a Slav of course! Unfortunately, the same alcohol intolerance exhibited by Mongolians is also exhibited by Native Americans, for whom alcoholism has been another terrible side effect of colonialization...

Linguistically, "azer" also means "fire" in Iranian languages (hence "Azerbaijan" and "Atropatene"), possible related to "astri-" which you know is star.

Unfortunately, I am not a linguist, despite a great interest in linguistics way-back-when... 20 years ago, I had all kinds of national (USA level) awards in Latin and Greek (self-taught) derivatives as well as mythology. But, for my immigrant parents, well, what is the "use" of mythology? And, who cares about Latin derivatives except to better understand medical terms? This band name exactly captures most eastern first generation immigrant parents:
https://doctorsandengineers.bandcamp.com/
(My bro became the doctor.)

In retrospect, I cannot blame my folks. At least they cared to educate their daughter -- which, the same cannot be said about many of their own family from Konya, unfortunately. (And, I selfishly admit it is nice to make more money than the unfortunate PhDs in linguistics -- *their* situation has got to change.)

Davidski said...

@All

Olalde et al. 2018, aka the Bell Beaker Behemoth, will be out in Nature this Wednesday, with over twice as many ancient samples as its preprint.

http://eurogenes.blogspot.com/2017/05/the-bell-beaker-behemoth_10.html

So expect a huge update to the Global 25 datasheets on Thursday, with several hundred new ancient samples being added.

Alberto said...

@MomOfZoha

Thanks for the interest and the comments. No time for a detailed reply now, so just some quick comments.

I never coded in R script before either, so I had to write Xmix (a year ago or so) with the manual open for learning the basic syntax and functions. I did look for something like what you suggest, but for what I remember there's nothing really that can help to solve this problem so easily. Moreover, as you say:

It will increase the run time of the algorithm, but may yield more accurate results.

And my main purpose was to reduce the runtime, since we can already get a high accuracy with this "brute force" kind of approach. When you run hundreds of models you really come to appreciate the speed.

And this brings me to the next part of your comment regarding clustering. As far as I know, nMonte was designed to run with a very large number of source populations. But for me personally (and probably many other here interested more in ancient samples) this was not really necessary. We had 4mix before, which could run very fast by checking all the combinations (without decimals) for 4 source populations. This was a bit too limiting, since many times we need more than 4 (maybe just 5 or 6) to make good models, but even going to just 6 increases the combinations into realm of billions. And then there are cases when you want 10-12 source populations, so that approach didn't scale.

And this is why I wrote Xmix. To be able to take any number of source populations but to adjust the number of iterations to the difficulty of the model (which usually implies only a handful of sources). And once you can run that faster, you can go ahead and run it against many target populations in one go, which gives a very nice and informative overview.

So the problem with clusters is much more easily solved by humans, in this case. We just create a source file with the populations that are interesting to test for a given target -or set of targets- (4, 8, 12... but not 200) and let the algorithm find the best solution (which it does quite fast).

Still, for a problem like the one you were having with your own case, I thought it could be useful to test that other approach of restricting the number of populations in the results (rather than in the sources). So ancestorsMix can do that, run with 200 sources and still give you a result in less than a second (not the best possible one, but one that can help you narrow things down if one is totally lost). I can't say that the program quite works as I thought it could. It's rather a failure of the original intent, but there it is in case someone can find any use for it.

In any case, since you looked at the code of Xmix you probably noticed it's written in a rather verbose way, clean and extensively commented. This is so that anyone can get it and test whatever changes might suit their own needs, or simply improve the code for everyone else to use (or add features or whatever).

Re: overfitting, I've never been worried about getting a too good distance. In fact, that's the whole point, to get the lowest possible distance (the truth should be as close to 0 as it gets). I would rather call that problem "overfeeding". That is, it's not a problem of the resulting distance being too low, it's a problem of feeding the program with too many similar sources that will only make the outcome unrealistic, instead of close to the truth.

MomOfZoha said...

@Alberto:

Sorry, it seems I did not see ancestorsMix and only see your shared Xmix folder. If it's no problem, could you please share the ancestorsMix in this thread? I just looked at what you shared after Arza's correction.

Also, btw, in the meantime, I caught an error and another correction to what I myself wrote above. Small error due to dynamically changing my intended example (deleting/rewriting): Intended to compare choose(15,5) to 3^5. Not a huge fold difference at this small stage, but the difference is huge as num-pops of source file increases.

Bigger issue: My original intuition in my reply to the previous blog post regarding target distance from the convex hull defined by source pops is more appropriate than framing the problem as a constrained least squares minimization (although it is also the latter). In the computational geometry framework, the high dimensionality of the problem and the difficulty of even representing the hull are issues such that an efficient solution to the problem is not at all "well known". I found a 2009 doctoral dissertation on the more general problem of finding the min distance between two convex hulls (the target point can be a degenerate hull), and that's a fairly recent doctoral dissertation. So... I see why y'all went with heuristics for computing the weight vector given particular source-pops.

I completely agree with you about overfitting, which is why I also prefer to frame the inherent issue by which "overfitting" can be a side effect instead: Using too-similar source pops. O.w. of course one should attempt to find the source pops and corresponding weights yielding min distance to target.

I will test your code sometime this week. Thanks again, Alberto. Best.

Alberto said...

@MomOfZoha

Here is the ancestorsMix prototype:

https://drive.google.com/file/d/16PUei4HRTeWz317nKrJ8Y4TQvJEETP0K/view?usp=sharing

It's hardly tested (I don't have the right mixed modern sample for it), but to know what kind of thing to expect, here's a quick example. Using a source file with 66 populations (ancients + Han + Yoruba), this is what I get with nMonte:

Spanish_Castilla_La_Mancha
"Iberia_EN" 34.95
"Bell_Beaker_Germany" 28.45
"CWC_Germany" 23.25
"Natufian" 5.7
"Ireland_MN" 3.85
"CHG" 2.8
"Greece_Peloponnese_N" 0.7
"Yoruba" 0.3

distance%=1.0856

(Elapsed time: 79.739 secs)

And this is with ancestorsMix when allowing 5 generations (a minimum of 3.12% per population):

Spanish_Castilla_La_Mancha
Iberia_EN 25%
CWC_Germany 21.88%
Ireland_MN 15.62%
Portugal_MBA 12.5%
Yamnaya_Kalmykia 9.38%
Bell_Beaker_Germany 6.25%
Natufian 6.25%
CHG 3.12%

Distance 1.2553%

(Elapsed time: 0.173 secs)

As you see it's not anywhere near the same. It does pinpoint the right populations, roughly, and gets a decent model, not too far from nMonte's one by distance. Which is not too bad for something that runs in 0.173 seconds and avoids overfitting. But it's not really good enough either. Full output.

I have no idea of how to go from here, so I guess it's a dead end unless someone else finds some way to make it really useful. Maybe for a quick overview of many target populations? For example, using the same source file with a target file containing 61 modern populations (European and West Asian), you get something like this also for 5 generations (in 15 secs run time):

https://drive.google.com/file/d/1sKN7c9S3481AiNNJg8kR-YIhRMXphoT0/view?usp=sharing

Matt said...

@Davidski, in case you're interested, it looks like a few samples might be outliers to me:

Modern: Ukrainian - 597_R01C01 (looks shifted towards Komi), 597_R01C02 (looks shifted away far from other Ukrainians), Mordovian - 491_R02C02 (looks shifted towards West Asia), Belarusian - 521_R01C02 (looks strongly shifted towards Baltic_BA), Belarusian - 520_R01C01 (looks shifted towards SE Europe)

Ancient: Corded_Ware_Germany - I1540 (looks WHG shifted), Protoboleraz_LCA - I2788 (looks to have steppe ancestry) (But I remember you know about these two?).

(NA17374 also seems like a slight outlier for Greece. Among the Greeks generally it looks like NA17374, NA17376, GreekGralPop13, NA17372 all form a cluster of more Sicilian like Greeks, while the other Greeks are Albanian like Greeks, overlapping the end of Slavic/Eastern European clusters).

This is just based on reprocessing data through PAST3 PCA: https://imgur.com/a/fUMV7

MomOfZoha said...

@Alberto:
I just tried your ancestorsMix, and it is awesome! Not "failed" at all. The greatest usefulness is trying it on totally unpruned source files, getting meaningful results. Running my folks' data using the same source file (with all pop aves) I get a ridiculously long list of admixture proportions AND worse fits than what I get with your ancestorsMix!

So awesome that you just coded that up, and totally NOT useless for me.

You are picking up my mom's recent Caucasus ancestry (that I had previously inferred from her FTDNA matches) in addition to her Central Asian stuff, mixed in with others. There's a big decrease in her best fit distance ancestry proportions via ancestorsMix going from 3 gen to 4 gen, while 2 gen also gives useful info (except that she is definitely not of any recent Italian descent however much she loves her Italian American daughter-in-law):

2 gen:
MoZ_Mother
Armenian 25%
Italian_South 25%
Turkish 25%
Turkmen 25%

Distance 1.907%

--
3 gen:
MoZ_Mother
Turkish 37.5%
Azeri_Dagestan 12.5%
Kabardin 12.5%
Kumyk 12.5%
Levant_N 12.5%
Uzbek 12.5%

Distance 1.7026%

--
4 gen:
MoZ_Mother
Kabardin 18.75%
Azeri 12.5%
Georgian_Jew 12.5%
Turkish 12.5%
Avar 6.25%
Azeri_Dagestan 6.25%
Daur 6.25%
LBKT_MN 6.25%
Levant_N 6.25%
Macedonian 6.25%
Tajik_Yagnobi 6.25%

Distance 1.399%

For my father-in-law, his best-fit distance% across the different gen parameters are basically ~1.5. Throughout the runs, Armenia_MLBA and Assyrian features prominently as does naturally all kinds of Iranian stuff and a bit of Turkish.

As for my dad, there a big decrease in his best-fit distance from 2 gen to 3 gen. At 2 gen, he gets simply half Sephardic Jewish and half Turkish. After that though, he keeps getting more ancient pop affinities than recent pops, with Iran_IA being the most prominent. At 4 gen, here's his ancestorsMix output:

MoZ_Father
Iran_IA 31.25%
Mentese_N 12.5%
Turkish 12.5%
Altaian 6.25%
Georgian_Imer 6.25%
Georgian_Laz 6.25%
Levant_BA 6.25%
Poltavka_outlier 6.25%
Tabasaran 6.25%
Tepecik_Ciftlik_N 6.25%

Distance 1.7277%

My dad matches several Lebanese Christians on FTDNA, but I have not confirmed any FTDNA Jewish matches for him. While "some kind of Jewish" consistently also pops up in both of my parents' 2-way, 3-way, 4-way GEDmatch admixture results across various projects, neither of them register *any* Jewish ancestry -- neither Sephardic nor Ashkenazi -- via 23andme and FTDNA. My mom does have close Ashkenazi Jewish FTDNA matches, all of whom are Northeast European Jews, including an X-match. Although neither FTDNA nor 23andme picks up any "Jewishness" for either of my parents, *I* do register as 3% Sephardic Jewish by FTDNA...

Well, anyway, thank you for your program, Alberto. It is really useful for large and messy source files especially..

Alberto said...

@MomOfZoha

Cheers! I'm quite surprised that you found it so useful for your kind of case. I didn't have anything similar to test it with, so it's nice that it can be helpful in some cases.

I still wouldn't be too confident in the accuracy of the results. They should be indicative, but nothing definitive. As a sanity check, you might want to run Xmix with the same source file and the same targets. It will take quite longer to run, but it should give a better result.

Thanks for testing it and the positive feedback. Glad to know it was not a total failure or waste of time.

Matt said...

Btw, in case anyone is using nMonte3, these may be of interest for use as the calc files

1: https://pastebin.com/51VNhNxs
2: https://pastebin.com/uR9JE6U0

Rather than prune similar populations out, this just puts all individuals from similar populations under the same cluster, e.g. Poltavka, Afanasievo, Yamnaya individuals all under Steppe_EMBA.

So that the nMonte3 result will just show x% Steppe_EMBA, rather than any of these populations individually. This might help produce fits that are easier to interpret, while including the advantage of having populations that cover a lot of the PCA space rather than single-point population averages.

I've restricted the file from having post-EBA populations and populations that seem like more complex mixes (barring the Spinigas2 outlier, as it seems important for NE Europe).

MomOfZoha said...

@Alberto:
Yup, definitely useful for me, even with the disclaimer to take actual proportions with a grain of salt. To recheck, I ran nMonte3 again on similar source files (including ancients files etc.), and then your ancestorsMix on same, and still for all of my family I am getting: way too many source pops output via nMonte3 in addition to worse distances than your ancestorsMix even with just a few generations. Also, ancestorsMix is clearly picking out the kinds of ancestry that I know for a fact exist in each family member, whereas that is not at all obvious in an nMonte output with 30+ sources. I know that Xmix and nMonte will both be very useful when the number of sources is severely limited. BUT, ancestorsMix helps guide one in the process of limiting source pops in the first place (I'll run the ancestorsMix output source pops as input ref pops to XMix and nMonte later). So, Kudos, and as we say in Turkish "ellerine saglik", in Farsi "destetun dert ne konid": Health to your hands. :)

@Matt:
Thank you for those ancients files. I've been wanting someone to share exactly such a file.

I know that ancestorsMix isn't yet configured to do the pop-groups, but it's easy to figure out in retrospect. For fun, I ran my dad's info on those files (which at this point I am assuming are "scaled" wlog):

nMonte3 gives a distance% of 3.16 with 21 source pops including several 0.2 contributions. ancestorsMix at just 4 generations gives a distance% of 2.175 with Armenia_EBA most significant:

MoZ_Father
Armenia_EBA:Armenia_EBA_I1635 31.25%
Levant_BA:Levant_BA_I1705 18.75%
Steppe_EMBA:Poltavka_I0374 12.5%
Armenia_EBA:Armenia_EBA_I1658 12.5%
Anatolian_Aegean_N:Mentese_N_I0723 6.25%
Central_Europe_EN:LBK_EN_I0026 6.25%
East_Asian_Paleolithic:Tianyuan_TY 6.25%
Hungary_MNChl:LBKT_MN_I1904 6.25%

Distance 2.1749%

Note that total Armenia_EBA is 43.75%.

Pretty good distance for a modern's ancients-only-combo.

MomOfZoha said...

@Alberto's ancestorsMix 3-gen output for Kalash average using the ancients file that @Matt uploaded:

Kalash
Steppe_EMBA:Poltavka_I0371 37.5%
Iran_EN_LN:Iran_N_AH1 37.5%
East_Asian_Paleolithic:Tianyuan_TY 12.5%
Armenia_EBA:Armenia_EBA_I1658 12.5%

Distance 5.4852%

Compare to nMonte3 output with ncycles raised to 5K from the 1K default (and still taking quite long):

"distance%=6.2048"

Kalash

Iran_EN_LN,47
Steppe_EMBA,33.2
Ust_Ishim,5.2
Armenia_Chl,5
Austronesian_Holocene_Ancient,2.8
CWC_Baltic_early,2.4
East_Asian_Paleolithic,2
Native_American_Holocene_Ancient,1
Iran_ChL,0.8
Armenia_EBA,0.4
CWC_Baltic_outlier,0.2

Likewise, I was curious about Kabardin average:

with ancestorsMix at 3 gen get

Kabardin
Armenia_EBA:Armenia_EBA_I1635 50%
Armenia_Chl:Armenia_ChL_I1631 25%
East_Asian_Paleolithic:Tianyuan_TY 12.5%
Steppe_EMBA:Poltavka_I0440 12.5%

Distance 3.2509%

with nMonte3 taking really long again at 5K ncycles, again get larger distance and many more epsilon pops:

Kabardin

Armenia_EBA,45.2
Armenia_Chl,33
Steppe_EMBA,13.2
Austronesian_Holocene_Ancient,2.4
CWC_Baltic_early,2.2
Native_American_Holocene_Ancient,1
Anatolia_ChL,0.6
Ust_Ishim,0.6
East_Asian_Paleolithic,0.4
Africa_Holocene_Ancient,0.2
Anatolian_Aegean_N,0.2
Central_Europe_EN,0.2
Hungary_MNChl,0.2
Iberia_MNChl,0.2
Minoan_Lasithi,0.2
Sweden_MN,0.2

Amazingly, the total ancient Armenia contribution and the total Steppe contribution in both are almost identical for the Kabardin in those totally different algorithm runs!

Alberto said...

@MomOfZoha

I'm really surprised with those results. It does seem that ancestorsMix works better than I expected.

Still I can't understand why nMonte is not able to find a better model (by distance) even when increasing the default cycles (which are quite high already, at a total of 1 million tries). So I took Matt's files to reproduce your results with ancestorsMix and then with Xmix. This is what I got:

> getAncestors('nmonte3_2.txt', 'target.txt', 3)

Kalash
Iran_EN_LN:Iran_N_AH1 37.5%
Steppe_EMBA:Poltavka_I0371 25%
East_Asian_Paleolithic:Tianyuan_TY 12.5%
Steppe_EMBA:Yamnaya_Kalmykia_RISE547 12.5%
Iran_ChL:Iran_ChL_I1662 12.5%

Distance 5.4711%

Not sure why a slightly different result, but quite close. With Xmix:

> getXmix('nmonte3_2.txt', 'target.txt')

Kalash
Iran_EN_LN:Iran_N_AH1 40.55%
Steppe_EMBA:Poltavka_I0371 30.2%
East_Asian_Paleolithic:Tianyuan_TY 16.45%
Armenia_Chl:Armenia_ChL_I1634 10.85%
Steppe_EMBA:Yamnaya_Kalmykia_RISE547 1.8%
Steppe_EMBA:Poltavka_I0440 0.15%
...
Distance 5.0684%

Similar model, but slightly better (as I expected, not having the minimum % constraint). Similar for Kabardin:

> getAncestors('nmonte3_2.txt', 'target.txt', 3)

Kabardin
Armenia_EBA:Armenia_EBA_I1635 50%
Armenia_Chl:Armenia_ChL_I1631 25%
East_Asian_Paleolithic:Tianyuan_TY 12.5%
Steppe_EMBA:Poltavka_I0440 12.5%

Distance 3.2509%

Exactly as yours. And with Xmix:

> getXmix('nmonte3_2.txt', 'target.txt')

Kabardin
Armenia_EBA:Armenia_EBA_I1635 40.8%
Steppe_EMBA:Poltavka_I0440 16.75%
Armenia_Chl:Armenia_ChL_I1631 14.2%
Armenia_EBA:Armenia_EBA_I1658 13.95%
Hungary_MNChl:ALPc_MN_I2377 4.85%
East_Asian_Paleolithic:Tianyuan_TY 3.15%
Austronesian_Holocene_Ancient:Lapita_Vanuatu_I1369 1.8%
Austronesian_Holocene_Ancient:Lapita_Vanuatu_I1368 1.65%
Native_American_Holocene_Ancient:Kennewick_kennewick 1.25%
Austronesian_Holocene_Ancient:Lapita_Tonga_CP30 0.7%
Minoan_Lasithi:Minoan_Lasithi_I0074 0.55%
Hungary_MNChl:ALPc_MN_I2744 0.1%
Iberia_MNChl:Iberia_ChL_I1838 0.1%
Armenia_Chl:Armenia_ChL_I1634 0.1%
Iberia_MNChl:Portugal_MN_LugarCanto41 0.05%
....

Distance 2.5944%

So again similar result, with better distance but more "noisy".

(BTW, Xmix had a known bug that triggered sometimes when using large number of source pops. It's related to rounding issues, and not worth fixing for me. But now I uploaded a version that makes it more resilient to such bug. It might still trigger in extreme conditions, but I hope it'll be very rarely. Same link to get the updated one: https://drive.google.com/file/d/1GnzWqr-hsa_GnwnWqx_as7ky0eHGKg-4/view?usp=sharing )

Alogo said...

Matt,

From a look I took at them too, NA17376 looks like a more outlying islander (e.g. Dodecanese, NA17374 seems Anatolian (and quite likely an eastern one at that) and NA17372 seems like it might be a northern mainland - Anatolian mix. They fill out the dataset nicely since it's otherwise mostly represented by mainlanders and/or close-by islanders. Makes sense from a population-weighted perspective to have a few represent those areas too since the samples are all merged in one population.

GRALPOP13 on the other hand seems like a mainlander but with a weird pull towards Western Europe...I think it might be a Central Greek sample I've come across before. It's the only sample that seems a bit hard to explain at any rate.

MomOfZoha said...

@Alberto:
If you increase the num gen to 7 you get distance 2.6% with Kabardin with way less noise from your ancestorsMix. Already at num gen of 5 ancestorsMix on Kabardin gets distance 2.63 with very similar output too. Not sure if all the "noise" is worth it for the 0.006% reduction in distance via Xmix compared to the sufficiently good and real fast ancestorsMix output.

Likewise with Kalash, I increased num gen to 7, and here's your ancestorsMix output now:

> getAncestors('ancientPops.txt','Kalash.txt',7)

Kalash
Iran_EN_LN:Iran_N_AH1 40.62%
Steppe_EMBA:Poltavka_I0371 31.25%
East_Asian_Paleolithic:Tianyuan_TY 16.41%
Armenia_Chl:Armenia_ChL_I1634 10.94%
Steppe_EMBA:Yamnaya_Kalmykia_RISE547 0.78%

Distance 5.0649%

Easy peezy.

Alberto said...

@MomOfZoha

Wow. That's pretty amazing. With those results surely it's not worth the extra time and noise. I'll keep testing ancestorsMix to see when it's really good and when it falls short.

Thanks for the surprising feedback!

Joshua Lipson said...

I'm a relative R novice. Is nMonte's source code broken? Every time I try getMonte(), I get the error: 'could not find function "getMonte"'. Everything's in the same directory; I've set to that working directory.

Davidski said...

Try this with nMonte2...

- Copy paste into the R window...

source('nMonte2.R')

- HIT ENTER

- Copy paste...

getMonte('data.txt', 'target.txt')

- HIT ENTER

And make sure your input files are labeled exactly as above.

Simon_W said...

My first cousin once removed is a lame tomato. I introduced her to the Global 25 analysis and as a result she ignored me on Facebook. I'm not going to mention her name here publicly, just in order to abide by the rules, but she would deserve it.