ppt - Language Log

Download Report

Transcript ppt - Language Log

Historical inference
from linguistic and genetic data
Potentially “…the best evidence of the
derivation of … the human race” (Thomas
Jefferson)
BUT
Inferences are complex
methods and results from several disciplines
Intellectual stakes are high
Work has often been careless
sometimes spectacularly so
dangers of overinterpretation and “scientism”
2/6/01
General methodological problems
• Not all graphs are trees
– “treeness” tests often left out
– “treeness” hypothesis can often be rejected
• Tree inference may be underdetermined
– Branching structure
– Root choice
• Rates of change may not be constant
– for different markers
– across time
• Gene trees (and language trees) may not be
population trees
• Biology and language are complicated
– simplifying assumptions are sometimes perniciously
mistaken
2/6/01
Trees vs. Clines (etc.)
• A tree structure represents the results of a
sequence of splits in population (or language)
– no further influences among separate branches
– if rates of change are constant, distances should
be quantized
• Within an interbreeding (intercommunicating)
population, distances reflect the amount of
gene flow (transmission of linguistic traits)
– should correlate strongly with accessibility
– e.g. geographical distance in the simplest case
2/6/01
2/6/01
The… procedures outlined here provide a rigorous method for
inferring whether the geographical pattern of variation is consistent
with an historical split (fragmentation) or no split(recurrent gene
flow) using criteria that are completley explicit. For example, in
analyzing the mtDNA of tiger salamanders, a clear split into eastern
and western lineages was detected for mtDNA. Using the same
explicit criteria, there was no split among any human populations.
Quite the contrary, the present analysis documents recurrent and
continual genetic interchange among all Old World human
populations throughout the entire time period marked by mt DNA.
Accordingly, estimating a date for a 'split' of Africans from nonAfricans based on evidnece from mtDNA is certainly allowed by
many computer programs, but the results are meaningless because
a date is being assigned to an 'event' that never occurred.
Templeton (1997)
2/6/01
Methods for tree inference
(“phylogeny”)
• Two general approaches
– clustering (easier but cruder)
– generate and evaluate alternative trees
• Distance-based methods
– based on matrix of distances/similarities
• Parsimony
– based on set of partly-shared characters or traits
http://evolution.genetics.washington.edu/phylip/software.html
documents 193 different phylogeny packages
2/6/01
Cognate percentages
for 8 Vanuatu languages
Toga
64 Mosina
64 58 Peterara
57 51 65 Nduindui
29 28 34 32 Sakao
51 45 55 52 40 Malo
39 39 45 41 43 50 Fortsenal
52 48 57 60 31 48 45 Raga
Data from Guy (1994)
2/6/01
Reconstruction Algorithm
(Guy 1994)
“A message is input at the root of a tree-shaped transmission
network, whence it is transmitted to the terminal nodes. As they
travel, copies of the original message are affected by errors
consisting in randomly selected segments of the message being
replaced by other segments randomly drawn from a pool of
possible segments (the "alphabet“ of the message). The
problem is: from the garbled versions of the original message
collected at the terminal nodes, reconstruct the network and the
history of the transmission of the message.”
“Additive-distance” tree with weights on branches rather
than on nodes -- doesn’t assume constant rate of change…
2/6/01
Explanatory force of the model
• Set of distances grows as
N N
2
2
• Set of binary-tree branch labels
grows as 2( N  1)
• For 8 languages: we predict 28 numbers
(the inter-language cognate proportions)
with 14 numbers
(the binary tree branch proportions)
2/6/01
Inferred tree
Toga
Mosina
Peterara
Nduindui
Raga
Sakao
Fortsenal
Malo
-830-----:-919-----:-972-----:-947-----:
-770-----'
|
|
|
-----829-----------'
|
|
-----795-----------:-949-----'
|
-----755-----------'
|
-----567-----------:-883-----:-895-----'
-----759-----------'
|
----------772----------------'
Mosina/Toga:
.77*.83 = .6391 (really 64%)
Peterara/Mosina: .829*.919*.77 = .5866 (really 58%)
Peterara/Toga:
.829*.919*.830 = .6323 (really 64%)
from Guy (1994)
2/6/01
True - predicted
cognate percentages
Toga
0 Mosina
1 -1 Peterara
1 -1
4 Nduindui
-2 -1
0
0 Sakao
2
0
2
3
1 Malo
-3
0 -1 -2
0 -2 Fortsenal
-1 -1 -1
0
1
1
4 Raga
The model fits very well!
2/6/01
Where’s the root?
Isn’t it obvious?
Toga
Mosina
Peterara
Nduindui
Raga
Sakao
Fortsenal
Malo
2/6/01
-830-----:-919-----:-972-----:-947-----:--Protolanguage
-770-----'
|
|
|
-----829-----------'
|
|
-----795-----------:-949-----'
|
-----755-----------'
|
-----567-----------:-883-----:-895-----'
-----759-----------'
|
----------772----------------'
Oops: other options
protolanguage
Toga
Mosina
Peterara
Nduindui
Raga
Sakao
Fortsenal
Malo
2/6/01
-830-----:-919-----:-972-----:-947-----:
-770-----'
|
|
|
-----829-----------'
|
|
-----795-----------:-949-----'
|
-----755-----------'
|
-----567-----------:-883-----:-895-----'
-----759-----------'
|
----------772----------------'
And some more…
protolanguage
Toga
Mosina
Peterara
Nduindui
Raga
-830-:-919-:-972-:-947-:-895-:-883-:-567- Sakao
-770-'
|
|
|
`-759- Fortsenal
-----829---'
|
`---772----- Malo
-----795---:-949-'
-----755---'
In the absence of other constraints, the root can be placed anywhere
in the tree without changing the model’s fit!
2/6/01
Possible “other constraints”
• Historical evidence
– about earlier forms
– about structure of relationships among
contemporary forms
• “outgroup”
• Constraints on rate of change
– linguistic (or genetic) “clock”
2/6/01
A universal constant
for glottochronology?
Thirteen sets of data, presented in partial justification of
these assumptions, serve as a basis for calculating a
universal constant to express the average rate of
retention k of the basic-root morphemes:
k = 0.8048 ± 0.0176 per millennium,
with a confidence limit of 90%.
Lees (1953)
2/6/01
Some of Lees’ data:
Language
Years
Words
Cognates
Rate
(per millenium)
English
1000
209
160
.766
Latin/Spanish
1800
200
131
.790
Latin/French
1850
200
125
.776
German
1100
214
180
.854
Middle Egyptian/ 2200
Coptic
200
106
.760
Greek
2070
213
147
.836
Chinese
1000
210
167
.795
Swedish
1050
207
176
.853
2/6/01
Some more retentive languages
(rates per 1000 years)
Language
100-word list
200-word list
Icelandic (rural)
99%
97.6%
Icelandic (urban)
98%
96.2%
Georgian
96.5%
89.9%
Amenian
97.8%
94%
Bergsland & Vogt (1962)
2/6/01
Some less retentive ones
Bergsland & Vogt estimate of vocabulary retention in East
Greenlandic as .722 in 600 years, or .34 per millenium.
David Lithgow (pers. com. circa 1970) has observed a
replacement of some 20% of the basic vocabulary in
Muyuw (Woodlark island) in one generation. Raise 0.8
to the 33rd power, and that gives you the retention rate
of Muyuw per 1000 years should it continue to evolve
at that rate: 0.06%.
Jacques Guy (1994)
2/6/01
“Language chains”
A
.77 B
.65 .76 C
Configurations like this are taken as prima facie evidence of
“non-treeness”, to be attributed to borrowing/mixing/cline
types of situations. But in fact they can also easily be generated
by variable rates of change:
A ----------- 90% -----------.
|____ protolanguage
B ---- 95% ----.
|
|---- 90% ----'
C ---- 80% ----'
Note that the required difference in mean rate of change
is only (.9-.9*.8)/.9 = .2 , or 20%
2/6/01
Mitochondrial Genome
2/6/01
Mitochondrial family tree
2/6/01
Mitochondrial phylogeny
2/6/01
Three fascinating “results”
• Mitochrondrial Eve
• Mitochrondial Clans
• The three-wave theory: converging
linguistic and genetic evidence
2/6/01
Mitochondrial Eve
Cann, Stoneking, and Wilson (1987):
mtDNA comparisons of 147 people from
Europe, Africa, Asia, Australia, and new
Guinea show that all present human
mtDNA is descended from a single
African woman who lived about 200,000
years ago.
2/6/01
First problem
• Computer program was used to find a tree
consistent with the mtDNA data
• But so were many other (unreported)
trees!
– order of answers depended on order of data
– root could be effectively anywhere in the
dataset
• e.g. Melanesian Eve, Asian Eve, European Eve…
2/6/01
Other problems
• mtDNA may not change at a constant rate
• mtDNA changes may be adaptive
• Gene trees may not be population trees
– DNA (including mtDNA) can spread by
gradual flow or by range expansion
– spread can be influenced by other factors
2/6/01
Early results: Native Americans come from four genetic lineages,
labeled A through D.
Amerinds have all four lineages,
NaDene only A, and Eskaleuts A and D.
Current results:
The four mtDNA lineages divide into nine distinct genetic subtypes.
All four lineages are in all three language groups.
Many local populations have all four lineages and a number even have
all the subtypes.
All subtypes can be found in North, Central and South America.
“It isn't realistic to believe that the same lineages ended up in all these
populations across two continents by separate migrations."
2/6/01
http://www.oxfordancestors.com/:
Oxford Ancestors
We put the Genes in Genealogy
Oxford Ancestors is the World's first organization to harness
the power and precision of modern DNA- based genetics in
the service of genealogy.
MatriLine™ interprets your deep maternal ancestry, linking
you - if your roots are in Europe - to one of seven women:
Ursula, Tara, Helena, Katrine, Velda, Xenia or Jasmine.
2/6/01
2/6/01
And MtDNA inheritance
may not even be entirely clonal!
• Mice
– demonstration of “paternal leakage”
• Hagelberg
– rare mtDNA mutation in Vanuatu
• Erye-Walker
– statistics of mtDNA “homoplasies”
2/6/01
Island evidence
• Erika Hagelberg (Proc. R. Soc. 1999)
– Island of Nguna (Vanuatu, Melanesia)
– 3 main MtDNA population groups
• as expected for the region
– In all three groups, the same mutation is
sometimes found
• previously known only from one Northern European
– Repeated chance mutation is unlikely
• local spread by recombination seems more probable
2/6/01
Statistics of mtDNA “homoplasies”
• Mutations that occur in different mtDNA
haplogroups around the world
• Assuming purely maternal inheritance, these
were thought to represent chance recurrence of
mutations in “hypervariable” regions
• Eyre-Walker et al. (Proc. R. Soc. 1999):
– regions are not statistically more variable than others
– mutations cluster geographically
• MacCauley (1999) counters
– much of the result comes from a dataset that may be
errorful
– “no need to panic”
2/6/01
Reaction of another mtDNA afficionado:
…I am reminded of a comment by a bishop’s wife
in Victorian England, also concerning human origins:
“Let us hope that it isn’t true, and if it is, that it will
not become generally known.”
2/6/01