Diapositive 1

Download Report

Transcript Diapositive 1

Michel Veuille
Ecole pratique des Hautes Etudes
Director of the Systematics and Evolution dept
Muséum National d’Histoire Naturelle
Paris
Scientific Advisory Board of the CBOL
Data Analysis Working Group
What is the molecular signature of speciation events?
There is no molecular signature of speciation events
What are the other signatures of speciation events?
There is no universal signature of speciation events
But there are local signatures of speciation events,
and one kind of signature (e.g. morphological) can be
present when the other (e.g. genetical) is absent
Two examples : 1st / 2
A case of two mtDNA species
with no morphological difference
In 1998, the common European earwig was shown to consist of two
sympatric and reproductively isolated species differing only in the
number of annual broods (one or two broods per year).
The two species differ strikingly in COII sequence
This is because the GC% of these species evolves at a very high rate
But since they present no apparent morphological difference, the two
species remain unnamed
GC% at COII in hexapoda
European earwig
Forficula auricularia
Other
hexapoda
earwigs
Wirth, Le Guellec, Vancassel, & Veuille. 1998. Evolution 52: 260-265
Wirth, Le Guellec, & M. Veuille. 1999 MBE, 16: 1645-1653.
Two examples : 2nd / 2
A case of two morphological species
with no mtDNA difference
São Tome
Drosophila santomea
Drosophila yakuba
Drosophila santomea lives in the highlands of São Tome above 1100 m
Drosophila yakuba lives in the lowlands, below 1100 m.
They hybridize at 1100 m, and nevertheless remain genetically distinct
They share the same mitochondria, but can be easily identified through the colour pattern of the abdomen
After Lachaise et al. Proc. Roy Soc. London, 2000
They belong to the Drosophila melanogaster ("black abdomen") subgroup
D. orena
D. erecta
1978
1974
Cameroon
Tropical Africa
D. teissieri
1971
Tropical Africa
D. yakuba
1954
Tropical Africa
D. santomea
D. mauritiana
2000
1830
1919
1974
São Tome island
Tropical Africa + worldwide
Tropical Africa + worldwide
Mauritius island
D. sechellia
1981
Sechelles islands
D. melanogaster
D. simulans
Share the same
mitochondrion through
common descent
D. santomea
D. yakuba
The condition of the barcoder is challenging
The species concept is hotly debated
There are many definitions of species
« Species » make sense to everybody.
For example, 12% of the nouns in the French vocabulary* correspond to taxa
that make sense to a taxonomist (species, families, varieties)
A solution is to let people use whatever species concept they prefer
and limit the barcoder’s activity to the domain where he/she can be helpful
* : From the Robert a classic French dictionary
What data analysis is about
(barcoder)
?0,000,000 species
Data & tools
(taxonomist)
Black box
Data analysis consists in providing data to taxonomists, in order to
make decisions about the status of specimens and taxa.
Barcoding and taxonomic decisions are logically distinct, even
though they can be performed by the same person.
« This is species A or B »
« This is a new species »
What data analysis is about (contd)
Tree of life
Tree of life
closest COI validated node
closest validated node
Closest validated node using
additional information
sister group
Query sequence
Local barcode
Local barcode
If we want to be 100% sure of the assignment of a taxon, then we must look at the nodes below the closest node
excluding a sister group with probability p < 0.01.
Below this point, a series of statistical and classificatory approaches allow us to estimate the probability that the query
sequence belongs or not to an already described species, based on the available information.
Alternatively, additional information using other genes, or an enlarged dataset can increase our understanding of the
taxonomic status of the query.
The population genetics background behind data analysis
Principle
two sequences from the same population find their last common ancestor with some
constant probabiilty p = 1/N
It is a « death process »
Very different from a normal distribution
Past (generations)
The most probable
coalescence time:
t=1
the expectation:
t=N
P = 0.05 for:
t = 3N
MRCA
1
p
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
n
0
2
9
19
39
Sample n1
Probability p that the MRCA of a sample of size n is also the MRCA of the species
assuming a standard Wright-Fisher model.
In a very large population p = (n-1)/(n+1)
p increases very rapidly.
The probability is p = 0.6667 for n = 5, and p = 0.8 for p = 9
Increasing the sample size beyond this is useless
MRCA
N
generations
2N (1-1/n)
generations
Typically, under a standard equilibrium
Wright-Fisher model(*) , the expected time to
the last common ancestor of the tree
(MRCA) is only twice the time to the common
ancestor of two randomly sampled
sequences
(*) assuming :
- neutrality
- constant population size
- no structuring
- mutation drift-equilibrium
- N = effective number of genes
Sample n1
Using a larger dataset does not increase the information very much at this level
MRCA
MRCA
N
generations
2N (1-1/n)
generations
Sample n1
Sample n2 > n1
« The older nodes of a genealogy tend to be revealed in a small sample, whereas more recent
portions are, on average, only revealed as the sample size per locus grows large. »
Kliman et al. 2000.
polymorphisms can go very far, back in the past of the species, and enter the
ancestral population with a sister species
After AG Clark 1997
A long time after they have split, two species still
share some neutral polymorphisms.
Exploring shallow nodes
1. Nielsen and Matzen’s MCMC method
Derived from Nielsen and Hey’s (2001)
IM method, based on MCMC
(Monte Carlo Markov chains).
This method estimated 5
Parameters, thus involving very long
computation time
1. Matz and Nielsen’s MCMC method
Derived from Nielsen and Hey’s (2001)
IM method, based on MCMC
(Monte Carlo Markov chains).
This method estimated 5
Parameters, thus involving very long
computation time
Matz and Nielsen (2005) reduce it to
two parameters:
- the population size
- time to speciation.
They estimate the probability that the
query sequence belongs or not to the
same species as the reference sample
2. Evaluating classification and phylogenetic methods : Austerlitz et al.
They compare two classification methods
CART
random forest
And two phylogenetic methods
Neighbour-joining
phy-ML
The classification methods partition the dataset using a
few characters
The distance methods work well with a small dataset,
provided there are enough mutations
They simulate n +1 individuals in each species.
n individuals are a reference sample
the last individual is the query.
Repeated simulations, allow them to record the rate of
correct assigment of the query to its species
Comparison of the methods for a low q
(2 populations, reference sample size = 10, q = 3)
100%
success rate
90%
80%
ml
cart
RF
70%
60%
50%
100
1000
10000
Separation time
Classification methods perform better for a low variation
Comparison of the methods for a high q
(2 populations, Reference sample size = 10, θ = 30)
100%
success rate
90%
80%
ml
CART
RF
70%
60%
50%
100
1000
10000
Separation time
Phylogenetic methods perform better for a highly variable population
Conclusion :
the appropriate method varies with the properties of the dataset
Comparing methods using realistic datasets
100.00%
1. Litoria nannotis
4 species
Average sample size: 43.7
average q = 1.54
success rate
95.00%
ML
CART
Random Forest
90.00%
85.00%
80.00%
0
5
10
15
20
25
30
sample size
2. Astraptes fulgeraptor
100%
Good assignment rate
99%
98%
97%
96%
phylo
95%
CART
94%
93%
92%
91%
90%
12 species
Average sample size: 38.8
average q = 23.5
3
4
5
6
7
8
Reference Sample size
9
10
100.00%
3. Cowries
good assignment rate
95.00%
ML
CART
Random Forest
90.00%
85.00%
80.00%
0
5
10
15
sample size
20
25
30
Other solutions:
Can we replace CO1 ?
Can we complement it with other genes
Properties of bilaterian mtDNA
Other systems
Large number of copies per cell
rDNA has a high copy number
High mutation rate
Microsatellites also
Low variation / divergence ratio
Centromeres, telomeres (documented in Drosophila)
No recombination
Centromeres, telomeres (documented in Drosophila)
Haploid
X-chromosome, Y chromosome
Maternally inherited
asexual
The Y is asexual
The other chromosomes recombine
Variation in mtDNA is lowered due to selective sweeps according to Bazin et al (2006)
Variation is also lowered in some nuclear regions due to background selection
The main disadvantage of maternal inheritance is that mitochondria can be transferred
horizontally along with Wolbachia endosymbiotic bacteria.
Examples: Protocalliphora and Drosophila
The main disadvantage of asexuality is that mitochondria do not follow the 2nd law of
Mendel :
mtDNA carries no information on genetic barriers..
Maternally transmitted endosymbiotic bacteria : hitchhiking by Wolbachia
Phylogeny of the fly Protocalliphora based on AFLP
(nuclear markers),according to Whitworth et al (2007).
Symbols represent different Wolbachia strains
nuclear
mtDNA
Phylogeny of Protocalliphora based on COI+COII.
The authors claim that the assignment of unknown
individuals to species is impossible in 60% of the species
After Whitworth et al. Proc Roy. Soc. B, in press
MRCA
Phylogenetic tree of mtDNA
Phylogram of nuclear DNA
A phyletic tree in mtDNA represents true phyletic relationships.
Mutations are in linkage disequilibrium because they do not recombine.
Having two divergent clades is trivial under a FW standard model
Whereas the phylogram of a recombining gene represents distances between haplotypes,
where mutations can seem to « appear » repeatedly on several terminal branches.
They thus inform us on the existence of barrier to gene flow
Conclusions
1.
There is no mitochondrial signature of speciation. There is no room for a barcode
species concept, and anything like a « barcodon ».
2.
Even a moderate sample can provide a wealth of information on the history of a
species.
3.
Additional information can be obtained in difficult cases, either by increasing the
population sample, or by using additional markers.
The END