Diapositiva 1

Download Report

Transcript Diapositiva 1

STRUCTURE
http://pritch.bsd.uchicago.edu
Riccardo Negrini
[email protected]
A model-based clustering methods that use molecular
markers to:
Infer the properties of populations starting from single individuals
Demonstrating the presence of a populations structure
 Detecting “cryptic” populations structure
Classify individuals of unknown origins
Identify immigrant
Identify mixed individuals
Distance-based methods
Easy to apply and visually appealing
but
The cluster identify are heavily dependant to the distance measures and to
the graphical representation chosen
Difficult to asses the level of confidence of the cluster obtained
Difficult to incorporate additional information
More suited to exploratory data analysis than to fine statistical
inference
Dice similarity and multivariate analysis
Italian Limousine
0,3
0,2
0,1
0
-0,3
-0,2
-0,1
0
-0,1
Romagnola
-0,2
-0,3
-0,4
-0,5
0,1
0,2
0,3
Marchigiana
0,4
Distribution Dice similarity between
(dotted line) and within breeds
20
18
ROM/FRI
16
14
ROM/LMI
12
10
ROM/ROM
8
6
ROM/MCG
4
ROM/CHI
2
0
0.64
0.66
0.68
0.7
0.72
0.74
0.76
0.78
0.8
0.82
STRUCTURE main assumption:
H-W equilibrium within populations
Linkage equilibrium between loci within
populations
STRUCTURE does not assume a particular mutation process so it can
be use with the most common molecular markers (STR, RFLP, SNP, AFLP).
Sequence data, Y chromosome or mtDNA haplotypes have to be recoded
as a single locus with many alleles
STRUCTURE accounts for the presence of H-W and LD by
introducing population structure and attempts to find populations
grouping that (as far as possible) are not in disequilibrium
STRUCTURE adopt a BAYESIAN
approach:
Let X denote the genotype of the sampled individuals
Let Z denote the unknown population of origin of the individuals
Let P denote the unknown allele frequencies in all populations
Under H-W and LE each allele at each locus in each genotype in an
independent drown from the appropriate frequency distributions
Having observed X, the knowledge on Z and P is given by the posterior
probability of Bayes theorem:
Pr (Z, P|X) = Pr(Z) Pr(P) Pr(X|Z, P)
It is not possible to compute the distribution exactly but it is possible to obtain
approximate samples of Z and P using MCMC and than make inference based on
summary statistics of this samples
Bayesian inferences: basic principles
 No logic distinction between parameters and data.
Both are random variables: data “observed” and
parameters “unobserved”
PRIOR encapsulates information about the values of
parameters before observing the data
a
LIKELIHOOD is a conditional distribution that specified the
probability of the data at any particular values of the
parameters
Aims of Bayesian inference is to calculate the POSTERIOR
distribution of the parameters (The conditional distribution of
the parameters given the data)
FORMAT OF THE DATA FILE:
Alleles in rows
Indicate learning samples
Label
Pop
Flag
Missing data
Locus 1
Locus 2
Locus 3
Locus n
Chi1
1
1
145
92
113
Size
Chi1
1
1
145
98
115
Size
Chi2
1
1
143
90
115
…
Chi2
1
1
-9
90
119
…
Chi3
1
0
151
155
117
…
Chi3
1
0
145
92
119
Rom1
2
0
145
98
121
Rom1
2
0
143
90
125
Rom2
2
0
-9
90
125
Rom2
2
0
-9
94
123
File in txt format with tabs
Dominant data: code 1 the band presence (AA or Aa) and 2 the absence (aa)
second alleles as missing data (-9)
BUILDING A PROJECT:
Step 1
Step 4
Step 2
Step 3
…….if everything goes well:
MODELLING DECISION:
Ancestry model:
No admixture model: each ind comes purely from one of the k populations. The output is the
posterior prob that individual i comes from the pop k. The prior prob for each populations is 1/k.
appropriate for discrete populations and for dominat data
Admixture model: ind may have mixed ancestry i.e have inherited some fractions of its genome
from ancestors in population k. The output is the posterior mean estimates of this proportions
Linkage model: If t generation in the past there was an admixture event that mixed the k
populations, any individual chromosome resulted composed of “chunks” inherited as discrete units from
ancestors at the time of admixture.
Using prior population information: this is the default option in structure. Not
recommended in the exploratory preliminary analysis of the data.
samples had to be used as learning samples to assist clustering
Popflag allow to specify which
Frequency mode:
Allele frequencies independent: it assumes that allele frequencies in each populations are
independent drown from a distribution specified by a parameters l. The prior says that we expect the
allele frequencies in each population to be reasonably different from each others.
Allele frequencies correlate: it assumes that allele frequencies in the different populations
are likely to be correlate probably due to migrations or shared ancestry. The K populations represented
in the dataset have each undergone an independent drift away from the ancestral allele fequencies
How long run the
program?
Length of burn-in period: number of MCMC iteration necessary to reach a
“stationary distribution”: the state it visit will tend to the probability distribution of interest
(e.g. Pr(Z, P|X)) that no longer depend on the number of iteration or the initial state of
the variables.
Number of MCMC after burn-in: number of iteration after burn-in to get accurate
parameters estimate
Loosely speaking: usually burn-in from 10,000 to 100,000 iteration are adequate.
Good estimate of the parameters P and Q can be obtained with fairly short run (100,000).
Accurate estimation of Pr(X|K) need quite long run (106)
How to choose k (number
of populations)?
No rules, but only iterative method: i.e. try different k and different
Length of burn-in period and number of MCMC iteration after burn-in.
 Fully resolving all the groups in your dataset testing all the values until
highest values likelihood values are reached
 Determining the rough relation (low K)
Be careful to:
Run several independent run for each K in order to verify the consistency of the
estimates across run
Population structure leads to LD among unlinked loci and departures from H-W. These
are the signals used by STRUCTURE. But also inbreeding, genotyping errors or null alleles
can lead to the same effect.
INTERPRETING THE OUTPUT:
The screen during run
Number of MCMC iteration
Divergence between
populations calculated as Fst
Log of data given the
current values of P and Q
Current estimates of ln(P|K)
averaged
over
all
the
iteration since the end of
burn-in period
The output file
Current
estimates
of
Prln(P|K) averaged over all
the iteration since the end
of burn-in period
Q output without using prior information
Estimated membership in the clusters
(k=3) and 90% probability interval
(ANCENDIST turned on)
Q output using prior information
Posterior probability of belonging
to the presumed population
Estimated probability of belonging
to the second populations or have
parent
and grandparent that
belong to the second population
PLOT THE RESULTS
• one vertical line/individual
• color = cluster
• more colors/line:
genetic components of individual
INFERRING POPULATION
STRUCTURE
RESGEN PROJECT: Towards a strategy for
the conservation of the genetic diversity of
European cattle
THE DATASET
More that 60 cattle breeds from Europe
5 African bos indicus breeds
20 individuals per breed
30 microsatellites
Structure parameters:
Admixture models
Allele frequencies correlate
No prior information
Model-based clustering
European cattle
Germ. Simmental
Swiss HF
Podolica
Bretonne BP
Betizu B
Swiss Brown
Sayaguesa
Swedish
Simmental
British HF
Romagnola
Charolais
Pirenaica
Germ. Br. Württemberg Alistano
Red Polled
Hinterwaelder
Jutland
1950
Chianina
Ayrshire
Blonde d'Aquitaine
Germ. Br. Bavaria
Rubia Gallega
Bohemian Red
German Yellow
Dutch Belted
N'Dama
Highland
Limousin
Germ. Br. Orig
Asturiana Valles
Polish Red
Evolene
German
BP-W
Somba
Hereford
Bazadais
Bruna Pirineds
Asturiana Montana
Red Danish
Eringer
Friesian-Holland
Lagunaire
Dexter
Gasconne
Menorquina
Tudanca
Angeln
Piemontese
Belgian
Blue
Borgou
Aberdeen
Angus
Aubrac
Mallorquina
Tora
de
Lidia
MRY
Grigio Alpina
Germ. Shorthorn
Zebu Peul
Jersey
Salers
Retinta
Casta Navarra
Red HF dual
Rendena
Maine-Anjou
Guernsey
Montbéliard
Morucha
Hungarian
Grey
Red HF dairy
Normande
Betizu A
Pezzata Rossa Ital. Cabannina
Avilena
Istrian
Groningen WH
k=2
EUR
k=2
 Europe – Africa
 Zebu influence in Podolian breeds
AFR
Model-based clustering
European cattle
Germ. Simmental
Swiss HF
Podolica
Bretonne BP
Betizu B
Swiss Brown
Sayaguesa
Swedish
Simmental
British HF
Romagnola
Charolais
Pirenaica
Germ. Br. Württemberg Alistano
Red Polled
Hinterwaelder
Jutland
1950
Chianina
Ayrshire
Blonde d'Aquitaine
Germ. Br. Bavaria
Rubia Gallega
Bohemian Red
German Yellow
Dutch Belted
N'Dama
Highland
Limousin
Germ. Br. Orig
Asturiana Valles
Polish Red
Evolene
German
BP-W
Somba
Hereford
Bazadais
Bruna Pirineds
Asturiana Montana
Red Danish
Eringer
Friesian-Holland
Lagunaire
Dexter
Gasconne
Menorquina
Tudanca
Angeln
Piemontese
Belgian
Blue
Borgou
Aberdeen
Angus
Aubrac
Mallorquina
Tora
de
Lidia
MRY
Grigio Alpina
Germ. Shorthorn
Zebu Peul
Jersey
Salers
Retinta
Casta Navarra
Red HF dual
Rendena
Maine-Anjou
Guernsey
Montbéliard
Morucha
Hungarian
Grey
Red HF dairy
Normande
Betizu A
Pezzata Rossa Ital. Cabannina
Avilena
Istrian
Groningen WH
k=2
k=5
k=7
k=9
Nordic
Lowland
British
Baltic
Pied
Red
North-West
Intermediates
French
Brown
Alpine
Alpine
Spotted
Brown
Alpine
Intermediates
Iberian
Podolian
 9 homogeneous clusters + 2 intermediate zones.
Courtesy of dr. J. A. Lenstra, dr I. Nijman and Resgen Consortium
INTRABIODIV: Tracking surrogates f. intraspecific biodiversity: towards
efficient selection strategies f. the conservation of natural genetic
resources using comparative mapping & modelling approaches
Phylogeography of Geum reptans
• 59 localities
• 177 samples
• ≈80 polymorphic
AFLP markers
Phylogeography of Geum reptans
High diversity
Low diversity
Phylogeography of Geum reptans
High diversity
Low diversity
Phylogeography of
Ligusticum
mutellinoides
• 127 localities
• 381 samples
• 123 polymorphic AFLP
markers
Phylogeography of Ligusticum mutellinoides
High diversity
Low diversity
Courtesy of dr. P.Taberlet and Intrabiodiv Consortium
PERFORM ASSIGNEMENT TEST
THE REFERENCE DATASET
Grigio Alpina
Valdostana Pezzata Rossa
Piemontese
Cabannina
Bruna
Pezzata Rossa It.
CARTINA
Romagnola
Limousine
Marchigiana
Chianina
Calvana
Podolica
Mucca Pisana
Maremmana




16 breeds reared in Italy
416 individuals
3 AFLP primer combinations
132 polymorphisms
Information on origins
Rendena
Frisona
Checking the reference dataset
Probabilità
LMI
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
MCG
Probabilità
FRI
MMA
CHI
ROM
MUP
90% threshold
20000 burn-in + 50000 routine MCMC; 8 independent runs
CAL
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
BRU
GAL
VPR
PRI
PIM
POD
CAB
REN
90% threshold
98% of individuals correctly assigned with a p>90% (91% con p>99%)
100% of Romagnola individuals from the genetic center assigned with p>99%
THE BLIND TEST



44 Romagnola individuals randomly selected
3 AFLP primer combination ; 132 polymorphism
No prior information
THE RESULTS
Assignement probability to the different breeds of the
reference dataset
1
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0
36 Romagnola cattle assigned with p>99%
4 Romagnola cattle assigned with 90%>p>99%
4 Romagnola cattle not assigned
R
O
M
A
G
N
O
L
A
REN
CAB
POD
PIM
PRI
GAL VPR
CAL
MUP
CHI
MMA
FRI
BRU
MCG
LIM
for who are very interested
•Yang BZ, Zhao H, Kranzler HR, Gelernter J. Practical population group assignment with selected
informative markers: characteristics and properties of Bayesian clustering via STRUCTURE. Genet
Epidemiol. 2005 May;28(4):302-12.
•Sullivan PF, Walsh D, O'Neill FA, Kendler KS. Evaluation of genetic substructure in the Irish Study of
High-Density Schizophrenia Families. Psychiatr Genet. 2004 Dec;14(4):187-9.
•Lucchini V, Galov A, Randi E. Evidence of genetic distinction and long-term population decline in
wolves (Canis lupus) in the Italian Apennines. Mol Ecol. 2004 Mar;13(3):523-36
•Peever TL, Salimath SS, Su G, Kaiser WJ, Muehlbauer FJ. Historical and contemporary multilocus
population structure of Ascochyta rabiei (teleomorph: Didymella rabiei) in the Pacific Northwest of the
United States. Mol Ecol. 2004 Feb;13(2):291-309.
•Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus genotype
data: linked loci and correlated allele frequencies. Genetics. 2003 Aug;164(4):1567-87.
•Bamshad MJ, Wooding S, Watkins WS, Ostler CT, Batzer MA, Jorde LB. Human population genetic
structure and inference of group membership. Am J Hum Genet. 2003 Mar;72(3):578-89. Epub 2003
Jan 28.
•Koskinen MT. Individual assignment using microsatellite DNA reveals unambiguous breed
identification in the domestic dog. Anim Genet. 2003 Aug;34(4):297-301.
•Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, Feldman MW. Genetic
structure of human populations. Science. 2002 Dec 20;298(5602):2381-5.
•Rosenberg NA, Burke T, Elo K, Feldman MW, Freidlin PJ, Groenen MA, Hillel J, Maki-Tanila A, TixierBoichard M, Vignal A, Wimmers K, Weigend S. Empirical evaluation of genetic clustering methods
using multilocus genotypes from 20 chicken breeds. Genetics. 2001 Oct;159(2):699-713
•Randi E, Pierpaoli M, Beaumont M, Ragni B, Sforzi A. Genetic identification of wild and domestic cats
(Felis silvestris) and their hybrids using Bayesian clustering methods. Mol Biol Evol. 2001
Sep;18(9):1679-93