Transcript Slide 1

Improving accuracy of DNA-based
methods for population estimation:
Incorporating genotype uncertainty
into Mark-Recapture models
Janine Wright, Richard Barker, Matthew Schofield
Department of Mathematics and Statistics,
University of Otago, NZ
Andrea Byrom
Landcare Research, Lincoln, NZ
Dianne Gleeson
Ecological Genetics Laboratory
Landcare Research, Auckland, NZ
Why use DNA to monitor wildlife?
• DNA can be collected from
non-invasive samples in the
field (e.g. hair, faeces)
• Relatively inexpensive
methods for developing
molecular markers
• Enables ID of individuals in
a wildlife population
• Feasible to generate count
data for cryptic, low-density,
or hard-to-trap species
• Free from bias
Avoiding bias by using DNA
• Variation in trappability of individual
animals
• Can create bias in estimates of trapcatch post-control
• Aims:
– Develop an unbiased method of measuring
absolute numbers of animals present
– Identify potential biases in trap-catch
estimates
– To develop a set of reliable microsatellite
markers for pests in New Zealand – start
with stoats and possums and extend to other
species (e.g. cats, rats, pigs, goats)
Molecular Markers –
What is a microsatellite?
• A repeated sequence of 2-5 nucleotides
e.g. ACACACACACACACAC = AC8
• Usable repeat lengths are 8-40 copies
• Occur in many locations in genome, usually in
non-coding regions
• Mutation prone (slippage replication)
(High mutation rate – 10-2 to 10-5)
• Thus any given population may contain
variants of differing sizes
• Size variants = ‘alleles’
• Typical vertebrate populations have 5-15 alleles at
each locus
(locus = position in genome)
• Each individual possesses two alleles at each locus
(maternally and paternally inherited)
• Can see if an individual is homozygous or
heterozygous at each locus
homozygous = both alleles identical
heterozygous = different alleles
Gives a ‘genotype’ (tag) for
each individual
Laboratory Procedure (Part 1)
• Extract DNA from
possum faecal pellet or
ear tissue
• Run DNA sample
through Real Time PCR
machine with known
DNA standards to
quantify (twice)
• Calculate amount
(>200pg) to add to PCR
reaction
Laboratory Procedure (Part 2)
• Use PCR to amplify microsatellite products at 7 loci
(repeated twice)
• Run on agarose gel to confirm success
of amplification and to determine
amount required for sequencing
• Run on sequencer
• Analyse using GeneMapper
software and by eye
How many loci to study?
• Between 4 and 12 depending on species being
studied (differing amounts of variability)
• More loci
=
less chance of genotypes
matching by chance
=> but more chance of error
Challenges for DNA methods
Four types of error can occur:
1. Laboratory/recording error –
thought to be negligible
2. Sample contamination – also
unlikely
3. ‘Shadow effect’ – not enough
loci/alleles used results in
several individuals sharing the
same genetic tag
4. ‘Allelic drop-out’
Challenges for DNA methods
Four types of error can occur:
1. Laboratory/recording error –
thought to be negligible
2. Sample contamination – also
unlikely
3. ‘Shadow effect’ – not enough
loci/alleles used results in
several individuals sharing the
same genetic tag
4. ‘Allelic drop-out’
Allelic drop-out
What is it?
• Failure of DNA amplification
at one or more loci
• More likely with a lower
concentration of DNA
• Only one allele detected in a
heterozygous individual
(observed # of homozygous
individuals is higher)
Allelic drop-out
Why is it a problem?
Leads to overestimation
of population size (may
be >5-fold):
(Creel et al., 2003)
1. Incorrect genotypes
lead to encounters of
‘new’ individuals
2. False decrease in
probability of recapture
(recaptured but thought
to be new individual)
Overcoming allelic drop-out
• Quantitative PCR approach designed to measure
amount of amplifiable nuclear DNA present in faecal
samples of possums
• Possum-specific piece of DNA used
as target sequence with specific
TaqMan assay primers and probe
• Duplicate standards of known DNA
amounts included in each set of
samples to produce a standard
curve
200 pg
How does this relate to mark-recapture?
• Rejection rate of samples may be high when
screened through quantitative PCR (samples
rejected if <200pg of DNA)
- 53% for possum faecal samples
• These low concentration samples still contain
‘usable’ information in the form of partial genotypes
despite <99% accuracy of tag ID (genotype)
• Error rate (allelic drop-out) can be linked to DNA
quantity & quality
• How to build this into MR models?
Data – Start with Data!
Sample # Tv54
Tv16
Tv58
Tv27
Tv12
Tv53
Tv19
1
PP1
113 113
136 138
137 137
174 180
234 234
244 272
274 274
2
PP2
113 113
128 136
137 137
174 192
232 232
244 266
261 275
3
PP3
99
99
136 138
145 149
174 190
232 232
266 266
261 263
4
PP4
99
99
136 138
145 149
174 190
232 232
266 266
261 263
92
PP105
99 125
126 138
159 161
174 198
232 234
260 266
263 273
113
PP135
99 125
126 138
159 161
174 198
232 234
260 266
263 273
205
PP439
113 113
132 136
143 153
192 196
232 232
240 268
255 261
206
PP440
113 113
132 136
143 153
192 196
232 232
240 268
255 261
What We Would Like to Do
 g11

O

 g S 1
X obs
g17 
  (G obs , X obs )

g S 7 
Example – 5 samples
1
1
0
0
0

 containing 3 individuals
 0 0 1 0 1 

 Gobs contains the genotypes of
 0 0 0 1 0  The sampled animals
Completing the Data
• What about the animals never sampled?
 X obs 
X 

 0 
• Xobs is (u.  S)
where 0 is sample history for the N–u. never sampled.
• Completing the data introduces N as an unknown ( parameter)
Likelihood
L( N ,  ; X obs , z )  [ X obs | N ,  , z ]
 N  S!
 ij ( , z )
 

 u.  c1 ! cu. ! j i
1
• If we assume  ij  i, j
N
[X
obs
 N  S!
1
| N , , z]   



 u.  c1 ! cu. ! j i  N 
X ij
X ij
Missing and Corrupted Data
• With no genotyping error:
(G , X )
Sampling 
O  (G , X
obs
obs
)
– Only one way to map Gobs and Xobs to G and X
– G irrelevant for estimation (ancillary for N and )
Missing and Corrupted Data
• With allelic dropout (corruption)
(G , X )
Sampling 
Corruption 
O  (G , X
obs
obs
)
– Now many ways to map Gobs and Xobs to G and X
– Each has different set of capture ({ci} and u.)
statistics
– With dropout observed u. usually too high; this leads
to overestimation of N
Knowns and Unknowns
• Known (observed): O, z
• Unknown: G, X, N,  and 
• Bayesian inference: find [Unknowns | Knowns]
[G, X , N , ,  | O, z]  [O, X , G | N ,  ,  , z][ N ,  ,  ]
Knowns and Unknowns
• Known (observed): O, z
• Unknown: G, X, N,  and 
• Bayesian inference: find [Unknowns | Knowns]
[G, X , N , ,  | O, z]  [O, X , G | N ,  ,  , z][ N ,  ,  ]
Complete data likelihood
Knowns and Unknowns
• Known (observed): O, z
• Unknown: G, X, N,  and 
• Bayesian inference: find [Unknowns | Knowns]
[G, X , N , ,  | O, z]  [O, X , G | N ,  ,  , z][ N ,  ,  ]
 [O | X , G,  ][G | N ,  ][ X | N ,  , z][ N ][ ][ ]
Independent priors
Knowns and Unknowns
• Known (observed): O, z
• Unknown: G, X, N,  and 
• Bayesian inference: find [Unknowns | Knowns]
[G, X , N , ,  | O, z]  [O, X , G | N ,  ,  , z][ N ,  ,  ]
 [O | X , G,  ][G | N ,  ][ X | N ,  , z][ N ][ ][ ]
Field sampling
Knowns and Unknowns
• Known (observed): O, z
• Unknown: G, X, N,  and 
• Bayesian inference: find [Unknowns | Knowns]
[G, X , N , ,  | O, z]  [O, X , G | N ,  ,  , z][ N ,  ,  ]
 [O | X , G,  ][G | N ,  ][ X | N ,  , z][ N ][ ][ ]
Allocation of genotypes
Knowns and Unknowns
• Known (observed): O, z
• Unknown: G, X, N,  and 
• Bayesian inference: find [Unknowns | Knowns]
[G, X , N , ,  | O, z]  [O, X , G | N ,  ,  , z][ N ,  ,  ]
 [O | X , G,  ][G | N ,  ][ X | N ,  , z][ N ][ ][ ]
Data corruption
Posterior Sampling
• Use McMC to draw a sample from the joint
posterior
– Calculate importance ratios using probability model
• E.g. updating genotypes – propose G* using
current G and a proposal distribution J(G* | G)
– Terms that don’t involve G cancel
[O | X , G* ,  ][G* | N ,  ] J (G | G* )
ir 

[O | X , G,  ][G | N ,  ] J (G* | G)
One Small Problem
•  not identifiable from O and z.
–  includes dropout rate and genotype frequencies
– Either need some different data or some strong
assumptions
One Small Problem
• Andrea also has data from ear samples
– So much DNA that Pr(dropout) = 0(.000000001)
– If the ear sample is a random sample from the
same population, provides information on genotype
frequencies and dropout rate
– Jointly model ear and pellet data  all unknowns
identifiable
– Can also test for H-W equilibrium
To Do
• Come up with more realistic models for [ X | N , , z ]
– Good progress on this
• Combine models for ear data and pellet data
• Incorporate information on amount of DNA
– Covariate for dropout rate
• Consider other applications
– Being able to impute G has implications for
genetic models and for open population MR
models
Acknowledgements
Funding:
• Animal Health Board
• Foundation for Research, Science & Technology
Landcare Research staff:
• Robyn Howitt
• Denise Jones
• Dave Morgan
• Graham Nugent
• Nick Poutu
• Casey Sole
• Caroline Thomson
Acknowledgements
Funding:
• Animal Health Board
• Foundation for Research, Science & Technology
Landcare Research staff:
• Robyn Howitt
• Denise Jones
• Dave Morgan
• Graham Nugent
• Nick Poutu
• Casey Sole
• Caroline Thomson