Lecture - University of Cincinnati

Download Report

Transcript Lecture - University of Cincinnati

DNA Forensics
• DNA Forensics deals with the use of recombinant DNA
technology on one or more biological specimens for
forensic investigation
• Common use of DNA Forensics include: Human
Identification, Kinship Analysis for Missing Person
Identification, Parentage Testing, etc.
• Probability and Statistics play important roles in assessing
the strength of DNA evidence in all such applications
• Events in DNA forensics are generally low probability
events, and statistical assessment of DNA forensic data
requires estimation based on sparse multi-dimensional data
Brief Introduction of the DNA Forensics
Session of the Symposium
•
Four talks will address some of the major Statistical/Probabilistic issues
of DNA Forensics
•
Current paradigm of the topic will be the focus of the first talk (R.
Chakraborty)
•
B. Budowle will address challenges to such paradigm, when DNA quantity
is low, and for identification of source of microbial agents in forensic
samples
•
T. Wang will introduce the need of pedigree-based probabilistic
calculations for missing person identification
•
A. Eisenberg will discuss possible statistical formulations applicable for
newer technologies that being (or, about to be) implemented in the field
•
All four speakers are major players in DNA Forensics in the country;
contributed significantly in the development of DNA Forensics; and
together, have over 75 years of experience working in the subject
Statistical and Probabilistic Issues in
DNA Forensics: Current Paradigms
Ranajit Chakraborty, PhD
Robert A. Kehoe Professor and Director
Center for Genome Information
Department of Environmental Health
University of Cincinnati College of Medicine
Cincinnati, OH 45267, USA
Tel. (513) 558-4925/3757; Fax (513) 558-4505
e-mail: [email protected]
(Presentation at the University of Cincinnati Symposium on Probability Theory and
Applications on March 21, 2009)
Overview of the Talk
• Brief History of DNA Forensics
• Currently used DNA Markers in Forensics
• Three Generic Forensic Scenarios
• Examples of DNA Evidence Data
• Frequency, Likelihood, and Bayesian Logic of DNA Statistics
• Population Substructure and Its Effect on DNA Statistics
• Lineage Markers (mtDNA and Y-STR haplotypes)
• Match and Partial Match in Databases
Brief History of DNA Forensics
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
1980 – Ray White described the first hypervariable RFLP marker
1985 – Alec Jeffreys discovered multilocus VNTR probes (the term “DNA
Fingerprinting” coined)
1985 – First paper on PCR published
1988 – In US, FBI started DNA forensic casework
1991 – First STR paper published
1992 – NRC-I Report Issued
1994 –CODIS STR Loci Characterized
1995 – FSS started UK DNA Database
1996 – NRC-II Report Issued; mtDNA introduced in Forensics
1997 – 13 CODIS STR Loci Validated for Forensic Use; Y-STRs described for
forensic investigation purposes
1998 – FBI launched CODIS Database
2000 – RFLP Technology replaced by Multiplex STR Technology
2002 – FBI mtDNA Population Database published; Y-STR 20plex published
2002 – SNPs have been proposed as supplementary markers
2004 – Large sizes of “offenders’ data bases” opened issues of coincidental
full/partial matches
2007 – Familial search through partial match occurrences in databases
Advantages of Use of STR Loci
in DNA Forensics
• PCR Based
• Low quantity DNA
• Degraded DNA
• Amenable to automation
• Non-isotopic
CSF1PO
D7S820
TPOX
D8S1179
THO1
D13S317
FGA
D16S539
• Rapid typing
VWA
• Discrete alleles
D3S1358
• Abundant in genome
• Highly informative
(satisfied by the CODIS STRs)
D18S51
D21S11
D5S818
Penta D
Penta E
15 CODIS STR Loci with
Chromosomal Positions
TPOX
D3S1358
TH01
D8S1179
D5S818
VWA
FGA
D7S820
CSF1PO
AMEL
Penta E
D13S317
D16S539
D18S51
D21S11,
Penta D
AMEL
Three Types of DNA Forensic Issues
Transfer Evidence: DNA profile of the evidence
sample providing indications of it being of a single source
origin
Mixture of DNA: Evidence sample’s DNA profile
suggests it being a mixture of DNA from multiple (more than
one) individuals
Kinship Determination: Evidence sample’s DNA
compared with that of one or more reference profiles is to be
used to determine the validity of stated biological
relatedness among individuals
Transfer Evidence – An Example
DNA Mixture Analysis
(amelogenin, D8S1179, D21S11, D18S51)
Inclusion
mtDNA
Lineage Marker
Y-Chromosomal Genes
Lahn, Pearson & Jegalian 2001
Y STR Loci
Three Types of Conclusions
Exclusion
Match, or Inclusion
Inconclusive
Statistical Assessment of DNA Evidence
Needed most frequently in the inclusionary events
(Apparent) exclusionary cases may also be sometimes
subjected to statistical assessment, particularly for kinship
determination because of genetic events such as mutation,
recombination, etc.
Loci providing inconclusive results are often excluded
from statistical considerations
Even if one or more loci show inconclusive results,
inclusionary observations of the other typed loci can be
subjected to statistical assessment
Approaches for Statistical Assessment of
DNA Evidence
Frequentist Approach: indicating the coincidental chance
of the event observed
Likelihood Approach: indicating relative support of the
event observed under two contrasting (mutually exclusive)
stipulations regarding the source of the evidence sample
Bayesian Approach: providing a posterior probability
regarding the source, when data in hand is considered with
a prior probability of the knowledge of the source (later is
not generally provided by the DNA profiles being considered
for statistical assessment)
Frequentist Approach of Statistical
Assessment for Transfer Evidence
When the evidence sample DNA profile matches that of the reference
sample, one or more of the following questions are answered:
 How often a random person would provide such a DNA match?
Equivalently, what is the expected frequency of the profile observed in
the evidence sample? – also called Random Match Probability,
complement of which is the Exclusion Probability
 What is the expected frequency of the profile seen in the evidence
sample, given that it is observed in another person (namely in the
reference sample) – also called Conditional Match Probability
 What would be the expected frequency of the profile seen in the
evidence sample in a relative (of specified kinship) of the reference
individual, given the DNA match of the reference and evidence samples
– also called the Match Probability in Relatives
Frequentist Approach of Statistical
Assessment for DNA Mixture
When the evidence mixture DNA profile fails to exclude a reference sample as a part
contributor, and more commonly a set of reference samples together explains all
alleles seen in the mixture, one or more of the following questions are answered:
 How often a random person would be excluded as a part contributor of
the mixture sample? – also called Exclusion Probability, the
complement of which is the inclusion probability, giving the expected
chance of Coincidental Inclusion
(Note: This answer is based on the data on the evidence sample alone, without any
consideration of the profiles of the reference samples)
 With a stipulation on the number of contributors, how often a random
person’s DNA, mixed with that of one or more of the reference persons,
would provide a mixture profile as seen in the evidence sample, given
that the reference persons are also part contributors of the DNA mixture
(Note: This answer considers data on the profiles of evidence sample as well as those
of the reference samples stipulated to be part contributors)
Kinship Assessment – Frequentist
Approach
When comparisons of evidence and reference samples fail
to exclude a stated relationship of the evidence sample with
the reference individual(s), the frequency based question is
of the form:
 What is the chance of excluding the stated relationship? –
called the Exclusion Probability (PE), this is generally
answered conditioned on the profiles of the reference
samples and stated relationship
Note: Average exclusion probability can also be computed
disregarding the profiles examined, which rationalizes the
choice of loci to be typed for validating the stated
relationship
Concept of Likelihood
A Likelihood represents the support of a given
hypothesis (of vale of a parameter) provided by the
observations in the data, written as
Likelihood = Prob. (Data | Hypothesis).
Technically, likelihood is mathematically identical to
the probability of the data given the hypothesis, but
interpreted as a function of the hypothesis (or,
parameter values specified by the hypothesis) for
the observations in the data.
Likelihood Ratio
With two (mutually exclusive) hypotheses, say H1
and H2, the likelihood ratio (LR) is the ratio of
probabilities of observing the same data under H1
and H2 , giving
LR = Prob. (Data | H1) / Prob. (Data | H2).
Meaning of LR:
LR < 1: Data less well supported by H1, compared with H2
LR = 1: Data equally well supported by H1 and H2
LR > 1: Data better supported by H1, compared with H2
LR in Transfer Evidence
Background
Data:
DNA profile of evidence sample (E) matches that of
the suspect (S); i.e., E = S
Contrasting Scenarios of Source (Hypotheses):
Hp: DNA in the evidence sample came from the
suspect
Hd: DNA in the evidence came from someone other
than the suspect, but it coincidentally matches the
DNA profile of the suspect.
LR in Transfer Evidence
Computation
LR = Pr. (Data | Hp) / Pr. (Data | Hd)
= Pr. (E = S | Hp) / Pr. (E = S | Hd)
= 1 / Pr. (coincidental match)
Thus, LR in this case is simply the inverse
(reciprocal) of the relative frequency of the DNA
profile of the evidence sample in the population,
given that it is the same as of the suspect
LR in Transfer Evidence
Variation
Since LR can be defined for any two mutually exclusive
hypotheses, one may also consider the alternative
hypothesis as:
Hr: A relative of the suspect is the source of evidence DNA
In this case, the likelihood ratio, LR(r), will be
LR(r) = Prob. (E=S | Hp) / Prob. (E =S | Hr)
= 1/ Pr. (DNA match in the relative),
which equals the reciprocal of the probability of the DNA
profile found in the evidence sample in the relative of the
suspect, given that the suspect has the same DNA profile
LR in DNA Mixture
Background
Data: The DNA evidence profile, E (a DNA mixture) has alleles
which are all explained by alleles present in the suspect’s
DNA profile (S) and that of a victim’s DNA profile (V)
Contrasting Hypotheses:
Hp: DNA in the evidence sample is the mixture of DNA of the
suspect and that of the victim; (i.e., Hp: E = V + S)
Hd1: Evidence DNA is a mixture of DNA from the victim and
that of an unknown person (i.e., Hd1: E = V + UN)
Hd2: Evidence DNA is a mixture of DNA from two unknown
persons (i.e., Hd2: E = UN + UN)
LR in DNA Mixture
Computation
Pr. (Data | Hp: E = V + S) = 1, since data represents all alleles in
the mixture are explained by alleles present in V and S, and
no extra alleles are present in V and/or S.
Hence under Hp: E = V + S, data observed is the only
possible outcome, but
Pr. (Data | Hd1: E = V + UN) = relative frequency of a random
person, whose DNA, mixed with the DNA of the victim,
would yield a mixture that matched the evidence sample,
Pr. (Data | Hd2: E = UN + UN) = relative frequency of a pair of
random persons, whose DNA mixture would match the
profile seen in the evidence sample
LR in DNA Mixture
Interpretation
LR for Hp vs. Hd1: = 1 / Pr. (Data | Hp: E = V + UN),
which becomes the reciprocal of the relative
frequency of a random person, whose DNA, mixed
with the DNA of the victim, would yield a mixture
that matched the evidence sample
Likewise,
LR for Hp vs. Hd2: = 1 / Pr. (Data | Hp: E = UN + UN),
which is the inverse of the relative frequency of a
pair of random persons, whose DNA mixture would
match the profile seen in the evidence sample
Other Considerations of Computing
LR in DNA Mixture
Computations of numerator and denominator
of LR in mixture interpretation depend on:
Precise knowledge of the number of
contributors in the DNA mixture
Assumptions regarding the biological
relatedness of the unknown contributors
(between themselves, or with the reference
individuals)
Population origin of the contributors
Likelihood Ratio in Kinship
Assessment
Although the logic is similar, principles of LR
formulation in kinship analysis can be simply
illustrated with:
Standard paternity analysis (with DNA of
mother, child, and alleged father typed for
several loci), and
Kinship assessment for a pair of individuals
(with genotype data from one or more loci)
Interpretation of LR in Paternity Testing
 LR in paternity testing, also called PI, is the ratio of
two conditional probabilities
 It contrasts the chance of observing the specific trio
of genotypes (GC, GM, and GAF) given that AF = BF,
as opposed to AF ≠ BF
 PI (or LR) can be computed even when M and AF, or
AF and BF, are biologically related
 PI can be computed for apparent exclusion events
as well, invoking mutation and/or recombination
(generally leading to drastically reduced PI or LR for
the loci where such events are observed)
LR in Standard Paternity Testing
Data:
Mother’s DNA profile (GM), and that of the child (GC)
suggests that all obligatory alleles (i.e., the alleles that the
child must have received from its biological father, BF) are
present in the DNA profile of AF (GAF)
Hypotheses contrasted:
Hp: Alleged father (AF) is the biological father (BF) of the
child (M is assumed to the true mother); i.e., Hp: AF = BF
Hd: Alleged father is not the biological father, but he is not
excluded from paternity (i.e., Hd: AF ≠ BF)
SAMPLING THEORY OF ALLELE
FREQUENCIES
Under the mutation-drift balance, the probability of a
sample in which
copies of the allele
is observed,
for any set of
is given by
Where
freq. of allele
in the population,
and G(.) is the Gamma function, in which is the
coefficient of coancestry (equivalent to Fst or Gst, the
coefficient of gene differentiation between
subpopulations within the population)
Match Probability - Formulae
under HWE
with substructure adjustment
unconditional
conditional
Homozygote
(AiAi )
p i2
pi2 +θpi (1-pi)
[pi (1-θ)+2θ] [pi (1-θ)+3θ]
(1+θ) (1+2θ)
Heterozygote
(AiAj )
2pipj
2pipj (1-θ)
2[pi (1-θ)+θ] [pj (1-θ)+θ]
(1+θ) (1+2θ)
CONDITIONAL MATCH PROBABILITY
[2  (1   ) pi ][3  (1   ) pi ]
Pr( Ai Ai | Ai Ai ) 
(1   )(1  2 )
2[  (1   ) pi ][  (1   ) p j ]
Pr( Ai Aj | Ai Aj ) 
(1   )(1  2 )
Where pi, pj are frequencies of alleles Ai and Aj , and
 = coefficient of co-ancestry ( Fst/Gst) representing
extent of population substructure effect
(Balding and Nichols, 1994)
Match Probability - examples
under HWE
with substructure adjustment (θ=.01)
unconditional
conditional
D3S1358
(14, 18 )
0.0457
0.0457
0.0495
vWA
(14, 16)
0.0411
0.0411
0.0451
FGA
(23, 25)
0.0218
0.0218
0.0253
D8S1179 (12, 14)
0.0586
0.0586
0.0626
D21S11
(29, 30)
0.0840
0.0840
0.0881
D18S51
(13, 17)
0.0381
0.0381
0.0418
D5S818
(12, 12)
0.1252
0.1275
0.1367
D13S317
( 9, 11)
0.0488
0.0488
0.0542
D7S820
(10, 10)
0.0844
0.0865
0.0949
Cumulative
3.9610-12
4.1310-12
9.1510-12
Upper bound of 95% C.I.
1.0210-11
1.0510-11
2.1710-11
Paternity Testing – Frequentist
Approach Example
In a standard paternity testing case, with mother’s genotype being A1A1, and
the child’s A1A2, an alleged father whose genotype does not contain the A2
allele would be excluded, giving
PE  1 – Freq.(A2 A2 ) – Freq.(A2 A2 )
where A2 is any allele other than the allele A2. This computation assumes
that no mutation occurred during the transmission of alleles across
generations.
Note: Average exclusion probability can also be computed disregarding the
profiles examined, which rationalizes the choice of loci to be typed for
validating the stated relationship
LR for Kinship of a Pair of Individuals
Data:
DNA profile (GX) of one individual X, compared with that (GY)
of another individual Y is considered to assess the accuracy
of a specified stated biological relationship between X and Y
Hypotheses contrasted:
Hp: X and Y are biologically related (i.e., the stated
relationship is correct)
Hd: X and Y are biologically not related
Note: Comparison between two stated relationships may also be tested
IBD Probabilities – ITO Method
Two individuals of genotypes GX and GY can
share:
Both alleles IBD (called scenario I),
Only one allele from each is IBD (scenario T),
None of their alleles are IBD (scenario O).
Their probabilities are denoted by Φ2, Φ1, and Φ0,
respectively, and for any biological relatedness
0  Φ2,Φ1,Φ0  1, Φ2 + Φ1 + Φ0 = 1, and 4 Φ0 Φ2  Φ12
Kinship Analysis of a pair of Individuals :
IBD Coefficients In Relatives
Relationship Type
Symbol
0
1
2
Monozygotic twins
MZ
0
0
1
Parent-Offspring
PO
0
1
0
Full Sib
S
1/4
1/2
1/4
First Cousin
1C
3/4
1/4
0
Unrelated
U
1
0
0
Conditional Probability of Gy given Gx for
specific kinship of x and y
• Stipulated kinship between x and y specifies the IBD
probabilities 0, 1, 2 for x and y
• For observed Gx and Gy :
Pr (Gy | Gx for the specified relationship)
= 0•Pr(Gy | Gx under O) + 1•Pr(Gy | Gx under T)
+ 2•Pr(Gy | Gx under I)
Rule: Conditional probability of Gy given Gx for a stated kinship is the
weighted average of conditional probabilities of the same event under
specified IBD described by the kinship
GENOTYPE PROBABILITIES FOR A PAIR OF
INDIVIDUALS CONDITIONED BY
IBD PROBABILITIES OF ALLELES
Bayes Formula (Odds form)
 P(H1 | E)   P(E | H1 )   P(H1 ) 

  
  

 P(H2 | E)   P(E | H2 )   P(H2 ) 
posterior odds = likelihood ratio x prior odds
E = DNA evidence
H1 = alleged father is biological father
H2 = alleged father is not biological father
Note: While the first factor of the RHS is computed from DNA evidence,
the second factor, P(H1)/P(H2), is not necessarily a DNA-based information
Synthesis of Three Approaches of Statistical
Assessment
Frequency-Approach provides the probability of the
observed DNA evidence (unconditional as well as
conditional) under a given stipulated hypothesis
Likelihood Ratio (LR) contrasts such probabilities for
two mutually exclusive hypotheses
In Bayesian approach, with the use of prior probability,
LR is transformed to obtain the relative odds of one
hypothesis against another given the DNA data of the
evidence (and that from known persons tested)
Synthesis of Three Approaches (Contd.)
The three approaches are built on one another, and
hence, it is inaccurate to say one is wrong and the
others are correct
LR, without the transformation with the use of the prior
probability, may be incorrectly interpreted as the
answer of the Bayesian computation, but the numerator
and denominator of LR can be stated with frequentist’s
interpretation to avoid the error of reverse conditioning
The prior probability of the Bayesian approach
generally comes from non-DNA evidence, and hence,
their assumptions are untestable from DNA data
Important Fact with An Example
LR, by itself, is not a Bayesian Approach, and the
prosecutor’s fallacy can be avoided by explaining the two
conditional probabilities separately
Example: Consider a mixture case, where victim’s profile (V) together with the defendant’s
profile (S) explains all alleles in the mixture profile (E).
Under Hp: E = V + S, the conditional probability of E given Hp is 1.0, but under Hd: E = V +
UN, say the conditional probability of E given that the other contributor is unknown (UN) is
1 in 100,000.
Instead of telling LR = 100,000, it is less confusing to say that if we were to assume that the
mixture DNA came from the victim and this defendant, this is the only observation possible
(certain), but if the other contributor is unknown, we have to sample 100,000 unrelated
persons before finding one, whose DNA mixed with that of the victim would produce a
profile matching the profile seen in the mixture DNA evidence sample.
Is the Extent of Population
Substructure Uncertain
for the Forensic Loci?
Inbreeding Coefficient
(FST)
Caucasian
African
American
Hispanic
Asian
Native
American
CSF1PO
-0.0007
-0.0009
-0.0003
-0.0012
0.0244
D13S317
-0.0008
0.0029
0.0047
0.0071
0.0157
D18S51
0.0001
0.0012
0.0011
0.0046
0.0268
D21S11
0.0008
0.0005
0.0013
0.0056
0.0371
D3S1358
-0.0009
-0.0009
0.0010
0.0035
0.0764
D5S818
-0.0001
0.0010
0.0010
0.0028
0.0656
D7S820
-0.0005
0.0000
0.0010
0.0039
0.0201
Inbreeding Coefficient
(FST)
Caucasian
African
American
Hispanic
Asian
Native
American
0.0000
-0.0001
0.0005
0.0025
0.0125
FGA
-0.0004
0.0004
0.0008
0.0029
0.0168
THO1
-0.0012
0.0015
0.0041
0.0058
0.0356
TPOX
-0.0015
0.0021
0.0024
0.0100
0.0164
VWA
-0.0011
0.0011
0.0029
0.0027
0.0172
Average
-0.0005
0.0006
0.0021
0.0039
0.0282
D8S1179
The NRC-II recommendation
 = 0.01 for large cosmopolitan populations
and
 = 0.03 for small isolated populations
is well-validated by empirical as well as
theoretical foundations
Are the DNA Forensic
Population Databases Random
and are their Sample Sizes
Sufficient?
Features of Genetic Databases
• Population Genetics historically always employed
‘convenient’ sampling, in stead of strict random sampling
• ‘Convenient sampling’ defined as sampling of individuals
without any prior knowledge of their DNA type is
operationally random, in particular, when variations at
DNA loci do not affect fertility, viability, cognitive or life
achievement abilities
• Allele frequency estimates from convenient samples have
been shown to well-approximate those estimates from
structured strict random sampling
• Strict random samples collected at one point of time from a
natural population may not remain random at another time
point because of birth, death, immigration, and emigration
events
Features of Genetic Databases - 2
• Allele frequencies from subjects of
convenient samples described by ‘selfidentified’ ethnicity have been shown to
represent genetic affinities comparable with
similar inferences drawn from
anthropologically well-defined populations
• Occasional presence of biological relatives
in convenient samples does not affect allele
frequency estimates, but may produce
excess allele/genotype sharing at some loci
Phylogenetic Tree (UPGMA) for some World
Populations with allele frequency data of
the CODIS STR Loci
SW Hispanic (TX)
SW Hispanic (CA)
US Caucasian
Swiss
Italian
SE Hispanic (FL)
Chinese
Japanese
African American (TX)
African American (CA)
Apache
Navajo
Athabaskan
Inupiat
Yupik
Sample Size Limitation Issue
• Strictly speaking, no sample size is universally
sufficient unless all individuals are continually
genotyped over times
• Sample sizes such as 100 to 150 individuals per
population has been shown to produce stable
estimates of allele frequencies above a prescribed
minimum threshold allele frequency
• Current forensic DNA statistics employ the
concepts of minimum threshold allele frequency,
and upper 95% confidence interval to account for
sampling variation
Concerns Related to Databases
Used for Lineage Markers
(e.g., mtDNA and Y-STRs)
Inheritance of Lineage Markers
(NOTE: Colors denote mtDNA-type, Letters (X, A, B) indicate Ylinked information, where X denotes no Y-chromosome; A and B are
Y-linked alleles or Haplotypes)
B
X
X B B B
X
A
A
A
A
X
X X A X
Introductory Comments on Lineage Markers
• mtDNA is maternally inherited, and Y-STRs are
transmitted to only sons from fathers alone
• Barring mutations, all maternally related persons
(males as well as females) will have the same
mtDNA profile, and all paternally related males
will have the same Y-STR profile
• Different markers on mtDNA are genetically
linked (with virtually no recombination) and so
are the Y-STRs (residing on the non-recombining
region of the Y chromosome)
Comments on Lineage Markers (Contd.)
• Consequently, mtDNA sequence data has to be treated like
a haploid haplotype, frequency of which is NOT
multiplicative across markers, and so is the case of Y-STR
based profile
• Counting method is the one that captures the genetic
information
• Stated ethnicity of individuals does not necessarily reflect
patrilineal or matrilineal ancestry (e.g., mtDNA of
Hispanics may be almost entirely of Native American
descent, while for the autosomal STRs, only 30-50% of
their genes are of Native American descent)
• Thus, grouping of populations used for autosomal nuclear
STR loci does not necessarily provide accurate frequency
estimates of Y-linked STR haplotype, nor that of specific
mtDNA sequence
Fundamental Difference of Frequency of CODIS
STR DNA Profile and that of based on mtDNA
and Y-STRs
• For CODIS STR loci, profile frequency provides
information regarding the rarity of the profile in
the population, or conditional probability given
that the profile is found in someone else
• For mtDNA, it is the frequency among individuals
who are NOT maternally related
• For Y-STRs, likewise, it is the frequency among
individuals NOT paternally related
Computation of Frequency of Lineage-based
Marker Profile
Using the general theory, the unconditional
frequency of an haplotype (say Ai), which is
count divided by sample size, can be
modified to get the conditional probability
Pr. (Ai|Ai) = [pi2 + pi(1-pi)]/pi
= pi + (1-pi)
=  + pi(1 - )
Hence, the conditional probability always
exceeds , the adjustment factor of possible
population substructure in the database used
Computation of Frequency of Lineage-based
Marker Profile (Contd.)
Some advocates suggest that the quantity pi
in
Pr. (Ai|Ai) =  + pi(1 - )
can be substituted by
(Count of Ai + 2)/(N + 3),
where N is the sample size.
When N is large, this has little effect, but can be of
help when the count of Ai in the database is zero
(i.e., profile in evidence not seen in the database)
mtDNA and Y-STR  -Value
Since in terms of match versus non-match,
how different are the haplotypes is not an
issue, the  values for mtDNA and Y-linked
haplotypes are to be computed not based on
mismatch based approaches (such as
AMOVA), but treating all haplotypes as
different alleles, generally leading to much
smaller  value
Issues Related to DNA-Match
Statistics when Suspects are
Identified by Database Search
Three Approaches – Three Types of
Questions!
• The NRC-I recommendation to use only the additional
loci, not used in database search, is counter-productive
• The chance of coincidental finding of a profile in a
database depends on the expected rarity of the profile and
database size
• NRC-II’s Np rule answers the question of expected number
of profiles matching a target profile (of rarity p) in a
database (random with respect to crime) of size N
• Bayesian approach makes additional assumptions
regarding the prior odds of each individual in the database
being the contributor of the DNA of the target profile
DOES SOMEONE HAVE YOUR BIRTHDAY?
Prob. that in a sample of persons, all
birthdays are different is given by
SAMPLE SIZE NEEDED FOR AT LEAST ONE
DUPLICATE FOR GIVEN VALUES OF EVENT
PROBABILITY AND DEGREE OF
CONFIDENCE
OBSERBED AND EXPECTED
MATCH PROBABILITY
Frequency
Caucasian
African-American
0.45
0.45
0.4
0.4
0.35
0.35
0.3
0.3
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
0
1
2
3
4
5
0.45
0.45
0.4
0.4
0.35
0.35
0.3
0.3
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0
0
1
2
3
4
5
6
7
8
Number of Loci
6
7
8
9
10
11
12
13
9
10
11
12
13
Caribbean
Hispanic
Frequency
Observed
Expected
9
10
11
12
13
0
1
2
3
4
5
6
7
8
Number of Loci
EXPECTED NUMBER OF MATCHES IN
DATABASE SEARCH (CARIBBEAN)
OBSERVED AND EXPECTED NUMBER OF
MATCHES IN PAIRWISE COMPARISON OF
PROFILE IN DATABASE (CARIBBEAN)
EFFECT OF PRESENCE OF RELATIVES
(Caucasian data on CODIS loci,  =0, N = 1000)
1000000
100000
10000
1000
Number of Pairs
100
10
1
Unrelated
0.1
1 Full sib
0.01
10 Full sibs
0.001
100 Full sibs
0.0001
0.00001
0.000001
1E-07
1E-08
1E-09
1E-10
0
1
2
3
4
5
6
7
Number of Loci
8
9
10
11
12
13
Conclusions
• With larger amount of data collected since1996, and with
experiences of statistical results from caseworks, NRC-II
recommendations remain as appropriate suggestions for
statistical evaluation of Forensic DNA evidence
• Statistical answers for different questions are necessarily
different; they do not constitute lack of general acceptance
• mtDNA and Y-STR database groupings are necessarily
different from that of autosomal STRs because of uniparental ancestry of lineage markers
• Convenient sampling effect and sampling size limitations
are imbedded in current protocols of DNA statistics
• Suspect from database search raises multiple type of
questions answers of which are different
Acknowledgements
• Dr. Bruce Budowle - from FBI Academy
• Hee S. Lee, Xiaohua Sheng, Jianye Ge Graduate Students at CGI, Univ. Cincinnati
• SWGDAM members – for providing
databases
• US Granting Agencies NIH and NIJ – for
partial support of the research
Thank You!