DNA Analysis in Forensics
Download
Report
Transcript DNA Analysis in Forensics
Forensics and CS
Philip Chan
CSI: Crime Scene Investigation
www.cbs.com/shows/csi/
high tech forensics tools
DNA profiling
Use as evidence in court cases
DNA
Deoxyribonucleic Acid
Each person is unique in DNA (except for
twins)
DNA samples can be collected at crime
scenes
About .1% of human DNA varies from person
to person
Forensics Analysis
Focus on loci (locations) of the DNA
Values at the those loci (DNA profile) are
recorded for comparing DNA samples.
Two DNA profiles from the same person have
matching values at all loci.
More or fewer loci are more accurate in
identification?
Tradeoffs?
FBI uses 13 core loci
http://www.cstl.nist.gov/biotech/strbase/fbicore.htm
We do not want to wrongly accuse
someone
How can we find out how likely another
person has the same DNA profile?
How many people are in the world?
How low the probability needs to be so that a
DNA profile is unique in the world?
Low probability doesn’t mean impossible
Just very unlikely
Review of basic probability
Joint probability of two independent events
P(A,B) = ?
Review of basic probability
Joint probability of two independent events
P(A,B) = P(A) * P(B)
Independent events mean knowing one event
does not provide information about the other
events
P(Die1=1, Die2=1)
= P(Die1=1) * P(Die2=1)
= 1/6 * 1/6 = 1/36.
Enumerating the events
1
1
2
3
1,1
1,2
…
4
2
3
4
5
6
36 events, each is equally likely, so 1/36
5
6
Joint probability
P(Die1=even, Die2=6) = ?
Joint probability
P(Die1=even, Die2=6)
= 1/2 * 1/6 = 1/12
P(Die1=1, Die2=5, Die3=4) = ?
Joint probability
P(Die1=even, Die2=6)
= 1/2 * 1/6 = 1/12
P(Die1=1, Die2=5, Die3=4)
= (1/6)3 = 1/216
DNA profile probability
How to estimate?
DNA profile probability
How to estimate?
Assuming loci are independent
P(Locus1=value1, Locus2=value2, ...)
= P(Locus1=value1) * P(Locus2=value2) * ...
DNA profile probability
How to estimate?
Assuming loci are independent
P(Locus1=value1, Locus2=value2, ...)
= P(Locus1=value1) * P(Locus2=value2) * ...
How to estimate P(Locus1=value1)?
DNA profile probability
How to estimate?
Assuming loci are independent
P(Locus1=value1, Locus2=value2, ...)
= P(Locus1=value1) * P(Locus2=value2) * ...
How to estimate P(Locus1=value1)?
a random sample of size N from the
population and
find out how many people out of N have
value1 at Locus1
Database of DNA profiles
Id
A5212
A6921
…
Locus1
Locus2
Locus3
…
Locus13
Problem Formulation
Given
A sample profile (e.g. collected from the crime
scene)
A database of known profiles
Find
The probability of the sample profile if it
matches a known profile in the database
Breaking Down the Problem
Find
The probability of the sample profile if it matches a
known profile in the database
What are the subproblems?
Breaking Down the Problem
Find
The probability of the sample profile if it matches a
known profile in the database
What are the subproblems?
Subproblem 1
Find whether the sample profile matches
1a: ?
1b: ?
Subproblem 2
Calculate the probability of the profile
Breaking Down the Problem
Find
The probability of the sample profile if it matches a
known profile in the database
What are the subproblems?
Subproblem 1
Find whether the sample profile matches
1a: check entries in the database
1b: check if all 13 loci match in each entry
Subproblem 2
Calculate the probability of the profile
Simpler Problem for 1a (very
common)
Given
an array of integers (e.g. student IDs)
an integer (e.g. an ID)
Find
whether the integer is in the array
int[]
directory; // student id’s
int
id;
// to be found
boolean found;
// true if id is in directory
Linear/Sequential Search
Check one by one
Stop if you find it
Stop if you run out of items to check
Not found
Number of Checks (speed of
algorithm)
Consider N items in the array
Best-case scenario
When does it occur? How many checks?
Number of Checks (speed of
algorithm)
Consider N items in the array
Best-case scenario
When does it occur? How many checks?
First item;1 check
Worst-case scenario
When does it occur? How many checks?
Number of Checks (speed of
algorithm)
Consider N items in the array
Best-case scenario
Worst-case scenario
When does it occur? How many checks?
First item;1 check
When does it occur? How many checks?
Last item or not there; N checks
Average-case scenario
Average of all cases
(1 + 2 + … + N) / N =
Can we do better? Faster
algorithm?
What if the array is sorted, items are in an
order
E.g. a phone book
Binary Search
Check the item at midpoint
If found, done
Otherwise, eliminate half and repeat
Breaking down the problem
While more items and not found
What are the two subproblems?
Breaking down the problem
While more items and not found
Eliminate half of the items
Find the mid point
Number of checks (Speed of
algorithm)
Best-case scenario
When does it occur? How many checks?
Number of checks (Speed of
algorithm)
Best-case scenario
When does it occur? How many checks?
In the middle; 1 check
Number of checks (Speed of
algorithm)
Best-case scenario
When does it occur? How many checks?
In the middle; 1 check
Worst-case scenario
When does it occur? How many checks?
Number of checks (Speed of
algorithm)
Best-case scenario
When does it occur? How many checks?
In the middle; 1 check
Worst-case scenario
When does it occur? How many checks?
Dividing into two halves, half has only one
item
? checks
Number of checks (Speed of
algorithm)
Best-case scenario
When does it occur? How many checks?
In the middle; 1 check
Worst-case scenario
When does it occur? How many checks?
Dividing into two halves, half has only one
item
? checks
Number of checks (Speed of
algorithm)
T(1) = 1
T(N) = T(N/2) + 1
Number of checks (Speed of
algorithm)
T(1) = 1
T(N) = T(N/2) + 1
Number of checks (Speed of
algorithm)
T(1) = 1
T(N) = T(N/2) + 1
= [ T(N/4) + 1 ] + 1
Number of checks (Speed of
algorithm)
T(1) = 1
T(N) = T(N/2) + 1
= [ T(N/4) + 1 ] + 1
Number of checks (Speed of
algorithm)
T(1) = 1
T(N) = T(N/2) + 1
= [ T(N/4) + 1 ] + 1
= [ [ T(N/8) + 1] + 1] + 1
Number of checks (Speed of
algorithm)
T(1) = 1
T(N) = T(N/2) + 1
= [ T(N/4) + 1 ] + 1
= [ [ T(N/8) + 1] + 1] + 1
= … any pattern?
Number of checks (Speed of
algorithm)
T(1) = 1
T(N) = T(N/2) + 1
= [ T(N/4) + 1 ] + 1
= [ [ T(N/8) + 1] + 1] + 1
= …
= T(N/2k) + k
Number of checks (Speed of
algorithm)
T(1) = 1
T(N) = T(N/2) + 1
= [ T(N/4) + 1 ] + 1
= [ [ T(N/8) + 1] + 1] + 1
= …
= T(N/2k) + k
N/2k gets smaller and eventually becomes 1
Number of checks (Speed of
algorithm)
T(1) = 1
T(N) = T(N/2) + 1
= [ T(N/4) + 1 ] + 1
= [ [ T(N/8) + 1] + 1] + 1
= …
= T(N/2k) + k
N/2k gets smaller and eventually becomes 1
solve for k
Number of Checks (Speed of
Algorithm)
N/2k = 1
N = 2k
k=?
Number of Checks (Speed of
Algorithm)
N/2k = 1
N = 2k
k = log2N
Number of Checks (Speed of
Algorithm)
N/2k = 1
N = 2k
k = log2N
T(N) = T(N/2k) + k
= T(1) + log2N
= ? + log2N
Number of Checks (Speed of
Algorithm)
N/2k = 1
N = 2k
k = log2N
T(N) = T(N/2k) + k
= T(1) + log2N
= 1 + log2N
Sorting (arranging the items in a
desired order)
How is the phone book arranged?
Why?
Why not arranged by numbers?
Sorting (arranging the items in a
desired order)
How is the phone book arranged?
Why?
Why not arranged by numbers?
Order
Alphabetical
Low to high numbers
DNA profile with 13 loci?
Sorting
Imagine you have a thousand numbers in an
array
How would you systemically sort them?
Selection Sort (ascending)
Find/select the smallest item
Swap the smallest item with the first item
Selection Sort (ascending)
Find the smallest item
Swap the smallest item with the first item
Find/select the second smallest item
Swap the second smallest item with the
second item
…
Breaking down the problem
Get all the items in ascending order
Get one item at the wanted position/index
What are the two subproblems?
Breaking down the problem
Get all the items in ascending order
Get one item at the wanted position/index
Find the smallest item
Breaking down the problem
Get all the items in ascending order
Get one item at the wanted position/index
Find the smallest item
Swap the smallest item with the item at the
wanted position
Number of comparisons (Speed of
Algorithm)
Consider counting
Number of comparisons between array items
Number of comparisons (Speed of
Algorithm)
Consider counting
Number of comparisons between array items
Best-case scenario (least # of comparisons)
When does it occur? How many
comparisons?
Number of comparisons (Speed of
Algorithm)
Consider counting
Number of comparisons between array items
Best-case scenario (least # of comparisons)
When does it occur? How many
comparisons?
Worst-case scenario (most # of comparisons)
When does it occur? How many
comparisons?
Number of comparisons (Speed of
Algorithm)
Consider counting
Number of comparisons between array items
Best-case scenario (least # of comparisons)
When does it occur? How many
comparisons?
Worst-case scenario (most # of comparisons)
When does it occur? How many
comparisons?
Same number of comparisons
For all cases (ie best case = worst case)
Number of comparisons (Speed of
Algorithm)
To find the smallest item
How many comparisons?
Number of comparisons (Speed of
Algorithm)
To find the smallest item
How many comparisons?
N-1
To find the second smallest item
How many comparisons?
Number of comparisons (Speed of
Algorithm)
To find the smallest item
How many comparisons?
N-1
To find the second smallest item
How many comparisons?
N-2
…
Total # of comparisons?
Number of comparisons (Speed of
Algorithm)
To find the smallest item
How many comparisons?
N-1
To find the second smallest item
How many comparisons?
N-2
…
Total # of comparisons
(N-1) + (N-2) + … + 1
Number of comparisons (Speed of
Algorithm)
To find the smallest item
How many comparisons?
N-1
To find the second smallest item
How many comparisons?
N-2
…
Total # of comparisons
(N-1) + (N-2) + … + 1
N(N-1)/2 = (N2 – N)/2
Summary
DNA samples from crime scene
Identify people using known DNA profiles
If there is a match
estimate probability of DNA profile
Matching a sample to known DNA profiles
Linear/sequential search [N checks]
Binary search [log2N + 1 checks]
Faster but needs sorted data/profiles
Selection Sort [(N2 – N)/2 comparisons]