Principal Component Analysis and Other Methods

Download Report

Transcript Principal Component Analysis and Other Methods

Computational Strategies for Tool Marks:
Principal Component Analysis and Other Met
hods
3
2
1
0
1
2
3
Outline
• Introduction
• Details of Our Approach
•
•
•
•
•
Data acquisition
Surface/Profile Pre-processing
Methods of Statistical Discrimination
Error Rate Estimates
Suggested Operating Guidelines
Background Information
• All impressions made by tools and firearms can
be viewed as numerical patterns
– Machine learning trains a computer to recognize
patterns
• Can give “…the quantitative difference between an
identification and non-identification”Moran
• Can yield identification error rate estimates
• May be even confidence for I.D.s, posterior error
probability estimates…??
Data Acquisition
• Obtain striation/impression
patterns from 3D confocal
microscopy
• Store files in ever expanding
database
• Data acquisition is labor
intensive and time consuming
• Data files are available to
practitioner community
through web interface
Screwdriver Striation Patterns in Lead
2D profiles
3D surfaces
(interactive)
Glock Fired Cartridges
Bottom of
Firing pin imp.
Glock 19 fired cartridge cases
Shear marks on primer of two
different Glock 19s
Mean total profile:
Mean “waviness”
profile:
Mean “roughness”
profile:
Profile Simulator
• Nice for data sets to be large but real data acquisition is SLOW
• Need SOMETHING in the interim
• Our simulator is based on DWT MRA
• May shed light on processed generating surfaces
• Should be extendable to 2D striations/impressions…
What Other Statistics Can Be Used?
• Multivariate Statistical pattern
comparison!
• Modern algorithms are called
machine learning
• Idea is to measure
features of the
physical evidence
that characterize it
• Train algorithm to
recognize “major”
differences between
groups of features while taking into account
natural variation and measurement error.
Setup for Multivariate Analysis
• Need a data matrix to do machine learning
é X11 .. X1 j
ê
:
ê :
ê
X = Xi1 .. Xij
ê
ê :
:
ê X
êë n1 .. X nj
.. X1 p ù
ú
: ú
.. Xip ú
ú
: ú
.. X np úú
û
Represent as a
vector of values
{-4.62, -4.60, -4.58, ...} • Each profile vector is a row in the data matrix
• Typical length is ~4000 points/profile
• 2D surfaces are far longer
• PCA can:
• HIGHLY REDUNDANT representation
• Remove much of the redundancy of surface data
• Make discrimination computations
far more tractable
• 3D PCA 24 Glocks, 720 simulated and real primer shear
profiles:
• How many PCs should we
use to represent the data??
•
•
•
~47% variance retained
No unique answer
FIRST we need an algorithm
to I.D. a toolmark to a tool
Support Vector Machines
• Support Vector Machines (SVM) determine
efficient association rules
• In the absence of any knowledge of probability
densities
• For small to medium sized data sets
Gaussian LDA decision boundary
SVM decision boundary
PCA-SVM
• How many PCs should we use?
• Gather training set
• Use systematically increase PC dimension and
check Hold-one-out Cross-Validation with SVM
With 7 PCs, expect ~3% error rate
Error Rate Estimation
• Cross-Validation: systematically hold-out
chunks of data set for testing
• Most common: Hold-one-out
• Bootstrap: Make up data sets with randomly
selected observation vectors (with replacement)
• Repeat thousands of times
• Can yield confidence intervals around error rate
estimate
• The Best: Small training set, BIG test set
18D PCA-SVM Primer Shear I.D. Model, 2000 Bootstrap Resamples
Refined bootstrapped I.D. error rate for primer shear striation patterns= 0.35%
95% C.I. = [0%, 0.83%]
(sample size = 720 real and simulated profiles)
How good of a “match” is it?
Conformal Prediction
Cumulative # of Errors
• Can give a judge or jury an easy to understand measure of
reliability of classification result
• Confidence on a scale of 0%-100%
• This is an orthodox “frequentist”
approach
• You choose the confidence level
• A Set of possible I.D. obtained
• Varying confidence changes
the size of the sets
80% confidence
20% error
Slope = 0.2
95% confidence
5% error
Slope = 0.05
99% confidence
1% error
Slope = 0.01
Sequence of Unk Obs Vects
Empirical Bayes’
• Bayes’ Rule: can we realistically estimate posterior
error probabilities empirically/falsifiably??
• Given algorithm indicates a toolmark is associated with a
tool, what is the probability it is truly not a “match”?
• Probability that there is another tool in the pop. that can
generate a toolmark which the algorithm will match to
the wrong tool
Probability of no actual association
given a test/algorithm indicates a
positive ID
(
Pr S | t
-
+
)=
(
Pr t + | S-
( )
Pr t +
) Pr (S )
-
Empirical Bayes’
• Erfon’s machinery for “empirical Bayes’ two-groups
model”Efron 2007
• Surprisingly simple!
• S-, truly no association, Null hypothesis
• S+, truly an association, Non-null hypothesis
• z, a Gaussian random variate derived from a machine learning task
to ID an unknown pattern with a group
• Scheme yields estimate of Pr(S-|z) along with it’s standard
error
• Called the local false discovery rate (fdr) or posterior error
probability
• Given a similarity score, fdr is an estimate of the probability that
the computer is wrong in calling a “match”
• Catch: you need A LOT of z and they should be fairly
independent and Pr(S-) > 0.9
Empirical Bayes’
• Try Erfon’s machinery for “empirical Bayes’ two-groups
model”
• Gives a calibrated “no-match posterior probability” model
The SVM alg got these
Primer shear IDs wrong
Empirical Bayes’
• Model’s use with crime scene “unknowns”:
This is the est. post.
prob. of no association
= 0.00027 = 0.027%
This is an uncertainty
in the estimate
Computer outputs “match” for:
unknown crime scene toolmarks-with knowns from “Bob the burglar” tools
Future Directions
• All feature vectors must be aligned and of the same length,
• Use CCF to align within and between guns.
• Padded any overhang with uniform random noise
Glock 1
Glock 23
A real pain for 1D, not to mention 2D!!
Future Directions
• Try invariant feature extraction from digital imaging:
• Legendre moments:
• Translationally invariant
• Trying for 2D striation patterns
• Zernike moments:
• Rotationally and translationally invariant but more expensive.
• Trying for 2D impression patterns
Future Directions
• Extend ImageJ Surface Metrology functionality
• DigitalSurf Mountains map is EXPEN$IVE!!
• DWT and MODWT modules for
• De-noising
• Form-removal
• Surface Filtering
• GP-GPU implementation of computationally intensive routines
• This WILL work in Java with JOCL bindings
• Need big surfaces to make copying overhead worthwhile
Acknowledgements
•
Research Team:
• Mr. Peter Diaczuk
• Ms. Carol Gambino
• Dr. James Hamby
• Dr. Brooke Kammrath
• Dr. Thomas Kubic
• Mr. Chris Lucky
• Off. Patrick McLaughlin
• Dr. Linton Mohammed
• Mr. Jerry Petillo
• Mr. Nicholas Petraco
• Dr. Graham Rankin
• Dr. Jacqueline Speir
• Dr. Peter Shenkin
• Mr. Peter Tytell
• Helen Chan
• Julie Cohen
• Aurora Dimitrova
• Frani Kammerman
•
•
•
•
•
Loretta Kuo
Dale Purcel
Stephanie Pollut
Chris Singh
Melodie Yu
Website Information and Reprints/Preprints:
toolmarkstatistics.no-ip.org/
[email protected]
[email protected]