Introduction

Download Report

Transcript Introduction

Data Analysis and Statistics for Forensic
Applications:
Everything you always wanted to know
but were afraid to ask
3
2
1
0
1
2
3
FOS/MAT 705, Spring 2016
“Legal” Science
• Daubert is a benchmark!!!:
• Daubert (1993)- Judges are the “gatekeepers” of
scientific evidence.
• Must determine if the science is reliable
• Has empirical testing been done?
• Falsifiability
• Has the science been subject to peer review?
• Are there known error rates?
• Is there general acceptance?
• Federal Government and 26(-ish) States are
“Daubert States”
Measurement and Randomness
• Any time an observation is made, one is making a
“measurement”
• As all scientists know, almost no two measurements of
the same quantity under the same conditions will
never agree exactly
1. Experimental error is inherent in every
measurement
•
Refers to variation in observations between
repetitions of the same experiment.
•
It is unavoidable and many sources contribute
2. Error in a statistical context is a technical termBHH
Measurement and Randomness
• Experimental error is a form of randomness
•
Randomness: inherent unpredictability in a
process
•
The the outcomes of the process follow a probability
distribution
• Statistical tools are used to both:
•
•
Describe the randomness
Make inferences taking into account the
randomness
Probability
• Frequency: ratio of the number of
observations of interest (ni) to the total
number of observations (N)
ni
frequency of observation i =
N
•
•
Probability (frequentist): frequency of
observation i in the limit of a very large number
of observations
This definition is falsifiable (i.e. testable)
Probability
• Belief: A “Bayesian’s” interpretation of
probability.
•
An observation (outcome, event) is a “measure of
the state of knowlege”Jaynes.
•
Bayesian-probabilities reflect degree of belief and can
be assigned to any statement
•
Beliefs (probabilities) can be updated in light of
new evidence (data) via Bayes theorem.
•
This definition is not always falsifiable
Probability
• Algorithmic Probability: How do you
rigorously assign a probably to an observation
on which you have no data?
•
•
Solomonoff, Kolmogorov, Levin, and Chaitin
came up with a way.
The basic idea:
•
Patterns which result from "computation" are
relatively likely
•
Patterns that can not be produced from any
computational processes are relatively unlikely.
Probability
•
Formally, a computable process produces an
observation:
•
•
A program executed on a theoretical computer (universal
Turing machine) and produces the observation as output.
Algorithmic probability, P(observation):
•
Probability that the output of a Turing machine is the
observation when provided with programs of "fair coin
flips” are run
•
A binary program of randomly drawn 1s and 0s each
with a probability of ½.
P ( obs.) = å 2
i
-length ( progi )
such that U ( progi ) = obs.
Probability
•
•
Definition of algorithmic probability can be made
mathematically rigorous, but does not yield exactly
computable results.
Approximations schemes can be made but so far
have been difficult to put into practice.
•
The length of the shortest program producing obs. is
called the observations Kolmogorov complexity.
What is Statistics??
• Study of relationships in data
• Descriptive Statistics – techniques to summarize data
• E.g. mean, median, mode, range, standard deviation, stem
and leaf plots, histograms, box and whiskers plots, etc.
• Inferential Statistics – techniques to draw conclusions
from a given data set taking into account inherent
randomness
• E.g. confidence intervals, hypothesis testing, Bayes’
theorem, forecasting, etc.
Why do we use statistical tools?
• For the Sciences, we ask:
•
Are the differences in measurements characterizing
two (or more) objects real or just due to (the
characteristic) randomness?
• Furthermore, for the Forensic Sciences we ask:
•
•
Do two pieces of evidence originate from a common
source?
For this, we must at least answer the above.
Population and Sample
• Almost all of statistics is based on a sample
drawn from a population.
• Population: The totality of observations that might
occur as a result of repeatedly performing an
experiment
• Why not measure the whole population?
• Usually impossible
• Likely wasteful
• Population should be relevant.
• Part logic
• Part guess
• Part philosophy….
Population and Sample
• Sampling:
• Sample: a few observations that are made from a
population
• Draw members out of as population with some
given probability
• Random sample: if all observations have an equal
chance of being made and no observation affects
any other
• Want a random sample to be representative of the
population
• Biased sample if not the case*
Data and Sampling
• Sample Representations:
Population
Representative
Sample
Biased
Samples
Population
Population
Sample
Sample
Data and Sampling
• Types of sampling:
• (Simple) Random Sampling
• Every data item is selected independently of every other.
• Every member of a population has an equal chance of being
selected
• Systematic Sampling
• Pick every kth data item to be in the sample
• Easier to conduct but risk getting a biased sample
• Stratified Sampling
• Partition population into disjoint groups containing specific
attributes of a particular category (strata)
• Random sample from the groups
Data and Sampling
• Types of sampling (con’t):
• The Bootstrap
• Sample from a population, preferably as large as possible.
• Say sample size is n
• Bootstrap sample: Sample with replacement out of the
original sample to build a new sample of size n
• Do this hundreds or thousands of times
• Use statistics from each bootstrap sample to get an idea of
population variation
• Computationally intensive but often works well compared
to traditional methods
• Free of many traditional assumptions
Parameters and Statistics
• Parameter: any function of the population
• Statistic: any function of a sample from the
population
•
•
Statistics are used to estimate population
parameters
•
Statistics can be biased or unbiased
•
Sample average is an unbiased estimator for
population mean
We may construct distributions for statistics
•
Populations have distributions for observations
•
Samples have distributions for observations and
statistics
Univariate vs. Multivariate Statistics
• Univariate Statistics: Statistical tools used to
analyze one random variable
•
•
Random variable could be raw observation or a
statistic
Common tools are: (univariate) hypothesis
testing, ANOVA, linear regression
• Multivariate Statistics: Statistical tools used
to analyze many random variables
•
Random variables can also be raw observations
(often encountered in chemometrics) or statistics
(currently popular in marketing, finance, surface
metrology)
Why Use Multivariate Statistics?
• Don’t if you can see clear differences/similarities in
your data and can clearly articulate how in court!
• If you can’t differentiate or want to study/search for
differences within a well defined population
• AND univariate
methods don’t do
the trick:
A linear (or non-linear) combination of
many experimental variables (multivariate)
may do the trick!
Some Important Terms
• Vector – A list of numbers or attributes
characterizing an observation or experiment
• Vectors can be pictures!
Represent normalized
intensities of mixture
Components as arrows:
Multivariate Feature Vectors
• Random variables - All measurements have an
associated “randomness” component
• Randomness –patternless, unstructured,
typical, total ignoranceChaitin, Claude
• For an experiment/observation, put many
measurements together into a list
• Collection random variables into a list called a
random vector
1. Also called: observation vectors
feature vectors
Pick Features
Example Feature Vector
o GC-MS instrument output for a gasoline :
Brief Description of Data Sets
• Gasoline (gas) GC-MS for 20 casework samples
• Collected by Mark Gil, NYPD Crime lab at the time
• 15 normalized peak areas characterize each chromatogram
• 3 to 7 replicates per sample
• 92 chromatograms
• ¼ in Screwdriver striation marks for 9 screwdrivers
(tool)
• Collected by Nicholas Petraco, Petraco Forensic Consulting
and NYPD Crime Lab
Brief Description of Data Sets
• ¼ in Screwdriver striation mark data (con’t)
• 140D binary vectors characterize each striation
pattern
• 6 to 9 replicate patterns per screwdriver
• 75 pattern total
• Also, ~750 simulated patterns too (toolsim)
Brief Description of Data Sets
• Neel and Wells Consecutive Matching Strae
(CMS) study(s)
• AFTE J 39(3):176-198 2007 (Part I)
• Enumerated CMS runs for a large set of known match
(KM) and known non-match (KNM) comparisons
• 4188 comparisons
• Various toolmark sources (cf. pp 179 second column)
Brief Description of Data Sets
• LAM 2011 study
•
Mohammed et al., “The dynamic character of disguise
Behaviour for text-based, mixed, and stylized signatures”
• J Forensic Sci 56(1),S136-S141 (2011)
• Variation of dynamic signature parameters with signing style
and conditions
• Params: duration, size, velocity, jerk, and pen pressure
• Style: text-based, stylized, and mixed
• Conditions: genuine, disguised, and auto-simulation
• Ninety writers: 10 genuine sigs, five disguised sigs, five auto-sim sigs.
• 1800 signatures total collected using a digitizing tablet
Brief Description of Data Sets
• Glass data (glass) from 6 glass types
•
•
•
•
•
Collected by Forensic Science Service, UK
“Famous” (infamous…) data set
9 variables: RI, Na, Mg, Al, Si, K, Ca, Ba, Fe
9-76 replicates per sample…
242 glass specimens total
• James Curran’s collected data sets in his R
package dafs. See CRAN.
• Lots of stuff. Explore it.
Brief Description of Data Sets
• Dust data (dust) from 10 locations
• Collected by Nicholas Petraco, Petraco Forensic
Consulting and NYPD Crime Lab
• 342 variables to characterize each sample
• 1/0 = present/absent in sample
• 3 replicates per sample
• 30 dust specimens total
Brief Description of Data Sets
• Type written letters in different fonts data (lett)
•
•
•
•
Benchmark machine learning test set
17 variables: All easy to obtain
Lots of replicates per letter
20,000 examples total
• Hand written digits (0-9) from US zip codes
(zip)
•
•
•
•
Benchmark set from USPS
256 variables: Digitized/normalized grey levels
Lots of replicates per number
9298 examples total
First Thing: Look at your Data
• Data frame (data matrix):
p variables
x1,3 L
O
O
x1, p 





xn, p 
n observation vects
 x1,1 x1,2
x
x
2,1
2,2

X   x3,1

 M
 xn,1 L

Part of data frame for
composition of gasoline:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
ID Ethylbenzene m.p.Xylene
o.Xylene
Propylbenzene…
1
0.3738972
1.5189473
0.509374
0.17058569
1
0.3821145
1.4975333
0.4869311
0.15266414
1
0.3910006
1.5735967
0.5140853
0.17097528
1
0.3592879
1.0931521
0.469633
0.14356119
1
0.379583
1.4976838
0.5004265
0.16732263
1
0.3824838
1.5347461
0.5003289
0.15989651
1
0.3932254
1.5370547
0.5191838
0.1693152
2
0.1697284
0.7243938
0.2739452
0.07111785
2
0.1730064
0.7494535
0.2791126
0.07370284
2
0.1587106
0.684664
0.2484791
0.06270977
2
0.1668295
0.6983527
0.2586032
0.06568599
2
0.1655228
0.7036451
0.2689125
0.07029099
2
0.1645599
0.6938837
0.2546212
0.0656616
2
0.1544826
0.6472038
0.2439379
0.06100052
3
0.1575096
0.5890765
0.2220248
0.0556264
3
0.1610904
0.6069997
0.2319318
0.05969297
3
0.1535362
0.5693021
0.2268586
0.06187798
3
0.1532304
0.5735848
0.2147113
0.0535368
3
0.1664476
0.6248774
0.2396432
0.06084041
First Thing: Look at your Data
o Explore the Glass dataset of the mlbench
package
14
13
12
11
Na
15
16
17
• Scatter plots: plot any two variables against each
other
1.515
1.520
1.525
RI
1.530
First Thing: Look at your Data
• Pairs plots: do many scatter plots at once
1
2
3
4
5
6
73
74
75
0
4
5
6
70
71
72
Si
12
14
16
0
1
2
3
K
6
8
10
Ca
70
71
72
73
74
75
6
8
10
12
14
16
First Thing: Look at your Data
• Histograms: “bin” a variable and plot frequencies
60
50
Percent of Total
40
30
20
10
0
1.510
1.515
1.520
1.525
RI
1.530
1.535
First Thing: Look at your Data
• Histograms conditioned on other variables: use
lattice package
1.5101.5151.5201.5251.5301.535
5
6
7
80
60
40
RIs Conditioned on glass
group membership
Percent of Total
20
0
1
2
3
80
60
40
20
0
1.5101.5151.5201.5251.5301.535
1.5101.5151.5201.5251.5301.535
RI
First Thing: Look at your Data
• Probability density plots: also needs lattice
200
Density
150
100
50
0
1.510
1.515
1.520
1.525
RI
1.530
1.535
First Thing: Look at your Data
• Empirical Probability Distribution plots: also
called empirical cumulative density
1.0
Empirical CDF
0.8
0.6
0.4
0.2
0.0
1.515
1.520
1.525
RI
1.530
1.535
First Thing: Look at your Data
• Box and Whiskers plots:
range
possible
outliers
possible
outliers
25th-%tile
1st-quartile
1.5188
1.5189
median
50th-%tile
1.5190
RI
75th-%tile
3rd-quartile
1.5191
1.5192
Visualizing Data
• Note the relationship:
First Thing: Look at your Data
• Box and Whiskers plots:
60
40
values
values
5
0
20
0
Al
Ba
Ca
Fe
K
Mg
Na
Box-Whiskers plots for
actual variable values
RI
Si
Al
Ba
Ca
Fe
K
Mg
Na
RI
Box-Whiskers plots for
scaled variable values
Si
“Variability”
• Variance: a measure of variability of an
experimental quantity
• “Spread” of measurement about the average s2:
~ 68%
~±1s
95%
±2s
~ 99%
±3s
Measures of Data Spread
• Sample variance:
• (Almost) the average of squared deviations from the
sample mean.
there are n data points
n
1
2
2
s 
 xi  x 

n  1 i 1
data point i
• Standard deviation is
sample mean
s2  s
• The sample mean and standard dev. are the most common
measures of central tendency and spread
• Sample mean and standard dev have the same units
“Variability”
o Covariance: variability of two measured
quantities with each other:
1 n
si, j =
(xk ,i - xi )(xk , j - x j )
å
n - 1 k =1
obs. #k, var. #i
avg. of var. #i
• As one quantity increases the other increases:
1. si,j positive
• As one quantity increases the other decreases:
1. si,j negative
Basic Data Pre-processing
• Mean Centering: Subtract the mean of each
column in X
• Puts the origin at the data’s “center of mass”
• Variance Scaling: Divide each column by its
standard deviation
• Weights each column (variable) to have the same
importance
• All variables will have the same variance = 1
• Autoscaling: Mean center and variance scale X
Basic Data Pre-processing
• Some Gasoline Data:
0.10
0.20
lbl.gas
lbl.gas
1
1
2
2
3
3
4
4
5
0.05
5
6
7
8
0.15
9
10
11
12
13
C4.alkylbenzene.unid.2
C4.alkylbenzene.unid.2
6
7
8
9
10
11
12
0.00
13
14
14
15
15
0.10
16
16
17
17
18
18
19
19
-0.05
20
20
0.2
0.3
0.4
0.5
0.6
o.Xylene
Raw Peak Areas
0.7
-0.2
-0.1
0.0
0.1
o.Xylene
Mean Centered
0.2
0.3
Basic Data Pre-processing
• Some Gasoline Data:
5.0
lbl.gas
1
4.5
2
lbl.gas
0.20
1
2
3
4
5
6
7
8
0.15
9
10
11
12
13
14
15
0.10
16
17
18
19
20
0.2
0.3
0.4
0.5
0.6
0.7
o.Xylene
Var = 0.002
Raw Peak Areas
C4.alkylbenzene.unid.2 Var = 1
Variance Scale
C4.alkylbenzene.unid.2
Var = 0.014
3
4
4.0
5
6
7
3.5
8
9
10
3.0
11
12
13
2.5
14
15
16
2.0
17
18
19
20
1.5
2
3
4
o.Xylene Var = 1
5
6
Basic Data Pre-processing
• Some Gasoline Data:
2.0
lbl.gas
1
2
C4.alkylbenzene.unid.2
Var = 1
1.5
3
4
5
1.0
6
7
8
9
0.5
10
11
12
0.0
13
14
15
-0.5
16
17
18
-1.0
19
20
-1.5
-1
0
o.Xylene
1
Var = 1
Autoscaled
2