Martin & Zamar - Robust Statistics

Download Report

Transcript Martin & Zamar - Robust Statistics

ROBUST STATISTICS
R. Douglas Martin* and Ruben H. Zamar**
*Professor of Statistics, Univ. of Washington
**Professor of Statistics, Univ. of British Columbia
Key Reference Books
• Huber, P.J. (1981). Robust Statistics, Wiley
• Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J.,
and Stahel, W.A. (1986). Robust Statistics, The
Approach Based on Influence Functions, Wiley.
• Rousseeuw, P.J. and Leroy, A.M. (1987). Robust
Regression and Outlier Detection, Wiley.
J. W. Tukey (1979)
“… just which robust/resistant methods
you use is not important – what is
important is that you use some. It is
perfectly proper to use both classical and
robust/resistant methods routinely, and
only worry when they differ enough to
matter. But when they differ, you should
think hard.”
J. W. Tukey
“Statistics is a science in my opinion, and it is
no more a branch of mathematics than are
physics, chemistry and economics; for if its
methods fail the test of experience – not the
test of logic – they will be discarded”
Recommended reading:
Annals of Statistics Tukey Memorial Volume (Fall, 2002)
“John Tukey’s Contributions to Robust Statistics” (P. J. Huber)
“The Life and Professional Contributions of J. W. Tukey” (D. R. Brillinger)
OUTLINE
1. DATA-ORIENTED INTRODUCTION
2. LOCATION AND SCALE ESTIMATES
3. BASIC ROBUSTNESS CONCEPTS
4. ROBUST REGRESSION
5. ROBUST MULTIVARIATE LOCATIONAND
SCATTER
INTRODUCTION
1. Outliers Examples
2. Classical Parameter Estimates are Not Robust
3. Classical Statistical Inference is Not Robust
4. Data-Oriented Robustness and Examples
5. Simple Robust Location and Scale Estimates
6. Simple Robust Estimates Have Bounded EIF’s
7. Outlier Mining One Dimension at a Time
OUTLIERS
– Outliers are atypical observations that are
“well” separated from the bulk of the data
• In isolation or in small clusters
Dimensionality context
• 1-D
(relatively easy to detect)
• 2-D
(harder to detect)
• Higher-D
(very hard to detect)
• Time Series
(special challenges)
Classical Statistics
• PARAMETER ESTIMATES (“Point” Estimates)
–
–
–
–
Sample mean and sample standard deviation
Sample correlation and covariance estimates
Linear least squares model fits
Gaussian maximum likelihood
• STATISTICAL INFERENCE
–
–
–
–
t-statistic and t-interval for an unkown mean
Standard errors and t-values for regression coefficients
F-tests for regression model hypotheses
AIC, BIC, Cp model selection statistics
CLASSICAL STATS ARE NOT ROBUST
Outliers have “unbounded influence” on classical
statistics, resulting in:
• Inaccurate parameter estimates and predictions
• Inaccurate statistical inference
–
–
–
–
Standard errors are too large
Confidence intervals are too wide
t-statistics lack power
AIC, BIC, Cp result in wrong models
• Unreliable outlier detection
EMPIRICAL INFLUENCE FUNCTION
x  ( x1 , x2 ,, xn )
x  an additional data point
EIF ( x ; T , x)  (n  1)  T ( x, x)  T ( x)
Normalization across sample size
Measures influence of an additional point x on T
CLASSICAL ESTIMATES HAVE
UNBOUNDED EIF
EIF ( x; mean, x )  x  x
Sample Mean
4
3
2
eif
1
0
-4
-3
-2
-1
0
-1
-2
-3
-4
x
1
2
3
4
RESISTANCE (J.W. Tukey’s term)
• A Fundamental Continuity Concept
- Small changes in the data result in only small changes in
estimate
- “Change a few, so what” J.W. Tukey (Seattle, 1977)
• “Small Changes” Generalization
- Small changes in all the data (e.g., rounding errors)
- Large changes in a small fraction of the data (a few outliers)
• Valuable Consequence
- A good fit to the bulk of the data
- Reliable, automatic outlier detection
1-D Outliers: Stock
Returns
Outliers
represent
locally large
losses/gains
1.2
1.0
Density
0.8
Sometimes you
must process
thousands of
such series
0.6
0.4
You need to
detect the
outliers
automatically!
0.2
0.0
-1
0
1
nobeled
2
3
1-D Outliers: Density of Earth
Density of Earth Relative to Density of Water
8
Cavendish, 1798,
measurements.
6
Because of the
low outlier the
median 5.46 is
a better estimate
of Earth density
than the mean
5.42
0
2
4
Outlier
4.0
4.5
5.0
Density
5.5
6.0
2-D Outliers: Predicting EPS
0.05
0.10
You have
to predict
2001 EPS!
0.00
You have
many of
these, e.g.,
Hundreds!
-0.05
EARNINGS PER SHARE
0.15
INVENSYS ANNUAL EPS VERSUS TIME
1985
1990
1995
YEAR
2000
2-D Outliers: Main Gain Data
TELEPHONE GAIN VS. DIFFERENCE IN NEW HOUSING STARTS
2.0
tel.gain
1.8
1.6
1.4
1.2
1.0
-0.85
-0.60
-0.35
-0.10
diff.hstarts
0.15
0.40
0.65
5-D Outliers: Woodmod Data
X
X
X
X
X
XX
X
XX
X
X
X
X
X
X
X
X
X
X
X
X
X
X
XX
X
X
X
X
X
XX
X
X
X
X
X
X
X
XX
X
X
X XX X
X
X
X
X
X
X
X
X
X
X
X
XX
X
X
X
X
V4
X
X
X
X
X X
X X X X
XX
X
XX X XX
0.65
X
X XX
X
X
X
X XX
X
Corr(V1,V2) = -0.15
X
X X
XX
X
X
XX X
X
X
X
X
X
X
X
X
X
0.65
X
XX
X
X X
XX
X
X
0.55
XXXX
X
X
X X
X
X
X
X
X
XX XX
X XX
X
X
XX
X
X
X
X
0.40
0.45
0.50
XX X
X
X
X X XX
X
X
X
X
X
0.55
X
XX
X
X
A group of 4 outliers
shows up in the plots
of V1 vs V2 and V4
vs V5
X
X
X
X
X
X
X
X
XX
X
X
X
X
XX
X
X
XX
X
X
X
X
X
X X
X
X
XX
XX
X
X
X
X
X
X
X
X
X
X X
X
X
X
X
XX
XX
X
X
X X
X X
0.45
X
X
X
V3
X X
X XX
XX
X
X
X
X
X
XX
X
X XXX
X
X
X
X
X
X
X
X
X
X
X
X
X
XX
XX
X
X
X
XX
X
X
X
X
X
X
X
XX
X XX
X
X
X
XX
X
X XX
X
X
X
X
XX
XX
XX
X
0.60
0.95
0.55
X X
X
X
0.45
X
X
X
X
X
X
X
X X
X
X
X
X X
X
X
X
X
X
XX X
X X
X
X
X
X
XX
XX X X
XX
X
X
0.50 0.55
X
X
X
X X
X
0.55
X
X
0.60
X
V5
X
0.85
0.16
X
X
X
XX
V2
X
X X
X
X
X
XX
X
XX
X
0.12
X
X
X
X
X
0.14
XX
0.45
X
X X
0.45
X
X
X X
X
XX
X
X
X
V1
0.16
X
X
0.60
XX
0.50
0.14
X
0.40
0.12
X
X
0.85
0.90
0.95
RobCorr((V1,V2) = 0.75
60
40
ESSEX
Population densities
in Suffolk and Essex
are much larger than
that in the other
counties
Correlation= -0.64
20
Percentage Treated at Home
80
LUNATICS IN
MASSACHUSETTS
Robust Correlation=-0.97
SUFFOLK
0
500
1000
1500
2000
Population Density
2500
3000
LUNATICS IN MASSACHUSETTS
80
(Continued)
60
40
50
Now Nantucket shows
up as outlier
Correlation = -0.84
30
Percentage Treated at Home
70
Plot with Suffolk
and Essex removed
NANTUCKET
50
100
150
Population Density
Robust Correlation = -0.93
200
LUNATICS IN MASSACHUSETTS
(Continued)
60
70
Now data show a clear
decreasing trend with
smaller percentages in
more populated counties
Correlation = -0.97
50
Percentage Treated at Home
80
Plot with Suffolk,
Essex and Nantucket
removed
Robust Correlation = -0.97
50
100
150
Population Density
200
Time Series with Outliers and Level Shifts
TOBACCO AND RELATED SALES IN THE UK
Outlier
800
Key aspects
of consumer
behavior
Level Shifts
Automate for
detecting key
changes in a
few out of many
thousands of
customers.
700
TOBACCO SALES
900
1000
Need to detect
outliers and
level shifts
as important,
distinct events
1955
1956
1957
1958
TIME
1959
1960
Gene Expression Data
Microarray experiments typically used to identify
differentially expressed genes.
DNA probes printed on a glass are hybridized to two RNA
samples separately labeled with two fluorescent dyes
The intensity of hybridization values after slide scanning
are calculated using image analysis and then used to identify
differentially expressed genes
Three Principal Stages of the
Technology
Array fabrication (pcr amplification and clone preparation, reaction clean up,
array printing)
 Probe preparation (mRNA extraction, mRNA labeling, probe labeling and
purification) and hybridization
 Slide scanning and image processing (gridding, segmentation intensity
extraction)
Gene Expression Data
(continued)
Each of the above-mentioned stages may generate several
sources of random variation and of systematic error.
For example
• The first one involves variation in the quantity of probe at a
spot and in hybridization efficiency of the probe as to their
counterparts (mRNA targets)
• The second one includes variation in the quantity of mRNA in
a sample applied to the slide and variation in the amount of
target hybridized to the probe
• The third one is subject to variation in optical measurements
and in fluorescent intensities computed from the scanned image.
Gene Expression Data
(continued)
Different substances can be used to increase or damp the level of
expression of a gene.
Hughes et al., 2000 in Cell 102: 109-126 (2000)
“Functional Discovery via Compendium of Expression Profiles”
considered 6068 genes and ten different substances abbreviated as:
cin
cup
spf
vma
fre
yap
mac
yer
sod
and
ymr
Gene Expression Data
(continued)
The sample exposed to the substance (treatment sample) was labeled “green”
The other sample (control sample) was labeled “red” .
The normalized green intensity of gene “i” in sample “j” is denoted by
X ij ,
i  1,...,6068
j  1,...,10
The normalized red intensity of gene “i” in sample “j” is denoted by
Yij ,
i  1,...,6068
j  1,...,10
Gene Expression Data
(continued)
We will examine the differences between normalized gene expression
intensities
Z ij  Yij  X ij ,
i  1,...,6068
i  1,...,10
The expression level for most genes are similar. Those will appear as
“normal data” in the boxplots.
There are some genes for which the difference in intensity is large.
Those are the genes that are likely to be over- or under-expressed in the
“treatment” samples.
Gene Expression Data
GENE EXPRESSION DIFFERENCES FOR
TEN SAMPLES (LOG-SCALE)
2
4
6
Red - Green
intensity levels
for ten samples
-6
-4
-2
0
Similar intensity
levels for most
genes
cin
cup
fre
mac
sod
spfl
vma
yap
yer
ymr
Outliers may
correspond
to over / under
expressed genes
NORMALIZED MEAN-MEDIAN
DIFFERENCE
Media Mean
n
CIN
CUP
FRE
0.007
0.013
0.003
0.001
-0.028
0.012
MAC
SOD
SPF
VMA
0.000
0.003
0.013
0.003
-0.007
0.002
-0.012
-0.026
YAP
VER
VMR
0.010
0.003
0.000
-0.010
0.002
-0.003
Difference
(Normalized)
0.34
2.61
-0.53
0.45
0.08
1.60
1.83
1.29
0.09
0.20
Diff = (Med-Mean)/SE(Med)
In several cases (red
rows in the table) the
mean and median have
different signs.
Differences are
relatively small
The positive and
negative outliers
balance each other
limiting their overall
effect on the mean.
NORMALIZED SD - MAD DIFFERENCE
MAD S.D.
Normalized
Difference
4.28
10.08
3.13
5.96
CIN
CUP
FRE
0.113
0.207
0.163
0.181
0.367
0.212
MAC
SOD
SPF
VMA
0.128
0.197
0.207
0.202
0.223
0.280
0.275
0.332
YAP
YER
0.148
0.069
0.310
0.086
10.19
1.05
YMR
0.113
0.224
6.98
5.22
4.27
8.22
Diff = (SD-MAD)/SE(MAD)
The outliers have a bigger
impact on the standard
deviations
Flagging outliers by
using means and SD’s
becomes more difficult
Standard Deviation vs. MAD
0.35
cup
vma
SD = 1.45 x MAD
0.25
0.30
yap
0.10
0.15
0.20
SD
SD is approximately
50% larger than MAD
across samples.
0.08
0.10
0.12
0.14
MAD
0.16
0.18
0.20
Flagging Outliers
Suppose we have a set of numbers
Zi (i  1,2, ...,n)
such that most of them are independent normal random
variables with mean
m
and variance
2
Suppose that a relatively small fraction of these numbers are
expected to be different from the majority.
Flagging Outliers
(continued)
We need reliable and automatic ways for flagging outliers
We may use the popular
c  3
rule
But a better approach (specially for large datasets) is to use “c” determined
by the equation


P max | Z i  m | c  0.999
1i  n
to reduce the probability of flagging “wrong outliers”.
Flagging Outliers
(continued)
It is easy to verify that:
c
1




n
0.999  1 


2

Flagging Outliers
(continued)
For the Gene-Expression data n = 6068 and so:
 6068 0.999  1 
  5.24
c   

2


1
For such a large datasets it is better to use
c  5.24 
to reduce the probability of flagging “wrong genes”.
Flagging Outliers
(continued)
We can assume that, for each sample,
X i  Red i  Green i
are (approximately) independent normal with mean
and unknown variance
2
m=0
Flagging Outliers
(continued)
SAMPLE SD MAD
Since sigma is unknown it must
be estimated from the data
Robust estimate:
MAD
Classical estimate: SD
cin
0.18
0.11
cup
0.37
0.21
fre
0.21
0.16
mac
0.22
0.13
sod
0.28
0.20
spf
0.27
0.21
0.33
0.20
0.31
0.15
yer
0.09
0.07
ymr
0.22
0.11
vma
Because of the outliers, the
SD will systematically overestimate yap
sigma
Flagging Outliers
cin
cup
fre
mac
sod
spf
vma
yap
yer
ymr
SD
MAD OUT(SD) OUT (SD)
0.18
0.21
0.37
0.22
0.28
0.27
0.33
0.31
0.09
0.22
0.11
0.16
0.21
0.13
0.20
0.21
0.20
0.15
0.07
0.11
9
22
7
23
15
20
91
28
7
12
61
102
16
73
60
50
27
114
18
32
ymr has relatively
few very large
outliers which
drastically inflate
the SD
cup and yap have
a large number of
moderate outliers
Which inflate the SD.
“MAD – SD Outliers” vs. “R =
SD/MAD” ymr (right-bottom
ROBUST
LS
40
60
In this case there are
relatively few large
outliers which
drastically inflate the
Standard Deviation.
ymr
20
OUTLIERS(MAD)-OUTLIERS(SD)
80
corner) appears as
an outlier in this plot
1.4
1.6
1.8
SD/MAD
2.0
Robust Fit:
Diff = -95+ 91 x R
LS Fit:
Diff = -51+ 60 x R
60
BEEF SALES IN USA (19251941)
Beef sales sharply
dropped around 1930
and showed a steady
increase on 1933 - 41
OUTLIER
54
56
58
OUTLIER
52
CBE
OUTLIER
46
48
50
High levels of beef
consumption in 1925-27
show up as outliers
in the plot.
1925
1930
1935
YEAR
1940