Chap10: SUMMARIZING DATA
Download
Report
Transcript Chap10: SUMMARIZING DATA
Chap 10: Summarizing Data
10.1: INTRO: Univariate/multivariate data (random
samples or batches) can be described using
procedures to reveal their structures via graphical
displays (Empirical CDFs, Histograms,…) that are to
Data what PMFs and PDFs are to Random Variables.
Numerical summaries (location and spread measures)
and the effects of outliers on these measures &
graphical summaries (Boxplots) will be investigated.
10.2: CDF based methods
10.2.1: The Empirical CDF (ecdf)
ECDF is the data analogue of the CDF of a
random variable.
Fn ( x)
# xi x
n
, where x1....xn is a batch of numbers
The ECDF is a graphical display that conveniently
n
summarizes
data sets.
Fn ( x)
I
i 1
( , x ]
(Xi )
, where X 1.... X n is a random sample
n
1 , t A
and I A (t )
is an indicator function
0 , t A
The Empirical CDF (cont’d)
The random variables I ( , x ] ( X i ) are independent
Bernoulli random variables:
1 with probability F ( x)
I ( , x ] ( X i )
0 with probabilty 1 F ( x)
n
nFn ( x) I ( , x ] ( X i ) ~ Bin n, F ( x)
i 1
E Fn ( x) F ( x) and Var Fn ( x)
Fn is an unbiased estimate of F
F ( x) 1 F ( x)
n
lim Var Fn ( x) 0 and Fn has a max imum var iance at the median of F
t
10.2.2: The Survival Function
In medical or reliability studies, sometime data
consist of times of failure or death; thus, it
becomes more convenient to use the survival
functionS (t ) 1 F (t ) , T is a random var iable with cdf F
rather than the CDF. The sample survival function
(ESF) gives the proportion of the data greater
than t and is given by:
Sn (t ) 1 Fn (t )
Survival plots (plots of ESF) may be used to
provide information about the hazard function that
may be thought as the instantaneous rate of
mortality for an individual alive at time t and is
d
d
defined to be:f (t )
h(t )
1 F (t )
dt
log 1 F (t )
dt
log S (t )
The Survival Function (cont’d)
From page 149, the method for the first order:
Var[ g ( X )] Var ( X )[ g '( X )]
2
2
1
F (t )
Var{log 1 F (t )} Var Fn (t ) *
n 1 F (t )
1 F (t )
which expresses how extremely unreliable (huge
variance for large values of t) the empirical logsurvival function is.
10.2.3:QQP(quantile-quantile plots)
Useful for comparing CDFs by plotting quantiles of
one dist’n versus quantiles of another dist’n.
Additive treatm enteffect : X ~ F vs Y ~ G
y p x p h G ( y ) F ( x h)
QQ plotis a line with slope 1 and y int ercept h
Multiplicative treatm enteffect : X ~ F vs Y ~ G
y
y p cx p G ( y ) F for c 0
c
QQ plotis a line with slope c and y int ercept 0
10.2.3: Q-Q plots
Q-Q plot is useful in comparing CDFs as it plots the
quantiles of one dist’n versus the quantiles of the
other dist’n.
Additive model:
X ~ F and Y ~ G when Y X h G ( y ) F y h
Q Q plot is a straigth line with slope 1and y int ercept h
Multiplicative model:
y
X ~ F and Y ~ G when Y cX G ( y ) F
c
Q Q plot is a straigth line with slope c and y int ercept 0
10.3: Histograms, Density curves
& Stem-and-Leaf Plots
Kernel PDF estimate:
iid
X 1 ...X n
f PDF
Estim ating function of f is f h Kernel PDF
1 x
Let wh ( x) w be a sm oothweight function
h h
1 n
1 n x Xi
Then f h ( x) wh ( x X i )
w
n i 1
nh i 1 h
h bandwidthparam eterthat controlsthe sm oothnessof f h
h bin width of the histogram.
Choosea ' reasonable' h not too big!nottoo sm all!
10.4: Location Measures
1 n
10.4.1: The Arithmetic Mean x xi is sensitive to
n i 1
outliers (not robust).
~
10.4.2: The Median x is a robust measure of location.
10.4.3: The Trimmed Mean is another robust location
d)
measure 0.1 0.2 (highlyrecommende
Step1 : orderthe data set
Step 2 : discard lowest *100% and highest *100%
Step3 : takethe arithm eticm ean for the rem ainingdata
n [ n ]
1
Step 4 : x
x(i ) is the *100% trim m edm ean
n 2[n ] i [ n ]1
Location Measures (cont’d)
The trimmed mean (discard only a certain number of
the observations) is introduced as a natural
compromise between the mean (discard no
observations) and the median (discard all but 1 or 2
observations)
x is was proposed
Another compromise between x and ~
by Huber (1981) who suggested to minimize:
n
Xi
with respectto , where ( x) is to be given.
n X
i 1
i
0 for , where '
or to solve
i 1
(its solution will be called an M-estimate)
10.4.4: M-Estimates (Huber, 1964)
1 2
n
2 x
X i , where ( x)
i 1
k | x | 1 k 2
2
if | x | k
if | x | k
is proportional to x 2 inside [k , k ]
( x)
replacesthe parabolicarcs by straigthlines outside [k , k ]
1 2
ˆ
Big k com escloser to the m eanx ; ( x) x or ( x) x
2
Sm allk ˆ com escloser to the m edian~
x ; ( x) k | x | or ( x) k sgn( x)
k corresponds to the m eanx and k 0 corresponds to the m edian~
x
3
k protectsagainstoutliers observations k away from the center
2
is suggestedas a " moderate" com prom ise
.
10.4.4: M-Estimates (cont’d)
M-estimates coincide with MLEs because:
n
Xi
Xi
Minim ize
wrt Maxim ize f
wrt
i 1
i 1
iid
User function ( x) log f ( x) with X 1 , X 2 ,...., X n
f
n
The computation of an M-estimate is a nonlinear
minimization problem that must be solved using an
iterative method (such as Newton-Raphson,…)
Such a minimizer is unique for convex functions. Here,
we assume that is known; but in practice, a robust
estimate of (to be seen in Section 10.5) should be
used instead.
10.4.5: Comparison
of Location Estimates
Among the location estimate introduced in this
section, which one is the best? Not easy !
For symmetric underlying dist’n, all 4 statistics
(sample mean, sample median, alpha-trimmed
mean, and M-estimate) estimate the center of
symmetry.
For non symmetric underlying dist’n, these 4 statistics
estimate 4 different pop’n parameters namely (pop’n
mean, pop’n median, pop’n trimmed mean, and a
functional of the CDF by ways of the weight function
).
Idea: Run some simulations; compute more than one
estimate of location and pick the winner.
10.4.6: Estimating Variability of
Location Estimates
by the Bootstrap
Using a computer, we can generate (simulate) many
samples B (large) of size n from a common known
dist’n F. From each sample, we compute the value of
the location estimate ˆ .
*
*
*
,
,...,
The empirical dist’n of the resulting values 1 2
B
is a good approximation (for large B) to the dist’n
function of ˆ . Unfortunately, F is NOT known in
general. Just plug-in the empirical cdf Fn for F and
bootstrap ( = resample from Fn ).
1
Fn is a discrete PMF with sam e probability
n
for eachobservedvaluex1 , x 2 ,..., x n
10.4.6: Bootstrap (cont’d)
A sample of size n from Fn is a sample of size n drawn
with replacement from the observed data 1
n
*
that produce b (b 1,...,B).
Thus,
1 B *
* 2
sˆ
b
x ,....,x
B
b 1
B
1
where * b* is the m eanof the b* b 1,2,...,B
B b 1
Read example A on page 368.
Bootstrap dist’n can be used to form an approximate CI
and to test for hypotheses.
10.5:Measures of Dispersion
A measure of dispersion (scale) gives a numerical
indication of the “scatteredness” of a batch of
numbers. The most common measure of dispersion
1 n
2
is the sample standard deviation
s
X
n 1
i 1
i
X
Like the sample mean, the sample standard deviation
is NOT robust (sensitive to outliers).
Two simple robust measures of dispersion are the
IQR (interquartile range) and the MAD (median
absolute deviation from the median).
10.6: Box Plots
Tukey invented a graphical display (boxplot)
that indicates the center of a data set
(median), the spread of the data (IQR) and
the presence of outliers (possible).
Boxplot gives also an indication of the
symmetry / asymmetry (skewness) of the
dist’n of data values.
Later, we will see how boxplots can be
effectively used to compare batches of
numbers.
10.7: Conclusion
Several graphical tools were introduced in
this chapter as methods of presenting and
summarizing data. Some aspects of the
sampling dist’ns (assume a stochastic
model for the data) of these summaries
were discussed.
Bootstrap methods (approximating a
sampling dist’n and functionals) were also
revisited.
Parametric Bootstrap:
Example: Estimating a population mean
It is known that explosives used in mining leave a crater
that is circular in shape with a diameter that follows an
exponential dist’n F ( x) 1 e x / , x 0 . Suppose a
new form of explosive is tested. The sample crater
diameters (cm) are as follows:
121 847 591 510 440 205 3110 142 65 1062
211 269 115 586 983 115 162 70 565 114
x 514.15samplemean and s 685.60sampleSD
249.07,779.23
It would be inappropriate to use x t 0.95
n
as a 90% CI for the pop’n mean via the t-curve (df=19)
s
Parametric Bootstrap: (cont’d)
because such a CI is based on the normality
assumption for the parent pop’n.
The parametric bootstrap replaces the exponential
pop’n dist’n F with unknown mean by the known
exponential dist’n F* with mean * x 514.15
Then resamples of size n=20 are drawn from this
surrogate pop’n. Using Minitab, we can generate
B=1000 such samples of size n=20 and compute
the sample mean of each of these B samples. A
bootstrap CI can be obtained by trimming off 5%
from each tail. Thus, a parametric bootstrap 90% CI
is given by:
(50th smallest = 332.51,951st largest = 726.45)
Non-Parametric Bootstrap:
If we do not assume that we are sampling from a
normal pop’n or some other specified shape pop’n,
then we must extract all the information about the
pop’n from the sample itself.
Nonparametric bootstrapping is to bootstrap a
sampling dist’n for our estimate by drawing samples
with replacement from our original (raw) data.
Thus, a nonparametric bootstrap 90% CI of is
obtained by taking the 5th and 95th percentiles of ˆ
among these resamples.