Exploratory Data Analysis Overview of TBW Data

Download Report

Transcript Exploratory Data Analysis Overview of TBW Data

Quantifying Uncertainty
Two approaches
• Use statistical theory
• Bootstrapping
Statistical Theory
Uncorrelated
Correlated
2
SEx 
N
2

2
 x
N
N 1


1  2  1  k N ( k )


k 1


2
SEx   x
Effective
sample size
Ne 
N
N 1


1  2  1  k N ( k )


k 1


Significance Statistics
• Significance statistics use the standard
error:
t  statistic : ˆ / SEˆ
• Confidence Intervals:
ˆ  z / 2  SEˆ
Bootstrapping
• Motivated by the absence of equations for
other accuracy measures (bias, prediction
error, confidence intervals) for statistics of
interest (correlation, regressions, ACF)
• Definition: “The bootstrap is a data-based
simulation method for statistical inference.”
• Principle: resample with replacement from
data.
After Efron and Tibshirani, An Introduction to the Bootstrap, 1993
Bootstrapping
BOOTSTRAP WORLD
REAL WORLD
Unknown
Probability
Distribution
F
Observed
Random
Sample
x = {x1, x2, …, xn}
ˆ  s (x)
Statistic of
Interest

Empirical
Distribution
F*
Sampling with
replacement
Bootstrap
Sample
x * = {x*1, x * 2, …, x *n}
ˆ*  s (x* )
Bootstrap
Replication
After Efron and Tibshirani, An Introduction to the Bootstrap, 1993
Hillsborough River at Zephyr Hills, September flows
0 2 4 6 8
Frequency
12
Mean = 8621 mgal
S = 8194 mgal
N = 31
0
5000
10000
15000
20000
25000
30000
35000
Uncertainty on estimates of the mean
80
40
0
Frequency
120
2
One and two standard errors SEx 
N
95% CI and interquartile range from
500 bootstrap samples
0
5000
10000
15000
20000
25000
Millions of gallons
30000
35000
Box-Cox Normality Plot for Monthly September Flows on Alafia R.
1.0
Box-Cox Normality Plot for (KS)
Alafia R.Statistic
Using Kolmogorov-Smirnov
 = -0.39
0.2
0.4
0.6
What is the range of
uncertainty on this?
0.0
KS p-value
0.8
Peak at
-2
-1
0
Box-Cox Lambda Value
Optimal Lambda= -0.39
1
2
Example for the ks.test
• How?
• Produce 500 new datasets (x*) of the same
length as x by sampling with replacement
from x
• Find the optimal  value for each
• Determine the 10th and 90th percentiles to
cover 80% of the  values calculated.
12
10
6
80%
confidence
interval
2
4
10% 50% 90%
-0.425 -0.250 -0.068
0
Frequency
8
Look back at the original plot and
verify that the original “optimal”
value was at the far left of the
broad top, which is reflected in this
confidence interval.
-0.6
-0.4
-0.2
0.0
Bootstrapped Optimal Lambda Value for Alafia September Flows
0.2