Bootstrapping - University of Notre Dame

Download Report

Transcript Bootstrapping - University of Notre Dame

Alternative Forecasting
Methods: Bootstrapping
Bryce Bucknell
Jim Burke
Ken Flores
Tim Metts
Agenda
Scenario
Obstacles
Regression Model
Bootstrapping
Applications and Uses
Results
Scenario
You have been recently hired as the statistician for the University of
Notre Dame football team. You are tasked with performing a statistical
analysis for the first year of the Charlie Weis era. Specifically, you have
been asked to develop a regression model that explains the relationship
between key statistical categories and the number of points scored by the
offense. You have a limited number of data points, so you must also find
a way to ensure that the regression results generated by the model are
reliable and significant.
Problems/Obstacles:




Central Limit Theorem
Replication of data
Sampling
Variance of error terms
Constrained by the Central Limit Theorem
In selecting simple random samples of size_n from a population, the
sampling distribution of the sample mean x can be approximated by a
normal probability distribution as the sample size becomes large. It
is generally accepted that the sample size must be 30 or greater to
satisfy the large-sample condition of the theorem.
Sample N = 1
Sample N = 2
Sample N = 3
Sample N = 4
1. http://www.statisticalengineering.com/central_limit_theorem_(summary).htm
Central Limit Theorem
Central Limit theorem is the foundation for many statistical
procedures, because the distribution of the phenomenon under
study does NOT have to be Normal because its average WILL tend to
be normal.
Why is the assumption of a normal distribution important?



A normal distribution allows for the application of the empirical rule – 68%,
95% and 99.7%
Chebyshev’s Theorem no more than 1/4 of the values are more than 2
standard deviations away from the mean, no more than 1/9 are more than
3 standard deviations away, no more than 1/25 are more than 5 standard
deviations away, and so on.
The assumption of a normally distributed data allows descriptive statistics
to be used to explain the nature of the population
Not enough data available?
Monte Carlo simulation, a type of spreadsheet simulation, is used to
randomly generate values for uncertain variables over and over to
simulate a model.





Monte Carlo methods randomly select values to create scenarios
The random selection process is repeated many times to create multiple
scenarios
Through the random selection process, the scenarios give a range of
possible solutions, some of which are more probable and some less
probable
As the process is repeated multiple times, 10,000 or more, the average
solution will give an approximate answer to the problem
The accuracy can be improved by increasing the number of scenarios
selected
Sampling without Replacement
Simple Random Sampling




A simple random sample from a population is a sample chosen
randomly, so that each possible sample has the same probability
of being chosen.
In small populations such sampling is typically done "without
replacement“
Sampling without replacement results in deliberate avoidance of
choosing any member of the population more than once
This process should be used when outcomes are mutually
exclusive, i.e. poker hands
Sampling with Replacement






Initial data set is not sufficiently large enough to use simple
random sampling without replacement
Through Monte Carlo simulation we have been able to replicate
the original population
Units are sampled from the population one at a time, with each
unit being replaced before the next is sampled.
One outcome does not affect the other outcomes
Allows a greater number of potential outcomes than sampling
without replacement
If observations were not replaced there would not be enough
independent observations to create a sample size of n ≥ 30
X
Homoscedasticity – constant variance



Residuals
Residuals
Hetroscedasticity vs. Homoscedasticity
X
Hetroscedasticity – nonconstant variance
All random variables have the same
finite variance

Simplifies mathematical and
computational treatment

Leads to good estimation results in
data mining and regression


Random variables may have different
variances
Standard errors of regression
coefficients may be understated
T-ratios may be larger than actual
More common with cross sectional
data
Regression Model For ND Points Scored
ND Points = 38.54 + 0.079*b1 - 0.170*b2 - 0.662*b3 - 3.16*b4
b1 = Total Yards Gained
b3 = Total Plays
b2 = Penalty Yards
b4 = Turnovers
Audit Trail -- Coefficient Table (Multiple Regression Selected)
Series
Included
Standard
Description
in Model
Coefficient
Error
ND Points
Dependent
38.54
14.26
Total YDS
Yes
0.08
0.02
Penalty YDS
Yes
-0.17
0.06
Total Plays
Yes
-0.66
0.23
Turnovers
Yes
-3.16
2.50
T-test
2.70
5.29
-2.64
-2.84
-1.26
F-test
7.31
27.97
6.99
8.05
1.59
Overall
F-test
8.92
4 Checks of a Regression Model
1. Do the coefficients have the correct sign?
2. Are the slope terms statistically
significant?
3. How well does the model fit the data?
4. Is there any serial correlation?
4 Checks of a Regression Model
1. Do the coefficients have the correct sign?
Audit Trail -- Coefficient Table
Series
Included
Description
in Model
ND Points
Dependent
Total YDS
Yes
Penalty YDS
Yes
Total Plays
Yes
Turnovers
Yes
Coefficient
38.54
0.08
-0.17
-0.66
-3.16
Could this represent a big play factor?
4 Checks of a Regression Model
2. Are the slope terms statistically significant?
Audit Trail -- Coefficient Table (Multiple Regression Selected)
Series
Included
Standard
3. How
well
does the model
fit the
Description
in Model
Coefficient
Error
T-test
ND Points
Dependent
38.54
14.26
2.70
Yes serial
0.08 correlation?
0.02
5.29
4. Total
IsYDSthere any
Penalty YDS
Yes
-0.17
0.06
-2.64
Total Plays
Yes
-0.66
0.23
-2.84
Turnovers
Yes
-3.16
2.50
-1.26
data?
F-test
7.31
27.97
6.99
8.05
1.59
Overall
F-test
8.92
M
ay
-0
Ju 5
n05
Ju
lAu 05
g0
Se 5
p0
O 5
ct
-0
N 5
ov
-0
D 5
ec
-0
Ja 5
n0
Fe 6
b0
M 6
ar
-0
Ap 6
r0
M 6
ay
-0
Ju 6
n06
Ju
lAu 06
g0
Se 6
p0
O 6
ct
-0
N 6
ov
-0
D 6
ec
-0
Ja 6
n0
Fe 7
b0
M 7
ar
-0
Ap 7
r07
4 Checks of a Regression Model
3. How well does the model fit the data?
ND Points
60
50
40
30
20
10
0
ND Points
Forecast of ND Points
Fitted Values
Adjusted R2 = 74.22%
4 Checks of a Regression Model
4. Is there any serial correlation?
Data is cross sectional
With limited data points, how useful is this
regression in describing how well the model fits the
actual data? Is there a way to tests its reliability?
How to test the significance of the analysis
What happens when the sample size is not large enough (n ≥ 30)?
Bootstrapping is a method for estimating the sampling distribution of an
estimator by resampling with replacement from the original sample.




Commonly used statistical significance tests are used to determine
the likelihood of a result given a random sample and a sample size
of n.
If the population is not random and does not allow a large enough
sample to be drawn, the central limit theorem would not hold true
Thus, the statistical significance of the data would not hold
Bootstrapping uses replication of the original data to simulate a
larger population, thus allowing many samples to be drawn and
statistical tests to be calculated
How It Works
Bootstrapping is a method for estimating the sampling distribution of an
estimator by resampling with replacement from the original sample.





The bootstrap procedure is a means of estimating the statistical accuracy . .
. from the data in a single sample.
Bootstrapping is used to mimic the process of selecting many samples when
the population is too small to do otherwise
The samples are generated from the data in the original sample by copying
it many number of times (Monte Carlo Simulation)
Samples can then selected at random and descriptive statistics calculated or
regressions run for each sample
The results generated from the bootstrap samples can be treated as if it they
were the result of actual sampling from the original population
Characteristics of Bootstrapping
Sampling with
Replacement
Full Sample
Bootstrapping Example
Original Data Set
Limited
number of
observations
Random sampling with
replacement can be
employed to create
multiple independent
samples for analysis
1st Random Sample
Pittsburgh
Navy
Michigan
Ohio State
Michigan State
USC
Washington
Washington
Purdue
Ohio State
USC
BYU
109 Copies
of each
observation
USC
BYU
Tennessee
Stanford
Navy
Pittsburgh
Syracuse
Stanford
Ohio State
Creating a
much larger
sample with
which to work
Ohio State
Stanford
Michigan
When it should be used
Bootstrapping is especially useful in situations when no analytic formula
for the sampling distribution is available.



Traditional forecasting methods, like exponential
smoothing, work well when demand is constant
– patterns easily recognized by software
In contrast, when demand is irregular, patterns
may be difficult to recognize.
Therefore, when faced with irregular demand,
bootstrapping may be used to provide more
accurate forecasts, making some important
assumptions…
Assumptions and Methodology

Bootstrapping makes no assumption regarding the population

No normality of error terms

No equal variance

Allows for accurate forecasts of intermittent demand


If the sample is a good approximation of the population, the
sampling distribution may be estimated by generating a large
number of new samples
For small data sets, taking a small representative sample of the data
and replicating it will yield superior results
Applications and Uses
Criminology



Statistical significance testing is important in
criminology and criminal justice
Six of the most popular journals in
criminology and criminal justice are
dominated by quantitative methods that rely
on statistical significance testing
However, it poses two potential problems:
tautology and violations of assumptions
Applications and Uses
Criminology



Tautology: the null hypothesis is always false
because virtually all null hypothesis may be
rejected at some sample size
Violation of assumptions of regression: errors
are homogeneous and errors of independent
variables are normally distributed
Bootstrapping provides a user-friendly
alternative to cross-validation and jackknife to
augment statistical significance testing
Applications and Uses
Actuarial Practice



Process of developing an actuarial model
begins with the creation of probability
distributions of input variables
Input variables are generally asset-side
generated cash flows (financial) or cash flows
generated from the liabilities side
(underwriting)
Traditional actuarial methodologies are rooted
in parametric approaches, which fit prescribed
distribution of losses to the data
Applications and Uses
Actuarial Practice




However, experience from the last two decades has
shown greater interdependence of loss variables with
asset variables
Increased complexity has been accompanied by
increased competitive pressures and more frequent
insolvencies
There is a need to use nonparametric methods in
modeling loss distributions
Bootstrap standard errors and confidence intervals
are used to derive the distribution
Applications and Uses
Classifications Used by Ecologists




Ecologists often use cluster analysis as a tool in the
classification and mapping of entities such as
communities or landscapes
However, the researcher has to choose an adequate
group partition level and in addition, cluster analysis
techniques will always reveal groups
Use bootstrap to test statistically for fuzziness of the
partitions in cluster analysis
Partitions found in bootstrap samples are compared to
the observed partition by the similarity of the
sampling units that form the groups.
Applications and Uses
Human Nutrition



Inverse regression used to estimate vitamin
B-6 requirement of young women
Standard statistical methods were used to
estimate the mean vitamin B-6 requirement
Used bootstrap procedure as a further check
for the mean vitamin B-6 requirement by
looking at the standard error estimates and
confidence intervals
Application and Uses
Outsourcing




Agilent Technologies determined it was time to
transfer manufacturing of its 3070 in-circuit test
systems from Colorado to Singapore
Major concern was the change in environmental test
conditions (dry vs humid)
Because Agilent tests to tighter factory limits (“guard
banding”), they needed to adjust the guard band for
Singapore
Bootstrap was used to determine the appropriate
guard band for Singapore facility
An Alternative to the bootstrap
Jackknife


A statistical method for estimating and
removing bias* and for deriving robust
estimates of standard errors and
confidence intervals
Created by systematically dropping out
subsets of data one at a time and
assessing the resulting variation
Bias: A statistical sampling or testing error caused by systematically favoring some outcomes
over others
A comparison of the Bootstrap & Jackknife

Bootstrap


Yields slightly different
results when repeated
on the same data
(when estimating the
standard error)
Not bound to
theoretical
distributions

Jackknife




Less general technique
Explores sample
variation differently
Yields the same result
each time
Similar data
requirements
Another alternative method
Cross-Validation

The practice of partitioning data into a
sample of data into sub-samples such that
the initial analysis is conducted on a single
sub-sample (training data), while further
sub-samples (test or validation data) are
retained “blind” in order for subsequent
use in confirming and validating the initial
analysis
Bootstrap vs. Cross-Validation

Bootstrap


Requires a small of
data
More complex
technique – time
consuming

Cross-Validation



Not a resampling
technique
Requires large
amounts of data
Extremely useful in
data mining and
artificial intelligence
Methodology for ND Points Model




Use bootstrapping on ND points scored
regression model
Goal: determine the reliability of the model
Replication, random sampling, and numerous
independent regression
Calculation of a confidence interval for adjusted
R2
Bootstrapping Results
R2 Data
Sample #
Adjusted R^2
Sample #
Adjusted R^2
1
0.7351
13
0.7482
2
0.7545
14
0.8719
3
0.7438
15
0.7391
4
0.7968
16
0.9025
5
0.5164
17
0.8634
6
0.6449
18
0.7927
7
0.9951
19
0.6797
8
0.9253
20
0.6765
9
0.8144
21
0.8226
10
0.7631
22
0.9902
11
0.8257
23
0.8812
12
0.9099
24
0.9169
The Mean, Standard
Dev., 95% and 99%
confidence intervals
are then calculated in
excel from the 24
observations
Bootstrapping Results
R2 Data
Mean:
STDEV:
0.8046
0.1131
Conf 95%
Conf 99%
0.0453 or 75.93 - 84.98%
0.0595 or 74.51 - 86.41%
So what does this mean for the results of the
regression?
Can we rely on this model to help predict the
number of points per game that will be scored by
the 2006 team?
Questions?