Nonparametric Methods Featuring the Bootstrap
Download
Report
Transcript Nonparametric Methods Featuring the Bootstrap
Nonparametric Methods Featuring
the Bootstrap
Jon Atwood
November 12, 2013
Laboratory for Interdisciplinary Statistical
Analysis
LISA helps VT researchers benefit
from the use of Statistics
Experimental Design • Data Analysis • Interpreting Results
Grant Proposals • Software (R, SAS, JMP, SPSS...)
Collaboration
Walk-In Consulting
Monday—Friday* 1-3PM, GLC
From our website request a meeting for
personalized statistical advice.
Coauthor not required, but encouraged.
*Mon—Thurs during the summer
Great advice right now:
Meet with LISA before collecting your data
Short Courses
Designed to help graduate students
apply statistics in their research
All services are FREE for VT researchers. We assist with research—not class projects or homework.
www.lisa.stat.vt.edu
Short Course Goals
By the end of this course, you will
know…
•
•
•
•
The fundamental differences between
nonparametric and parametric
In what general situations nonparametric methods
would be advantageous
How to implement nonparametric alternatives to ttests, ANOVAs and simple linear regression
What nonparametric bootstrapping is and when
you can use it effectively
What we will do today…
Give a brief overview of nonparametric statistical
methods
Take a look at real world data sets! (and some nonreal world data sets)
Implement the following methods in R…
Wilcoxon
ranked sum and signed rank (alternatives to t-
tests
Kruskal-Wallis (alternative to one-way ANOVA)
Spearman correlation (alternative to Pearson correlation)
Nonparametric Bootstrapping
?Bonus Topic?
What does nonparametric mean?
Well, first of all, what does parametric mean?
Parametric is a word that can have several
meanings
In statistics, parametric usually means some specific
probability distribution is assumed
Normal
(regression, ANOVA, etc.)
Exponential (survival)
May also involve assumptions about parameters
(variance)
And now a closer look…
Regression Analysis
We
want to see the relationship between a bunch of predictor
variables (x1, x2,…,xp) and a response variable.
For example, suppose we wanted to see the relationship
between weight and blood pressure
A usual regression model might look something like this…
Simple Linear Regression Plot
Error term
εi~ N(0, σ2)
The error terms are assumed to come from a
normal distribution with mean 0, variance σ2
Usual methods for testing the significance of
our θ estimates are based directly on this
assumption of normality
If the assumption of normality false, these
methods are invalid
So onto nonparametric…
A statistical method is nonparametric if it does not
require most of these assumptions
Data are not assumed to follow some underlying
distribution
In some cases, it means that variables do not take
predetermined forms
Nonparametric
regression, for example
Assumptions nonparametric methods do make
Randomness
Independence
In multi-sample tests, distributions are of the same
shape
Nonparametric Methods
Advantages
Disadvantages
Free from
distributional
assumptions of data
Easy to interpret
Usually
computationally
straightforward
Loss of power when
data does follow
usual assumptions
Reduces the data’s
information
Larger sample sizes
needed due to less
efficiency
With that said…Rank Tests
Rank tests are a simple group of nonparametric
tests
Instead of using the actual numerical value of
something, we use its rank, or relative position on
the number line to the other observations
As an example…
Here’s some data
Basically, these are just numbers
I picked out of the sky
Any ideas on what we should
call it? Wigs in a wig shop?
Y Value
Ascending
Value
Rank
32
1
1
64
2
(2+3+4)/3=3
64
3
(2+3+4)/3=3
64
311
2000
4
5
6
(2+3+4)/3=3
5
6
Y Value
Ascending
Value
Rank
32
1
1
45
2
2
54
3
3
64
4
4
311
5
5
2000
6
6
What about ties?
Their ascending values are added
together and divided by the total
number of ties
Wilcoxon rank tests
These provide alternatives to the standard t-tests,
which test mean differences between two groups
T-tests assume that the data is normally distributed
Can
be highly sensitive to outliers (we’ll see that in an
example soon), which may reduce power (ability to
detect significant differences)
Wilcoxon tests alleviate these problems by using
ranks, not actual values
First, the Wilcoxon Rank-Sum Test
Alternative to the independent sample t-test (recall
that the independent sample t-test compares two
independent samples, testing if the means are
equal)
The t-test assumes normality of the data, and equal
variances (though adjustments can be made for
unequal variances)
The Wilcoxon rank-sum test assumes that the two
samples follow continuous distributions, as well as
the usual randomness and independence
Let’s try it out with some data!
The source of this data is…me, Jon
Atwood
Data randomly generated from R
First group, 15 observations
distributed normally,
mean=30000, variance=2500.
Extra observation added
Second group, 16 observations
distributed normally,
mean=40000, variance=2500
Group 1
32897.08
29383.8
30539.39
27448.5
33712.17
28166.07
26613.3
30105.7
25786.81
30761.04
27427.12
Group 2
40273
35610.5
42547.4
41214.4
40183
36460.8
45040.8
29060.02
39248.6
25593.89
33901.7
29815.82
32453.97
1000000
40985.1
42412
38247.2
36384.2
41849.5
38222.8
39739.4
So what is this test…testing?
Remember, for the t-test, we are testing
H0: μ1=μ2
vs.
Ha: μ1≠μ2
Where μ1 is the mean of group 1 and μ2 is the
mean of group 2
In the Wilcoxon rank-sum
test, we are testing
H0: F(G1)=F(G2)
vs.
Ha: F(G1)=F(G2-α)
Where F(G1) is the distribution
of group 1 and F(G2) is the
distribution of group 2, and
α is the “location shift”
What this really means is
that F(G1) and F(G2) are
basically the same, except
one is to the right of the
other, shifted by α
α
In our problem…
Group 1
Value
25593.89
25786.81
26613.3
27427.12
27448.5
28166.07
29060.02
29383.8
29815.82
30105.7
30539.39
30761.04
32453.97
32897.08
33712.17
1000000
Rank sum
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
32
152
Group 2
Value
33902
35611
36384
36461
38223
38247
39249
39739
40183
40273
40985
41214
41850
42412
42547
45041
Rank
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
376
So, group 2 has a rank sum of 224
higher than group 1
The question is, what is the
probability of observing a
combination of rank sums in the
two groups where the difference is
greater than 224?
R will compute p-values using the
normal approximation of sample
sizes greater than 50
So let’s do some R-Coding!
We will view graphs and results in
the R program we run (this applies to
all examples)
Moving to a paired situation…
Suppose we have two samples, but they are paired
in some way
Suppose pairs of people are couples, or plants are
paired off into different plots, dogs are paired by
breeds, or whatever
Then, the rank-sum test will not be the optimal test.
Instead, we use the signed rank test
Signed Rank Test
The null hypothesis is that the medians of the two
distributions are equal
Procedure
1.
2.
3.
4.
5.
Calculate all differences between two groups
Take the absolute value of those differences
Convert those to ranks (like before)
Attach a + or -, depending on if the difference is
positive or negative
Add these signed ranks together to get the test
statistic W
Example
This data comes from Laureysens et al. (2004), who,
in August and November, took 13 poplar tree
clones and measured aluminum levels
Below is a table with the raw data, ranks, and
differences
Abs Value
Clone
Aug Nov
Rank S-Rank
Diff
Balsam Spire
8.1
11.2
3.1
6
6
W=59
Beaupre
10
16.3
6.3
9
9
Hazendans
Hoogvorst
Raspalje
Unal
Columbia River
Fritzi Pauley
Trichobel
Gaver
Gibecq
Primo
Wolterson
16.5
13.6
9.5
8.3
18.3
13.3
7.9
8.1
8.9
12.6
13.4
15.3
15.6
10.5
15.5
12.7
11.1
19.9
20.4
14.2
12.7
36.8
1.2
2
1
7.2
5.6
2.2
12
12.3
5.3
0.1
23.4
3
4
2
10
8
5
11
12
7
1
13
-3
4
2
10
-8
-5
11
12
7
1
13
How many more combinations could create a more
extreme w?
R will give an exact p-value for sample sizes less
than 50
Otherwise, a normal approximation is used
R Time!
More than 2 groups
Suppose we are comparing more than two groups
Normally (no pun intended), we would use an
analysis of variance, or ANOVA
But, of course, we need to assume normality for this
as well
Kruskal-Wallis
In this situation, we use the Kruskal-Wallis test
Again, we are converting the actual numeric values
to ranks, regardless of groups
Ultimately, we will compute the test statistic
An example
We’ll use the built-in R data air quality
In this data, Chambers et al. (1973) compared air
quality measurements in New York in 1973 from
May to September
The question is, does the air quality differ from
month to month?
Break out the R!
Onto Part 2…
In this part of the course, we will do the following
things
1.
2.
Look at the Spearman correlation (alternative to the
Pearson correlation) between an x variable and y
variable
Examine nonparametric bootstrapping, and how it
can help us when our data does not approximate
normality
Spearman correlation
Suppose you want to discover the association
between infant mortality and GDP by country
Here’s a 2003 scatterplot of the situation
Pearson correlation
This data comes from www.indexmundi.com
In this example, the Pearson correlation is about .63
Still significant, but perhaps underestimates the
monotone nature of the relationship between GDP
and infant mortality rate
In addition, the Pearson correlation assumes
linearity, which is clearly not present here
We can use the Spearman correlation
instead
This is a correlation coefficient based on ranks,
which are computed in the y variable and x
variable independently, with sample size n
To calculate the coefficient, we do the following…
1.
2.
3.
Take each xi and each yi, convert them into ranks
(ranks of x and y are independent of each other
Subtract rxi from ryi to get di, the difference in ranks
The formula is
In this case, the hypotheses are
H0: Rs=0
vs.
Ha: Rs≠0
Basically, we are attempting to see whether or not
the two variables are independent, or if there is
evidence of an association
Let’s return to the GDP data
We will now plot the data in R, and see how to get
the Spearman correlation
R tests this by using an exact p-value for small
sample sizes, and an approximate t-distribution for
larger ones
The test statistic in that case would follow a tdistribution with n-2 degrees of freedom
Turn to R now!!!
Nonparametric Bootstrapping
Suppose you are fitting a multiple linear regression
model
Yi=β0+β1x1i+β2x2i+…+βkxki+εi
Recall that εi~ N(0, σ2)
But what if we have reason to suppose this
assumption is not met?
Then, the regular way of testing the significance of
our coefficients is invalid
So what do we do?
Depending on the situations, there are several
options
The one we will talk about today is called
nonparametric bootstrapping
Bootstrapping is a resampling method, where we
take the data and randomly sample from it to draw
inferences
There are several types of bootstrapping. We will
focus on the simpler, nonparametric type
How do we do nonparametric
bootstrapping?
We assign each of the n observations a 1/n
probability of being selected
We then take a random sample with replacement,
usually of size n
We compute estimated coefficients based on this
sample
We repeated the process many times, say 10,000,
or 100,000, or 100,000,000,000,000…
For example, in regression
Suppose we want to test whether or not β1 = 0.
With our newly generated sample of (10,000, or
whatever) β1 coefficients
For a 95 percent confidence interval, we would look
at the 250th observation (or 250th lowest), and the
9,750th observation (or the 250th highest)
If this interval does not contain 0, we conclude that
there is evidence that β1 is not equal to 0
Possible issue?
Theoretically, this methods comes out of the idea
that the distribution of a population is
approximated by the distribution of its sample
This assumption becomes less and less valid the
smaller your sample size
Example
This data was taken from The Practice of
Econometrics by ER Berndt (1991)
We will be regressing wage on education and
experience
We will look for whether graphs tell us the
residuals are approximately normal or not
R-Code Now!
In Summary
We now understand that certain parametric
methods, like t-tests and regression, depend on
assumptions that may or may not be met
We know that nonparametric methods are methods
that do not make distributional assumptions, and
therefore are applicable when data do not meet
these assumptions
We can implement these methods in R, incase our
data does not meet these assumptions
Bonus Topic!
Indicator Variables in Regression
Used when we have categorical variables as predictors in
regression
As a start, let’s re-examine the wage data
Here, we will drop education and just look at
experience and sex
The full model in a case like this would look like
Wi=β0+β1Ei+β2Si+β3S*Ei+εi
Where…
W=wage
E=experience
S=1 if sex=“male”, 0 otherwise
ε is the error term
Separate regressions for different sexes
For women, the reference group,
Wia=β0a+β1aEi+εia
For men, the non-reference group,
Wib=β0b+β1bEi+εib
With that in mind…
Let’s return to the full model
Wi=β0+β1Ei+β2Si+β3S*Ei+εi
Suppose we want to see how women do in this model
Well, we can set S=0
We are left with Wi=β0+β1Ei+εi
But, since this is with women, this is equivalent to the
women only model, Wia=β0a+β1aEi+εia
Thus…
β0= β0a
β1Ei= β1aEi
Now, look at “male”
We set S=1
Wi=β0+β1Ei+β2*1+β31*Ei+εi
= β0+β1Ei+β2+β3Ei+εi
=β0a+β1aEi+β2+β3Ei+εi
=(β0a+β2)+(β1a+β3)Ei
=β0b+β1bEi+εib
So…
β2=β0b- β0a
β3=β1b- β1a
Example in R!
Thank you!
References
Berndt, ER. The Practice of Econometrics. 1991. NY: Addison-Wesley.
Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983)
Graphical Methods for Data Analysis. Belmont, CA: Wadsworth.
Laureysens, I., R. Blust, L. De Temmerman, C. Lemmens and R. Ceulemans.
2004. Clonal variation in heavy metal accumulation and biomass
production in a poplar coppice culture. I. Seasonal variation in leaf, wood
and bark concentrations. Environ. Pollution 131: 485-494.