Introduction to Hypothesis Testing
Download
Report
Transcript Introduction to Hypothesis Testing
Regression and Correlation
GTECH 201
Lecture 18
ANOVA
Analysis of Variance
Continuation from matched-pair difference
of means tests; but now for 3+ cases
We still check whether samples come from
one or more distinct populations
Variance is a descriptive parameter
ANOVA compares group means and looks
whether they differ sufficiently to reject H0
ANOVA H0 and HA
ANOVA Test Statistic
MSB
F
MSw
MSB = between-group mean squares
MSW = within-group mean squares
Between-group variability is calculated
in three steps:
1. Calculate overall mean as weighted average of
sample means
2. Calculate between-group sum of squares
3. Calculate between-group mean squares (MSB)
Between-group Variability
1. Total or overall mean
k
XT
n X
i 1
i
i
N
2. Between-group sum of squares
k
SSB ni X i XT
i 1
2
k
2
ni X i N XT2
i 1
3. Between-group mean squares
SSB SSB
MSB
dfB
k 1
Within-group Variability
1. Within-group sum of squares
k
SSw ni 1s
i 1
2
i
2. Within-group mean squares
SSW
SSW
MSW
dfW
N k
Kruskal-Wallis Test
Nonparametric equivalent of ANOVA
Extension of Wilcoxon rank sum W test to
3+ cases
Average rank is Ri / ni
Then the Kruskal-Wallis H test statistic is
k
Ri2
12
H
3 N 1
N N 1 i 1 ni
With N =n1 + n2 + … +nk = total number of
observations, and
Ri = sum of ranks in sample i
ANOVA Example
House prices
A
175
147
138
156
184
148
by neighborhood in ,000 dollars
B
C
D
151
127
174
183
142
182
174
124
210
181
150
191
193
180
205
196
ANOVA Example, continued
Sample statistics
A
B
C
D
Total
n
X
s
6
7
5
4
158.00
183.29
144.60
189.25
17.83
17.61
22.49
15.48
22
168.68
24.85
Now fill in the six steps of the ANOVA calculation
The Six Steps
k
XT
n X
i
i 1
N
i
6(158.00) 7(183.29) 5(144.60) 4(189.25)
168.68
22
6 158.00
k
SSB ni X i2 N XT2
i 1
2
7 183.29 5 144.60 4 189.25 22 168 6769.394
2
2
2
2
SSB SSB 6769.394
MSB
2256.465
dfB
k 1
3
k
SSw ni 1si2 5 17.83 6 17.61 4 22.49 3 15.48 6193.379
2
2
i 1
SSW
SSW
6193.379
344.077
dfW
N k
22 4
MSB 2256.465
F
6.558 p .003
MSW
344.077
MSW
2
2
Correlation
Co-relatedness between 2+ variables
As the values of one variable go up,
those of the other change proportionally
Two step approach:
1. Graphically - scatterplot
2. Numerically – correlation coefficients
Is There a Correlation?
Scatterplots
Exploratory analysis
Pearson’s Correlation Index
Based on concept of covariance
CVXY X X Y Y
CVXY = covariation between X and Y
X X
Y Y
= deviation of X from its mean
= deviation of Y from its mean
Pearson’s correlation coefficient
X X Y Y / N
r
S X SY
Sample and Population
r is the sample correlation coefficient
Applying the t distribution, we can infer
the correlation for the whole population
Test statistic for Pearson’s r
t
r n2
1 r
2
Correlation Example
Lake effect snow
Spearman’s Rank Correlation
Non-parametric alternative to Pearson
Logic similar to Kruskal and Wilcoxon
Spearman’s rank correlation coefficient
rs 1
6
d
2
N N
3
Regression
In correlation we observe degrees of
association but no causal or functional
relationship
In regression analysis, we distinguish an
independent from a dependent variable
Many forms of functional relationships
bivariate
linear
multivariate
non-linear (curvi-linear)
Graphical Representation
In correlation analysis either variable could
be depicted on either axis
In regression analysis, the independent
variable is always on the X axis
Bivariate relationship is described by a
best-fitting line through the scatterplot
Least-Square Regression
Objective: minimize
d
2
i
Y a bX
Regression Equation
Y = a + bX
b
n XY X Y
n X X
2
Y b X
a
n
2
Strength of Relationship
How much is explained by the regression
equation?
Coefficient of Determination
Total variation of Y (all the bucket water)
y
Y Y
2
Large ‘Y’ = dependent
variable
y
y
y
Small ‘y’ = deviation of each value of Y from
its mean
2
y
2
2
y y
2
e
2
e
2
u
2
u
e = explained; u = unexplained
Explained Variation
Ratio of square of covariation between X
and Y to the variation in X
xy
y x
2
e
2
2
where Sxy = covariation between X and Y
Sx2 = total variation of X
2
r
Coefficient of determination
2
y
e
2
y
Error Analysis
r 2 tells us what percentage of the variation
is accounted for by the independent
variable
This then allows us to infer the standard
error of our estimate
SE
2
y
e
n2
which tells us, on average, how far off our
prediction would be in measurement units