Introduction to Hypothesis Testing

Download Report

Transcript Introduction to Hypothesis Testing

Regression and Correlation
GTECH 201
Lecture 18
ANOVA





Analysis of Variance
Continuation from matched-pair difference
of means tests; but now for 3+ cases
We still check whether samples come from
one or more distinct populations
Variance is a descriptive parameter
ANOVA compares group means and looks
whether they differ sufficiently to reject H0
ANOVA H0 and HA
ANOVA Test Statistic

MSB
F
MSw
 MSB = between-group mean squares


MSW = within-group mean squares
Between-group variability is calculated
in three steps:
1. Calculate overall mean as weighted average of
sample means
2. Calculate between-group sum of squares
3. Calculate between-group mean squares (MSB)
Between-group Variability
1. Total or overall mean
k
XT 
n X
i 1
i
i
N
2. Between-group sum of squares
k

SSB   ni X i  XT
i 1

2
 
 k
2 
   ni X i   N XT2
 i 1

 
3. Between-group mean squares
SSB SSB
MSB 

dfB
k 1
Within-group Variability
1. Within-group sum of squares
k
SSw    ni  1s
i 1
2
i
2. Within-group mean squares
SSW
SSW
MSW 

dfW
N k
Kruskal-Wallis Test




Nonparametric equivalent of ANOVA
Extension of Wilcoxon rank sum W test to
3+ cases
Average rank is Ri / ni
Then the Kruskal-Wallis H test statistic is
k
Ri2
12
H
 3  N  1

N  N  1 i 1 ni


With N =n1 + n2 + … +nk = total number of
observations, and
Ri = sum of ranks in sample i
ANOVA Example
House prices
A
175
147
138
156
184
148
by neighborhood in ,000 dollars
B
C
D
151
127
174
183
142
182
174
124
210
181
150
191
193
180
205
196
ANOVA Example, continued
Sample statistics
A
B
C
D
Total

n
X
s
6
7
5
4
158.00
183.29
144.60
189.25
17.83
17.61
22.49
15.48
22
168.68
24.85
Now fill in the six steps of the ANOVA calculation
The Six Steps
k
XT 
n X
i
i 1
N
i

6(158.00)  7(183.29)  5(144.60)  4(189.25)
 168.68
22
   6 158.00
 k

SSB    ni X i2   N XT2
 i 1

 
2

 7 183.29   5 144.60   4 189.25   22 168  6769.394
2
2
2
2
SSB SSB 6769.394
MSB 


 2256.465
dfB
k 1
3
k
SSw    ni  1si2  5 17.83   6 17.61  4  22.49   3 15.48   6193.379
2
2
i 1
SSW
SSW
6193.379


 344.077
dfW
N k
22  4
MSB 2256.465
F

 6.558  p  .003
MSW
344.077
MSW 
2
2
Correlation



Co-relatedness between 2+ variables
As the values of one variable go up,
those of the other change proportionally
Two step approach:
1. Graphically - scatterplot
2. Numerically – correlation coefficients
Is There a Correlation?
Scatterplots

Exploratory analysis
Pearson’s Correlation Index

Based on concept of covariance

CVXY   X  X Y  Y







CVXY = covariation between X and Y
X  X
Y  Y 
= deviation of X from its mean
= deviation of Y from its mean
Pearson’s correlation coefficient



 X  X Y  Y / N 

r
S X SY
Sample and Population



r is the sample correlation coefficient
Applying the t distribution, we can infer
the correlation for the whole population
Test statistic for Pearson’s r
t
r n2
1 r
2
Correlation Example

Lake effect snow
Spearman’s Rank Correlation

Non-parametric alternative to Pearson

Logic similar to Kruskal and Wilcoxon

Spearman’s rank correlation coefficient
rs  1 
6
d 
2
N N
3
Regression



In correlation we observe degrees of
association but no causal or functional
relationship
In regression analysis, we distinguish an
independent from a dependent variable
Many forms of functional relationships


bivariate
linear


multivariate
non-linear (curvi-linear)
Graphical Representation



In correlation analysis either variable could
be depicted on either axis
In regression analysis, the independent
variable is always on the X axis
Bivariate relationship is described by a
best-fitting line through the scatterplot
Least-Square Regression

Objective: minimize
d
2
i
Y  a  bX
Regression Equation



Y = a + bX
b
n  XY    X  Y 
n X   X 
2
Y  b X

a
n
2
Strength of Relationship

How much is explained by the regression
equation?
Coefficient of Determination

Total variation of Y (all the bucket water)
y




  Y Y

2
Large ‘Y’ = dependent
variable
y

y

  y
Small ‘y’ = deviation of each value of Y from
its mean
2
y

2
2
 y  y
2
e
2
e
2
u
2
u
e = explained; u = unexplained
Explained Variation

Ratio of square of covariation between X
and Y to the variation in X
xy 


y  x

2
e



2
2
where Sxy = covariation between X and Y
Sx2 = total variation of X
2
r

Coefficient of determination
2
y
 e
2
y

Error Analysis


r 2 tells us what percentage of the variation
is accounted for by the independent
variable
This then allows us to infer the standard
error of our estimate
SE 
2
y
 e
n2
which tells us, on average, how far off our
prediction would be in measurement units