Slides for Session #24

Download Report

Transcript Slides for Session #24

Statistics for Social
and Behavioral Sciences
Part IV: Causality
Multivariate Regression
R squared, F test, Chapter 11
Prof. Amine Ouazad
Data: Variables
•
•
•
•
•
•
•
•
•
•
•
•
•
y
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
Box = First run U.S. box office ($)
MPRating = 1 if movie is PG13 or R, 0 if the movie is G or PG.
Budget = Production budget ($Mil)
Starpowr = Index of star power
Sequel = 1 if movie is a sequel, 0 if not
Action = 1 if action film, 0 if not
Comedy = 1 if comedy film, 0 if not
Animated = 1 if animated film, 0 if not
Horror = 1 if horror film, 0 if not
Addict = Trailer views at traileraddict.com
Cmngsoon = Message board comments at comingsoon.net
Fandango = Attention at fandango.com
Cntwait3 = Percentage of Fandango votes that can't wait to see.
Statistics Course Outline
PART I. INTRODUCTION AND RESEARCH DESIGN
Week 1
Four Steps of “Thinking Like a Statistician”
Study Design: Simple Random Sampling, Cluster Sampling, Stratified Sampling
Biases: Nonresponse bias, Response bias, Sampling bias
PART II. DESCRIBING DATA
Weeks 2-4
Sample statistics: Mean, Median, SD, Variance, Percentiles, IQR, Empirical Rule
Bivariate sample statistics: Correlation, Slope
PART III. DRAWING CONCLUSIONS FROM DATA:
INFERENTIAL STATISTICS
Weeks 5-9
Estimating a parameter using sample statistics. Confidence Interval at 90%, 95%, 99%
Testing a hypothesis using the CI method and the t method.
Weeks 10-14
PART IV. : CORRELATION AND CAUSATION:
TWO GROUPS, REGRESSION ANALYSIS
Multivariate regression
now! R Squared, F stat
Coming up
• “Comparison of Two Groups”
Last week.
• “Univariate Regression Analysis”
Last Saturday, Section 9.5.
• “Association and Causality: Multivariate Regression”
Last Saturday, Chapter 10.
Yesterday, Today, Chapter 11 – R Squared, F test.
• “Randomized Experiments and ANOVA”.
Wednesday. Chapter 12.
• “Robustness Checks and Wrap Up”.
Last Thursday.
Multivariate Regression
• Instead we estimate b1, b2, …, bK on the
sample:
– Minimizing the sum of the squared prediction
error !
• With these we can predict the success of a
movie:
Outline
1. Multiple Correlation and R Squared
2. F test
3. Partial correlation
Next time: Multivariate regression: the F test (Continued)
R Squared
• How good are we at predicting the success of
a movie?
• The R squared is 1 if we are absolutely correct
in our predictions. ei=0 for every movie.
• The R squared is 0 if we do not better than
taking the average. ei =
ESS/TSS = 13356/18665 = 0.7156
Graphically …
How would the graph look
like if R2 = 1 ?
Each point on this graph is a movie
Properties of the R Squared
• The larger the value of the R Squared, the better
the (x1, …., x12) collectively predict y.
• Adding a variable on the right hand side raises
the R squared.
Warning
A high R squared is not a sign that your linear regression measures causal effects.
Adding a large number of variables will lead to a high R squared.
It merely says that within the sample, your predictions are close to the actual values yi.
Ask yourself: is there a reason to think that x1, …., x12 causes y?
Compare this….
(without web popularity variables)
with this…
(with web popularity variables)
Adjusted R Squared
• What about an R squared that increases only if
the variable that we add has a high enough t
stat??
• Such adjusted R Squared increases only if the
absolute value of the t statistic is greater than 1.
Compare this … (without Star Power)
… with this (Star Power Added)
What happened to the R squared when we added Star power?
What happened to the adjusted R squared?
Outline
1. Multiple Correlation and R Squared
2. F test
3. Partial correlation
Next time: Multivariate regression: the F test (Continued)
F test
• The t test checks that one particular
coefficient, say b3, is statistically significant.
But about all coefficients collectively?
• H0: “b1 = b2 = b3 = … = b12 = 0”.
The alternative hypothesis is that at least one bk
is non zero.
• Ha: “For at least one k, bk ≠ 0”.
F test
• F statistic:
• Under the null hypothesis, F follows an F
distribution with df1 = K and df2 = N – (K+1)
degrees of freedom.
• The F is always positive.
• Intuition: For N=∞, notice how the F stat is a
way of comparing the R2 to a threshold.
Notice the degrees of freedom of the F stat: df1 = ?
df2 = ?
Can we reject the null hypothesis? H0: “b1 = b2 = b3 = … = b12 = 0”.
F test – Intuitions
• We could just check that at least one t-statistic
is above the t score with df = N-(K+1).
– This is time consuming.
• It may happen that the p value of the F stat is
marginally above 0.05, while one p value of
one t stat is marginally below 0.05.
– Either the F test or the t test face Type I or Type II
error.
– Be conservative: trust the least favorable result.
Outline
1. Multiple Correlation and R Squared
2. F test
3. Partial correlation
Next time: Multivariate regression: the F test (Continued)
Partial Correlation between y and x1
• Measures the association between y and x1,
controlling for all other variables x2, x3, …, x12.
• The partial correlation is thus measuring “All other
things equal” or “ceteris paribus” (See previous
slides).
• ryx1x2 is between -1 and +1.
• ryx1x2 has the sign of b1.
• What if rx1x2 = 0?
Correlation vs Partial Correlation
Correlations (corr) with box_mil
Correlation between y (box_mil) and
cntwait3 (percentage that can’t wait) is
0.6511.
Partial Correlations (pcorr) with box_mil
Partial correlation between y (box_mil)
and cntwait3 (percentage that can’t wait)
is 0.3083.
• Last time we saw that the budget has an impact on
website popularity !
Congratulations: You now fully
understand regression output!
• You can use a number of variables
to explain a dependent variable.
☞ Multiple regression accounts for multiple causes.
• The coefficients minimize the sum of the squared residuals.
• Understand the t test and the p value.
☞ The F test tests the null hypothesis that all coefficients (except the
constant) are zero.
• The coefficients should be understood “all other things equal”
or “ceteris paribus”.
• The standardized coefficients express effects in terms of
standard deviations.
• The R squared between 0 and 100% measures how accurate our
predictions are.
☞ The Adjusted R Squared corrects the R squared for the addition of
variables with small t statistic.
Coming up:
• Coverage for the final just ends right after the F test.
• Chapter on “Association and Causality”, and “Multivariate Regression”.
• Make sure you come to sessions and the last recitation.
Sunday
Monday
Multivariate
Regression
Tuesday
Multivariate
Regression
The F test
Wednesday
Thursday
Randomized
Wrap up
Experiments and
ANOVA
Recitation
Evening session
7.30pm
West
Administration
002
Usual class
12.45pm
Usual room
Evening session
7.30pm
West
Administration
001
Usual class
12.45pm
Usual room