Regression: Topics - Stanford University
Download
Report
Transcript Regression: Topics - Stanford University
Topics: Regression
• Simple Linear Regression: one dependent
variable and one independent variable
• Multiple Regression: one dependent
variable and two or more independent
variables.
Correlation
• A correlation describes a relationship between two
variables
• Correlation tries to answer the following questions:
– What is the relationship between variable X and variable Y?
– How are the scores on one measure associated with scores on
another measure?
– To what extent do the high scores on one variable go with the high
scores on the second variable?
Simple Linear Regression
• Understanding relationships between
variables:
– Prediction
– Explanation
Design Requirements and
Assumptions
•
•
•
•
•
•
Two continuous variables
Variables are linearly related
Random Sampling
Independence
Bivariate Normality
N >= 30
Example
• You are the admissions committee in the Sociology
department of a large west coast University. You are trying
to make decisions about who to admit to the Master’s
program. You would like to be able to predict how well the
applicants you are deciding about will do at your school.
• Your department has been analyzing the performance of
it’s graduate students over the years. One thing it has been
looking at it is relationship between undergraduate GPA
and graduate GPA.
• From regression analyses done over the years, you are able
to make some educated guesses about how applicants will
perform once admitted.
How Used in Making Predictions
The Regression Coefficient? What
Slope? What Altitude?
Fitting the Regression Line: The Best
Fit (Least Squares)
• Y'= a + byX
• The predicted value of Y(Y') for a value of X is
computed by:
– Multiplying a score (X) by the regression
coefficient (by)
– Adding the regression constant (a) to this
product
• The prediction of Y from X based on linear
relationship of X and Y so that errors are
minimized
Least Squares Fit: Visual
*
*
Where the average squared
distance of the points from the
regression line is minimized
Minimizing Prediction Error: What
that Means (For Math Types)
The Regression Coefficient: Close
Your Eyes if You Don’t Want the
Derivation
• by = rxy (sy/sx)
–
–
–
–
by = regression coefficient
r = correlation between X and Y
sy = standard deviation of Y
sx = standard deviation of X
• Compute by: divide the standard deviation
of Y (sy) by the standard deviation of X (sx)
then multiply by the Pearson correlation
(rxy)between X and Y
The Constant (a): More Math
• Regression Constant (a): the altitude of the
regression line; the value where the regression line
intercepts Y where X = 0 (the Y intercept)
• a = Y - byX
–
–
–
–
a = the regression constant
Y = mean of Y
by = regression coefficient
X = mean of X
• Compute a: multiply X (mean of X) by the
regression coefficient (by) and then subtract that
product from Y (mean of Y)
Plotting Regression Line
• Need compute two predicted scores:
– For X (undergrad GPA) = 2.75
• Y’ = a + byX = 2.93+.24(2.75) = 3.59
– For X (undergrad GPA) = 3.60
• Y’ = a + byX = 2.93+.24(3.60) = 3.79
• Draw regression line through scatter plot
using these two points
Plotting the Regression Line: Visual
*
*
Errors of Prediction
Standard Error of Estimate
• The magnitude of the error made in
estimating Y from X: a measure of
dispersion around the regression line
• The average error of prediction
The Standard Error of Estimate: A
Visual Representation
Graduate GPA
4.00
3.75
3.75
3.50
3.25
3.25
3.00
3.00
3.25
3.50
3.75
Undergraduate GPA
4.00
Standard Error of Estimate: Another
Visual Representation
Y
Is the prediction worth pursuing?
• Standard error
• Amount of variance explained by X
• Testing the regression coefficient (b) for
significance
Explaining Variance: How much?
Predicted Variance
2
 (Y' - Y )
Total Variance
 (Y - Y) 2
Y
Unpredicted
Variance
2
 (Y - Y' )
Assessing Prediction Accuracy:
Explaining Variance
• Total Variance: = Predicted variance + Residual
(unexplained) variance
• Coefficient of Determination (r2):Proportion of total
variance in Y that has been predicted by variable X (r2
= s2y’/s2y)
– Our example: r = .56, so r2 = .3136
• Coefficient of Non-Determination (1-r2): : Proportion
of total variance in Y that is not predicted by X
– Our example: 1- r2 = 1- .31 = .69
Proportion of Explained (Predicted)
and Unexplained (Residual) Variance
X
rxy = .56
Y
(1-r2) =.69 (69%)
Unexplained variance
r2=.31 (31%)
Explained variance
t-Test for Individual Regression
Coefficients (by)
• H0: = 0 (where is the population
regression coefficient)
• H1: not= 0
• Compute a t statistic:
• T = (b - )/sb = b/sb (how many standard
error points b is from the hypothesized
population parameter under the null
hypothesis, = 0 )
t-Test of b: Our Example
•
•
•
•
•
t = .24/.12 = 2.00
Set alpha at .05 (two-tailed)
Figure out df (N-2): 8
t critical (05/2,8) = 2.306
Decision: tobserved (2.00) < tcritical (2.306) so
do not reject the null hypothesis
• Conclusion: cannot conclude that the slope
is significantly different from 0 in the
population.
Our Conclusion: Do not reject the null
hypothesis
*
*
Warnings
• Simple regression assumes a straight line
relationship
• Outliers can control regression results
• Assumes random samples for making proper
generalizations
• Regression is correlational and does not show a
causal link between x causes y