Tue Nov 11 - Wharton Statistics Department

Download Report

Transcript Tue Nov 11 - Wharton Statistics Department

Lecture 19: Tues., Nov. 11th
• R-squared (8.6.1)
• Review
• Midterm II on Thursday in class: Allowed
calculator, two double-sided pages of notes
• Office hours: Today after class; Wednesday,
1:30-2:30; by appointment (I will be around
Wed. morning and Thurs. morning before
10:30).
R-Squared
• The R-squared statistic, also called the coefficient
of determination, is the percentage of response
variation explained by the explanatory variable.
Total sum of squares - Residual sum of squares
R  100(
)%
Total sum of squares
2
• Total sum of squares = i1 (Yi Y )2 . Best sum of
squared prediction error without using x.
• Residual sum of squares =
n
ˆ  ˆ x )2
res

(
y


i
i1
i1 i 0 1 i
n
2
n
R-Squared example
Neuron activity index
Bivariate Fit of Neuron activity index By Years playing
30
Linear Fit
25
Neuron activity index = 7.9715909 + 1.0268308 Years playing
20
Summary of Fit
15
10
5
0
0
5
10
15
Years playing
20
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.866986
0.855902
3.025101
15.89286
14
• R2= 86.69. Read as “86.69 percent of the
variation in neuron activity was explained
by linear regression on years played.”
Interpreting
2
R
• R2 takes on values between 0 and 1, with
higher R2 indicating a stronger linear
association.
• If the residuals are all zero (a perfect fit),
then R2 is 100%. If the least squares line
has slope 0, R2 will be 0%.
• R2 is useful as a unitless summary of the
strength of linear association.
Caveats about
2
R
– R2 is not useful for assessing model adequacy
(e.g., linearity) or whether or not there is an
association.
– A good R2 depends on the context. In precise
laboratory work, R2 values under 90% might be
too low, but in social science contexts, when a
single variable rarely explains great deal of
variation in response, R2 values of 50% may be
considered remarkably good.
Coverage of Second Midterm
• Transformations of the data for two group problem
(Ch. 3.5)
• Welch t-test (Ch. 4.3.2)
• Comparisons Among Several Samples (5.1-5.3,
5.5.1)
• Multiple Comparisons (6.3-6.4)
• Simple Linear Regression (Ch. 7.1-7.4, 7.5.3)
• Assumptions for Simple Linear Regression and
Diagnostics (Ch. 8.1-8.4, 8.6.1, 8.6.3)
Transformations for two-group problem
• Goal: Find transformation so that the two distributions have
approximately equal spread.
• Log transformation might work when distributions are skewed and
spread is greater in the distribution with larger median.
• Interpretation of log transformation:
– For causal inference: Let  be the additive treatment effect on the
log scale (log Y *  log Y   ). Then the effect of the treatment is

to multiply the control outcome by e (Y *  Ye )
– For population inference: Let 1 and 2 be the means of the
logged values of population 1 and 2 respectively. If the logged
values of the population are symmetric, then e 2 1 equals the
ratio of the median of population 2 to the median of population 1.
Review of One-way layout
• Assumptions of ideal model
– All populations have same standard deviation.
– Each population is normal
– Observations are independent
• Planned comparisons: Usual t-test but use all groups to
estimate  . If many planned comparisons, use Bonferroni
to adjust for multiple comparisons
• Test of H 0 : 1  2    I vs. alternative that at least
two means differ: one-way ANOVA F-test
• Unplanned comparisons: Use Tukey-Kramer procedure to
adjust for multiple comparisons.
Regression
• Goal of regression: Estimate the mean response Y
for subpopulations X=x, {Y | X }
• Applications: (i) Description of association
between X and Y; (ii) Passive prediction of Y
given X ; (iii) Control – predict what y will be if x
is changed. Application (iii) requires the x’s to be
randomly assigned.
• Simple linear regression model: {Y | X }  0  1 X
• Estimate  0 and 1 by least squares – choose
to minimize the sum of squared residuals ˆ0 , ˆ1
(prediction errors)
Ideal Model
• Assumptions of ideal simple linear regression
model
– There is a normally distributed subpopulation of
responses for each value of the explanatory variable
– The means of the subpopulations fall on a straight-line
function of the explanatory variable.
– The subpopulation standard deviations are all equal (to
 )
– The selection of an observation from any of the
subpopulations is independent of the selection of any
other observation.
The standard deviation 
•  is the standard deviation in each
subpopulation.
•  measures the accuracy of predictions from the
regression. ˆ  sum of all squared residuals
n-2
• If the simple linear regression models holds, then
approximately
– 68% of the observations will fall within ̂ of the least
squares line
– 95% of the observations will fall within 2̂ of the least
squares line
Inference for Simple Linear Regression
• Inference based on the ideal simple linear
regression model holding.
• Inference based on taking repeated random
samples ( y1,, yn ) from the same subpopulations
( x1,, xn ) as in the observed data.
• Types of inference:
–
–
–
–
Hypothesis tests for intercept and slope
Confidence intervals for intercept and slope
Confidence interval for mean of Y at X=X0
Prediction interval for future Y for which X=X0
Tools for model checking
1. Scatterplot of Y vs. X (see Display 8.6)
2. Scatterplot of residuals vs. fits (see
Display 8.12)
•
Look for nonlinearity, non-constant variance
and outliers
3. Normal probability plot (Section 8.6.3) –
for checking normality assumption
Outliers and Influential
Observations
• An outlier is an observation that lies outside the overall
pattern of the other observations. A point can be an outlier
in the x direction, the y direction or in the direction of the
scatterplot. For regression, the outliers of concern are
those in the x direction and the direction of the scatterplot.
A point that is an outlier in the direction of the scatterplot
will have a large residual.
• An observation is influential if removing it markedly
changes the least squares regression line. A point that is an
outlier in the x direction will often be influential.
• The least squares method is not resistant to outliers.
Follow the outlier examination strategy in Display 3.6 for
dealing with outliers in x direction and outliers in the
direction of scatterplot.
Transformations
• Goal: Find transformations f(y) and g(x)
such that the simple linear regression model
approximately describes the relationship
between f(y) and g(x).
• Tukey’s Bulging Rule can be used to find
candidate transformations.
• Prediction after transformation
• Interpreting log transformations