Transcript Regression2

Single Variable Regression
Go to Table of Content
Which Approach Is Appropriate
When?
• Choosing the right method for the data is
the key statistical expertise that you need to
have.
Go to Table of Content
2
Do I Need to Know the
Formulas?
• You do not need to know exact formulas.
• You do need to understand the concept behind
them and the general statistical concepts imbedded
in the use of the formulas.
• You do not need to be able to do correlation and
regression by hand.
• You must be able to do it on a computer using
Excel.
Go to Table of Content
3
Table of Content
•
•
•
•
•
Objectives
Purpose of Regression
Correlation or Regression?
First Order Linear Model
Probabilistic Linear
Relationship
• Estimating Regression
Parameters
• Assumptions
• Sum of squares
• Tests
• Percent of variation
explained
• Example
• Regression Analysis in
Excel
• Normal Probability Plot
• Residual Plot
• Goodness of Fit
• ANOVA For Regression
Go to Table of Content
4
Objectives
• To learn the assumptions behind and the
interpretation of single and multiple
variable regression.
• To use Excel to calculate regressions and
test hypotheses.
Go to Table of Content
5
Purpose of Regression
• To determine whether values of one or more
variable are related to the response variable.
• To predict the value of one variable based
on the value of one or more variables.
• To test hypotheses.
Go to Table of Content
6
Correlation or Regression?
• Use correlation if you are interested only in
whether a relationship exists.
• Use Regression if you are interested in
building a mathematical model that can
predict the response variable.
• Use regression if you are interested in the
relative effectiveness of several variables in
predicting the response variable.
Go to Table of Content
7
First Order Linear Model
• A deterministic
mathematical model
between y and x:
• 0 is the intercept with y
axis, the point at which
x=0
• 1 is the angle of the line,
the ratio of rise divided by
the run in figure to the
right. It measures the
change in y for one unit of
change in x.
Independent variable y
y = 0 + 1 * x
Go to Table of Content
Rise
Run
Dependent variable x
8
Probabilistic Linear Relationship
• But relationship between x and y is not always
exact. Observations do not always fall on a
straight line.
• To accommodate this, we introduce a random
error term referred to as epsilon:
y = 0 + 1 * x + 
• The task of regression analysis then is to estimate
the parameters b0 and b1 in the equation:
^y = b + b * x
0
1
so that the difference between y and ^y is
Go to Table of Content
minimized
9
Estimating Regression
Parameters
50
Regression
line
45
40
Y
• Red dots show the
observations
• The solid line shows the
estimated regression line
• The distance between each
observation and the solid
line is called residual
• Minimize the sum of the
squared residuals
(differences between line
and observations).
Go to Table of Content
35
30
Residual
25
20
1
3
5
X
10
Assumptions
• The dependent (response) variable is measured on
an interval scale
• The probability distribution of the error is Normal
with mean zero
• The standard deviation of error is constant and
does not depend on values of x
• The error terms associated with any particular
value of Y is independent of error term associated
with other values of Y
Go to Table of Content
11
Sum of Squares
• Variation in y = SSR + SSE
• MSR divided by MSE is the test statistic for
ability of regression to explain the data
Degrees
of
freedom
Sum of square of
differences between
Predicted values and mean
Regression (SSR) of observations
1
Predicted values and
Error (SSE)
observations
n-2
Observations and mean of
Variation in Y
observations
n-1
Mean sum of square is obtained by dividing SS by degrees
of freedom
Go to Table of Content
12
Tests
• The hypothesis that the regression equation
does not explain variation in Y and can be
tested using F test.
• The hypothesis that the coefficient for x is
zero can be tested using t statistic.
• The hypothesis that the intercept is 0 can be
tested using t statistic
Go to Table of Content
13
Percent of Variation Explained
•
•
•
•
R2 is the coefficient of determination.
The minimum R2 is zero. The maximum is 1.
1- R2 is the variation left unexplained.
If Y is not related to X or related in a non-linear
fashion, then R2 will be small.
• Adjusted R2 shows the value of R2 after
adjustment for degrees of freedom. It protects
against having an artificially high R2 by increasing
the number of variables in the model.
Go to Table of Content
14
Example
• Is waiting time related
to satisfaction ratings?
• Predict what will
happen to satisfaction
ratings if waiting time
reaches 15 minutes?
Waiting
time
Patient
Go to Table of Content
1
2
3
4
5
6
7
8
9
7
5
6
8
5
7
8
Satisfaction
ratings
80
90
90
100
85
100
85
75
15
Regression Analysis in Excel
•
•
•
•
Select tools
Select data analysis
Select regression analysis
Identify the x and y data
of equal length
• Ask for residual plots to
test assumptions
• Ask for normal probability
plot to test assumption
Go to Table of Content
16
Normal Probability Plot
Norm al Probability Plot
110
Satisfaction ratings
• Normal Probability Plot
compares the percent of
errors falling in particular
bins to the percentage
expected from Normal
distribution.
• If assumption is met then
the plot should look like a
straight line.
Go to Table of Content
100
90
80
70
60
0
50
100
Sam ple Percentile
17
Residual Plot
The difference between the observed value of the dependent
variable (y) and the predicted value (ŷ) is called the residual (e).
Each data point has one residual.
Residual = Observed value - Predicted value
Waiting tim e Residual Plot
10
Residuals
• Tests that residuals
have mean of zero and
constant standard
deviation
• Tests that residuals are
not dependent on
values of x
Go to Table of Content
5
0
-5 4
6
8
10
-10
Waiting tim e
18
Residual Plot
• A residual plot is a graph that shows the residuals on the vertical
axis and the independent variable on the horizontal axis.
• If the points in a residual plot are randomly dispersed around the
horizontal axis, a linear regression model is appropriate for the
data; otherwise, a non-linear model is more appropriate.
• Below the chart displays the residual (e) and independent
variable (X) as a residual plot.
• This random pattern indicates that a linear model provides a
decent fit to the data.
Go to Table of Content
19
Residual Plot
• Below, the residual plots show three typical patterns.
• The first plot shows a random pattern, indicating a good fit for a
linear model.
• The other plot patterns are non-random (U-shaped and inverted
U), suggesting a better fit for a non-linear model.
Random pattern
Non-random: U-shaped
Go to Table of Content
Non-random: Inverted U
20
Linear Equation
• Satisfaction = 121.3 – 4.8* Waiting time
• At 15 minutes waiting time, satisfaction is predicted to be:
121.3 - 4.8 * 15 = 48.87
• The t statistic related to both the intercept and waiting time
coefficient are statistically significant.
• The hypotheses that the coefficients are zero are rejected.
Standard
Coefficients
Error
t Stat P-value
Intercept
121.34
10.48
11.58
0.00
Waiting time
-4.83
1.50
-3.23
0.02
Go to Table of Content
21
Goodness of Fit
• 57% of variation in satisfaction ratings is
explained by the equation
• 43% of variation in satisfaction ratings is left
unexplained
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.796902768
0.635054022
0.574229692
5.7674349
8
Go to Table of Content
22
ANOVA For Regression
• The regression model has mean sum of square of 347.
• The mean sum of errors is 33. Note the error term is called
residuals in Excel.
• F statistics is 10, the probability of observing this statistic
is 0.02.
• The hypothesis that the MSR and MSE are equal is
rejected. Significant variation is explained by regression.
ANOVA
df
Regression 1
Residual
6
Total
7
SS
347.30
199.58
546.88
Signific
MS
F
ance F
347.30 10.44
0.02
33.26
Go to Table of Content
23
Null Hypothesis
• The null hypothesis corresponds to a general or default position.
• For example, the null hypothesis might be that there is no
relationship between two measured phenomena or that a potential
treatment has no effect.
• It is important to understand that the null hypothesis can never
be proven.
• A set of data can only reject a null hypothesis or fail to reject
it.
• For example, if comparison of two groups (e.g.: treatment,
no treatment) reveals no statistically significant difference
between the two, it does not mean that there is no difference
in reality.
• It only means that there is not enough evidence to reject the null
hypothesis (in other words, the experiment fails to reject the
Go to Table of Content
24
null hypothesis)
What is a P value?
• ‘P’ stands for probability
• Measures the strength of the evidence against the null
hypothesis (that our regression has no significance)
• Smaller P values indicate stronger evidence against the
null hypothesis
• By convention, p-values of <.05 are often accepted as
“statistically significant”; but this is an arbitrary cut-off.
Go to Table of Content
25