Correlation and Regression Analysis

Download Report

Transcript Correlation and Regression Analysis

Correlation
and Regression
Analysis
Dr. Mohammed Alahmed
GOALS
• Understand and interpret the terms dependent and
independent variable.
• Calculate and interpret the coefficient of correlation, the
coefficient of determination, and the standard error of
estimate.
• Conduct a test of hypothesis to determine whether the
coefficient of correlation in the population is zero.
• Calculate the least squares regression line.
• Predict the value of a dependent variable based on the
value of at least one independent variable.
• Explain the impact of changes in an independent
variable on the dependent variable.
Dr. Mohammed Alahmed
Introduction
• Correlation and regression analysis are
related in the sense that both deal with
relationships among variables.
• For example, we may be interested in
studying the relationship between blood
pressure and age, height and weight….
• The nature and strength of the
relationship between variables may be
examined by Correlation and Regression
analysis.
Dr. Mohammed Alahmed
Correlation Analysis
• The term “correlation” refers to a
measure of the strength of association
between two variables.
• Finding the relationship between two
quantitative variables without being able
to infer causal relationships
• Correlation is a statistical technique used
to determine the degree to which two
variables are related.
Dr. Mohammed Alahmed
• If the two variables increase or decrease together, they
have a positive correlation.
• If, increases in one variable are associated with
decreases in the other, they have a negative correlation
Dr. Mohammed Alahmed
Visualizing Correlation
• A scatter plot (or scatter diagram) is used to show the relationship
between two variables.
• Linear relationships implying straight line association are visualized
with scatter plots
Dr. Mohammed Alahmed
Linear Correlation Only!
Linear relationships
Curvilinear relationships
Y
Y
X
X
Y
Y
X
Dr. Mohammed Alahmed
X
Correlation Coefficient
• The population correlation coefficient ρ
(rho) measures the strength of the
association between the variables.
• The sample (Pearson) correlation
coefficient r is an estimate of ρ and
is used to measure the strength of the
linear relationship in the sample
observations.
Dr. Mohammed Alahmed
• r is a statistic that quantifies a relation
between two variables.
• Can be either positive or negative
• Falls between -1.00 and 1.00
Dr. Mohammed Alahmed
• The value of the number (not the sign)
indicates the strength of the relation.
• The purpose is to measure the strength of
a linear relationship between 2 variables.
• A correlation coefficient does not ensure
“causation” (i.e. a change in X causes a
change in Y)
Dr. Mohammed Alahmed
Calculating the Correlation
Coefficient
• The sample (Pearson) correlation coefficient (r) is
defined by
r
 ( x  x )( y  y )
[ ( x  x ) ][ ( y  y )
2
where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable
Dr. Mohammed Alahmed
2
]
Statistical Inference for Correlation
Coefficients
• Significance Test for Correlation
– Hypotheses
H0: ρ = 0
H1: ρ ≠ 0
(no correlation)
(correlation exists)
• Test statistic
t
r
1 r
n2
Dr. Mohammed Alahmed
2
Example
• A small study is conducted involving 17 infants to
investigate the association between gestational age at
birth, measured in weeks, and birth weight, measured in
grams.
Dr. Mohammed Alahmed
Dr. Mohammed Alahmed
Dr. Mohammed Alahmed
H0 : ρ = 0
H1 : ρ ≠ 0
Dr. Mohammed Alahmed
Cautions about Correlation
• Correlation is only a good statistic
to use if the relationship is roughly
linear.
• Correlation can not be used to
measure non-linear relationships
• Always plot your data to make sure
that the relationship is roughly
linear!
Dr. Mohammed Alahmed
Regression Analysis
Regression analysis is used to:
– Predict the value of a dependent variable based
on the value of at least one independent
variable.
– Explain the impact of changes in an independent
variable on the dependent variable.
Dependent variable:
the variable we wish to
explain.
Independent variable: the variable used to
explain the dependent
variable.
Dr. Mohammed Alahmed
Simple Linear Regression Model
• Only one independent variable, X
• Relationship between X and Y is
described by a linear function.
• Changes in Y are assumed to be
caused by changes in X.
Dr. Mohammed Alahmed
• The formula for a simple linear regression
Dependent
Variable
Population
y intercept
Population
Slope
Coefficient
Independent
Variable
y  β0  β1x  ε
Linear component
Random Error
component
The regression coefficients β0 and β1 are unknown and have to be
estimated from the observed data (sample).
Dr. Mohammed Alahmed
Random
Error
term, or
residual
y
y  β0  β1x  ε
Observed Value
of y for xi
Slope = β1
εi
Predicted Value
of y for xi
Random Error
for this x value
β0
xi
Dr. Mohammed Alahmed
x
Linear Regression Assumptions
• The assumption of linearity
– The relationship between the dependent and independent
variables is linear.
• The assumption of homoscedasticity
– The errors have the same variance
• The assumption of independence
– The errors are independent of each other
• The assumption of normality
– The errors are normally distributed
Dr. Mohammed Alahmed
Estimated Regression Model
The sample regression line provides an estimate of
the population regression line
Estimated (or
predicted) y
value
Estimate of the
regression
intercept
Estimate of the
regression slope
ŷi  b0  b1x
Independent
variable
The individual random error terms ei have a mean of zero
Dr. Mohammed Alahmed
Least Squares Method
• b0 and b1 are called the regression
coefficients and obtained by finding the
values of b0 and b1 that minimize the
sum of the squared residuals
e
2
 (y ŷ)
  (y  (b

2
0
 b1x))
2
Define a residual e as the difference between the observed y
and
fitted 𝑦 , that is, Residuals are interpreted as estimates of random
errors e‘s
Dr. Mohammed Alahmed
The Least Squares Equation
• The formulas for b1 and b0 are:
b1
( x  x )( y  y )


 (x  x)
2
b0  y  b1 x
• b0 is the estimated average value of y when the value of x is
zero
• b1 is the estimated change in the average value of y as a
result of a one-unit change in x
• The coefficients b0 and b1 will usually be found using
computer software, such as SPSS.
Dr. Mohammed Alahmed
Relationship between the Regression Coefficient
(b1) and the Correlation Coefficient (r)
• What is the relationship between the sample
regression coefficient (b1) and the sample
correlation coefficient (r)?
sx
r  b1
sy
Sx is the standard deviation of X and Sy the standard deviation of Y
Dr. Mohammed Alahmed
Example
• Use the previous example assuming
the birth weight is the dependent
variable and gestational age as the
independent variable.
• Fit a linear-regression line relating
birth weight to gestational age using
these data.
• Predict the birth weight of a baby
from a women with gestational age
40.5 weeks.
Dr. Mohammed Alahmed
Dr. Mohammed Alahmed
b0
b1
birth weight  - 4020.054  180.455 (gestational age)
Dr. Mohammed Alahmed
Coefficient of Determination, R2
• The coefficient of determination is the
portion of the total variation in the
dependent variable that is explained by
variation in the independent variable
• The coefficient of determination is also
called R-squared and is denoted as R2
R r
2
2
Dr. Mohammed Alahmed
• R2 = Explained variation / Total variation
• R2 is always (%) and between 0% and 100%:
• 0% indicates that the model explains none of the
variability of the response data around its mean.
• 100% indicates that the model explains all the
variability of the response data around its mean.
• In general, the higher the R-squared, the better the
model fits your data.
Dr. Mohammed Alahmed
r2 = 0.668
66.8 % of the variation in birth
weight is explained by variation
in gestational age in week
Dr. Mohammed Alahmed
F- test for Simple Linear Regression
• The criterion for goodness of fit is the ratio
of the regression sum of squares to the
residual sum of squares.
• A large ratio indicates a good fit, whereas a
small ratio indicates a poor fit.
• In hypothesis-testing terms we want to test
the hypothesis:
H0: β = 0 vs. H1: β ≠ 0
Dr. Mohammed Alahmed
The P-value < 0.05.
Therefore H0 is rejected, implying a significant linear
relationship between birth weight and gestational age.
birth weight  - 4020.054  180.455 (40.5)
 3288.3735 gm
Dr. Mohammed Alahmed
Checking the Regression Assumptions
There are two strategies for checking the
regression assumptions:
1. Examining the degree to which the variables
satisfy the criteria, .e.g. normality and linearity,
before the regression is computed by plotting
relationships and computing diagnostic
statistics.
2. Studying plots of residuals ei  Yi  Yˆi and
computing diagnostic statistics after the
regression has been computed.
Dr. Mohammed Alahmed
Check Linearity assumption:
A scatter plot (or scatter diagram) is used to show the relationship
between two variables.
Dr. Mohammed Alahmed
Check Independence assumption:
Error terms associated with individual observations should be
independent of each other. Rule of thumb: Random samples ensure
independence.
scatterplot of residuals and predicted value should show no trends
Dr. Mohammed Alahmed
Check Equal Variance Assumption (Homoscedasticity):
Variability of error terms should be the same (constant) for all values of
each predictor.
Check 1: Scatterplot of residuals against the predicted value shows
consistent spread.
Check 2: Boxplot of y against each predictor of x should show consistent
spread.
Dr. Mohammed Alahmed
Check Normality Assumption:
Check normality of residuals and individual variables and
identify outliers of variables using normal probability plot
• Run normality tests. All or almost all of them should
have P-value > 0.05
Dr. Mohammed Alahmed
• Plot histogram of residuals. A bell-shaped curve centered around
zero should be displayed.
• Construct normal probability plot (qq_plot) of residuals
Dr. Mohammed Alahmed