Correlations and Linear Regression
Download
Report
Transcript Correlations and Linear Regression
Correlation and Linear
Regression
Microbiology 3053
Microbiological Procedures
Correlation
Correlation analysis is used when you have
measured two continuous variables and want to
quantify how consistently they vary together
The stronger the correlation, the more likely to
accurately estimate the value of one variable
from the other
Direction and magnitude of correlation is
quantified by Pearson’s correlation coefficient, r
Perfectly negative (-1.00) to perfectly positive (1.00)
No relationship (0.00)
Correlation
The closer r = |1|, the stronger the relationship
Correlation analysis uses data that has already
been collected
R=0 means that knowing the value of one variable
tells us nothing about the value of the other
Archival
Data not produced by experimentation
Correlation does not show cause and effect but
may suggest such a relationship
Correlation ≠ Causation
There is a strong, positive correlation
between
the number of churches and bars in a town
smoking and alcoholism (consider the
relationship between smoking and lung
cancer)
students who eat breakfast and school
performance
marijuana usage and heroin addiction (vs
heroin addiction and marijuana usage)
Visualizing Correlation
Scatterplots are used to illustrate
correlation analysis
Assignment of axes does not matter (no
independent and dependent variables)
Order in which data pairs are plotted does not
matter
In strict usage, lines are not drawn through
correlation scatterplots
Correlations
Strong Negative Correlation
Weak Positive Correlation
120
100
80
60
40
20
0
-20 0
-40
-60
-80
-100
600
r = 0.266
500
400
300
200
100
0
-100 0
10
20
30
40
50
-200
-300
-400
r = - 0.9960
10
20
No Correlation
5000
r = 0.00
4000
3000
2000
1000
0
0
-1000
-2000
50
100
150
200
250
30
40
50
Linear Regression
Used to measure the relationship between two
variables
Prediction and a cause and effect relationship
Does one variable change in a consistent manner with
another variable?
x = independent variable (cause)
y = dependent variable (effect)
If it is not clear which variable is the cause and
which is the effect, linear regression is probably
an inappropriate test
Linear Regression
Calculated from experimental data
Independent variable is under the control of
the investigator (exact value)
Dependent variable is normally distributed
Differs from correlation, where both variables
are normally distributed and selected at
random by investigator
Regression analysis with more than one
independent variable is termed multiple
(linear) regression
Linear Regression
y = 1.0092x + 8.6509
60
Dependent Variable
Best fit line based
on the sum of the
squares of the
distance of the
data points from
the predicted
values (on the
line)
70
R2 = 0.8863
50
40
30
20
10
0
0
10
20
30
Independent Variable
40
50
Linear Regression
y = a + bx where
a = y intercept (point where x = 0 and the line
passes through the y-axis)
b = slope of the line (y2-y1/x2-x1)
The slope indicates the nature of the correlation
Positive = y increases as x increases
Negative = y decreases as x increases
0 = no correlation
Same as Pearson’s correlation
No relationship between the variables
Correlation Coefficient (r)
Shows the strength of the linear relationship
between two variables, symbolized by r
The closer the data points are to the line, the
closer the regression value is to 1 or -1
r varies between -1 (perfect negative correlation) to 1
(perfect positive correlation)
0 - 0.2 no or very weak association
0.2 -0.4 weak association
0.4 -0.6 moderate association
0.6 - 0.8 strong association
0.8 - 1.0 very strong to perfect association
null hypothesis is no association (r = 0)
Salkind, N. J. (2000) Statistics for people who think they
hate statistics. Thousand Oaks, CA: Sage
Coefficient of Determination (r2)
Used to estimate the extent to which the
dependent variable (y) is under the
influence of the independent variable (x)
r2 (the square of the correlation
coefficient)
Varies from 0 to 1
r2 = 1 means that the value of y is completely
dependent on x (no error or other
contributing factors)
r2 < 1 indicates that the value of y is
influenced by more than the value of x
Coefficient of Determination
A measurement of the proportion of variance of
y explained by its dependence on x
Remainder (1 - r2) is the variance of y that is not
explained by x (i.e., error or other factors)
e.g., if r2 = 0.84, it shows a strong, positive
relationship between the variables and shows that the
value of x is used to predict 84% of the variability of
y (and 16% is due to other factors)
r2 can be calculated for correlation analysis by
squaring r but
Not a measure of variation of y explained by variation
in x
Variation in y is associated with the variance of x (and
vice versa)
Assumptions of Linear Regression
Independent variable (x) is selected by investigator (not
random) and has no associated variance
For every value of x, values of y have a normal
distribution
Observed values of y differ from the mean value of y by
an amount called a residual. (Residuals are normally
distributed.)
The variances of y for all values of x are equal
(homoscedasticity)
Observations are independent (Each individual in the
sample is only measured once.)
Linear Regression Data
The numbers alone do not guarantee that the data
have been fitted well!
Anscombe, F. J. 1973. Graphs in Statistical Analysis. The American
Statistician 27(1):17-21.
Linear Regression Data
Linear Regression Data
Figure 1: Acceptable regression model with
observations distributed evenly around the regression
line
Figure 2: Strong curvature suggests that linear
regression may not be appropriate (an additional variable
may be required)
Linear Regression Data
Figure 3: A single outlier alters the slope of the line.
The point may be erroneous but if not, a different test
may be necessary
Figure 4: Actually a regression line connecting only
two points. If the rightmost point was different, the
regression line would shift.
What if we’re not sure if linear
regression is appropriate?
Residuals
Homoscedastic
• Variance appears random
• Good regression model
Heteroscedastic
• “Funnel” shaped and may be bowed
• Suggests that a transformation and
inclusion of additional variables may
be warranted
Helsel, D.R., and R.M. Hirsh. 2002. Statistical Methods in Water Resources. USGS
(http://water.usgs.gov/pubs/twri/twri4a3/)
Data Set 2
2.5
2
1.5
1
0.5
0
-0.5 0
-1
-1.5
-2
-2.5
1.5
5
10
15
Residuals
Residuals
Data Set 1
1
0.5
0
-0.5 0
-1
-1.5
-2
-2.5
5
Data Set 4
Data Set 3
4
Residuals
Residuals
3
2
1
0
-1
5
10
-2
X Variable 1
15
X Variable 1
X Variable 1
0
10
15
2.5
2
1.5
1
0.5
0
-0.5 0
-1
-1.5
-2
5
10
X Variable 1
15
20
Outliers
Values that appear very different from others in the data
set
Three causes
Rule of thumb: an outlier is more than three standard deviations
from mean
Measurement or recording error
Observation from a different population
A rare event from within the population
Outliers need to be considered and not simply dismissed
May indicate important phenomenon
e.g., ozone hole data (outliers removed automatically by analysis
program, delaying observation about 10 years)
Outliers
Helsel, D.R., and R.M. Hirsh. 2002. Statistical Methods in Water Resources. USGS
(http://water.usgs.gov/pubs/twri/twri4a3/)
When is Linear Regression
Appropriate?
Data should be interval or ratio
The dependent and independent variables should be
identifiable
The relationship between variables should be linear (if
not, a transformation might be appropriate)
Have you chosen the values of the independent variable?
Does the residual plot show a random spread
(homoscedastic) and does the normal probability plot
display a straight line (or does a histogram of residuals
show a normal distribution)?
(Normal Probability Plot of Residuals)
The normal probability
plot indicates whether
the residuals follow a
normal distribution, in
which case the points
will follow a straight line.
Expect some moderate
scatter even with normal
data. Look only for
definite patterns like an
"S-shaped" curve, which
indicates that a
transformation of the
response may provide a
better analysis. (from
Design Expert 7.0 from
Stat-Ease)
(Histogram of Residuals Distribution)
Lineweaver-Burk Plot
The Michaelis-Menton equation to describe
enzyme activity:
[ S ] Vmax
vo
K m [S ]
is linearized by taking its reciprocal:
Km 1
1
1
vo Vmax Vmax [ S ]
where: y = 1/vo
x = 1/[S]
a = 1/Vmax
b = Km/Vmax
Mock Enzyme Experiment
Michaelis-Menton Plot
90
80
v (pennies/min)
70
60
50
40
30
20
10
0
0
20
40
60
80
S (pennies/m^2)
100
120
140
Mock Enzyme Experiment
Lineweaver-Burk Plot
0.090
y = 0.7053x + 0.0076
R2 = 0.9785
1/v (pennies/min)^-1
0.080
0.070
0.060
0.050
0.040
0.030
0.020
0.010
0.000
0.000
0.020
0.040
0.060
0.080
1/S (pennies/m^2)^-1
0.100
0.120
Mock Enzyme Experiment
Eadie-Hofstee
140
y = -85.671x + 124.48
R2 = 0.8543
v (pennies/min)
120
100
80
60
40
20
0
0
0.2
0.4
0.6
0.8
v/S (m^2/min)
1
1.2
1.4
Mock Enzyme Experiment
Mock Enzyme Experiment
Mock Enzyme Experiment
Mock Enzyme Experiment
Residual Plot
Residuals
0.01
0.005
0
0.00
-0.005
0.05
0.10
-0.01
X Variable
0.15
Mock Enzyme Experiment
Y
Normal Probability Plot
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0
20
40
60
80
Sample Percentile
100
120