Correlation and Linear Regression

Download Report

Transcript Correlation and Linear Regression

Correlation and
Covariance
Overview
Outcome, Dependent Variable (Y-Axis)
Height
Continuous
Histogram
Predictor Variable
(X-Axis)
Continuous
Scatter
Categorical
Boxplot
Correlation
Covariance is High: r ~1
Covariance is Low: r ~0
Things to Know about the Correlation
• It varies between -1 and +1
• 0 = no relationship
• It is an effect size
• ±.1 = small effect
• ±.3 = medium effect
• ±.5 = large effect
• Coefficient of determination, r2
• By squaring the value of r you get the proportion of
variance in one variable shared by the other.
Variables
Dependent
Variables
Y
Y
Height
X1
X2
X4
X3
Independent Variables
X’s
Little Correlation
Correlation is For Linear Relationships
Outliers Can Skew Correlation Values
Correlation and Regression Are Related
Covariance
cov( x, y) 
  xi  x  yi  y 
N 1
Y
X
-2
-1
5
4
3
2
1
0
-1 0
-2
-3
-4
1
Persons 2,3, and 5 look to have similar magnitudes from their means
2
3
Covariance
( x i  x )( y i  y )
cov( x , y ) 
N 1
( 0.4)( 3)  ( 1.4)( 2 )  ( 1.4)( 1)  (0.6)( 2 )  ( 2.6)( 4)

4
1.2  2.8  1.4  1.2  10.4

4
 17
4
 4.25
• Calculate the error [deviation] between the mean and each subject’s score for the
first variable (x).
• Calculate the error [deviation] between the mean and their score for the second
variable (y).
• Multiply these error values.
• Add these values and you get the cross product deviations.
• The covariance is the average cross-product deviations:
Covariance
Do they VARY the same way relative to their own means?
Age
7
4
6
8
8
7
5
9
7
8
9
8
9
8
3
3
8
1
3
6
Income Education
4
3
1
8
3
5
6
1
5
7
2
9
3
3
5
8
4
5
2
2
5
2
4
2
2
3
4
7
1
4
1
3
2
6
2
5
1
7
3
3
2.47
Age vs. Income
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
4.0
2.0
0.0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-2.0
-4.0
-6.0
Delta A
cov( x, y) 
Delta I
  xi  x  yi  y 
N 1
Limitations of Covariance
• It depends upon the units of measurement.
• E.g. the covariance of two variables measured in miles
might be 4.25, but if the same scores are converted to
kilometres, the covariance is 11.
• One solution: standardize it! normalize the data
• Divide by the standard deviations of both variables.
• The standardized version of covariance is known
as the correlation coefficient.
• It is relatively unaffected by units of measurement.
The Correlation Coefficient
r

cov xy
sx s y
  xi  x  yi  y 
 N 1 sx s y
r
cov xy
sx s y
4.25

1.67  2.92
 .87
Correlation
Covariance is High: r ~1
Covariance is Low: r ~0
Correlation
Correlation
Need inter-item/variable correlations > .30
Data Structures
numeric
vector
character
vector
Numeric Vector: a <- c(1,2,5.3,6,-2,4)
Character Vector: b <- c("one","two","three")
Matrix: y<-matrix(1:20, nrow=5,ncol=4)
Dataframe:
List:
d <- c(1,2,3,4)
w <- list(name="Fred", age=5.3)
e <- c("red", "white", "red", NA)
f <- c(TRUE,TRUE,TRUE,FALSE)
Framework Source:
mydata <- data.frame(d,e,f)
Hadley Wickham
names(mydata) <- c("ID","Color","Passed")
Correlation Matrix
Correlation and Covariance
cov( x, y) 
  xi  x  yi  y 
N 1
Revisiting the Height Dataset
Galton: Height Dataset
cor(heights)
Error in cor(heights) : 'x' must be numeric
cor() function does
not handle Factors
Excel correl()
does not either
Initial workaround: Create data.frame without the Factors
h2 <- data.frame(h$father,h$mother,h$avgp,h$childNum,h$kids)
Later we will RECODE the variable into a 0, 1
Histogram of Correlation Coefficients
-1
+1
Correlations Matrix: Both Types
library(car)
scatterplotMatrix(heights)
Zoom in on Gender
Correlation Matrix for Continuous Variables
PerformanceAnalytics package
chart.Correlation(num2)
Categorical: Revisit Box Plot
Correlation will
depend on spread
of distributions
Note there is an
equation here:
Y = mx b
Factors/Categorical
work with Boxplots;
however some
functions are not set
up to handle Factors
Manual Calculation: Note Stdev is Lower
Note that with 0 and 1 the Delta from Mean are low;
and Standard Deviation is Lower. Whereas the
Continuous Variable has a lot of variation, spread.
Categorical: Recode!
Gender recoded as
a 0= Female
1 = Male
@correl does not
work with Factor
Variables
Formula now
works!
Correlation: Continuous & Discrete
More examples of cor.test()
Correlation  Regression
Summary
Outcome, Dependent Variable (Y-Axis)
Continuous
Categorical
Mean, Median,
Standard Deviation
Proportions
Histogram
Pie
Bar
Yes
Continuous
Predictor Parents Height
Variable
(X-Axis)
Categorical
Gender
Regression Model
Scatter
Cross
Table
Boxplot
Cross
Table
Linear
Regression
Logistic
Regression
No
Frequency
Mosaic
1
0