Chapter 7 Scatterplots, Association, and Correlation
Download
Report
Transcript Chapter 7 Scatterplots, Association, and Correlation
Chapter 7
Scatterplots, Association, and Correlation
Examining Relationships
Relationship between two variables
Examples:
•
•
•
•
Height and Weight
Alcohol and Body Temperature
SAT Verbal Score and SAT Math Score
High School GPA and College GPA
Two Types of Variables
Response Variable (Dependent)
Explanatory Variable (Independent)
Measures an outcome of the study
Used to explain the response variable.
Example: Alcohol and Body Temp
Explanatory Variable: Alcohol
Response Variable: Body Temperature
Two Types of Variables
Does not mean that explanatory
variable causes response variable
It helps explain the response
Sometimes there are no true response
or explanatory variables
Ex. Height and Weight
SAT Verbal and SAT Math Scores
Graphing Two Variables
Plot of explanatory variable vs. response
variable
Explanatory variable goes on horizontal axis (x)
Response variable goes on vertical axis (y)
If response and explanatory variables do not exist,
you can plot the variables on either axis.
This plot is called a scatterplot
This plot can only be used if explanatory and
response variables are both quantitative.
Scatterplots
Scatterplots show patterns, trends, and
relationships.
When interpreting a scatterplot (i.e.,
describing the relationship between two
variables) always look at the following:
Overall Pattern
• Form
• Direction
• Strength
Deviations from the Pattern
• Outliers
Interpreting Scatterplots
Form
Is the plot linear or is it curved?
Strength
Does the plot follow the form very closely or
is there a lot of scatter (variation)?
Interpreting Scatterplots
Direction
Is the plot increasing or is it decreasing?
Positively Associated
• Above (below) average in one variable tends to
be associated with above (below) average in
another variable.
Negative Associated
• Above (below) average in one variable tends to
be associated with below (above) average in
another variable.
Example – Scatterplot
The following survey was conducted in
the U.S. and in 10 countries of Western
Europe to determine the percentage of
teenagers who had used marijuana and
other drugs.
Example – Scatterplot
Percent who have used
Country
Marijuana
Other Drugs
Czech Republic
22
4
Denmark
17
3
England
40
21
Finland
5
1
Ireland
37
16
Italy
19
8
North Ireland
23
14
Norway
6
3
Portugal
7
3
Scotland
53
31
United States
34
24
Example – Scatterplot
Percent who have used Marijuana vs Other Drugs
35
30
25
20
15
10
5
0
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
0
10
20
30
40
50
60
Example – Scatterplot
The variables are interchangeable in this
example.
In this example, Percent of Marijuana is
being used as the explanatory variable (since
it is on the x-axis).
Percent of Other Drugs is being used as the
response since it is on the y-axis.
Example - Scatterplot
The form is linear
The strength is fairly strong
The direction is positive since larger
values on the x-axis yield larger values on
the y-axis
Example - Scatterplot
Negative
association
Outside
temperature
and amount
of natural gas
used
Gas
10
5
0
-5.0
.0
5.0
Temp
10.0
15.0
Correlation
The strength of the linear relationship
between two quantitative variables can be
described numerically
This numerical method is called
correlation
Correlation is denoted by r
Correlation
A way to measure the strength of the
linear relationship between two
quantitative variables.
1 ( x x )( y y )
r
n 1
sx sy
Correlation
Steps to calculate correlation:
Calculate the mean of x and y
Calculate the standard deviation for x and y
(x
x
)(yy
)
Calculate
Plug all numbers into formula
Correlation
Femur vs. Humerus
100
Humerus
80
60
40
20
0
0
10
20
30
40
Femur
50
60
70
80
Calculating r.
Femur (x)
Humerus (y)
38 56 59 63 74
41 63 70 72 84
Set up a table with columns for x, y,
2
2
,
,
, and x x
y y xx
y y
xxyy
,
Calculating r.
41
xx
-20
y y
-25
56
63
-2
59
70
63
74
x
y
38
x x y y xxyy
2
2
400
625
500
-3
4
9
6
1
4
1
16
4
72
5
6
25
36
30
84
16
18
256
324
288
290 330
0
0
686
1010
828
Calculating r
Recall: y y
n
So,
290
x
58
5
330
y
66
5
Calculating r
Recall:
s
( y y)
2
n 1
So,
686
sx
13.1
4
1010
sy
15.9
4
Calculating r.
Put everything into the formula:
x x y y
r
n 1s x s y
828
5 113.115.9
0.994
Properties of r
r has no units (i.e., just a number)
Measures the strength of a LINEAR
association between two quantitative
variables
If the data have a curvilinear relationship, the
correlation may not be strong even if the data
follow the curve very closely.
Properties of r
r always ranges in values from –1 to 1
r = 1 indicates a straight increasing line
r = -1 indicates a straight decreasing line
r = 0 indicates no LINEAR relationship
As r moves away from 0, the linear
relationship between variables is stronger
Properties of r
Changing the scale of x or y will not
change the value of r
Not resistant to outliers
Strong correlation ≠ Causation
Strong linear relationship between two
variables is NOT proof of a causal
relationship!
Reading JMP Output
The following is some output from JMP
where I considered Blood Alcohol Content
and Number of Beers. The explanatory
variable is the number of beers. Blood
alcohol content is the response variable.
Reading JMP Output
Bivariate Fit of BAC By Be ers
0.2
BAC
0.15
0.1
0.05
0
0
2
4
6
Beers
8
10
Reading JMP Output
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.803536
0.788424
0.02092
0.076
15
Reading JMP Output
RSquare = r2
This means r RSquare 0.803536 0.896
I know this is positive because the
scatterplot has a positive direction.
The Mean of the Response is the mean of
the y’s or y