Chapter 14 - Wayne State University

Download Report

Transcript Chapter 14 - Wayne State University

Chapter 14
Describing Relationships:
Scatterplots and Correlation
Chapter 14
1
Thought Question 1
For all cars manufactured in the U.S., there
is a positive correlation between the size
of the engine and horsepower. There is a
negative correlation between the size of
the engine and gas mileage. What does it
mean for two variables to have a positive
correlation or a negative correlation?
Chapter 14
2
Scatterplot
A Scatterplot shows the relationship between two
quantitative variables measured on the same
individuals. The values of one variable appear on
the horizontal axis, and the values of the other
variable appear on the vertical axis. Each
individual in the data appears as the point in the
plot fixed by the values of both variables for that
individual.
Chapter 14
3
Figure 14.9 Scatterplot of average SAT Mathematics score for each state against
the proportion of the state’s high school seniors who took the SAT. The light-colored
point corresponds to two states. (This figure was created using the Minitab software
package.)
Figure 14.3 Scatterplot of the life expectancy of people in many nations against each
nation’s gross domestic product per person. (This figure was created using the Minitab
software package.)
Examining a Scatterplot
In any graph of data, look for the overall
pattern and for striking deviations from that
pattern.
You can describe the overall pattern of a
scatterplot by the direction, form, and
strength of the relationship.
An important kind of deviation is an outlier,
an individual value that falls outside the
overall pattern of the relationship.
Chapter 14
6
Positive association, Negative association
Two variables are positively/Negatively
associated when above-average values of
one tend to accompany above-average
/below-average values of the other.
The scatterplot slops upward/downward as
we move from the left to right.
Chapter 14
7
• Our scatterplot regarding the SAT scores shows
two clusters of states.
• The one with the GDP shows a curved
relationship.
• The strength of a relationship in a scatterplot is
determined by how closely the points follow a
clear form.
• The relationship in both our plots are not strong.
– States with similar percentages show quite a bit of
scatter in their average scores.
– Nations with similar GDPs can have quite different life
expectancies.
Chapter 14
8
A scatterplot with strong relationship
Figure 14.5 Scatterplot of the lengths of two bones in 5
fossil specimens of the extinct beast Archaeopteryx.
Chapter 14
9
Statistical versus Deterministic
Relationships
• Distance versus Speed (when travel time is
constant).
• Income (in millions of dollars) versus total
assets of banks (in billions of dollars).
Chapter 14
10
Distance versus Speed
• Distance = Speed  Time
• Suppose time = 1.5 hours
• Each subject drives a fixed
speed for the 1.5 hrs.
– speed chosen for each subject
varies from 10 mph to 50 mph
distance
80
70
• Distance does not vary for
those who drive the same
fixed speed
• Deterministic relationship
60
50
40
30
20
10
0
0
20
40
60
speed
Chapter 14
11
Income versus Assets
300
income (millions)
• Income =
a + bAssets?
• Assets vary from 3.4
billion to 49 billion
• Income varies from
bank to bank, even
among those with
similar assets
• Statistical relationship
Chapter 14
250
200
150
100
50
0
0
20
40
60
assets (billions)
12
Linear Relationship
Some relationships are such that the points of
a scatterplot tend to fall along a straight line -linear relationship.
Chapter 14
13
Measuring Strength & Direction
of a Linear Relationship
• How closely does a non-horizontal straight line
fit the points of a scatterplot?
• The correlation coefficient (often referred to as
just correlation) r is a measure of:
– the strength of the relationship: the stronger the
relationship, the larger the magnitude of r.
– the direction of the relationship: positive r indicates
a positive relationship, negative r indicates a
negative relationship.
Chapter 14
14
Correlation Coefficient
• special values for r :
 A perfect positive linear relationship would have r = +1.
 A perfect negative linear relationship would have r = -1.
 If there is no linear relationship, or if the scatterplot points
are best fit by a horizontal line, then r = 0.
 Note: r must be between -1 and +1, inclusive.
• r > 0: as one variable changes, the other variable
tends to change in the same direction.
• r < 0: as one variable changes, the other variable
tends to change in the opposite direction.
Chapter 14
15
Figure 14.7 How correlation measures the strength of a straight-line relationship.
Patterns closer to a straight line have correlations closer to 1 or −1.
Correlation Calculation
• Suppose we have data on variables X and Y
for n individuals:
x1, x2, … , xn and y1, y2, … , yn
• Each variable has a mean and std dev:
( x, s ) and ( y, s y )
(see ch. 12 for s )
1 n  xi  x  y i  y 


r

n - 1 i 1  s x  s y 
x
Chapter 14
17
Case Study
Per Capita Gross Domestic Product
and Average Life Expectancy for
Countries in Western Europe
Chapter 14
18
Case Study
Country
Austria
Belgium
Finland
France
Germany
Ireland
Italy
Netherlands
Switzerland
United Kingdom
Per Capita GDP (x)
21.4
23.2
20.0
22.7
20.8
18.6
21.5
22.0
23.8
21.2
Chapter 14
Life Expectancy (y)
77.48
77.53
77.32
78.63
77.17
76.39
78.51
78.15
78.99
77.37
19
Case Study
x
xi  x /s x y i  y /s y
 x i - x  y i - y



 s x  s y
0.027
x
y
21.4
77.48
-0.078
-0.345
23.2
77.53
1.097
-0.282
-0.309
20.0
77.32
-0.992
-0.546
0.542
22.7
78.63
0.770
1.102
0.849
20.8
77.17
-0.470
-0.735
0.345
18.6
76.39
-1.906
-1.716
3.271
21.5
78.51
-0.013
0.951
-0.012
22.0
78.15
0.313
0.498
0.156
23.8
78.99
1.489
1.555
2.315
21.2
77.37
-0.209
-0.483
0.101
= 21.52
sx =1.532
y = 77.754




sum = 7.285
sy =0.795
Chapter 14
20
Case Study
There is a strong, positive linear relationship between
Per Capita GDP (x) and Life Expectancy (y).
Chapter 14
21
Problems with Correlations
• Outliers can inflate or deflate correlations.
• Groups combined inappropriately may mask
relationships (a third variable).
– groups may have different relationships when
separated.
Chapter 14
22
Figure 14.8 Moving one point reduces the correlation from r = 0.994 to r = 0.640.
Not all Relationships are Linear
Miles per Gallon versus Speed
• Linear relationship?
MPG = a + bSpeed?
• Speed chosen for each
subject varies from 20
mph to 60 mph.
• MPG varies from trial to
trial, even at the same
speed.
• Statistical relationship
Chapter 14
24
Not all Relationships are Linear
Miles per Gallon versus Speed
• Speed chosen for each
subject varies from 20
mph to 60 mph.
• MPG varies from trial to
trial, even at the same
speed.
35
miles per gallon
• Curved relationship
(r is misleading)
30
25
20
15
10
5
0
0
• Statistical relationship
50
100
speed
Chapter 14
25
Price of Books versus Size
140
120
price (dollars)
• Relationship between
price of books and the
number of pages?
• Positive?
• Look at paperbacks:
• Look at hardcovers:
• All books together:
• Overall correlation is
Negative!
100
80
60
40
20
0
0
100
200
300
400
# of pages
Chapter 14
26
Key Concepts
•
•
•
•
•
•
Statistical vs. Deterministic Relationships
Statistically Significant Relationship
Strength of Linear Relationship
Direction of Linear Relationship
Correlation Coefficient
Problems with Correlations
Chapter 14
27