Transcript Chapter 3

Chapter 3
Association: Contingency,
Correlation, and Regression
Section 3.1
How Can We Explore the Association
between Two Categorical Variables?
Learning Objectives
1. Identify variable type: Response or
Explanatory
2. Define Association
3. Contingency tables
4. Calculate proportions and conditional
proportions
Learning Objective 1:
Response and Explanatory variables
 Response variable (Dependent Variable)
the outcome variable on which comparisons are
made
 Explanatory variable (Independent variable)
defines the groups to be compared with respect
to values on the response variable
 Example: Response/Explanatory
Blood alcohol level/# of beers consumed
 Grade on test/Amount of study time
 Yield of corn per bushel/Amount of rainfall

Learning Objective 2:
Association
 The main purpose of data analysis with two
variables is to investigate whether there is an
association and to describe that association
 An association exists between two variables if
a particular value for one variable is more
likely to occur with certain values of the other
variable
Learning Objective 3:
Contingency Table
 A contingency table:




Displays two categorical variables
The rows list the categories of one variable
The columns list the categories of the
other variable
Entries in the table are frequencies
Learning Objective 3:
Contingency Table
What is the response variable?
What is the explanatory variable?
Learning Objective 4:
Calculate proportions and conditional proportions
Learning Objective 4:
Calculate proportions and conditional proportions
 What proportion of organic foods contain
pesticides?
 What proportion of conventionally grown foods
contain pesticides?
 What proportion of all sampled items contain
pesticide residuals?
Learning Objective 4:
Calculate proportions and conditional proportions
Use side by side bar charts to show conditional proportions
Allows for easy comparison of the explanatory variable with
respect to the response variable
Learning Objective 4:
Calculate proportions and conditional proportions
 If there was no association between organic and
conventional foods, then the proportions for the
response variable categories would be the same for
each food type
Chapter 3
Association: Contingency,
Correlation, and Regression
Section 3.2
How Can We Explore the Association
between Two Quantitative Variables?
Learning Objectives:
1. Constructing scatterplots
2. Interpreting a scatterplot
3. Correlation
4. Calculating correlation
Learning Objective 1:
Scatterplot
 Graphical display of relationship between two
quantitative variables:
 Horizontal Axis: Explanatory variable, x
 Vertical Axis: Response variable, y
Learning Objective 1:
Internet Usage and Gross National Product (GDP)
Data Set
Algeria
Argentina
Australia
Austria
Belgium
Brazil
Canada
Chile
China
Denmark
Egypt
Finland
France
Germany
Greece
India
Iran
Ireland
Israel
INTERNET
0.65
10.08
37.14
38.7
31.04
4.66
46.66
20.14
2.57
42.95
0.93
43.03
26.38
37.36
13.21
0.68
1.56
23.31
27.66
GDP
6.09
11.32
25.37
26.73
25.52
7.36
27.13
9.19
4.02
29
3.52
24.43
23.99
25.35
17.44
2.84
6
32.41
19.79
Japan
Malaysia
Mexico
Netherlands
New Zealand
Nigeria
Norway
Pakistan
Philippines
Russia
Saudi Arabia
South Africa
Spain
Sweden
Switzerland
Turkey
United Kingdom
United States
Vietnam
Yemen
INTERNET
38.42
27.31
3.62
49.05
46.12
0.1
46.38
0.34
2.56
2.93
1.34
6.49
18.27
51.63
30.7
6.04
32.96
50.15
1.24
0.09
GDP
25.13
8.75
8.43
27.19
19.16
0.85
29.62
1.89
3.84
7.1
13.33
11.29
20.15
24.18
28.1
5.89
24.16
34.32
2.07
0.79
Learning Objective 1:
Internet Usage and Gross National Product (GDP)
 Enter values of explanatory variable
(x) in L1
 Enter values of of response variable
(y) in L2
 STAT PLOT
 Plot 1 on
 Type: scatter plot
 X list: L2
 Y list: L1
 ZOOM
 9:ZoomStat
 Graph
Learning Objective 1:
Baseball Average and Team Scoring
Learning Objective 1:
Baseball Average and Team Scoring
 Enter values of explanatory variable
(x) in L1
 Enter values of of response variable
(y) in L2
 STAT PLOT
 Plot 1 on
 Type: scatter plot
 X list: L1
 Y list: L2
 ZOOM
 9:ZoomStat
 Graph
Use L3 for x and L4 for y. You will use
data from prior example again later on in
the PowerPoint.
Learning Objective 2:
Interpreting Scatterplots
 You can describe the overall pattern of a
scatterplot by the trend, direction, and
strength of the relationship between the two
variables



Trend: linear, curved, clusters, no pattern
Direction: positive, negative, no direction
Strength: how closely the points fit the trend
 Also look for outliers from the overall trend
Learning Objective 2:
Interpreting Scatterplots: Direction/Association
 Two quantitative variables x and y are
 Positively associated when
 High values of x tend to occur with high values of y
 Low values of x tend to occur with low values of y
 Negatively associated when high values of one
variable tend to pair with low values of the other
variable
Learning Objective 2:
Example: 100 cars on the lot of a used-car
dealership
Would you expect a positive association, a
negative association or no association between
the age of the car and the mileage on the
odometer?
a)
b)
c)
Positive association
Negative association
No association
Learning Objective 2:
Example: Did the Butterfly Ballot Cost Al
Gore the 2000 Presidential Election?
Learning Objective 3:
Linear Correlation, r
 Measures the strength and direction of the linear
association between x and y




A positive r value indicates a positive association
A negative r value indicates a negative association
An r value close to +1 or -1 indicates a strong linear
association
An r value close to 0 indicates a weak association
1
xx y y
r
(
)(
)

n 1
sx
sy
Learning Objective 3:
Correlation coefficient: Measuring Strength &
Direction of a Linear Relationship
Learning Objective 3:
Properties of Correlation
 Always falls between -1 and +1
 Sign of correlation denotes direction
 (-) indicates negative linear association
 (+) indicates positive linear association
 Correlation has a unitless measure - does not
depend on the variables’ units
 Two variables have the same correlation no matter
which is treated as the response variable
 Correlation is not resistant to outliers
 Correlation only measures strength of linear
relationship
Leaning Objective 4:
Calculating the Correlation Coefficient
Country
Per Capita GDP (x)
Life Expectancy (y)
Austria
21.4
77.48
Belgium
23.2
77.53
Finland
20.0
77.32
France
22.7
78.63
Germany
20.8
77.17
Ireland
18.6
76.39
Italy
21.5
78.51
Netherlands
22.0
78.15
Switzerland
23.8
78.99
United Kingdom
21.2
77.37
Per Capita Gross Domestic Product
and Average Life Expectancy for
Countries in Western Europe
Learning Objective 4:
Calculating the Correlation Coefficient
x
xi  x /s x y i  y /s y
 x i - x  y i - y 





 s x  s y 
x
y
21.4
77.48
-0.078
-0.345
0.027
23.2
77.53
1.097
-0.282
-0.309
20.0
77.32
-0.992
-0.546
0.542
22.7
78.63
0.770
1.102
0.849
20.8
77.17
-0.470
-0.735
0.345
18.6
76.39
-1.906
-1.716
3.271
21.5
78.51
-0.013
0.951
-0.012
22.0
78.15
0.313
0.498
0.156
23.8
78.99
1.489
1.555
2.315
21.2
77.37
-0.209
-0.483
0.101
= 21.52
sx =1.532
y
= 77.754
sy =0.795
sum = 7.285
1 n  x i  x  y i  y 


r

n - 1 i 1  s x  s y 
 1 

(7.285)
 10  1
 0.809
Learning Objective 4:
Internet Usage and Gross National Product (GDP)
1. STAT CALC menu
2. Choose 8:
LinReg(a+bx)
3. 1st number = x variable
4. 2nd number = y variable
5. Enter
Correlation = .889
Learning Objective 4:
Baseball Average and Team Scoring
1. Enter x data into L1
2. Enter y data into L2
3. STAT CALC memu
4. Choose 8:
LinReg(a+bx)
5. 1st number = x variable
6. 2nd number = y variable
7. Enter
Correlation = .874
Learning Objective 4:
Cereal: Sodium and Sugar
Chapter 3
Association: Contingency,
Correlation, and Regression
Section 3.3
How Can We Predict the Outcome of a
Variable?
Learning Objectives
1. Definition of a regression line
2. Use a regression equation for prediction
3. Interpret the slope and y-intercept of a
regression line
4. Identify the least-squares regression line
as the one that minimizes the sum of
squared residuals
5. Calculate the least-squares regression
line
Learning Objectives
6. Compare roles of explanatory and response
variables in correlation and regression
7. Calculate r2 and interpret
Learning Objective 1:
Regression Analysis
 The first step of a regression analysis is to
identify the response and explanatory
variables

We use y to denote the response variable

We use x to denote the explanatory
variable
Learning Objective 1:
Regression Line
 A regression line is a straight line that describes how
the response variable (y) changes as the explanatory
variable (x) changes
 A regression line predicts the value of the response
variable (y) for a given level of the explanatory
variable (x)
yˆ  a  bx
 The y-intercept of the regression line is denoted by a
 The slope of the regression line is denoted by b
Learning Objective 2:
Example: How Can Anthropologists Predict
Height Using Human Remains?
 Regression Equation:
yˆ  61.4  2.4 x

ŷ
x
is the predicted height and is the length of a
femur (thighbone), measured in centimeters
 Use the regression equation to predict the
height of a person whose femur length was 50
centimeters
yˆ  61.4  2.4(50) 181.4
Learning Objective 3:
Interpreting the y-Intercept
 y-Intercept:

The predicted value for y when x = 0

Helps in plotting the line

May not have any interpretative value if no
observations had x values near 0
Learning Objective 3:
Interpreting the Slope
 Slope: measures the change in the predicted
variable (y) for a 1 unit increase in the
explanatory variable in (x)
 Example: A 1 cm increase in femur length
results in a 2.4 cm increase in predicted height
Learning Objective 3:
Slope Values: Positive, Negative, Equal to 0
Learning Objective 3:
Regression Line
 At a given value of x, the equation:
yˆ  a  bx

Predicts a single value of the response variable

But… we should not expect all subjects at that value of
x to have the same value of y
 Variability occurs in the y values!
Learning Objective 3:
The Regression Line
 The regression line connects the estimated means
of y at the various x values
 In summary,
yˆ  a  bx
Describes the relationship between x and the estimated
means of y at the various values of x
Learning Objective 4:
Residuals
 Measures the size of the prediction errors, the vertical
distance between the point and the regression line
 Each observation has a residual
 Calculation for each residual:
y  yˆ
 A large residual indicates an unusual observation
Learning Objective 4:
“Least Squares Method” Yields the
Regression Line
 Residual sum of squares:
 (residuals)
2
  ( y  yˆ )
2
 The least squares regression line is the line that
minimizes the vertical distance between the points and
their predictions, i.e., it minimizes the residual sum of
squares
 Note: the sum of the residuals about the regression
line will always be zero
Learning Objective 5:
Regression Formulas for y-Intercept and Slope
 Slope:
b  r(
sy
sx
)
 Y-Intercept:
a  y  b( x )
Regression line always passes through
x, y 
y  4.5:
979
Learning Objective
Calculating thes slope
and y intercept for the
x  0.0091
regression line s  0.368
y
x  .275
r  0.653
y  4.979
 sy 
0.368 

s x  0.0091 b  r    0.653
  26.4
sx 
 0.0091 

s  0.368
y
r  0.653
Slope =26.4
 sy 
0.368




ab 
y

b
x

4
.
979

26
0.275
 r    0.653
26
.4   2.28
.4
 0.0091 
 sx 
y intercept=-2.28
Learning Objective 5:
Internet Usage and Gross National Product (GDP)
Algeria
Argentina
Australia
Austria
Belgium
Brazil
Canada
Chile
China
Denmark
Egypt
Finland
France
Germany
Greece
India
Iran
Ireland
Israel
INTERNET
0.65
10.08
37.14
38.7
31.04
4.66
46.66
20.14
2.57
42.95
0.93
43.03
26.38
37.36
13.21
0.68
1.56
23.31
27.66
GDP
6.09
11.32
25.37
26.73
25.52
7.36
27.13
9.19
4.02
29
3.52
24.43
23.99
25.35
17.44
2.84
6
32.41
19.79
Japan
Malaysia
Mexico
Netherlands
New Zealand
Nigeria
Norway
Pakistan
Philippines
Russia
Saudi Arabia
South Africa
Spain
Sweden
Switzerland
Turkey
United Kingdom
United States
Vietnam
Yemen
INTERNET
38.42
27.31
3.62
49.05
46.12
0.1
46.38
0.34
2.56
2.93
1.34
6.49
18.27
51.63
30.7
6.04
32.96
50.15
1.24
0.09
GDP
25.13
8.75
8.43
27.19
19.16
0.85
29.62
1.89
3.84
7.1
13.33
11.29
20.15
24.18
28.1
5.89
24.16
34.32
2.07
0.79
Learning Objective 5:
Internet Usage and Gross National Product
 Enter x data into L1
 Enter y data into L2
1. STAT CALC menu
2. Choose 8:
LinReg(a+bx)
3. 1st number = x variable
4. 2nd number = y variable
5. Enter
yˆ =1.548x-3.63
Learning Objective 5:
Baseball Average and Team Scoring
Learning Objective 5:
Baseball average and Team Scoring
1. Enter x data into L1
2. Enter y data into L2
3. STAT CALC
4. Choose 8:
LinReg(a+bx)
5. 1st number = x variable
6. 2nd number = y variable
7. Enter
yˆ  41.58 x  6.25
Learning Objective 5:
Cereal: Sodium and Sugar
yˆ  7.08 x  243.5
Learning Objective 6:
The Slope and the Correlation
 Correlation:
 Describes the strength of the linear association
between 2 variables
 Does not change when the units of measurement
change
 Does not depend upon which variable is the
response and which is the explanatory
Learning Objective 6:
The Slope and the Correlation
 Slope:
 Numerical value depends on the units used to
measure the variables
 Does not tell us whether the association is strong
or weak
 The two variables must be identified as response
and explanatory variables
 The regression equation can be used to predict
values of the response variable for given values
of the explanatory variable
Learning Objective 7:
The Squared Correlation
 When a strong linear association exists, the
regression equation predictions tend to be
much better than the predictions using only
 We measureythe proportional reduction in
error and call it, r2
Learning Objective 7:
The Squared Correlation

r
2
measures the proportion of the variation
in the y-values that is accounted for by the
linear relationship of y with x
 A correlation of .9 means that
.9 2  .81  81%
81% of the variation in the y-values can be
explained by the explanatory variable, x
Chapter 3
Association: Contingency,
Correlation, and Regression
Section 3.4
What Are Some Cautions in Analyzing
Association?
Learning Objectives:
1. Extrapolation
2. Outliers and Influential Observations
3. Correlations does not imply causation
4. Lurking variables and confounding
5. Simpson’s Paradox
Learning Objective 1:
Extrapolation
 Extrapolation: Using a regression line to
predict y-values for x-values outside the
observed range of the data


Riskier the farther we move from the range
of the given x-values
There is no guarantee that the relationship
given by the regression equation holds
outside the range of sampled x-values
Learning Objective 2:
Outliers and Influential Points
 Construct a scatterplot
 Search for data points that are well
outside of the trend that the remainder of
the data points follow
Learning Objective 2:
Outliers and Influential Points
 A regression outlier is an observation that lies far
away from the trend that the rest of the data follows
 An observation is influential if

Its x value is relatively low or high compared to the
remainder of the data
The observation is a regression outlier
Influential observations tend to pull the regression line
toward that data point and away from the rest of the data

Learning Objective 2:
Outliers and Influential Points
 Impact of removing an Influential data point
Learning Objective 3:
Correlation does not Imply Causation
 A strong correlation between x and y means
that there is a strong linear association that
exists between the two variables
 A strong correlation between x and y, does
not mean that x causes y
Learning Objective 3:
Association does not imply causation
Data are available for all fires in Chicago last year on x =
number of firefighters at the fires and y = cost of damages
due to fire
1.
2.
3.
Would you expect the correlation to be negative, zero, or
positive?
If the correlation is positive, does this mean that having more
firefighters at a fire causes the damages to be worse? Yes or
No
Identify a third variable that could be considered a common
cause of x and y:
a.
b.
c.
Distance from the fire station
Intensity of the fire
Size of the fire
Learning Objective 4:
Lurking Variables & Confounding
 A lurking variable is a variable, usually unobserved,
that influences the association between the
variables of primary interest



Ice cream sales and drowning – lurking variable
=temperature
Reading level and shoe size – lurking variable=age
Childhood obesity rate and GDP-lurking variable=time
 When two explanatory variables are both associated
with a response variable but are also associated with
each other, there is said to be confounding
 Lurking variables are not measured in the study but
have the potential for confounding
Learning Objective 5:
Simpson’s Paradox
Simpson’s Paradox:
 When the direction of an association between
two variables changes after we include a third
variable and analyze the data at separate
levels of that variable
Learning Objective 5:
Simpson’s Paradox Example
Is Smoking Actually Beneficial to Your Health?
Probability of Death of Smoker = 139/582=24%
Probability of Death of Nonsmoker=230/732=31%
This can’t be true that smoking improves your chances of living!
What’s going on!
Learning Objective 5:
Simpson’s Paradox Example
Break out Data by Age
Learning Objective 5:
Simpson’s Paradox Example
An association can look quite different after adjusting for the effect of
a third variable by grouping the data according to the values of the third
variable