Transcript Chapter 3

Chapter 3
Association: Contingency and
Correlation
1
Looking Back

In Chapter 2 we learned how to:




Distinguish between categorical and numerical variables
Summarize a single variable using descriptive statistics
In practice, research investigations usually require
analyzing more than one variable.
In this chapter we will focus on the relationships
between two variables. There are three possible
situations:



Categorical vs. Numerical
Categorical vs. Categorical
Numerical vs. Numerical
2
Relationships Between Two
Variables

The first step is to determine which is the
explanatory variable and which is the response
variable.

Explanatory variable

Response variable

The explanatory variable influences the response variable.
3
Example 3.1

Determine which variable is the explanatory and
which is the response. Also identify the type of each
variable.



Smoking (yes or no) & Survival after 20 years (yes or no)
 Smoking:
 Survival:
Sucrose level (in grams) & Fruit type (A,B or C)
 Sucrose level:
 Fruit type:
Daily amount of gasoline used by automobiles (in gallons)
& Amount of air pollution (in ppm)
 Gasoline use:
 Air pollution:
Example taken from Statistics: The Art and
Science of Learning from Data
4
The Main Purpose of Two
Variable Analysis

Determining whether an association exists.



An association exists between two variables if a
particular value for one variable is more likely to
occur with certain values of the other variable.
In other words, are the variables _______ or are
they ____________ of each other?
If an association exists, then we need to
know how to
5
Categorical vs. Numerical

In this case,



Categorical variable:
Numerical variable:
Use the following to determine if there is an
association:


Graphical Comparison
 Side-by-side boxplots (one boxplot per group)
Numerical Comparison
 Summary statistics like the ______ and
___________ ___________ of each group
6
Example 3.2

To determine the effect of pets on stress, researchers recruited 45
females who owned dogs. The subjects were randomly assigned to
three groups of fifteen. Each group was asked to complete a
stressful task alone, with a good friend, or with their dog present.
The heart rate of each subject was measured.

Using the charts below, which group performed best?
Example taken from The Practice of Statistics
by Yates, Moore and Starnes
7
Determining Association Between
Categorical and Numerical

Side-by-side boxplots


If one is much farther to the right or left than the
other, then there is probably an association.
Means and standard deviations

If one of the means or standard deviations is very
different, then there may be an association.
8
Determining Association Between
Categorical and Numerical

When there is no association between the
two variables, the boxplot and descriptive
statistics will look something like this:
9
Example 3.3

A teacher was interested in knowing “On average, is
the fastest speed ever driven by TAMU men greater
than that of TAMU women?” To study this, she took
a random sample of 242 STAT 302 students in the
Fall of 2004.

Which variable is the explanatory and which is the
response?

Gender:
 What are the values of the variable?

Fastest Speed:
Used with permission from Dr. Ellen Toby
10
Example 3.3

Do you think there is an association between
gender and fastest speed?
11
Categorical vs. Categorical


In this case, determining which variable is the
explanatory and which is the response is not
always straightforward.
Use the following to determine if there is an
association:
 Numerical Comparison


Contingency tables
Graphical Comparison


Stacked bar charts
Side-by-Side pie charts
12
Example 3.4

Many people purchase organic foods because they believe
they are pesticide free and therefore healthier than
conventionally grown foods. Is it really worth the extra cost?
The Consumer’s Union led a study which sampled both
organic and conventional foods, recording the presence or
absence of pesticide residue on the food.

Which variable is the explanatory and which is the response?
What are the values of the variables?

Explanatory:


Values:
Response:

Values:
Example taken from Statistics: The Art and
Science of Learning from Data
13
Contingency Tables



The rows list the categories of the explanatory
variable.
The columns list the categories of the response
variable.
Each row and column combination is called a cell.


There are four cells in the table below.
The number in each cell is the frequency of that
particular combination.
14
Contingency Tables

Conditional Probability




The probability of an observation falling into one of the response
categories given that it is in a particular explanatory category
The conditional probabilities in each row should sum to ______.
What is the conditional probability that there is pesticide on a
randomly selected organic food in this sample?
What is the conditional probability that there is pesticide on a
randomly selected conventional food in this sample?
15
Stacked Bar Charts &
Side-by-Side Pie Charts


Stacked Bar Charts
 There is one bar for each category of the explanatory variable.
 Sections within each bar represent conditional probabilities for
each category of the response variable.
Side-by-Side Pie Charts
 There is one pie for each category of the explanatory variable.
 Slices within each pie represent conditional probabilities for each
category of the response variable.
16
Determining Association Between
Categorical and Categorical

Contingency Tables



Compare the percentages in the first row to the
percentages in the second row.
If the percentages are very different, then there is probably
an association.
Stacked Bar Charts & Side-by-Side Pie Charts


Compare the colored areas in the first bar to those in the
second bar.
If they are very different, then there is probably an
association.
17
Determining Association Between
Categorical and Categorical

When there is no
association between
the two variables, the
charts and contingency
table will look
something like this:
18
Example 3.5

A study was done to determine
if there is any association
between gender and ability to
differentiate between CocaCola and C2.

Are gender and soda
preference associated?
Used with permission from Dr. Ellen Toby
19
Numerical vs. Numerical


Once again, determining which variable is the
explanatory and which is the response is not
always straightforward.
Use the following to determine if there is an
association:
 Graphical Comparison


Scatterplots
Numerical Comparison

Correlation
20
Example 3.6

What is the relationship between the weight of a
vehicle and the number of miles it travels per gallon
of gasoline? To answer this question, a random
sample of 25 vehicles was taken. The weight (in
pounds) and the miles per gallon (MPG) was
recorded for each car.

Which variable is the explanatory and which is the
response?
 Explanatory:
 Response:
Example taken from Statistics: The Art and
Science of Learning from Data
21
Scatterplots

Treat each pair as an (x, y) coordinate:
22
Determining Association Between
Numerical and Numerical


Look for a consistent change or pattern in the
response variable as the explanatory variable
increases.
The particular kind of association seen in Example
3.6 is called _______ __________.


We describe linear association numerically by a measure
called correlation.
Correlation


Correlation summarizes the ____________ and
____________ of the linear relationship between two
numerical variables.
It is denoted by the symbol R.
23
Correlation R

Properties of the Correlation R:






Takes values between -1 and 1
R = 1 or R = -1 implies
R = 0 implies there is
R < 0 implies there is a ___________ association
R > 0 implies there is a ___________ association
Even if the X and Y variables are switched, the
correlation will ________________.
24
Correlation R

The formula for correlation is
( xi  x )( yi  y )
R
sx s y
i 1
N


StatTools will calculate this for us
The association in Example 3.6 is _______ and
___________.

The correlation is R =
25
Example 3.7
Used with permission from Dr. Ellen Toby
26
Correlation R

Correlation does not imply causation

Lurking variables


An unobserved variable influences the association
between the explanatory variable and response
variable.
Confounding variables

Two explanatory variables are both associated with a
response variable. It is impossible to determine which
variable causes the response.
27
Two Cases When R Is Not a Good
Measure of Linear Association


Case 1: Outliers are
present.
Case 2: Relationship
between x and y is not
linear.
R0
R  0.89
R  0.48
28
Important Points

Categorical Explanatory & Numerical Response

Numeric Summary


Graphical Summary


Measures of center and spread for each group
Histograms and QQ plots for each group, side-by-side
boxplots
Categorical Explanatory & Response

Numeric Summary


Contingency tables with relative frequencies
Graphical Summary

Stacked bar charts and side-by-side pie charts
29
Important Points

Numerical Explanatory & Response




Numeric Summary
 Correlation R
Graphical Summary
 Scatterplot
At this point in the class, we can only report what we
see in the sample data.
Later in the semester, we will learn about statistical
methods that allow us to apply our findings to the
overall population.
30