MAT 1000 - Wayne State University

Download Report

Transcript MAT 1000 - Wayne State University

MAT 1000
Mathematics in Today's World
Last Time
We saw how to use the mean and standard deviation
of a normal distribution to determine the percentile of
a data value from that distribution.
The Pth percentile of a distribution is a value which P
percent of the data is less than. For instance: 80% of a
distribution is less than the 80th percentile.
For a normal distribution, we can find percentiles by
computing a standard score, and then using a table to
look up the percentile.
Today
Recall that “variables” are characteristics or attributes of
individuals. We will consider pairs of variables. In other words,
we will look at a pair of characteristics of an individual. The key
question will be: are these variables related?
We will discuss scatterplots, which are a way to visualize data
that consists of pairs of variables.
We will talk about the key features of scatterplots: form,
direction, and strength.
Today
We will also talk about correlation.
For a data set consisting of pairs of numbers, correlation is a
number between -1 and 1.
If the data has a linear form, then correlation tells us about the
strength and direction of the relationship between the
variables.
Pairs of variables
What are some examples of pairs of variables?
Height and weight of people. This gives us two
numbers, one for each individual.
The time it takes me to run a mile and my heart rate
afterwards. Each time I run a mile I get another pair of
numbers.
Pairs of variables
We collect data on pairs of variables in order to study
relationships between those variables. Sometimes we are
interested in cause and effect relationships.
Example
Will you live longer if you increase your intake of vitamin A?
Cause: Amount of vitamin A taken (in IUs)
Effect: Lifespan (in years)
For each individual we get two numbers: average daily intake of
vitamin A, and lifespan.
Pairs of variables
When we talk about a pair of variables that we know, or at
least believe or hope, have a cause and effect relationship, we
use the following terms:
The explanatory variable is what we believe to be the cause.
The response variable is what we believe to be the effect.
Statistics give us evidence for a cause and effect relationship,
but statistics will not prove this.
Scatterplots
Scatterplots are visual representations of pairs of
data.
There is a horizontal scale and a vertical scale. Each
direction corresponds to one of the variables.
Each individuals is represented by one dot. The
horizontal and vertical location of the dot corresponds
to the values of each variable.
Scatterplots
Scatterplot of the life expectancy of people in many nations against each
nation’s gross domestic product per person.
Scatterplots
Interpreting scatterplots
To interpret a scatterplot, look for three things:
1. Form
2. Direction
3. Strength
The form of a scatterplot is its overall shape. This may be a
straight line, a curved line, or some other shape altogether.
The strength is how close the scatterplot is to its form.
Interpreting scatterplots
We distinguish between two directions: positive and negative.
This is especially useful when the form of a scatterplot is a
straight line. (In this case, the direction corresponds to the sign
of the slope of the line: positive slope = positive direction.)
The rule for a positive direction: larger values of the
explanatory variable correspond to larger values of the
response variable.
The rule for a negative direction: larger values of the
explanatory variable correspond to smaller values of the
response variable.
Interpreting scatterplots
Form: curved line
Strength: fairly strong
Direction: positve
Interpreting scatterplots
Form: straight line
Strength: moderate
Direction: positve
Interpreting scatterplots
If height and weight have a positive association what
does that tell us?
It means that taller people tend to weigh more.
This is a statement about a general tendency. We
don’t worry about the exceptions.
Interpreting scatterplots
What about the time it takes me to run a mile and my heart
rate afterwards?
If I run faster, my time is less, and I’m working harder so my
heart rate will go up.
If I run slower, I will have a longer time, and I won’t be working
as hard, so my heart rate won’t go up as much.
What direction is this association?
Negative: larger values of the explanatory variable (time)
correspond to lower values of the response variable (heart
rate).
Interpreting scatterplots
In addition to form, direction, and strength, which are
general features of a scatterplot, you should also note
any outliers.
On a scatterplot the outliers are dots that don’t fit
into the overall pattern.
Interpreting scatterplots
Sierra Leone is a clear outlier on this scatterplot.
Linear form
To find the form of an association, look at a scatterplot.
If one straight line gives a reasonable approximation to the
scatterplot, the form is said to be “linear.”
Let’s consider some examples.
Linear form
Linear form
Non-linear form
Not every relationship is linear
Example
Consider the relationship between the speed you drive and the
gas mileage you get.
As your speed increases, your mileage increases, up to a certain
speed (usually around 55 or 60 mph). This will look roughly like
a straight line.
But around 55 or 60 mph (the exact speed depends on the type
of car), your mileage begins to decrease.
Let’s look at a scatterplot.
Non-linear form
This is not a linear scatterplot
Non-linear form
Correlation
When a scatterplot has a linear form, we can measure the
strength of the association using a number called the
“correlation.”
Here are some facts about correlation:
• Abbreviated by the letter 𝑟
• 𝑟 is a number between -1 and 1
• The sign of 𝑟 (positive or negative) is same as the direction of
the association
• Stronger associations have 𝑟 closer to either 1, or to -1.
• Correlation has no units.
Interpreting correlation
Here are some guidelines on using the value of 𝑟 to interpret
the strength of a relationship
Value of correlation
Strength of relationship
0.8 to 1.0
-1.0 to -0.8
Very strong
0.6 to 0.8
-0.8 to -0.6
Strong
0.4 to 0.6
-0.6 to -0.4
Moderate
0.2 to 0.4
-0.4 to -0.2
Weak
-0.2 to 0.2
Either very weak, or not a
linear relationship
Interpreting correlation
Here are some concrete examples to give you a better feel for
correlations:
• The correlation between SAT score and college GPA is about
0.6.
• The correlation between height and weight for American
males is about 0.4.
• The correlation between income and education level in the
United States is about 0.4.
• The correlation between a person’s income and the last 4
digits of their phone number is 0.
Interpreting correlation
Here are examples of scatterplots for various values of 𝑟
Notice the relationship between direction and sign, and also
that the closer r is to 1 or -1, the stronger the association
Calculating correlation
Calculating correlations by hand takes some work.
Example
Find the correlation between the height and weight of the
following five men:
Height (inches)
67
72
77
74
69
Weight (pounds)
155
220
240
195
175
Notice that our data set has five individuals and two variables.
Calculating correlation
Example
We start by finding four numbers:
1. The mean of the five heights
2. The mean of the five weights
3. The standard deviation of the five heights
4. The standard deviation of the five weights
Remember our notation for the mean: 𝑥. With two different
means, it would be confusing to call them both 𝑥
To keep them separate, call the mean of the heights 𝑥 and the
mean of the weights 𝑦
Calculating correlation
Example
We have the same issue for the standard deviations: we don’t
want to call both of them 𝑠.
So let’s call the standard deviation of the height 𝑠𝑥 and the
standard deviation of the weights 𝑠𝑦 (this is the usual notation).
Using this notation we can find that:
𝑥 = 71.8 inches
𝑠𝑥 = 3.96 inches
𝑦 = 197pounds
𝑠𝑦 = 34.02 pounds
Calculating correlation
Example
Next we find standard scores for each height and weight.
Remember the formula for standard scores:
𝑥𝑖 − 𝑥
𝑧=
𝑠
For each height we subtract the average of the heights 𝑥, and
divide by 𝑠𝑥 , the standard deviation of the heights.
Likewise, for each weight we subtract the average of the
weights 𝑦, and divide by 𝑠𝑦 , the standard deviation of the
weights.
To keep organized, let’s make a table
Calculating correlation
𝑥𝑖
𝑥𝑖 − 𝑥
𝑠𝑥
𝑦𝑖
𝑦𝑖 − 𝑦
𝑠𝑦
67
72
77
74
69
-1.21
0.05
1.31
0.56
-0.71
155
220
240
195
175
-1.23
0.68
1.26
-0.06
-0.65
𝑥𝑖 − 𝑥
𝑠𝑥
𝑦𝑖 − 𝑦
𝑠𝑦
1.50
0.03
1.66
-0.03
0.46
3.61
Multiply the standard score of a person’s weight by the
standard score of their height.
Then we add up this last column.
Calculating correlation
Example
Finally, we take this number 3.61 and divide by 𝑛 − 1. Here 𝑛 is
the number of individuals in the data set. Don’t forget there
are 2 numbers per individual, so we have 𝑛 = 5
The correlation is
3.61
𝑟=
= 0.9
4
This means there is a very strong positive correlation between
height and weight for these five men.
Calculating correlation
In review, the steps for finding correlation are:
1. Find standard scores for each variable
2. Multiply corresponding pairs of standard scores
3. Add up these products
4. Divide by 𝑛 − 1
There is a formula that encapsulates all the steps we’ve taken:
𝑟=
1
𝑛−1
𝑥1 − 𝑥
𝑠𝑥
𝑦1 − 𝑦
𝑥2 − 𝑥
+
𝑠𝑦
𝑠𝑥
𝑦2 − 𝑦
𝑥𝑛 − 𝑥
+ ⋯+
𝑠𝑦
𝑠𝑥
𝑦𝑛 − 𝑦
𝑠𝑦
Calculating correlation
One more fact about correlation worth noting: the correlation
between two variables does not depend on the units we use to
measure them.
Height (inches)
67
72
77
74
69
Weight (pounds)
155
220
240
195
175
For this data, we found 𝑟 = 0.9
If we had measured the heights and weights of these five men
in centimeters and kilograms, our data would look like this:
Height (cm)
170
183
196
188
175
Weight (kg)
70
100
109
88
79
Calculating correlation
Height (cm)
170
183
196
188
175
Weight (kg)
70
100
109
88
79
It turns out that for this data the correlation is also 𝑟 = 0.9
Even though the numbers are different, the correlation is
exactly the same. This is not a coincidence:
When we find correlation, it does not matter what units we
use.