Chapter 4 - peacock

Download Report

Transcript Chapter 4 - peacock

AP Stats
Chapter 4
Part 3
Displaying and
Summarizing
Quantitative Data
Learning Goals
1. Know how to display the distribution of a
quantitative variable with a histogram, a
stem-and-leaf display, or a dotplot.
2. Know how to display the relative position of
quantitative variable with a Cumulative
Frequency Curve and analysis the
Cumulative Frequency Curve.
3. Be able to describe the distribution of a
quantitative variable in terms of its shape.
4. Be able to describe any anomalies or
extraordinary features revealed by the
display of a variable.
Learning Goals
5. Be able to determine the shape of the
distribution of a variable by knowing
something about the data.
6. Know the basic properties and how to
compute the mean and median of a set of
data.
7. Understand the properties of a skewed
distribution.
8. Know the basic properties and how to
compute the standard deviation and IQR of
a set of data.
Learning Goals
9. Understand which measures of center and
spread are resistant and which are not.
10. Be able to select a suitable measure of
center and a suitable measure of spread for
a variable based on information about its
distribution.
11. Be able to describe the distribution of a
quantitative variable in terms of its shape,
center, and spread.
Learning Goal 6
Know the basic
properties and how to
compute the mean and
median of a set of data.
Learning Goal 6:
Measures of Central Tendency
 A measure of central tendency for a
collection of data values is a number
that is meant to convey the idea of
centralness or center of the data set.
 The most commonly used measures of
central tendency for sample data are
the: mean, median, and mode.
Learning Goal 6:
Measures of Central Tendency
Overview
Central Tendency
Mean
Median
Mode
n
X
X
i 1
n
i
Midpoint of
ranked
values
Most
frequently
observed
value
Learning Goal 6:
The Mean
• Mean: The mean of a set of numerical
(data) values is the (arithmetic)
average for the set of values.
• When computing the value of the
mean, the data values can be
population values or sample values.
• Hence we can compute either the
population mean or the sample mean
Learning Goal 6:
Mean Notation
• NOTATION: The population mean
is denoted by the Greek letter µ
(read as “mu”).
• NOTATION: The sample mean is
denoted by 𝑥 (read as “x-bar”).
• Normally the population mean is
unknown.
Learning Goal 6:
The Mean
 The mean is the most common measure of
central tendency.
 The mean is also the preferred measure of
center, because it uses all the data in
calculating the center.
 For a sample of size n:
n
X
X
i1
n
Sample size
i
X1  X2    Xn

n
Observed values
Learning Goal 6:
The Mean - Example
• What is the mean of the following 11
sample values?
3
8
6
14
0
0
12 -7
0
-10
-4
Learning Goal 6:
The Mean - Example (Continued)
• Solution:
3  8  6  14  0  (4)  0  12  (7)  0  (10)
x
11
2
Learning Goal 6:
Mean – Frequency Table
• When a data set has a large number of
values, we summarize it as a
frequency table.
• The frequencies represent the number
of times each value occurs.
• When the mean is calculated from a
frequency table it is an approximation,
because the raw data is not known.
Learning Goal 6:
Mean – Frequency Table Example
 What is the mean of the following 11 sample
values (the same data as before)?
Class
Frequency
-10 to < -4
2
-4 to < 2
4
2 to < 8
2
8 to < 14
2
14 to < 20
1
Learning Goal 6:
Mean – Frequency Table Example
 Solution:
Class
Midpoint
Frequency
-10 to < -4
-7
2
-4 to < 2
-1
4
2 to < 8
5
2
8 to < 14
11
2
14 to < 20
17
1
 2   7     4   1    2  5   2  11  1  17 
x
11
 2.82
Learning Goal 6:
Calculate Mean on TI-84 Raw Data
1. Enter the raw data into a list, STAT/Edit.
2. Calculate the mean, STAT/CALC/1-Var
Stats
List: L1
FreqList: (leave blank)
Calculate
16
Learning Goal 6:
Calculate Mean on TI-84 Frequency Table Data
Same Data
Class
Mark Freq
0-50
25
1
50-100
75
1
100-150
125
3
150-200
175
4
200-250
225
7
250-300
275
4
1. Enter the Frequency table data into two
lists (L1 – Class Midpoint, L2 – Frequency),
STAT/Edit.
2. Calculate the mean,
STAT/CALC/1-Var Stats
List: L1
FreqList: L2
Calculate
17
Learning Goal 6:
Calculate Mean on TI-84 – Your Turn
Raw Data: 548, 405, 375, 400, 475, 450, 412
375, 364, 492, 482, 384, 490, 492
490, 435, 390, 500, 400, 491, 945
435, 848, 792, 700, 572, 739, 572
Learning Goal 6:
Calculate Mean on TI-84 – Your Turn
 Frequency Table Data (same):
Class Limits
350 to < 450
450 to < 550
550 to < 650
650 to < 750
750 to < 850
850 to < 950
Frequency
11
10
2
2
2
1
Learning Goal 6:
Median
 The median is the midpoint of the
observations when they are ordered
from the smallest to the largest (or
from the largest to smallest)
 If the number of observations is:
 Odd, then the median is the middle
observation
 Even, then the median is the average of
the two middle observations
20
Center of a Distribution -- Median
 The median is the value with exactly half the
data values below it and half above it.
 It is the middle data value (once the data
values have been ordered) that divides
the histogram into two equal areas.
 It has the same units as the data.
Learning Goal 6:
Finding the Median
 The location of the median:
n 1
Median position 
position in the ordered data
2
 If the number of values is odd, the median is the
middle number.
 If the number of values is even, the median is the
average of the two middle numbers.
 Note that
𝑛+1
2
is not the value of the median,
only the position of the median in the ranked
data.
Learning Goal 6:
Finding the Median – Example (n odd)
• What is the median for the following
sample values?
3
8
6
2
12 -7
14
0
-1 -10
-4
Learning Goal 6:
Finding the Median – Example (n odd)
• First of all, we need to arrange the data set in
order ( STATS/SortA )
• The ordered set is:
• -10 -7 -4 -1 0 2 3 6 8 12 14
6th value
• Since the number of values is odd, the
median will be found in the 6th position in the
ordered set (To find; data number divided by
2 and round up, 11/2 = 5.5⇒6).
• Thus, the value of the median is 2.
Learning Goal 6:
Finding the Median – Example (n even)
• Find the median age for the following
eight college students.
23 19 32 25 26 22 24 20
Learning Goal 6:
Finding the Median – Example (n even)
• First we have to order the values as shown
below.
19 20 22 23 24 25 26 32
Middle Two
Average
• Since there is an even number of ages, the
median will be the average of the two middle
values (To find; data number divided by 2,
that number and the next are the two middle
numbers, 8/2 = 4⇒4th & 5th are the middle
numbers).
• Thus, median = (23 + 24)/2 = 23.5.
Learning Goal 6:
The Median - Summary
The median is the midpoint of a distribution—the number such that half
of the observations are smaller and half are larger.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
25 12
6.1
1. Sort observations from smallest to
largest.n = number of observations
______________________________
2. If n is odd, the median is observation
n/2 (round up) down the list
 n = 25
n/2 = 25/2 = 12.5=13
Median = 3.4
3. If n is even, the median is the mean
of the two center observations
n = 24 
n/2 = 12 &13
Median = (3.3+3.4) /2 = 3.35
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
2
3
4
5
6
7
8
9
10
11
1
2
3
4
5
6
7
8
9
10
11
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
Learning Goal 6:
Finding the Median on the TI-84
1. Enter data into L1
2. STAT; CALC; 1:1-Var Stats
28
Learning Goal 6:
Find the Mean and Median – Your Turn
CO2 Pollution levels in 8 largest nations
measured in metric tons per person:
2.3 1.1 19.7 9.8 1.8 1.2 0.7 0.2
a. Mean = 4.6
b. Mean = 4.6
c. Mean = 1.5
Median = 1.5
Median = 5.8
Median = 4.6
29
Learning Goal 6:
Mode





A measure of central tendency.
Value that occurs most often or frequent.
Used for either numerical or categorical data.
There may be no mode or several modes.
Not used as a measure of center.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
0 1 2 3 4 5 6
No Mode
Learning Goal 6:
Mode - Example
 The mode is the measurement which
occurs most frequently.
 The set: 2, 4, 9, 8, 8, 5, 3
 The mode is 8, which occurs twice
 The set: 2, 2, 9, 8, 8, 5, 3
 There are two modes - 8 and 2
(bimodal)
 The set: 2, 4, 9, 8, 5, 3
 There is no mode (each value is
unique).
Learning Goal 6:
Summary Measures of Center
Learning Goal 7
Understand the
properties of a
skewed distribution.
Learning Goal 7:
Where is the Center of the Distribution?
 If you had to pick a single number to
describe all the data what would you
pick?
 It’s easy to find the center when a
histogram is unimodal and
symmetric—it’s right in the middle.
 On the other hand, it’s not so easy to
find the center of a skewed histogram
or a histogram with outliers.
Learning Goal 7:
Meaningful measure of Center
Your measure of center must be meaningful.
The distribution of women’s height appears coherent and
symmetrical. The mean is a good measure center.
Height of 25 women in a class
x  69.3
Is the mean always a good measure of center?
Learning Goal 7:
Impact of Skewed Data
Mean and median of a symmetric
distribution
Disease X:
x  3.4
M  3.4
Mean and median are the same.
and skewed distribution.
Multiple myeloma:
x  3.4
M  2.5
The mean is pulled toward
the skew.
Learning Goal 7:
The Mean
 Nonresistant – The mean is sensitive to the
influence of extreme values and/or outliers.
Skewed distributions pull the mean away
from the center towards the longer tail.
 The mean is located at the balancing point of
the histogram. For a skewed distribution, is
not a good measure of center.
Learning Goal 7:
Mean – Nonresistant Example
 The most common measure of central tendency.
 Affected by extreme values (skewed dist. or outliers).
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
1  2  3  4  5 15

3
5
5
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
1  2  3  4  10 20

4
5
5
Learning Goal 7:
The Median
 Resistant – The median is said to be
resistant, because extreme values
and/or outliers have little effect on the
median.
 In an ordered array, the median is the
“middle” number (50% above, 50%
below).
Learning Goal 7:
Median – Resistant Example
 Not affected by extreme values (skewed
distributions or outliers).
0 1 2 3 4 5 6 7 8 9 10
Median = 3
0 1 2 3 4 5 6 7 8 9 10
Median = 3
Learning Goal 7:
Mean vs. Median with Outliers
Percent of people dying
x  3.4
x  4.2
Without the outliers
With the outliers
The mean (non-resistant) is
The median (resistant), on the
pulled to the right a lot by the
other hand, is only slightly
outliers (from 3.4 to 4.2).
pulled to the right by the outliers
(from 3.4 to 3.6).
Learning Goal 7:
Effect of Skewed Distributions
• The figure below shows the relative positions of the
mean and median for right-skewed, symmetric, and
left-skewed distributions.
• Note that the mean is pulled in the direction of
skewness, that is, in the direction of the extreme
observations.
• For a right-skewed distribution, the mean is greater
than the median; for a symmetric distribution, the mean
and the median are equal; and, for a left-skewed
distribution, the mean is less than the median.
Learning Goal 7:
Comparing the mean and the median
The mean and the median are the same only if the distribution is symmetrical. The
median is a measure of center that is resistant to skew and outliers. The mean is not.
Mean and median for a
symmetric distribution
Mean
Median
Left skew
Mean
Median
Mean and median for
skewed distributions
Mean
Median
Right skew
Learning Goal 7:
Which measure of location is the “best”?
 Because the median considers only the order
of values, it is resistant to values that are
extraordinarily large or small; it simply notes
that they are one of the “big ones” or “small
ones” and ignores their distance from center.
 To choose between the mean and median,
start by looking at the distribution.
 Mean is used, for unimodal symmetric
distributions, unless extreme values (outliers)
exist.
 Median is used, for skewed distributions or
when there are outliers present, since the
median is not sensitive to extreme values.
Learning Goal 7:
Class Problem
 Observed mean =2.28, median=3,
mode=3.1
 What is the shape of the
distribution and why?
Learning Goal 7:
Example
 Five houses on a hill by the beach.
$2,000 K
House Prices:
$500 K
$300 K
$100 K
$100 K
$2,000,000
500,000
300,000
100,000
100,000
Learning Goal 7:
Example – Measures of Center
House Prices:
$2,000,000
500,000
300,000
100,000
100,000
Which is the best
measure of center?
Median
Sum $3,000,000
 Mean:
($3,000,000/5)
= $600,000
 Median: middle value of ranked data
= $300,000
 Mode: most frequent value
= $100,000
Conclusion – Mean or Median?
 Mean – use with symmetrical
distributions (no outliers),
because it is nonresistant.
 Median – use with skewed
distribution or distribution with
outliers, because it is resistant.
Learning Goal 8
Know the basic
properties and how to
compute the standard
deviation and IQR of a
set of data.
Learning Goal 8:
How Spread Out is the Distribution?
 Variation matters, and Statistics is
about variation.
 Are the values of the distribution tightly
clustered around the center or more
spread out?
 Always report a measure of spread
along with a measure of center when
describing a distribution numerically.
Learning Goal 8:
Measures of Spread
 A measure of variability for a collection
of data values is a number that is
meant to convey the idea of spread for
the data set.
 The most commonly used measures of
variability for sample data are the:
 range
 interquartile range
 variance and standard deviation
Learning Goal 8:
Measures of Variation
Variation
Range

Interquartile
Range
Variance
Standard
Deviation
Measures of variation
give information on the
spread or variability
of the data values.
Same center,
different variation
Learning Goal 8:
The Interquartile Range
 One way to describe the spread of a
set of data might be to ignore the
extremes and concentrate on the
middle of the data.
 The interquartile range (IQR) lets us
ignore extreme data values and
concentrate on the middle of the data.
 To find the IQR, we first need to know
what quartiles are…
Learning Goal 8:
The Interquartile Range
 Quartiles divide the data into four equal
sections.
 One quarter of the data lies below the lower
quartile, Q1
 One quarter of the data lies above the upper
quartile, Q3.
 The quartiles border the middle half of the data.
 The difference between the quartiles is the
interquartile range (IQR), so
IQR = upper quartile(Q3) – lower quartile(Q1)
Learning Goal 8:
Interquartile Range
 Eliminate some outlier or extreme
value problems by using the
interquartile range.
 Eliminate some high- and low-valued
observations and calculate the range
from the remaining values.
 IQR = 3rd quartile – 1st quartile
IQR = Q3 – Q1
Learning Goal 8:
Finding Quartiles
1.
2.
3.
4.
5.
Order the Data
Find the median, this divides the data into a
lower and upper half (the median itself is in
neither half).
Q1 is then the median of the lower half.
Q3 is the median of the upper half.
Example
Even data
Q1=27, M=39, Q3=50.5
IQR = 50.5 – 27 = 23.5
Odd data
Q1=35, M=46, Q3=54
IQR = 54 – 35 = 19
Learning Goal 8:
Quartiles
Example:
X
minimum
Q1
25%
12
Middle fifty
Median
(Q2)
25%
30
25%
45
X
Q3
maximum
25%
57
70
Interquartile range
= 57 – 30 = 27
Not influenced by extreme values (Resistant).
Learning Goal 8:
Quartiles
 Quartiles split the ranked data into 4
segments with an equal number of values per
segment.
25%



25%
25%
25%
Q1
Q2
Q3
The first quartile, Q1, is the value for which
25% of the observations are smaller and 75%
are larger.
Q2 is the same as the median (50% are
smaller, 50% are larger).
Only 25% of the observations are greater
than the third quartile.
Learning Goal 8:
The Interquartile Range - Histogram
 The lower and upper quartiles are the 25th
and 75th percentiles of the data, so…
 The IQR contains the middle 50% of the
values of the distribution, as shown in figure:
+
Learning Goal 8:
Find and Interpret IQR
Travel times to work for 20 randomly selected New Yorkers
10
30
5
25
40
20
10
15
30
20
15
20
85
15
65
15
60
60
40
45
5
10
10
15
15
15
15
20
20
20
25
30
30
40
40
45
60
60
65
85
Q1 = 15
M = 22.5
Q3= 42.5
IQR = Q3 – Q1
= 42.5 – 15
= 27.5 minutes
Interpretation: The range of the middle half of travel times
for the New Yorkers in the sample is 27.5 minutes.
Learning Goal 8:
Interquartile Range on the TI-84
•
•
Use STATS/CALC/1-Var Stats to find
Q1 and Q3.
Then calculate IQR = Q3 – Q1.
Interquartile range = Q3 – Q1 = 9 – 6 = 3.
Learning Goal 8:
Calculate IQR - Your Turn
The following scores for a statistics 10point quiz were reported. What is the
value of the interquartile range?
7 8 9 6 8 0 9 9 9
0 0 7 10 9 8 5 7 9
Learning Goal 8:
5-Number Summary
Definition:
The five-number summary of a distribution consists
of the smallest observation, the first quartile, the
median, the third quartile, and the largest observation,
written in order from smallest to largest.
Minimum
Q1
M
Q3
Maximum
Learning Goal 8:
5-Number Summary
 The 5-number summary of a distribution
reports its minimum, 1st quartile Q1, median,
3rd quartile Q3, and maximum in that order.
 Obtain 5-number summary from 1-Var Stats.
Min.
3.7
Q1
6.6
Med.
7
Q3
7.6
Max.
9
Learning Goal 8:
Calculate 5 Number Summary
1.
2.
3.
4.
5.
Enter data into L1.
STAT; CALC; 1:1-Var Stats; Enter.
List: L1.
Calculate.
Scroll down to 5 number summary.
65
Learning Goal 8:
Calculate 5 Number Summary – Your Turn
The grades of 25 students are given
below :
42, 63, 47, 77, 46, 71, 68, 83, 91, 55,
67, 66, 63, 57, 50, 69, 73, 82, 77, 58,
66, 79, 88, 97, 86.
Calculate the 5 number summary for the
students grades.
Learning Goal 8:
Calculate 5 Number Summary – Your Turn
 A group of University students took part in a
sponsored race. The number of laps completed is
given in the table.
number of laps
frequency (x)
1-5
2
6 – 10
9
11 – 15
15
16 – 20
20
21 – 25
17
26 – 30
25
31 – 35
2
36 - 40
1
 Calculate the 5 number summary.
Learning Goal 8:
Standard Deviation
 A more powerful measure of spread
than the IQR is the standard deviation,
which takes into account how far each
data value is from the mean.
 A deviation is the distance that a data
value is from the mean.
 Since adding all deviations together would
total zero, we square each deviation and
find an average of sorts for the deviations.
 But to calculate the standard deviation
you must first calculate the variance.
Learning Goal 8:
Variance
 The variance is measure of variability
that uses all the data.
 It measures the average deviation of
the measurements about their mean.
Learning Goal 8:
Variance
 The variance, notated by s2, is found by
summing the squared deviations and
(almost) averaging them:
s
2
x  x



2
n 1
 Used to calculate Standard Deviation.
 The variance will play a role later in our
study, but it is problematic as a measure of
spread - it is measured in squared units – not
the same units as the data, a serious
disadvantage!
Learning Goal 8:
Variance
 The variance of a population of N
measurements is the average of the squared
deviations of the measurements about their
mean m.
Sigma
Squared
2

(
x

m
)
2
i
 
N
 The variance of a sample of n measurements
is the sum of the squared deviations of the
measurements about their mean, divided by
(n – 1).
S
Squared
( xi  x )
s 
n 1
2
2
Learning Goal 8:
Standard Deviation
 The standard deviation, s, is just the
square root of the variance.
 Is measured in the same units as the
original data. Why it is preferred over
variance.
s
 x  x
n 1
2
Learning Goal 8:
Standard Deviation
 In calculating the variance, we squared
all of the deviations, and in doing so
changed the scale of the
measurements.
 To return this measure of variability to
the original units of measure, we
calculate the standard deviation, the
positive square root of the variance.
Population standard deviation :   
Sample standard deviation : s  s 2
2
Learning Goal 8:
Finding Standard Deviation
 The most common measure of spread looks at how far
each observation is from the mean. This measure is
called the standard deviation. Let’s explore it!
 Consider the following data on the number of pets
owned by a group of 9 children.
1) Calculate the mean.
2) Calculate each deviation.
deviation = observation – mean
deviation: 1 - 5 = -4
deviation: 8 - 5 = 3
x =5
Learning Goal 8:
Finding Standard Deviation
(xi-mean)2
xi
(xi-mean)
1
1 - 5 = -4
(-4)2 = 16
3
3 - 5 = -2
(-2)2 = 4
3) Square each deviation.
4
4 - 5 = -1
(-1)2 = 1
4) Find the “average” squared
deviation. Calculate the sum of
the squared deviations divided
by (n-1)…this is called the
variance.
4
4 - 5 = -1
(-1)2 = 1
4
4 - 5 = -1
(-1)2 = 1
5
5-5=0
(0)2 = 0
7
7-5=2
(2)2 = 4
8
8-5=3
(3)2 = 9
9
9-5=4
(4)2 = 16
5) Calculate the square root of the
variance…this is the standard
deviation.
Sum=?
“average” squared deviation = 52/(9-1) = 6.5
Standard deviation = square root of variance =
Sum=?
This is the variance.
6.5  2.55
Learning Goal 8:
Standard Deviation - Example
The standard deviation is used to describe the variation around the mean.
1) First calculate the variance s2.
1 n
2
s 
(
x

x
)
 i
n 1 1
2
2) Then take the square root to get
the standard deviation s.
x
Mean
± 1 s.d.
1 n
2
s
(
x

x
)
 i
n 1 1
Learning Goal 8:
Standard Deviation - Procedure
1. Compute the mean .
x
2. Subtract the mean from each individual
value to get a list of the deviations from the
mean  x  x  .
3. Square each of the differences to produce
the square
of the deviations from the mean
2
 x  x.
4. Add all of the squares of the deviations from
2
the mean to get   x  x  .
 x  x
5. Divide the sum
by  n  1 . [variance]
6. Find the square root of the result.
2
Learning Goal 8:
Standard Deviation - Example
 Find the standard deviation of the Mulberry
Bank customer waiting times. Those times (in
minutes) are 1, 3, 14. Use a Table.
We will not normally calculate standard deviation by hand.
Learning Goal 8:
Calculate Standard Deviation
1.
2.
3.
4.
5.
Enter data into L1
STAT; CALC; 1:1-Var Stats; Enter
List: L1;Calculator
Sx is the sample standard deviation.
σx is the population standard
deviation.
79
Learning Goal 8:
Calculate Standard Deviation – Your Turn
 The prices ($) of 18 brands of walking
shoes:
90 70 70 70 75 70
65 68 60 74 70 95
75 70 68 65 40 65
 Calculate the standard deviation.
Learning Goal 8:
Calculate Standard Deviation – Your Turn
 During 3 hours at Heathrow airport 55 aircraft
arrived late. The number of minutes they
were late is shown in the grouped frequency
table.
minutes late
frequency
010 20 30 40 50 -
9
19
29
39
49
59
27
10
7
5
4
2
 Calculate the standard deviation for the
number of minutes late.
Learning Goal 8:
Standard Deviation - Properties
 The value of s is always positive.
 s is zero only when all of the data values are the
same number.
 Larger values of s indicate greater amounts of
variation.
 The units of s are the same as the units of the
original data. One reason s is preferred to s2.
 Measures spread about the mean and should
only be used to describe the spread of a
distribution when the mean is used to describe
the center (ie. symmetrical distributions).
 Nonresistant (like the mean), s can increase
dramatically due to extreme values or outliers.
Learning Goal 8:
Standard Deviation - Example
Larger values of standard deviation indicate greater amounts
of variation.
Small standard deviation
Large standard deviation
Learning Goal 8:
Standard Deviation - Example
Standard Deviation: the more variation, the
larger the standard deviation. Data set II has
greater variation.
Learning Goal 8:
Standard Deviation - Example
Data Set I
Data Set II
Data set II has greater variation and the visual clearly
shows that it is more spread out.
Learning Goal 8:
Comparing Standard Deviations
The more variation, the larger the standard deviation.
Data A
11
12
13
14
15
16
17
18
19
20 21
Mean = 15.5
S = 3.338
20 21
Mean = 15.5
S = 0.926
20 21
Mean = 15.5
S = 4.567
Data B
11
12
13
14
15
16
17
18
19
Data C
11
12
13
14
15
16
17
18
19
Values far from the mean are given extra weight
(because deviations from the mean are squared).
Learning Goal 8:
Spread: Range
 The range of the data is the difference
between the maximum and minimum
values:
Range = max – min
 A disadvantage of the range is that a
single extreme value can make it very
large and, thus, not representative of
the data overall.
Learning Goal 8:
Range
 Simplest measure of variation.
 Difference between the largest and the
smallest values in a set of data.
Example:
Range = Xlargest – Xsmallest
0 1 2 3 4 5 6 7 8 9 10 11 12
Range = 14 - 1 = 13
13 14
Learning Goal 8:
Disadvantages of the Range
 Ignores the way in which data are distributed
7
8 9 10 11 12
Range = 12 - 7 = 5
7
8 9 10 11 12
Range = 12 - 7 = 5
 Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Learning Goal 8:
Range
• The range is affected by outliers (large
or small values relative to the rest of
the data set).
• The range does not utilize all the
information in the data set only the
largest and smallest values.
• Thus, range is not a very useful
measure of spread or variation.
Learning Goal 8:
Summary Measures
Describing Data Numerically
Central Tendency
Quartiles
Variation
Mean
Range
Median
Interquartile Range
Mode
Variance
Standard Deviation
Shape
Skewness
Learning Goal 9
Understand which
measures of center and
spread are resistant
and which are not.
Learning Goal 9:
Resistant or Non-Resistant
 Which measures of center and spread
are resistant?
1. Median – Extreme values and outliers
have little effect.
2. IQR – Measures the spread of the middle
50% of the data, therefore extreme
values and outliers have no effect.
3. When using Median to measure the
center of a distribution, use IQR to
measure the spread of the distribution.
Learning Goal 9:
Resistant or Non-Resistant
 Which measures of center and spread
are Non-Resistant?
1. Mean – Extreme values and outliers pull
the mean towards those values.
2. Standard Deviation – Measures the
spread relative to the mean. Extreme
values or outliers will increase the
standard deviation of the distribution.
3. When using Mean to measure the center
of a distribution, use Standard Deviation
to measure the spread of the distribution.
Learning Goal 9:
Resistant or Non-Resistant
 Measures of Center:
 Mean (not resistant)
 Median (resistant)
 Measures of Spread:
 Standard deviation (not resistant)
 IQR (resistant)
 Range (not resistant)
 Most often and preferred, use the mean and the
standard deviation, because they are calculated
based on all the data values, so use all the
available information.
Learning Goal 9:
Resistant or Non-Resistant
Animated
Center and Spread
63.33
Mean:
68.82
Mean:72.5
72.5
70
Median:
70
Median:72.5
72.5
S:
16.84
S:
12.56
S:10.16
10.16
IQR:
30
IQR:
20
IQR: 15
15
What is the difference between
the center and spread of a
distribution?
Which measure of center
(mean or median) was affected
more by adding data points
that skewed the distribution?
Explain your answer.
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
Quiz Scores
In a symmetric distribution:
• The mean, non-resistant, is used to represent the center.
• The standard deviation (S), non-resistant, is used to represent the spread.
In a skewed distribution:
• The median, resistant, is used to represent the center.
• The interquartile range (IQR), resistant, is used to represent the spread.
©2013 All rights reserved.
For each distribution below,
which measure of center and
spread would you use?
How do you know?
A
B
Mean
&S
Median
& IQR
CCSS 6th Grade Statistics and Probability 2.0
Describe the distribution of a data set.
Lesson to be used by EDI-trained teachers only.
Learning Goal 9:
Resistant or Non-Resistant
 Median and IQR are paired together –
Resistant.
 Mean and Standard Deviation are
paired together – Non-Resistant.
Learning Goal 10
Be able to select a
suitable measure of
center and a suitable
measure of spread for
a variable based on
information about its
distribution.
Learning Goal 10:
Choosing Measures of Center and Spread
 We now have a choice between two descriptions for center and
spread
 Mean and Standard Deviation
 Median and Interquartile Range
Choosing Measures of Center and Spread
•The median and IQR are usually better than the mean and
standard deviation for describing a skewed distribution or a
distribution with outliers.
•Use mean and standard deviation only for reasonably
symmetric distributions that don’t have outliers.
•NOTE: Numerical summaries do not fully describe the
shape of a distribution. ALWAYS PLOT YOUR DATA!
Learning Goal 10:
Choosing Measures of Center and Spread
Plot your data
Dotplot, Stemplot, Histogram
Interpret what you see:
Shape, Outliers, Center, Spread
Choose numerical summary:
𝒙 and s, or
Median and IQR
Learning Goal 10:
Choosing Center and Spread - Practice
The distribution of a data set shows the arrangement of values in the data set.
The center of a distribution is a number that represents all the values in the data set.
The spread of a distribution is a number that describes the variability in the data set.
The dot plots below show the ratings given to a new movie by two different audiences.
1.
1
2.
Audience #1
2
3
4 5 6 7 8
Audience Rating
9 10
Mean: 7
Median: 7
S: 1.43
IQR: 2
1
Symmetric
Audience #2
2
3
4 5 6 7 8
Audience Rating
9 10
Mean: 5.71
Median: 6
S: 1.67
IQR: 3
Center: Mean
Spread: S
Skewed
Shape: The shape of the distribution is mostly
Shape: The shape of the distribution is mostly
symmetric.
Center: Because the distribution is symmetric, the
mean of 7 can be used as the measure of center.
Spread: The S of the distribution is 1.43.
symmetric.
Center: Because the distribution is symmetric, the
mean of 5.71 can be used as the measure of center.
Spread:The S of the distribution is 1.67.
Center: Median
Spread: IQR
©2013 All rights reserved.
CCSS 6th Grade Statistics and Probability 2.0
Describe the distribution of a data set.
Lesson to be used by EDI-trained teachers only.
Learning Goal 10:
Choosing Center and Spread - Practice
The distribution of a data set shows the arrangement of values in the data set.
The center of a distribution is a number that represents all the values in the data set.
The spread of a distribution is a number that describes the variability in the data set.
The histograms below show the number of hours studied in a week for students in two math classes.
4.
Class #1
Students
10
8
6
4
2
0-2
3-5
6-8
9-11 12-14 15-17
Mean: 9.69
Median: 10.5
S: 3.6
IQR: 6.5
Symmetric
Class #2
10
8
6
4
2
Students
3.
0-2
Hours Studied
3-5
6-8
9-11 12-14 15-17
Mean: 7.75
Median: 7
S: 2.93
IQR: 4.5
Center: Mean
Spread: S
Hours Studied
Shape: The shape of the distribution is skewed to
Shape: The shape of the distribution is skewed to
the left.
the right
Center: Because the distribution is skewed, the
Center: Because the distribution is skewed, the
Skewed
median of 10.5 can be used as the measure of center. median of 7 can be used as the measure of center.
Spread: The IQR of the distribution is 6.5.
Spread:The IQR of the distribution is 4.5.
Center: Median
Spread: IQR
©2013 All rights reserved.
CCSS 6th Grade Statistics and Probability 2.0
Describe the distribution of a data set.
Lesson to be used by EDI-trained teachers only.
Learning Goal 10:
Choosing Center and Spread - Practice
The distribution of a data set shows the arrangement of values in the data set.
The center of a distribution is a number that represents all the values in the data set.
The spread of a distribution is a number that describes the variability in the data set.
The dot plot below shows the number of hours of The histogram below shows the number of hours of
sleep per night for 33 students in a 6th-grade class. sleep per night for 33 adults selected at random.
1.
2.
4
5
6 7 8 9 10 11
Hours of Sleep
Adults
Mean: 8.4
Median: 9
S: 1.53
IQR: 3
12
10
8
6
4
2
Mean: 6.8
Median: 7
S: 1.54
IQR: 2.5
0-1
2-3
4-5
6-7
8-9
Center: Mean
Spread: S
10+
Hours Slept
Skewed
Shape: The shape of the distribution is skewed
Shape: The shape of the distribution is fairly
left.
symmetric, with a slight skew to the left.
Center: Because the distribution is mostly symmetric,
the mean of 6.8 can be used as the measure of center.
Spread:The S of the distribution is 1.54.
Center: Because the distribution is skewed, the
median of 9 can be used as the measure of center.
Spread: The IQR of the distribution is 3.
Symmetric
Center: Median
Spread: IQR
©2013 All rights reserved.
CCSS 6th Grade Statistics and Probability 2.0
Describe the distribution of a data set.
Lesson to be used by EDI-trained teachers only.
Learning Goal 10:
Choosing Center and Spread - Practice
The histograms below show the scores of 31 students on a pretest and posttest.
Pretest
41-50 51-60 61-70 71-80 81-90 91-100
Mean: 57.67
Median: 54
S: 9.07
IQR: 14
Score
12
10
8
6
4
2
Students
2.
12
10
8
6
4
2
Students
1.
Posttest
41-50 51-60 61-70 71-80 81-90 91-100
Mean: 76
Median: 76
S: 9.81
IQR: 24
Score
Shape: The shape of the distribution is skewed
Shape: The shape of the distribution is mostly
right.
symmetric.
Center: Because the distribution is mostly symmetric,
the mean of 76 can be used as the measure of center.
Spread:The S of the distribution is 9.81.
Center: Because the distribution is skewed, the
median of 54 can be used as the measure of center.
Spread: The IQR of the distribution is 14.
Did scores on the test improve from the pretest
to the posttest? Explain your answer.
Yes, test scores improved from the pretest to the posttest. It
can be seen by the noticeably higher center in the distribution
of scores for the posttest.
CCSS 6 Grade Statistics and Probability 2.0
th
©2013 All rights reserved.
Describe the distribution of a data set.
Lesson to be used by EDI-trained teachers only.
Learning Goal 10:
Choosing Center and Spread - Practice
The dot plot below shows the number of pets in each
household of 28 students in a 6th-grade class.
Mean: 1.82
Median: 2
S: 1.13
IQR: 1.5
1.
Shape:
The shape of the distribution is skewed right.
Center: Because the distribution is skewed, the
median of 2 can be used as the measure of center.
Spread: The IQR of the distribution is 1.5.
0
1
2
3
4
5
6
7
8
9
Number of Pets
©2013 All rights reserved.
CCSS 6th Grade Statistics and Probability 2.0
Describe the distribution of a data set.
Lesson to be used by EDI-trained teachers only.
Learning Goal 10:
Choosing Center and Spread - Questions
Choose Yes or No to indicate
whether each statement is
true about this distributions.
A. Both distributions are symmetric.
B. The median is the best measure of center
for Distribution A.
C. Overall, scores were higher in Distribution A
than Distribution B.
D. There is more variability in scores for
Distribution A than Distribution B.
E. Distribution A is skewed to the right.
F. The Standard Deviation can be used
to describe the spread for Distribution B.
©2013 All rights reserved.
O Yes O No
O Yes O No
O Yes O No
O Yes O No
O Yes O No
O Yes O No
CCSS 6th Grade Statistics and Probability 2.0
Describe the distribution of a data set.
Lesson to be used by EDI-trained teachers only.
Learning Goal 11
Be able to describe the
distribution of a
quantitative variable in
terms of its shape,
center, and spread.
Learning Goal 11:
How to Analysis Quantitative Data
2009 Fuel Economy Guide
Examine each variable
by itself.
Then study
relationships among
the variables.
MODEL
2009 Fuel Economy Guide
2009 Fuel Economy Guide
MPG
MPG
MODEL
<new>MODEL
MPG
1
Acura RL
9 22 Dodge Avenger
1630 Mercedes-Benz E350
24
2
Audi A6 Quattro
1023 Hyundai Elantra
1733 Mercury Milan
29
3
Bentley Arnage
1114 Jaguar XF
1825 Mitsubishi Galant
27
4
BMW 5281
1228 Kia Optima
1932 Nissan Maxima
26
5
Buick Lacrosse
1328 Lexus GS 350
2026 Rolls Royce Phantom
18
6
Cadillac CTS
1425 Lincolon MKZ
2128 Saturn Aura
33
7
Chevrolet Malibu
1533 Mazda 6
2229 Toyota Camry
31
8
Chrysler Sebring
1630 Mercedes-Benz E350
2324 Volkswagen Passat
29
9
Dodge Avenger
1730 Mercury Milan
2429 Volvo S80
25
Start with a graph or
graphs
Add numerical
summaries
<new>
Learning Goal 11:
How to Describe a Quantitative Distribution
The purpose of a graph is to help us understand the data. After you
make a graph, always ask, “What do I see?”
How to Describe the Distribution of a Quantitative Variable
In any graph, look for the overall pattern and for striking
departures from that pattern.
Describe the overall pattern of a distribution by its:
•Shape
Don’t forget your
•Outliers
SOCS!
•Center
•Spread
Note individual values that fall outside the overall pattern.
These departures are called outliers.
Learning Goal 11:
Describing a Quantitative Distribution
 We describe a distribution (the values the variable
takes on and how often it takes these values) using
the acronym SOCS.
 Shape– We describe the shape of a distribution in one of
two ways:
Symmetric/Approx. Symmetric
or
Skewed right/Skewed left
 Approx. Symmetric (with extreme values)
Dot Plot
Number of Home Runs in a Single Season
Babe Ruth’s
Single Season
Home Runs
20
25
30
35
40
45
Ruth
50
55
60
65
Learning Goal 11:
Describing a Quantitative Distribution
 Outliers: Observations that we would consider
“unusual”. Data that don’t “fit” the overall pattern of
the distribution.
 Babe Ruth had two seasons that appear to be
somewhat different than the rest of his career.
These may be “outliers”. (We’ll learn a numerical way
to determine if observations are truly “unusual” later).
 Outliers 22, 25
Dot Plot
Number of Home Runs in a Single Season
Babe Ruth’s
Single Season
Home Runs
Possible Outliers
20
25
30
35
Unusual observation???
40
45
Ruth
50
55
60
65
Learning Goal 11:
Describing a Quantitative Distribution
 Center: A single value that describes the entire
distribution. Symmetric distributions use mean and
skewed distributions use median.
Dot Plot
Number of Home Runs in a Single Season
Babe Ruth’s
Single Season
Home Runs
20
 Median is 46
25
30
35
40
45
Ruth
50
55
60
65
Learning Goal 11:
Describing a Quantitative Distribution
 Spread: Talk about the variation of a distribution.
Symmetric distributions use standard deviation and
skewed distributions use IQR.
Dot Plot
Number of Home Runs in a Single Season
Babe Ruth’s
Single Season
Home Runs
20
25
30
35
Q1
 IQR is 19
40
45
Ruth
50
55
Q3
60
65
Learning Goal 11:
Distribution Description using SOCS
 The distribution of Babe Ruth’s
number of home runs in a single
season is approximately symmetric1
with two possible outlier observations
at 23 and 25 home runs.2 He typically
hits about 463 home runs in a season.
Over his career, the number of home
runs has normally varied from
between 35 and 54.4
1-Shape
2-Outliers
3-Center
4-Spread
Learning Goal 11:
Describe the Distribution – Your Turn
 The table and dotplot below displays the
Environmental Protection Agency’s estimates
of highway gas mileage in miles per gallon
(MPG) for a sample of 24 model year 2009
midsize cars.
 Describe the shape, center, and spread of
the distribution. Are there any outliers?
2009 Fuel Economy Guide
MODEL
2009 Fuel Economy Guide
2009 Fuel Economy Guide
MPG
MPG
MODEL
<new>MODEL
MPG
1
Acura RL
922 Dodge Avenger
1630 Mercedes-Benz E350
24
2
Audi A6 Quattro
1023 Hyundai Elantra
1733 Mercury Milan
29
3
Bentley Arnage
1114 Jaguar XF
1825 Mitsubishi Galant
27
4
BMW 5281
1228 Kia Optima
1932 Nissan Maxima
26
5
Buick Lacrosse
1328 Lexus GS 350
2026 Rolls Royce Phantom
18
6
Cadillac CTS
1425 Lincolon MKZ
2128 Saturn Aura
33
7
Chevrolet Malibu
1533 Mazda 6
2229 Toyota Camry
31
8
Chrysler Sebring
1630 Mercedes-Benz E350
2324 Volksw agen Passat
29
9
Dodge Avenger
1730 Mercury Milan
2429 Volvo S80
25
<new>
Learning Goal 11:
Describe the Distribution – Your Turn
Smart Phone
Battery Life
(minutes)
Apple iPhone
300
Motorola Droid
385
Palm Pre
300
Blackberry
Bold
Blackberry
Storm
Motorola Cliq
Samsung
Moment
Blackberry
Tour
HTC Droid
360
330
360
330
300
460
Smart Phone
Battery Life:
Here is the
estimated battery
life for each of 9
different smart
phones in minutes.
Describe the
distribution.
Cartoon Time