Transcript Section 1-3

5-Minute Check on Lesson 1-2
1. What 4 terms are used to describe data sets or distributions?
Shape, Outliers, Center, Spread (SOCS)
2. Which type of graph can our calculators do (bar or
histogram)?
histogram
3. How many classes should a histogram have?
classes = square root (number of observations)
4. What needs to be looked for in time-series graphs?
seasonal trends
5. What is the major difference between a histogram and a stemplot?
histogram summarizes the data
stem-plot maintains the data
6. Name a possible graphical error in a histogram
overlapping categories
Click the mouse button or press the Space Bar to display the answers.
Lesson 1 - 3
Describing Quantitative Data
with Numbers
adapted from Mr. Molesky’s TPS 4E slides
Objectives
• Calculate and interpret measures of center (mean,
median, mode)
• Calculate and interpret measures of spread (IQR,
standard deviation, range)
• Identify outliers using the 1.5 x IQR rule
• Make a boxplot
• Select appropriate measures of center and spread
• Use appropriate graphs and numerical summaries to
compare distributions of quantitative variables
Vocabulary
• Boxplot – graphs the five number summary and any outliers
• Degrees of freedom – the number of independent pieces of
information that are included in your measurement
• Five-number summary – the minimum, Q1, Median, Q3,
maximum
• Interquartile range – the range of the middle 50% of the data;
(IQR) – IQR = Q3 – Q1
• Mean – the average value (balance point); x-bar
• Median – the middle value (in an ordered list); M
• Mode – the most frequent data value
Vocabulary cont
• Outlier – a data value that lies outside the interval [Q1 – 1.5 
IQR, Q3 + 1.5  IQR]
• Pth percentile – p percent of the observations (in an ordered
list) fall below at or below this number
• Quartile – multiples of 25th percentile (Q1 – 25th; Q2 –50th or
median; Q3 – 75th)
• Range – difference between the largest and smallest
observations
• Resistant measure – a measure (statistic or parameter) that is
not sensitive to the influence of extreme observations
• Standard Deviation– the square root of the variance
• Variance – the average of the squares of the deviations from
the mean
Measures of Center
Numerical descriptions of distributions begin with a
measure of its “center”
If you could summarize the data with one number,
what would it be?
Mean: The “average” value of a dataset
x1  x2  ... xn
x
n
x

x
i
n
Median: The “middle” value of an ordered dataset
1. Arrange observations in order min to max
2. Locate the middle
observation, average if needed
Mean vs Median
The mean and the median are the most common
measures of center
If a distribution is perfectly symmetric,
the mean and the median are the same
The mean is not resistant to outliers
The mode, the data value that occurs the most often,
is a common measure of center for categorical data
You must decide which number is the most
appropriate description of the center...
MeanMedian Applet
Use the mean on symmetric data and
the median on skewed data or data with outliers
Distributions Parameters
Median
Mean
Mode
Mean < Median < Mode
Skewed Left: (tail to the left)
Mean substantially smaller than median
(tail pulls mean toward it)
Distributions Parameters
Mode
Median
Mean
Mean ≈ Median ≈ Mode
Symmetric:
Mean roughly equal to median
Distributions Parameters
Median
Mode
Mean
Mean > Median > Mode
Skewed Right: (tail to the right)
Mean substantially greater than median
(tail pulls mean toward it)
Central Measures Comparisons
Measure of
Central Tendency
Computation
Interpretation
Mean
μ = (∑xi ) / N
x‾ = (∑xi) / n
Center of gravity
Median
Arrange data in
ascending order
and divide the data
set into half
Divides into
bottom 50% and
top 50%
Mode
Tally data to
determine most
frequent
observation
Most frequent
observation
When to use
Data are
quantitative and
frequency
distribution is
roughly symmetric
Data are
quantitative and
frequency
distribution is
skewed
Data are
categorical or the
most frequent
observation is the
desired measure of
central tendency
Measuring Center: Example 1
• Use the data below to calculate the mean and median of the
commuting times (in minutes) of 20 randomly selected New
York workers.
Example, page 53
10
30
5
25
40
x
0
1
2
3
4
5
6
7
8
20
10
15
5
20
15
20
85
15
65
15
60
60
40
45
10  30  5  25  ...  40  45
 31 .25 minutes
20
5
005555
0005
Key: 4|5
00
represents a
005
005
30
New York
worker who
reported a 45minute travel
time to work.
20  25
M
 22.5 minutes
2
Example 2
Which of the following measures of central
tendency resistant?
1. Mean
Not resistant
2. Median
Resistant
3. Mode
Resistant
Example 3
Given the following set of data:
70,
28,
56,
63,
56,
35,
51,
50,
48,
58,
46,
46,
48,
62,
39,
69,
53,
45,
56,
53,
52,
60,
32,
70,
66,
38,
44,
33,
48,
73,
60,
54,
What is the mean?
51.125
What is the median?
51
What is the mode?
48, 51, 56
36,
45,
51,
55,
49,
51,
44,
52
What is the shape of the distribution?
Symmetric
(tri-modal)
Example 4
Given the following types of data and sample sizes, list
the measure of central tendency you would use and
explain why?
Sample of 50
Hair color
Height
Weight
Parent’s Income
Number of Siblings
Age
mode
mean
mean
median
mean
mean
Sample of 200
mode
mean
mean
median
mean
mean
Does sample size affect your decision?
Not in this case, but the larger the sample size,
might allow use to use the mean vs the median
Day 1 Summary and Homework
• Summary
– Three characteristics must be used to describe
distributions (from histograms or similar charts)
•
•
•
•
–
–
–
–
Shape (uniform, symmetric, bi-modal, etc)
Outliers (rule next lesson)
Center (mean, median, mode measures)
Spread (IQR, variance – next lesson)
Median is resistant to outliers; mean is not!
Use Mean for symmetric data
Use Median for skewed data (or data with outliers)
Use Mode for categorical data
• Homework
– pg 70-74; prob 79, 81, 83, 87, 89
5-Minute Check on Lesson 1-3a
1. What are the two quantitative measures of center?
Mean and median
2. When do we use one versus the other?
Mean for symmetric data and median for skewed
3. Which one is resistant to outliers?
Median
4. Which measure of center is used for qualitative data?
Mode
5. Find the mean, median and mode of the following data
set:
7, 15, 4, 8, 16, 17, 2, 5, 11, 8, 12, 6
Mean:
Median:
Mode:
9.25
8
8
Click the mouse button or press the Space Bar to display the answers.
Measures of Spread
Variability is the key to Statistics. Without
variability, there would be no need for the subject.
When describing data, never rely on center alone.
Measures of Spread:
Range - {rarely used ... why?}
Quartiles - InterQuartile Range {IQR=Q3-Q1}
Variance and Standard Deviation {var and sx}
Like Measures of Center, you must choose the most
appropriate measure of spread.
Standard Deviation
Another common measure of spread is the Standard
Deviation: a measure of the “average” deviation of
all observations from the mean.
To calculate Standard Deviation:
Calculate the mean.
Determine each observation’s deviation (x - xbar).
“Average” the squared-deviations by dividing the total
squared deviation by (n-1).
This quantity is the Variance.
Square root the result to determine the Standard
Deviation.
Standard Deviation Properties
s measures spread about the mean and should be
used only when the mean is used as the measure of
center
s = 0 only when there is no spread/variability. This
happens only when all observations have the same
value. Otherwise, s > 0. As the observations
become more spread out about their mean, s gets
larger
s, like the mean x-bar, is not resistant. A few
outliers can make s very large
Standard Deviation
Variance:
(x1  x ) 2  (x2  x ) 2  ... (xn  x ) 2
var 
n 1
Standard Deviation:

sx 
2
(x

x
)
 i
n 1
Example 1.16 (p.85 of TPS 3E): Metabolic Rates
1792
1666
1362

1614
1460
1867
1439
Standard Deviation
1792
1666
1362
1614
1460
1867
1439
Metabolic Rates: mean=1600
x
(x - x)
(x - x)2
1792
192
36864
1666
66
4356
1362
-238
56644
1614
14
196
1460
-140
19600
1867
267
71289
1439
-161
25921
Totals:
0
214870
Total
Squared
Deviation
214870
Variance
var=214870/6
var=35811.66
Standard
Deviation
s=√35811.66
s=189.24 cal
What does this value, s, mean?
The Interquartile Range (IQR)
– A measure of center alone can be misleading.
– A useful numerical description of a distribution requires
both a measure of center and a measure of spread.
How to Calculate the Quartiles and the Interquartile Range
To calculate the quartiles:
1)Arrange the observations in increasing order and locate the
median M.
2)The first quartile Q1 is the median of the observations
located to the left of the median in the ordered list.
3)The third quartile Q3 is the median of the observations
located to the right of the median in the ordered list.
The interquartile range (IQR) is defined as:
IQR = Q3 – Q1
Quartiles
Quartiles Q1 and Q3 represent the 25th and 75th
percentiles.
To find them, order data from min to max.
Determine the median - average if necessary.
The first quartile is the middle of the ‘bottom half’.
The third quartile is the middle of the ‘top half’.
19
22
23
23
23
68
74
Q1
26
27
28
med
Q1=23
45
26
75
76
29
30
31
32
Q3=29.5
82
med=79
82
91
Q3
93
98
Example 1
Which of the following measures of spread are
resistant?
1. Range
Not Resistant
2. Variance
Not Resistant
3. Standard Deviation
Not Resistant
4. Interquartile Range (IQR)
Resistant
Example 2
• Travel times to work for 20 randomly selected New
Yorkers
Example, page 57
10
30
5
25
40
20
10
15
30
20
15
20
85
15
65
15
60
60
40
45
5
10
10
15
15
15
15
20
20
20
25
30
30
40
40
45
60
60
65
85
Q1 = 15
M = 22.5
Q3= 42.5
IQR = Q3 – Q1
= 42.5 – 15
= 27.5 minutes
Interpretation: The range of the middle half of travel times for
the New Yorkers in the sample is 27.5 minutes.
Determining Outliers
“1.5  IQR Rule”
InterQuartile Range “IQR”: Distance between Q1 and
Q3. Resistant measure of spread...only measures
middle 50% of data.
IQR = Q3 - Q1 {width of the “box” in a boxplot}
1.5 IQR Rule: If an observation falls more than 1.5
IQRs above Q3 or below Q1, it is an outlier.
Why 1.5? According to John Tukey, 1 IQR seemed like too little and 2 IQRs
seemed like too much...
Outliers: 1.5  IQR Rule
To determine outliers:
1. Find 5 Number Summary
2. Determine IQR
3. Multiply 1.5  IQR
4. Set up “fences”
A. Lower Fence: Q1 - (1.5  IQR)
B. Upper Fence: Q3 + (1.5  IQR)
5. Observations “outside” the fences are outliers.
Example 2 part 2
• In addition to serving as a measure of spread, the interquartile
range (IQR) is used as part of a rule of thumb for identifying
outliers.
Definition:
The 1.5 x IQR Rule for Outliers
Call an observation an outlier if it falls more than 1.5 x IQR above the
third quartile or below the first quartile.
Example, page 57
In the New York travel time data, we found Q1=15
minutes, Q3=42.5 minutes, and IQR=27.5 minutes.
0
1
2
For these data, 1.5 x IQR = 1.5(27.5) = 41.25
3
Q1 - 1.5 x IQR = 15 – 41.25 = -26.25
4
Q3+ 1.5 x IQR = 42.5 + 41.25 = 83.75
5
Any travel time shorter than -26.25 minutes or longer than 6
7
83.75 minutes is considered an outlier.
8
5
005555
0005
00
005
005
5
5-Number Summary, Boxplots
The 5 Number Summary provides a reasonably
complete description of the center and spread of
distribution
MIN
Q1
MED
Q3
MAX
We can visualize the 5 Number Summary with a
boxplot.
min=45
45
50
Q1=74
55
Outlier?
60
med=79
65
70
75
Q3=91
80
Quiz Scores
85
max=98
90
95 100
Drawing a Boxplot
The five-number summary divides the distribution
roughly into quarters. This leads to a new way to
display quantitative data, the boxplot.
• Draw and label a number line that includes the range
of the distribution.
• Draw a central box from Q1 to Q3.
• Note the median M inside the box.
• Extend lines (whiskers) from the box out to the
minimum and maximum values that are not outliers
Example 2 part 3
• Boxplot
10
30
5
25
40
20
10
15
30
20
15
20
85
15
65
15
60
60
40
45
5
10
10
15
15
15
15
20
20
20
25
30
30
40
40
45
60
60
65
85
Min=5
Q1 = 15
M = 22.5
Q3= 42.5
Max=85
Recall, this is
an outlier by the
1.5 x IQR rule
Example 3
Consumer Reports did a study of ice cream bars (sigh, only
vanilla flavored) in their August 1989 issue. Twenty-seven bars
having a taste-test rating of at least “fair” were listed, and
calories per bar was included. Calories vary quite a bit partly
because bars are not of uniform size. Just how many calories
should an ice cream bar contain?
342
377
319
353
295
234
294
286
377
182
310
439
111
201
182
197
209
147
190
151
131
151
Construct a boxplot for the data above.
Example 3 - Answer
Q1 = 182
Min = 111
IQR = 137
Q2 = 221.5
Max = 439
UF = 524.5
Q3 = 319
Range = 328
LF = -23.5
100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500
Calories
Example 4
The weights of 20 randomly selected juniors at MSHS are
recorded below:
121
126
130
132
143
137
141
144
148
205
125
128
131
133
135
139
141
147
153
213
a) Construct a boxplot of the data
b) Determine if there are any mild or extreme outliers
c) Comment on the distribution
Example 4 - Answer
Q1 = 130.5
Min = 121
IQR = 15
Mean = 143.6
StDev = 23.91
Q2 = 138
Max = 213
UF = 168
Q3 = 145.5
Range = 92
LF = 108
Extreme Outliers
( > 3 IQR from Q3)
*
100
110
120
130
140
150
160
170
180
190
200
*
210
220
Weight (lbs)
Shape: somewhat symmetric
Center: Median = 138
Outliers: 2 extreme outliers
Spread: IQR = 15
Example 5
Consider the following test scores for a small class:
75
76
82
93
45
68
74
82
91
98
Plot the data and describe the SOCS:
Shape?
Outliers?
Center?
Spread?
Why use median describes the “center”?
Why use IQR to describes the “spread’?
skewed left
maybe 45
M = 79
IQR = 91-74=17
data skewed
data skewed
Choosing Measures of Center & Spread
• We now have a choice between two descriptions for
center and spread
– Mean and Standard Deviation
– Median and Interquartile Range
Choosing Measures of Center and Spread
•The median and IQR are usually better than the mean and
standard deviation for describing a skewed distribution or a
distribution with outliers.
•Use mean and standard deviation only for reasonably
symmetric distributions that don’t have outliers.
•NOTE: Numerical summaries do not fully describe the
shape of a distribution. ALWAYS PLOT YOUR DATA!
Using the TI-83
• Enter the test data into List, L1
– STAT, EDIT enter data into L1
• Calculate 5 Number Summary
– Hit STAT go over to CALC
and select 1-Var Stats and hitt 2nd 1 (L1)
• Use 2nd Y= (STAT PLOT) to graph the box plot
–
–
–
–
–
Turn plot1 ON
Select BOX PLOT (4th option, first in second row)
Xlist: L1
Freq: 1
Hit ZOOM 9:ZoomStat to graph the box plot
• Copy graph with appropriate labels and titles
Day 2 Summary and Homework
• Summary
– Sample variance is found by dividing by (n – 1) to keep it an
unbiased (since we estimate the population mean, μ, by using
the sample mean, x-bar) estimator of population variance
– The larger the standard deviation, the more dispersion the
distribution has
– Boxplots can be used to check outliers and distributions
– Use comparative boxplots for two datasets
– Identifying a distribution from boxplots or histograms is
subjective!
– Use standard deviation with mean and IQR with median
• Homework
– pg 82: prob 33; pg 89 probs 40, 41;
pg 97 probs 45, 46