Transcript Chapter 4
Displaying and Summarizing
Quantitative Data
90 min
Example
Here are the scores for the first test.
A histogram is a graphical display of a frequency
distribution of quantitative variable, shown as
adjacent rectangles (like a bar graph)
Here is a histogram of
earthquake magnitudes
First, slice up the entire span of values covered by
the quantitative variable into equal-width piles
called bins.
Count the number of values of each bin, and the
counts in each bin will be the heights of the bars.
The bins and the corresponding counts give the
distribution of the quantitative variable.
A relative frequency histogram displays the
percentage of cases in each bin instead of the count.
◦ In this way, relative
frequency histograms
are faithful to the
area principle.
Here is a relative
frequency histogram of
earthquake magnitudes:
Stem-and-leaf displays or stemplots show
the distribution of a quantitative variable, like
histograms do, while preserving the
individual values.
Stem-and-leaf displays contain all the
information found in a histogram and, when
carefully drawn, satisfy the area principle and
show the distribution.
Solution
Compare the histogram and stem-and-leaf display
for the pulse rates of 24 women at a health clinic.
Which graphical display do you prefer?
First, cut each data value into leading digits
(“stems”) and trailing digits (“leaves”).
Use the stems to label the bins.
Use only one digit for each leaf—either round
or truncate the data values to one decimal
place after the stem.
A dotplot is a simple
display. It just places a
dot along an axis for
each case in the data.
The dotplot to the right
shows Kentucky Derby
winning times, plotting
each race as its own dot.
You might see a dotplot
displayed horizontally or
vertically.
Example
John wanted to add a new DVD player to his home
theater system. He used the Internet to shop and went
to pricewatch.com. There he found 16 quotes on
different brands and styles of DVD players. Construct a
dotplot for these data.
Solution
1.
2.
3.
Humps
Symmetry
Unusual features
Does the histogram have a single, central
hump or several separated humps?
◦
◦
◦
◦
Humps in a histogram are called modes.
A histogram with one main peak is called unimodal
Histograms with two peaks are bimodal
Histograms with three or more peaks are called
multimodal.
A bimodal histogram has two apparent peaks:
Diastolic Blood Pressure
A histogram that doesn’t appear to have
any mode and in which all the bars are
approximately the same height is called
uniform:
Proportion of Wins
Is the histogram symmetric?
◦
If you can fold the histogram along a vertical line
through the middle and have the edges match
pretty closely, the histogram is symmetric.
◦ The (usually) thinner ends of a distribution are
called the tails. If one tail stretches out farther
than the other, the histogram is said to be
skewed to the side of the longer tail.
Skewed left
Skewed right
Do any unusual features stick out?
◦
Sometimes it is the unusual features that tell
us something interesting or exciting about the
data.
◦
You should always mention any stragglers, or
outliers, that stand off away from the body of
the distribution.
◦
Are there any gaps in the distribution? If so, we
might have data from more than one group.
The following histogram has outliers—there
are three cities in the leftmost bar:
The figure on the next slide displays a relativefrequency histogram for the heights of the 3264
female students who attend a midwestern college.
Also included is a smooth curve that approximates
the overall shape of the distribution. Both the
histogram and the smooth curve show that this
distribution of heights is bell shaped, but the
smooth curve makes seeing the shape a little
easier.
Common distribution shapes
Example
The relative-frequency histogram for household size in
the U. S. shown in the figure is based on data contained
in Current Population Reports, a publication of the U.S.
Census Bureau. Identify the distribution shape for sizes
of U.S. households.
Solution - Skewed right
Mode of a data set is the data value with the
greatest frequency.
Find the frequency of each value in the data set.
If no value occurs more than once, then the data
set has no mode.
Otherwise, any value that occurs with the greatest
frequency is a mode of the data set.
The median is the value with exactly half the
data values below it and half above it.
◦ It is the middle data
value (once the data
values have been
ordered) that divides
the histogram into
two equal areas.
◦ It has the same
units as the data.
Arrange the data in increasing order.
If the number of observations is odd, then the median is
the observation exactly in the middle of the ordered list.
If the number of observations is even, then the median is
the average of the two middle observations in the ordered list.
Mean = average of a data set
Total y
y
n
n
The formula says that to find the mean,
we add up the numbers and divide by n.
The mean feels like the center because it is
the point where the histogram balances:
Because the median considers only the order of
values, it’s resistant to values that are extraordinarily
large or small; it simply notes that they are one of
the “big ones” or “small ones” and ignores their
distance from center.
To choose between the mean and median, start by
looking at the data.
◦ If the histogram is symmetric and there are no outliers,
use the mean.
◦ However, if the histogram is skewed or with outliers,
you are better off with the median.
Example
Find the mean, median, mode for each team’s heights
Always report a measure of spread along
with a measure of center when describing a
distribution numerically.
The range of the data is the difference
between the maximum and minimum values:
Range = max – min
A disadvantage of the range is that a single
extreme value can make it very large and,
thus, not representative of the data overall.
Team I has range 6 inches, Team II has range 17 inches.
Quartiles divide the data into four equal sections.
◦ One quarter of the data lies below the lower quartile Q1
◦ One quarter of the data lies above the upper quartile Q3.
The difference between the quartiles is the
interquartile range (IQR), so
IQR = Q3 – Q1
The interquartile range (IQR) lets us ignore
extreme data values and concentrate on the
middle of the data.
The lower and upper
quartiles are the 25th
and 75th percentiles of
the data, so the IQR
contains the middle 50%
of the values of the
distribution.
Arrange the data in increasing order and determine the
median.
The second quartile is the median of the entire data set.
The first quartile is the median of the part of the entire data
set that lies at or below the median of the entire data set.
The third quartile is the median of the part of the entire data
set that lies at or above the median of the entire data set.
The 5-number summary of a distribution
reports its median, quartiles, and extremes
(maximum and minimum)
The 5-number summary for the recent tsunami
earthquake Magnitudes looks like this:
A more powerful measure of spread than the
IQR is the standard deviation, which takes
into account how far each data value is from
the mean.
A deviation is the distance that a data value is
from the mean.
◦ Since adding all deviations together would total
zero, we square each deviation and find an average
of sorts for the deviations.
The variance, notated by s2, is found by
summing the squared deviations and (almost)
averaging them:
y y
2
s
2
n 1
The variance will play a role later in our study,
but it is problematic as a measure of spread—
it is measured in squared units!
The standard deviation, s, is just the square
root of the variance and is measured in the
same units as the original data.
y y
2
s
n 1
Example
Data set II has greater variation, and hence has
greater standard deviation.
Data set II has greater variation and the visual clearly
shows that it is more spread out.
Data Set I
Data Set II
Since Statistics is about variation, spread is an
important fundamental concept of Statistics.
Measures of spread help us talk about what we
don’t know.
When the data values are tightly clustered
around the center of the distribution, the IQR
and standard deviation will be small.
When the data values are scattered far from
the center, the IQR and standard deviation will
be large.
Choose a bin width appropriate to the data.
◦ Changing the bin width changes the appearance of
the histogram:
We’ve learned how to make a picture for quantitative data
to help us see the story the data have to Tell.
We can display the distribution of quantitative data with a
histogram, stem-and-leaf display, or dotplot.
We’ve learned how to summarize distributions of
quantitative variables numerically.
◦ Measures of center for a distribution include the
median and mean.
◦ Measures of spread include the range, IQR, and
standard deviation.
◦ Use the median and IQR if the distribution is skewed.
◦ Use the mean and standard deviation if the distribution
is symmetric.
Example
A pediatrician tested the cholesterol levels of several
young patients and was alarmed to find that many had
levels higher than 200 mg per 100 mL. The following table
presents the readings of 20 patients with high levels.
Construct a stem-and-leaf diagram for these data by using
a. one line per stem.
b. two lines per stem.
Solution
Example
A firm employed a few senior consultants, who made
between $800 and $1050 per week; a few junior
consultants, who made between $400 and $450 per
week; and several clerical workers, who made $300 per
week. The firm required more employees during the first
half of the summer than the second half. The tables list
typical weekly earnings for the two halves of the summer.
Data Set I
Data Set II
Solution
Interpretation: The employees who worked in the first
half of the summer earned more, on average (a mean
salary of $483.85), than those who worked in the second
half (a mean salary of $474.00).
Comparing Distribution Ch.5
Page 78 -85:
Problem # 7, 9, 13, 15, 17, 19, 23, 25, 29,
31, 33, 37, 39, 41, 43, 47.