Chapter 4: Quantitative Data

Download Report

Transcript Chapter 4: Quantitative Data

AP Statistics
Chapter 1:
Quantitative Data
Displaying and Describing Numerical Data
AP Statistics
Quantitative Data

When dealing with a lot of numbers, we
want to use summaries to describe the
data
 Without
summaries of the data, it’s hard to grasp
what the data tell us.
 The best thing to do is to make a picture…
can’t use bar charts, relative frequency bar
charts, segmented bar charts, or pie charts for
quantitative data, since those displays are for
categorical variables.
 We
2
Histograms and Your Calculator

Use the following data to create a histogram of your own utilizing
your graphing calculator. Then describe the distribution.
Ages of the US Presidents at the time of their inauguration:
President
Washington
J. Adams
Jefferson
Madison
Monroe
JQ Adams
Jackson
Van Buren
WH Harrison
Tyler
Polk
Taylor
Fillmore
Pierce
Age
57
61
57
57
58
57
61
54
68
51
49
64
50
48
President
Buchanan
Lincoln
A. Johnson
Grant
Hayes
Garfield
Arthur
Cleveland
B. Harrison
Cleveland
McKinley
T Roosevelt
Taft
Wilson
Age
65
52
56
46
54
49
51
47
55
55
54
42
51
56
President
Harding
Coolidge
Hoover
FD Roosevelt
Truman
Eisenhower
Kennedy
L. Johnson
Nixon
Ford
Carter
Reagan
G. Bush
Clinton
Age
55
51
54
51
60
61
43
55
56
61
52
69
64
46
3
Histograms: Displaying the Distribution


In order to get a better understanding of the data
we must first slice up the entire span of values
covered by the quantitative variable into equalwidth piles called bins.
The bins and the counts in each bin give the
distribution of the quantitative variable.
4
Histograms: Displaying the Distribution of
Earthquake Magnitudes




A histogram plots the
bin counts as the
heights of bars (like a
bar chart). It displays
the distribution at a
glance.
When constructing
histograms for discrete
data, center each bar
above its corresponding
value.
Make sure you label axes and scales.
The bars in a histogram should touch (unlike bar charts
for categorical data)
5
Example: Hair Length of Students
 Consider the hair length of students in this class.
 Would

this be continuous or discrete? Why?
Since this data is continuous, choose the most
“natural” categories to place the data; for
example, we could use 1”, 2”, 3”, etc. (note: the
“break” points are called boundaries). In this
case, we will define our own categories, called
classes. There is no exact way to create classes,
but classes should always be the same length
and never overlap.
6
Example: Hair Length of Students
 What if an observation falls exactly on a
boundary? By convention, we will put
boundary values into the upper class. For
example, suppose you decided on class lengths
of 3” starting at 0. Then, the boundaries would
be 0”, 3”, 6”, etc. The class from 0-3 traditionally
is 0 ≤ x < 3 and an observation of 3” would fall
into the 3 ≤ x < 6 category.
 If a frequency distribution has non-bounded
classes, such as “12 or more,” a histogram
cannot be made.
7
Stem-and-Leaf Displays


Stem-and-leaf displays show the distribution of
a quantitative variable, like histograms do, while
preserving the individual values.
Stem-and-leaf displays contain all the
information found in a histogram and, when
carefully drawn, satisfy the area principle and
show the distribution.
8
Constructing a Stem-and-Leaf Display





First, cut each data value into leading digits
(“stems”) and trailing digits (“leaves”).
Use the stems to label the bins.
Use only one digit for each leaf—either round or
truncate the data values to one decimal place
after the stem.
You MUST provide a key with units
Do not use commas
9
Stem-and-Leaf Example

Compare the histogram and stem-and-leaf display for
the pulse rates of 24 women at a health clinic. Which
graphical display do you prefer?
Key: 8|8 represents 88 bpm
10

In addition to the standard stemplot, there is also a backto-back stemplot which will allow you to compare two
distributions.
Test scores for females and males:
Key:
1|9|0 represents a
score of 91 and 90 on
this particular test for
females and males
respectively.

Stem-and-leaf plots and dotplots are very good for
displaying small data sets. However, when there are a
large number of observations, frequency distributions
and histograms are a better choice.
11
Dotplots

A dotplot is a simple
display of numerical data. It
just places a dot along an
axis for each case in the
data.



Kentucky Derby Winning Times
Be sure to label and scale
clearly
The dotplot to the right shows
Kentucky Derby winning
times, plotting each race as its
own dot.
You might see a dotplot
displayed vertically or
horizontally.
Debt after Graduating
College
In thousands of dollars
12
Describing the Data – Put on Your SOCS

Whenever you describe a distribution, you
should always include the following:
 Shape:
1)Unimodal, bimodal, multimodal, uniform;
2) skewed, symmetric
 If the shape is skewed, report the median and IQR.
 If the shape is symmetric, report the mean and
standard deviation and possibly the median and IQR
as well.
Outliers: Any stragglers, clusters, gaps or groups
 Center: Mean or Median
 Spread: Standard deviation or the five-number
summary
13
Humps (Shape)
Does the histogram have a single, central hump or
several separated bumps?


Humps in a histogram are called modes.
A histogram with one main peak is dubbed
unimodal; histograms with two peaks are bimodal;
histograms with three or more peaks are called
multimodal.
This is unimodal
This is bimodal
14
Humps (Shape)

A histogram that doesn’t appear to have any mode and
in which ALL the bars are approximately the same
height is called uniform:
15
Symmetry (Shape)
Is the histogram symmetric?

If you can fold the histogram along a vertical line
through the middle and have the edges match pretty
closely, the histogram is symmetric.
16
Skew…(Shape)


The (usually) thinner ends of a distribution are called the tails. If
one tail stretches out farther than the other, the histogram is said
to be skewed to the side of the longer tail.
In the figure below, the histogram on the left is said to be skewed
left, while the histogram on the right is said to be skewed right.
Skewed Left
tail
Skewed Right
tail
17
Outliers: Anything Unusual?
Do any unusual features stick out?



Sometimes it’s the unusual features that tell us
something interesting or exciting about the data.
You should always mention any stragglers, or
outliers, that stand off away from the body of the
distribution.
Are there any gaps in the distribution? If so, we
might have data from more than one group.
18
Outliers-Anything Unusual?

The following histogram has outliers—there are
three cities in the leftmost bar:
19
Center of a Skewed Distribution -- Median

The median is the value with exactly half the data values
below it and half above it.


It is the middle data value (once the data values have been
ordered) that divides the histogram into two equal areas
It has the same units as the data
20
Example

1.Try to find the median for the following data
1, 2, 4, 6, 7, 8, 9, 5, 6, 8, 10, 12, 13, 11, 24, 29,
31, 32, 35, 40, 45

2.Use your calculator to find the median for the
following:
55, 59, 62, 65, 71, 73, 74, 75, 75, 75, 76, 78
78, 79, 81, 81, 81, 90, 91, 93, 95, 96, 97, 99
21
The Other Measure of Center -- The Mean



When we have symmetric data, there is an alternative
other than the median.
If we want to calculate a number, we can average the
data.
We use the Greek letter sigma to mean “sum” and write:
Total  y
y

n
n
The formula says that to find the
mean, we add up all the values
of the variable and divide by the
number of data values, n.
22
The Other Measure of Center-- The Mean

The mean feels like the center because it is the
point where the histogram balances:
23
Center: Mean or Median?



Because the MEDIAN considers only the order of
values, it is RESISTANT to values that are
extraordinarily large or small; it simply notes that they
are one of the “big ones” or “small ones” and ignores
their distance from center. In other words, it always
finds the number in the middle regardless of how big or
small the first and last numbers are if the numbers are
arranged from biggest to smallest.
To choose between the mean and median, start by
looking at the data. If the histogram is symmetric and
there are no outliers, use the mean.
However, if the histogram is skewed or with
outliers, you are better off with the median.
24
How Spread Out is the Distribution?



Variation matters, and Statistics is about
variation.
Are the values of the distribution tightly
clustered around the center or more spread out?
Always report a measure of spread along with a
measure of center when describing a
distribution numerically.
25
Spread: Home on the Range



The range of the data is the difference between
the maximum and minimum values:
Range = max – min
A disadvantage of the range is that a single
extreme value can make it very large and, thus,
not be representative of the data overall.
Consider the following examples:

Find the range for the data:
 5, 7, 7, 8, 9, 9, 10


The range is 5 since 10 – 5 = 5
5, 7, 7, 8, 9, 9, 100

Although similar numbers, one value, 100, changes the range to
95. 95 is NOT indicative of the how spread the number are.
26
Spread: The Interquartile Range


The interquartile range (IQR) lets us ignore
extreme data values and concentrate on the
middle of the data.
To find the IQR, we use quartiles. Quartiles
divide the data into four equal sections.
 One
quarter of the data lies below the lower quartile,
Q1
 One quarter of the data lies above the upper quartile,
Q3.
 The quartiles border the middle half of the data.

The difference between the quartiles is the
interquartile range (IQR), so
IQR = upper quartile – lower quartile
27
Example

1.Try to find the IQR for the following data
1, 2, 4, 6, 7, 8, 9, 5, 6, 8, 10, 12, 13, 11, 24, 29,
31, 32, 35, 40, 45

2.Use your calculator to find the IQR for the
following:
55, 59, 62, 65, 71, 73, 74, 75, 75, 75, 76, 78
78, 79, 81, 81, 81, 90, 91, 93, 95, 96, 97, 99
28
5-Number Summary


The 5-number summary of a distribution reports its
median, quartiles, and extremes (maximum and
minimum)
The 5-number summary for the recent tsunami
earthquake Magnitudes looks like this:
29
What About Spread? The Standard
Deviation (The GF of Mean)



A more powerful measure of spread than the
IQR is the standard deviation, which takes into
account how far each data value is from the
mean.
When we describe the center with the mean, we
use standard deviation. When we describe the
center with the median, we use the 5 number
summary.
A deviation is the distance that a data value is
from the mean.
 Since
adding all deviations together would total zero,
we square each deviation and find an average of sorts
for the deviations.
30
What About Spread? The Standard
Deviation (The GF of the Mean)

The variance, notated by s2, is found by
summing the squared deviations and (almost)
averaging them:
y  y 


2
s

2
n 1
The variance will play a role later in our study,
but it is problematic as a measure of spread—it
is measured in squared units!
31
What About Spread? The Standard
Deviation

The standard deviation, s, is just the square root
of the variance and is measured in the same
units as the original data.
 y  y 
2
s
n 1
32
Thinking About Variation




Since Statistics is about variation, spread is an
important fundamental concept of Statistics.
Measures of spread help us talk about what we
don’t know.
When the data values are tightly clustered
around the center of the distribution, the IQR
and standard deviation will be small.
When the data values are scattered far from the
center, the IQR and standard deviation will be
large.
33
What Can Go Wrong?









Don’t forget to do a reality check – don’t let the calculator
do the thinking for you.
Don’t forget to sort the values before finding the median
or percentiles.
Don’t worry about small differences when using
different methods.
Don’t compute numerical summaries of a categorical
variable.
Don’t report too many decimal places (3 is what we’ll
use).
Don’t round in the middle of a calculation.
Watch out for multiple modes
Beware of outliers
Make a picture … make a picture . . . make a picture !!!
34
Relative Frequency Histograms: Displaying
the Distribution

A relative frequency
histogram displays
the percentage of
cases in each bin
instead of the count.
 In
this way, relative
frequency histograms
are faithful to the
area principle.
35
Relative Frequency





When constructing a histogram we can use the
“relative frequency” (given in percent) instead of
“count”
Using relative frequency allows us to do better
comparisons.
Histograms using relative frequency have the
same shape as those using count.
For each count in a class, divide by the total
number of data points in the data set.
Convert to a percentage.
36
Finding Relative Frequency
Class
Frequency
40-44
2
45-49
6
50-54
13
55-59
12
60-64
7
65-69
3
Total
43
Relative
Frequency
37
Finding Relative Frequency
Class
Frequency
Relative
Frequency
40-44
2
2/43=4.7%
45-49
6
6/43=14.0%
50-54
13
13/43=30.2%
55-59
12
12/43=27.9%
60-64
7
7/43=16.3%
65-69
3
3/43=7.0%
Total
43
38
Finding Cumulative Frequency
Class
Frequency
Relative
Frequency
Cumulative
Frequency
40-44
2
2/43=4.7%
2
45-49
6
6/43=14.0%
8
50-54
13
13/43=30.2%
21
55-59
12
12/43=27.9%
33
60-64
7
7/43=16.3%
40
65-69
3
3/43=7.0%
43
Total
43
39
Finding
Relative Cumulative Frequency
Class
Frequency
Relative
Frequency
Cumulative
Frequency
Relative
Cumulative
Frequency
40-44
2
2/43=4.7%
2
2/43=4.7%
45-49
6
6/43=14.0%
8
8/43=18.6%
50-54
13
13/43=30.2%
21
21/43=48.8%
55-59
12
12/43=27.9%
33
33/43=76.7%
60-64
7
7/43=16.3%
40
40/43=93.0%
65-69
3
3/43=7.0%
43
43/43=100%
Total
43
40
Percentiles


“The p-th percentile of a distribution is the value
such that p percent of the observations fall at or
below it.”
If you scored in the 80th percentile on the SAT,
then 80% of all test takers are at or below your
score.
41
Percentiles
Class
Relative
Cumulative
Frequency
40-44
2/43=4.7%
45-49
8/43=18.6%
50-54
21/43=48.8%
55-59
33/43=76.7%
60-64
40/43=93.0%
65-69
43/43=100%



It is easy to see the
percentiles at the breaks.
“A 64 year old would be
at the 93rd percentile.”
What percentile is 59 year
old?
Total
42
Ogives (o-JIVEs) or
“Relative Cumulative Frequency Graph”
Class
Relative
Cumulative
Frequency
40-44
2/43=4.7%
45-49
8/43=18.6%
50-54
21/43=48.8%
55-59
33/43=76.7%
60-64
40/43=93.0%
65-69
43/43=100%
Total
43
Ogives (o-JIVEs) or
“Relative Cumulative Frequency Graph”

Approximately, what
age would be at the
25th percentile?

What about the 50th
percentile?

The 75th percentile?
44