Chapter 1: Examining Distributions
Download
Report
Transcript Chapter 1: Examining Distributions
Chapter 1:
Examining
Distributions
1.1 Displaying
Distributions with
graphs
Many public health efforts are directed toward
increasing levels of physical activity. “Physical
Activity in Urban White, African American, and
Mexican American Women” (Medicine and
Science in Sports and Exercise [1997]) reported
on physical activity patterns in urban women.
The accompanying data set given the preferred
leisure-time physical activity for each of 30
Mexican American Women. The following
coding is used; W=walking, T=weight training,
C=cycling, G=gardening, A=aerobics.
W T
T W
A W
A
A
A
W
T
W
G T W
T W G
W W T
W
W
W
C W
W C
W T
Construct what you think is an appropriate graph to
display this information.
The chronicle of Higher Education (August 31,
2001) reported graduation rates for NCAA Div. 1
schools. The rates reported are the % of full-time
freshmen in fall 1993 who had earned a bachelor’s
degree for August 1999.
California: 64
35
66
41
37
66
44
81
70
31
90
63
37
82
73
74
72
79
68
67
Texas:
21
63
24
32
12
22
88
46
35
35
71
39
39
28
35
65
67
71
25
Individual
Definition:
Variable
Definition:
Categorical
Quantitative
Definition:
Definition:
Examples:
Examples
Types of graphs used:
Types of graphs used:
Individual
Definition: object described by a set of data
Variable
Definition: characteristic of an individual
Categorical
Quantitative
Definition: placing into group or
category
Definition: Numerical values as a result of
a measurement
Examples: gender, race, smoker,
marital status
Examples: age, blood pressure, salary
Types of graphs used: bar graph;
pie chart
Types of graphs used: histogram,
stemplot, time plot
Categorical Variable
Bar Graph (pictograph)
What
does the height
show?
count or %
Does
graph need to
include all categories?
Pg
no
8 #1.3 & 1.4
Pie Chart
Shows?
Visual for comparison
with whole group
Does
graph need to
include all categories?
yes
Histogram
Rating Frequency
0-2
20
3-5
14
6-8
15
9 - 11
2
12 - 14
1
Histogram
Has
a horizontal axis that often represents
groups of data rather than individual data
Method:
Divide data into classes of equal width (5-15)
Count number in each class
Draw bar graph with no space between bars
Example:
NCAA
The chronicle of Higher Education (August 31,
2001) reported graduation rates for NCAA Div. 1
schools. The rates reported are the % of full-time
freshmen in fall 1993 who had earned a bachelor’s
degree for August 1999.
California: 64
35
66
41
37
66
44
81
70
31
90
63
37
82
73
74
72
79
68
67
Texas:
21
63
24
32
12
22
88
46
35
35
71
39
39
28
35
65
67
71
25
Histograms
These six histograms each describe the same set of data
from Table 1.2 on page 11 of your book.
A
B
C
D
E
F
Which one is most useful? least useful? Why?
Interpreting histograms
Look for overall pattern & striking
deviations
Describe shape, center, and spread
Symmetric
to the right –
right side extends much
farther out than the left
side
Skewed
Quantitative variable cont.
Stemplot
For
small data sets
Quicker to make and presents more detailed
info
Stem consists of all but final, rightmost digit,
and leaf is the final digit
Example: NCAA
Time plot
To
show a change over time
Example: pg 19 #1.10
The chronicle of Higher Education (August 31,
2001) reported graduation rates for NCAA Div. 1
schools. The rates reported are the % of full-time
freshmen in fall 1993 who had earned a bachelor’s
degree for August 1999.
California: 64
35
66
41
37
66
44
81
70
31
90
63
37
82
73
74
72
79
68
67
Texas:
21
63
24
32
12
22
88
46
35
35
71
39
39
28
35
65
67
71
25
What kind of graph would be
appropriate?
Whether a spun penny lands “heads” or “tails”
The number of calories in a fast food sandwich
The life expectancy of a nation
The occupational background of a Civil War general
The weight of an automobile
For whom an American voted in the 1992 Presidential
election
The age of a bride on her wedding day
The average low temperature in January for Appleton
Misleading graphs
In trying to make the graph more visually interesting by
replacing the bars of a bar chart with milk buckets, areas
are distorted.
Another common distortion occurs when a third dimension is added to
bar charts or pie charts. The 3-D version distorts the areas, and as a
consequence, is much more difficult to interpret correctly.
It is common to see scatterplots with broken axes, but be cautious of
time plots, bar graphs, or histograms with broken axes. Broken axes in
time plots can exaggerate the magnitude of change over time.
In bar graphs and histograms, the vertical axis should never be broken.
For example, by starting the vertical axis at 50 exaggerates the gain.
The area for the rectangle representing 68 is more than three times the
area of the rectangle representing 55.
Watch out for unequal time spacing in time plots.
Information from
research studies is
sometimes taken out
of context.
Think critically!
What might be wrong with the
following?
Only 3% of the men surveyed read cosmopolitan magazine.
Since most automobile accidents occur within 15 miles of a person’s
residence, it is safer to make long trips.
A television commercial claims that “our razor blades are
manufactured to such high standards that they will give you a shave
that is 50% closer”.
A national health food magazine claims that “95% of its subscribers
who follow the magazines recommendation and take megadoses of
vitamin C are healthy and vigorous”.
During 1990 there were 234 accidents involving drunken drivers and
15,897 accidents involving drunken pedestrians reported in Danville.
Can we conclude that it is safer in Danville to be a drunken driver
than a drunken pedestrian?
Review of graphs:
Pg 14 #1.7 & 1.8
Pg 20 #1.11, 1.18, 1.19
SAT scores: Make a histogram to better
understand data given and interpret the
histogram.
1.2 Describing
distributions with
numbers
Population – the entire group of individuals
that we want information about
Sample – part of the population that we
actually examine in order to gather
information and make conclusions
Mean
Measure of its center or average
x1 x2 x3 ...xn
x
n
or
x
x
µ used for population mean
n
Median
Midpoint of distribution
To find median:
Symmetrical distribution – mean and
median are close together
Skewed distribution – the mean is farther
out in the long tail than is the median
http://www.rossmanchance.com/applets/DotPlotAppletAug11/DotPlotApplet.html
Mode
Data that is repeated most often
Mode
=
Mean
=
Median
Mode
SYMMETRIC
Mean
Median
SKEWED RIGHT
(positively)
Mean
Mode
Median
SKEWED LEFT
(negatively)
Quartiles
Spread of the middle half of data
To calculate
arrange data in ascending order and locate median
lower quartile (Q1) is the median of the low half of
data
upper quartile (Q3) is the median of the upper half
Q1 is larger than 25% of data
Q2 is larger than 50% of data
Q3 is larger than 75% of data
Find the Quartiles for the following data.
12 35 23 9 5 21 45 56 24 6 28 31
5
6
9
Q1
10.5
12
21
23
Q2
Median
23.5
24
28
31
Q3
33
35
45
56
http://www.rossmanchance.com/applets/DotPlotAppletAug11/DotPlotApplet.html
5 number summary and boxplot
5 number summary – minimum, Q1, Q2,
Q3, maximum
Boxplot – graph of 5 number summary
Best
used for side-by-side comparison of
more than one set of data
Include numerical scale in the graph
Min
Q1
Q2
5
10.5
23.5
Q3
Max
33
56
The chronicle of Higher Education (August 31,
2001) reported graduation rates for NCAA Div. 1
schools. The rates reported are the % of full-time
freshmen in fall 1993 who had earned a bachelor’s
degree for August 1999.
California: 64
35
66
41
37
66
44
81
70
31
90
63
37
82
73
74
72
79
68
67
Texas:
21
63
24
32
12
22
88
46
35
35
71
39
39
28
35
65
67
71
25
Outliers
An unusually small or large data value
Calculate interquartile range (Q3 – Q1)
An observation is an outlier if it falls more
than 1.5 times the IQR above Q3 or below
Q1
Standard Deviation
Measures spread by looking at how far the
observations are from their mean
( x x)
Variance formula: s
n 1
2
s
s
Standard deviation formula:
s used for sample data; σ is used for
population (equation is slightly different)
2
2
i
Waiting Times of Bank Customers
at Different Banks (in minutes)
Jefferson Valley
Bank
Bank of
Providence
6.5
6.6
6.7
6.8
7.1
7.3
7.4
7.7
7.7
7.7
4.2
5.4
5.8
6.2
6.7
7.7
7.7
8.5
9.3
10.0
Jefferson Valley Bank
Bank of Providence
Mean
7.15
7.15
Median
7.20
7.20
Mode
7.7
7.7
What is the Standard Deviation of the data from JV Bank? from BofP?
Dotplots of Waiting Times
Visually, which one has the greater spread?
Calculate the mean & standard
deviation for each set of test scores
95
99
100
92
90
92
94
95
87
90
90
83
89
90
89
85
75
65
90
89
87
93
95
89
90
Calculate the mean & standard
deviation for each set of test scores
95
99
100
92
90
92
94
95
87
90
90
83
89
90
89
85
75
65
90
89
87
93
95
89
90
89.8
88.8
88.8
89.6
89.6
3.96
9.65
13.86
1.82
.55
Choosing a summary
The five number summary is used for
describing a skewed distribution or a
distribution with outliers
Use mean for reasonably symmetric
distributions that are free of outliers
Section 1.2 practice
Pg 41 - 45 #1.35, 1.38, 1.47, 1.48, 1.49
1.3 Normal
Distributions
Compact picture of the overall
pattern of the data
Density curve
pg 46 & 47
Scores on national tests
often have a regular
distribution
make total
area under
curve equal
one
symmetrical
partial area
represents %
of total
“students”
(observations)
Normal Distributions
pg 51-52
What are they?
Density
curves that are symmetrical, single-peaked,
and bell-shaped
Curve is described by its . . .
mean
Where is the mean located?
µ and standard deviation σ
at the center of the curve
What controls how spread out the curve is?
Standard
deviation controls the spread; the larger the
σ the more spread out the data
Where is the σ on the curve?
at
the points of change of curvature
Why are normal curves important?
Good descriptions for some distributions of
real data (scores on tests, measurements
of same quantity, characteristics of
biological populations)
Good approximations to the results of
many kinds of chance outcomes (tossing
coin, rolling die)
68-95-99.7 rule
In a normal distribution:
68% of the observations fall within 1 of
the mean
95% of the observations fall within 2 of
the mean
99.7% of the observations fall within 3 of
the mean
http://www.rossmanchance.com/applets/DotPlotAppletAug11/DotPlotApplet.html
68-95-99.7 rule
99.7% of data are within 3 standard deviations of the mean
95% within
2 standard deviations
68% within
1 standard deviation
34%
0.1%
2.4%
2.4%
0.1%
13.5%
x
34%
- 3s
x
- 2s
x
13.5%
-s
x
x + s x + 2s
x + 3s
example: Light bulbs: x = 1600 hrs,
s = 100 hr
68% of light bulbs last:
95% of light bulbs last:
99.7% of light bulbs last:
Practice problems: pg 54 #1.53, 1.54, 1.55
Standard normal curve
standardizing a normal curve is making all
normal distributions the same
normal distribution with mean = 0 and standard
deviation = 1
z-score (# of standard deviations a value is away
from the mean)
Formula:
z
x
Practice problem pg 56 #1.56
any question about what proportion of
observations lie in some range of values can be
answered by finding the area under the curve
(percentage)
What % of the population has a zscore. . .
Less than -1.76
Shaded
Less than 0.58
Shaded
area = .0392 or 3.92%
area = .7190 or 71.90%
Greater than 1.96
Lower
area = .9750 so shaded area = .0250 or
2.50%
Between -1.76 and .58
.7190
- .0392 = .6798 or 67.98%
In a standard normal distribution,
find the z-score that cuts off the
bottom 10%
.1003
is z = -1.28
top 15%
.8508
.10
is z = 1.04
.85
.15
If the probability of getting less
than a certain z-value is .1190,
what is the z-value?
z = -1.18
.1190
If the probability of getting larger
than a certain z-value is .0129,
what is the z-value?
1 - .0129 = .9871
z = 2.23
.0129
In a normal distribution µ=25 and
=5. What is the probability of
obtaining a value
greater than 30?
z = (30-25)/5 = 1
1-.8413 = .1587 or 15.87%
30
less than 15?
z = (15-25)/5 = -2
.0228 or 2.28%
15
between 20 and 30?
z = -1 and z = 1
.8413-.1587 = .6826 or 68.26%
20
30
The Flatt Tire Corporation claims that the
useful life of its tires is normally distributed
with a mean life of 28,000 miles and with a
standard deviation of 4000 miles. What
percentage of the tires are expected to last
more than 35,000 miles?
z = (35000-28000) / 4000 = 1.75
1 - .9599 = .0401 or 4.01%
35000
Suppose it takes you 20 minutes to drive to
school, with a standard deviation of 2 minutes.
How often will you arrive on school in less than 22 minutes?
• How often will it take you more than 24 minutes?
• 75% of the time you will arrive in x minutes or less. Solve
for x.
• 43% of the time you will arrive in y minutes or more. Solve
for y.
•
Section 1.3 practice problems
Pg. 61 #1.57, 1.58
Pg. 64 #1.61, 62, 63, 65, 66, 68, 70