Chapter 1: Statistics

Download Report

Transcript Chapter 1: Statistics

Chapter 2: Descriptive Analysis
and Presentation of SingleVariable Data
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Largest(1)
Smallest(1)
26.86667
2.816392
25
20
10.90784
118.981
-0.61717
0.11344
38
7
45
403
15
45
7
9
8
7
6
5
4
3
2
1
0
5
10
15
20
Chapter Goals
• Learn how to present and describe sets of
data.
• Learn measures of central tendency,
measures of dispersion (spread), measures
of position, and types of distributions.
• Learn how to interpret findings so that we
know what the data is telling us about the
sampled population.
2.1: Graphic Presentation of Data
• Use initial exploratory data-analysis
techniques to produce a pictorial
representation of the data.
• Resulting displays reveal patterns of
behavior of the variable being studied.
• The method used is determined by the type
of data and the idea to be presented.
• No single correct answer when constructing
a graphic display.
Circle Graphs and Bar Graphs: Graphs that are
used to summarize attribute data.
Circle graphs (pie diagrams) show the amount of
data that belongs to each category as a proportional
part of a circle.
Bar graphs show the amount of data that belongs to
each category as proportionally sized rectangular
areas.
Example: The table below lists the number of automobiles
sold last week by day for a local dealership.
Day
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Number Sold
15
23
35
11
12
42
Describe the data using a circle graph and a bar graph.
Automobiles Sold Last Week
Monday
11%
Saturday
30%
Tuesday
17%
Friday
9%
Thursday
8%
Wednesday
25%
Automobiles Sold Last Week
45
40
35
30
25
20
15
10
5
Saturday
Friday
Thursday
Wednesday
Tuesday
Monday
0
Pareto Diagram: A bar graph with the bars arranged
from the most numerous category to the least
numerous category. It includes a line graph
displaying the cumulative percentages and counts for
the bars.
Note:
• The Pareto diagram is often used in quality
control applications.
• Used to identify the number and type of defects
that happen within a product or service.
Example: The final daily inspection defect report for a cabinet
manufacturer is given in the table below.
Defect
Dent
Stain
Blemish
Chip
Scratch
Others
Number
5
12
43
25
40
10
Construct a Pareto diagram for this defect report. Management
has given the cabinet production line the goal of reducing their
defects by 50%. What two defects should they give special
attention to in working toward this goal?
Solution:
Daily Defect Inspection Report
140
100
120
80
60
80
60
40
Percent
Count
100
40
20
20
0
Defect
Count
Percent
Cum %
0
i sh
em
Bl
ch
rat
Sc
ip
Ch
in
St a
rs
he
Ot
nt
De
43
40
25
12
10
5
31.9
29.6
18.5
8.9
7.4
3.7
31.9
61.5
80.0
88.9
96.3
100.0
The production line should try to eliminate blemishes and scratches. This
would cut defects by more than 50%.
Quantitative Data: One reason for constructing a graph of
quantitative data is to examine the distribution - is the data
compact, spread out, skewed, symmetric, etc.
Distribution: The pattern of variability displayed by the data
of a variable. The distribution displays the frequency of each
value of the variable.
Dotplot Display: Displays the data of a sample by
representing each piece of data with a dot positioned along a
scale. This scale can be either horizontal or vertical. The
frequency of the values is represented along the other scale.
Example: A random sample of the lifetime (in years) of 50
home washing machines is given below.
2.5
16.9
4.5
0.9
1.5
17.8
8.5
8.9
2.5
6.4
14.5
0.7
7.3
1.4
12.2
3.5
2.9
4.0
3.7
6.8
7.4
4.1
0.4
3.3
0.9
4.2
3.3
4.7
18.1
2.6
4.4
7.2
6.9
7.0
0.7
1.6
2.2
9.2
5.2
15.3
4.0
10.4
12.2
4.0
4.1
1.8
21.8
18.3
3.6
The figure below is a dotplot for the 50 lifetimes.
.
: . . .:.
.
..: :.::::::.. .::. ... .
:
. .
. :.
.
+---------+---------+---------+---------+---------+------0.0
4.0
8.0
12.0
16.0
20.0
Notice how the data is “bunched” near the lower extreme and
more “spread out” near the higher extreme.
Background:
• The stem-and-leaf display has become very popular for
summarizing numerical data.
• It is a combination of graphing and sorting.
• The actual data is part of the graph.
• Well-suited for computers.
Stem-and-Leaf Display: Pictures the data of a sample using
the actual digits that make up the data values. Each numerical
data is divided into two parts: The leading digit(s) becomes
the stem, and the trailing digit(s) becomes the leaf. The stems
are located along the main axis, and a leaf for each piece of
data is located so as to display the distribution of the data.
Example: A city police officer, using radar, checked the speed
of cars as they were traveling down the main street in town:
41
31
33
35
36
37
39
49
33
19
26
27
24
32
40
39
16
55
38
36
Construct a stem-and-leaf plot for this data.
Solution: All the speeds are in the 10s, 20s, 30s, 40s, and 50s.
Use the first digit of each speed as the stem and the second
digit as the leaf. Draw a vertical line and list the stems, in
order to the left of the line. Place each leaf on its stem: place
the trailing digit on the right side of the vertical line opposite
its corresponding leading digit.
20 Speeds
--------------------------------------1 | 6 9
2 | 4 6 7
3 | 1 2 3 3 5 6 6 7 8 9 9
4 | 0 1 9
5 | 5
---------------------------------------The speeds are centered around the 30s.
Note: The display could be constructed so that only five
possible values (instead of ten) could fall in each stem. What
would the stems look like? Would there be a difference in
appearance?
Note:
1. It is fairly typical of many variables to display a distribution
that is concentrated (mounded) about a central value and
then in some manner be dispersed in both directions.
(Why?)
2. A display that indicates two “mounds” may really be two
overlapping distributions.
3. A back-to-back stem-and-leaf display makes it possible to
compare two distributions graphically.
4. A side-by-side dotplot is also useful for comparing two
distributions.
2.2: Frequency Distributions and
Histograms
• Stem-and-leaf plots often present adequate
summaries, but they can get very big, very
fast.
• Need other techniques for summarizing
data.
• Frequency distributions and histograms are
used to summarize large data sets.
Frequency Distribution: A listing, often expressed in chart
form, that pairs each value of a variable with its frequency.
Ungrouped Frequency Distribution: Each value of x in the
distribution stands alone.
Grouped Frequency Distribution: Group the values into a
set of classes.
1. A table that summarizes data by classes, or class intervals.
2. In a typical grouped frequency distribution, there are
usually 5-12 classes of equal width.
3. The table may contain columns for class number, class
interval, tally (if constructing by hand), frequency, relative
frequency, cumulative relative frequency, and class mark.
4. In an ungrouped frequency distribution each class consists
of a single value.
Guidelines for constructing a frequency distribution:
1. Each class should be of the same width.
2. Classes should be set up so that they do not overlap and so
that each piece of data belongs to exactly one class.
3. For problems in the text, 5-12 classes are most desirable.
The square root of n is a reasonable guideline for the
number of classes if n is less than 150.
4. Use a system that takes advantage of a number pattern, to
guarantee accuracy.
5. If possible, an even class width is often advantageous.
Procedure for constructing a frequency distribution:
1. Identify the high (H) and low (L) scores. Find the range.
Range = H - L.
2. Select a number of classes and a class width so that the
product is a bit larger than the range.
3. Pick a starting point a little smaller than L. Count from L
by the width to obtain the class boundaries. Observations
that fall on class boundaries are placed into the class
interval to the right.
Note:
1. The class width is the difference between the upper- and
lower-class boundaries.
2. There is no best choice for class widths, number of classes,
and starting points.
Example: The hemoglobin test, a blood test given to diabetics
during their periodic checkups, indicates the level of control
of blood sugar during the past two to three months. The data
in the table below was obtained for 40 different diabetics at a
university clinic that treats diabetic patients. Construct a
grouped frequency distribution using the classes 3.7 - <4.7,
4.7 - <5.7, 5.7 - <6.7, etc. Which class has the highest
frequency?
6.5
6.4
5.0
7.9
5.0
6.0
8.0
6.0
5.6
5.6
6.5
5.6
7.6
6.0
6.1
6.0
4.8
5.7
6.4
6.2
8.0
9.2
6.6
7.7
7.5
8.1
7.2
6.7
7.9
8.0
5.9
7.7
8.0
6.5
4.0
8.2
9.2
6.6
5.7
9.0
Solution:
Class
Frequency Relative
Cumulative
Class
Boundaries
f
Frequency Rel. Frequency Mark, x
----------------------------------------------------------------------------------3.7 - <4.7
1
.025
.025
4.2
4.7 - <5.7
6
.150
.175
5.2
5.7 - <6.7
16
.400
.575
6.2
6.7 - <7.7
4
.100
.250
7.2
7.7 - <8.7
10
.250
.925
8.2
8.7 - <9.7
3
.075
1.000
9.2
The class 5.7 - <6.7 has the highest frequency. The frequency
is 16 and the relative frequency is .40.
Histogram: A bar graph representing a frequency distribution
of a quantitative variable. A histogram is made up of the
following components:
1. A title, which identifies the population of interest.
2. A vertical scale, which identifies the frequencies in the
various classes.
3. A horizontal scale, which identifies the variable x. Values
for the class boundaries or class marks may be labeled
along the x-axis. Use whichever method of labeling the
axis best presents the variable.
Note:
1. The relative frequency is sometimes used on the vertical
scale.
2. It is possible to create a histogram based on class marks.
Example: Construct a histogram for the blood test results
given in the previous example.
Solution:
Frequency
15
10
5
0
4.2
5.2
6.2
7.2
Blood Test
8.2
9.2
Example: A recent survey of Roman Catholic nuns
summarized their ages in the table below.
Age
Frequency
Class Mark
--------------------------------------------------------20 up to 30
34
25
30 up to 40
58
35
40 up to 50
76
45
50 up to 60
187
55
60 up to 70
254
65
70 up to 80
241
75
80 up to 90
147
85
Construct a histogram for this age data.
Solution:
Frequency
200
100
0
25
35
45
55
Age
65
75
85
Terms used to describe histograms:
Symmetrical: Both sides of the distribution are identical.
There is a line of symmetry.
Uniform (rectangular): Every value appears with equal
frequency.
Skewed: One tail is stretched out longer than the other. The
direction of skewness is on the side of the longer tail.
(Positively skewed vs. negatively skewed)
J-shaped: There is no tail on the side of the class with the
highest frequency.
Bimodal: The two largest classes are separated by one or
more classes. Often implies two populations are sampled.
Normal: A symmetrical distribution is mounded about the
mean and becomes sparse at the extremes.
Note:
1. The mode is the value that occurs with greatest frequency
(discussed in Section 2.3).
2. The modal class is the class with the greatest frequency.
3. A bimodal distribution has two high-frequency classes
separated by classes with lower frequencies.
4. Graphical representations of data should include a
descriptive, meaningful title and proper identification of
the vertical and horizontal scales.
Cumulative Frequency Distribution: A frequency
distribution that pairs cumulative frequencies with values of
the variable.
The cumulative frequency for any given class is the sum of
the frequency for that class and the frequencies of all classes
of smaller values.
The cumulative relative frequency for any given class is the
sum of the relative frequency for that class and the relative
frequencies of all classes of smaller values.
Example: A computer science aptitude test was given to 50
students. The table below summarizes the data.
Class
Relative
Cumulative Cumulative
Boundaries Frequency Frequency
Frequency
Rel. Frequency
------------------------------------------------------------------------------------0 up to 4
4
.08
4
.08
4 up to 8
8
.16
12
.24
8 up to 12
8
.16
20
.40
12 up to 16
20
.40
40
.80
16 up to 20
6
.12
46
.92
20 up to 24
3
.06
49
.98
24 up to 28
1
.02
50
1.00
Ogive: A line graph of a cumulative frequency or cumulative
relative frequency distribution. An ogive has the following
components:
1. A title, which identifies the population or sample
2. A vertical scale, which identifies either the cumulative
frequencies or the cumulative relative frequencies.
3. A horizontal scale, which identifies the upper class
boundaries. Until the upper boundary of a class has been
reached, you cannot be sure you have accumulated all the
data in the class. Therefore, the horizontal scale for an
ogive is always based on the upper class boundaries.
Note: Every ogive starts on the left with a relative frequency
of zero at the lower class boundary of the first class and ends
on the right with a relative frequency of 100% at the upper
class boundary of the last class.
Example: The graph below is an ogive using cumulative
relative frequencies for the computer science aptitude data.
1
.
0
0
.
9
0
.
8
0
.
7
0
.
6
0
.
5
0
.
4
0
.
3
CumlativeRFrquency
0
.
2
0
.
1
0
.
0
0
4
8
1
2
1
6
2
0
2
4
2
8
T
e
s
t
S
c
o
r
e
2.3: Measures of Central
Tendency
• Numerical values used to locate the middle
of a set of data, or where the data is
clustered.
• The term average is often associated with
all measures of central tendency.
Mean: The type of average with which you are probably most
familiar. The mean is the sum of all the values divided by the
total number of values, n.
1 n
1
x   xi  ( x1  x2   xn )
n i 1
n
Note:
1. The population mean, m, (lowercase mu, Greek alphabet),
is the mean of all x values for the entire population.
2. We usually cannot measure m but would like to estimate its
value.
3. A physical representation: the mean is the value that
balances the weights on the number line.
Example: The data below represents the number of accidents
in each of the last 6 years at a dangerous intersection.
8, 9, 3, 5, 2, 6, 4, 5
Find the mean number of accidents.
Solution:
1
x  (8  9  3  5  2  6  4  5)  5.25
8
Note: In the data above, change 6 to 26.
1
x  (8  9  3  5  2  26  4  5)  7.75
8
The mean can be greatly influenced by outliers.
Median: The value of the data that occupies the middle
position when the data are ranked in order according to size.
Note:
~x
1. Denoted by “x tilde”:
2. The population median, M (uppercase mu, Greek
alphabet), is the data value in the middle position of the
entire population.
To find the median:
1. Rank the data.
x )  n 1
2. Determine the depth of the median. d ( ~
2
3. Determine the value of the median.
Example: Find the median for the set of data
{4, 8, 3, 8, 2, 9, 2, 11, 3}.
Solution:
1. Rank the data: 2, 2, 3, 3, 4, 8, 8, 9, 11
x )  (9 1)/ 2  5
2. Find the depth: d (~
3. The median is the fifth number from either end in the
x 4
ranked data: ~
Suppose the data set is {4, 8, 3, 8, 2, 9, 2, 11, 3, 15}.
1. Rank the data: 2, 2, 3, 3, 4, 8, 8, 9, 11, 15
2. Find the depth: d (~x )  (10  1) / 2  5.5
3. The median is halfway between the fifth and sixth
x  (4 8)/ 2  6
observations: ~
Mode: The mode is the value of x that occurs most frequently.
Note: If two or more values in a sample are tied for the
highest frequency (number of occurrences), there is no mode.
Midrange: The number exactly midway between a lowest
value data L and a highest value data H. It is found by
averaging the low and the high values.
midrange
L H
2
Example: Consider the data set {12.7, 27.1, 35.6, 44.2, 18.0}.
The midrange is
midrange
L  H 127
.  442
.

 2845
.
2
2
Note:
1. When rounding off an answer, a common rule-of-thumb is
to keep one more decimal place in the answer than was
present in the original data.
2. To avoid round-off buildup, round off only the final
answer, not intermediate steps.
2.4: Measures of Dispersion
• Measures of central tendency alone cannot
completely characterize a set of data. Two
very different data sets may have similar
measures of central tendency.
• Measures of dispersion are used to describe
the spread, or variability, of a distribution.
• Common measures of dispersion: range,
variance, and standard deviation.
Range: The difference in value between the highest-valued
(H) and the lowest-valued (L) pieces of data:
range H  L
Other measures of dispersion are based on the following
quantity.
Deviation from the Mean: A deviation from the mean,
x  x , is the difference between the value of x and the mean x.
Example: Consider the sample {12, 23, 17, 15, 18}.
Find the range and each deviation from the mean.
Solution:
1
x  (12  2317 1518) 17
5
Data Deviation
x
x x
_______________
12
-5
23
6
17
0
15
-2
18
1
range H  L  2312 11
n
Note:
 ( xi  x)  0
(Always!)
i 1
Mean Absolute Deviation: The mean of the absolute values
of the deviations from the mean:
1 n
Mean absolute deviation   | xi  x |
n i 1
For the previous example:
1 n
1
14
| xi  x |  (5  6  0  2  1)   2.8

n i 1
5
5
Sample Variance: The sample variance, s2, is the mean of the
squared deviations, calculated using n  1 as the divisor.
1
2
2
s 
(
x

x
)

n 1
where n is the sample size.
Note: The numerator for the sample variance is called the sum
of squares for x, denoted SS(x).
s2  SS( x)
n 1
where
2
1
SS( x )   ( x  x )   x    x 
n
2
2
Standard Deviation: The standard deviation of a sample, s,
is the positive square root of the variance:
s  s2
Example: Find the variance and standard deviation for the
data {5, 7, 1, 3, 8}.
x  1(5 7 1 38)  48
.
5
x
Sum
5
7
1
3
8
24
x2
25
49
1
9
64
148
1
s  (32.8)  8.2
4
s  8.2  2.86
2
x x
0.2
2.2
-3.8
-1.8
3.2
0
( x  x)2
0.04
4.84
14.44
3.24
10.24
32.80
Note:
1. The shortcut formula for the sample variance:
s2 
2


x

2
x


n 1
n
2. The unit of measure for the standard deviation is the same
as the unit of measure for the data.
The unit of measure for the variance might then be thought
of as units squared.
2.5: Mean and Standard
Deviation of Frequency
Distribution
• If the data is given in the form of a
frequency distribution, we need to make a
few changes to the formulas for the mean,
variance, and standard deviation.
• Complete the extension table in order to
find these summary statistics.
In order to calculate the mean, variance, and standard
deviation for data:
1. In an ungrouped frequency distribution, use the frequency
of occurrence, f, of each observation.
2. In a grouped frequency distribution, we use the frequency
of occurrence associated with each class mark.
xf

x
f
s 
2
x
2
xf 


f 
f
 f 1
2
Example: A survey of students in the first grade at a local
school asked for the number of brothers and/or sisters for
each child. The results are summarized in the table below.
Find the mean, variance, and standard deviation.
x
0
1
2
4
5
Sum
f
15
17
23
5
2
62
xf
0
17
46
20
10
93
x  93/ 62 15
.
2
(
93
)
239 
2
s  62 62
.
1 163
s 163
. 128
.
x2 f
0
17
92
80
50
239
2.6: Measures of Position
• Measures of position are used to describe
the relative location of an observation.
• Quartiles and percentiles are two of the
most popular measures of position.
• An additional measure of central tendency,
the midquartile, is defined using quartiles.
• Quartiles are part of the 5-number summary.
Quartiles: Values of the variable that divide the ranked data
into quarters; each set of data has three quartiles.
1. The first quartile, Q1, is a number such that at most 25% of
the data are smaller in value than Q1 and at most 75% are
larger.
2. The second quartile is the median.
3. The third quartile, Q3, is a number such that at most 75%
of the data are smaller in value than Q3 at at most 25% are
larger.
Ranked data, increasing order
25%
L
25%
Q1
25%
Q2
25%
Q3
H
Percentiles: Values of the variable that divide a set of ranked
data into 100 equal subsets; each set of data has 99
percentiles. The kth percentile, Pk, is a value such that at
most k% of the data is smaller in value than Pk and at most
(100  k)% of the data is larger.
at most k %
L
at most (100 - k )%
Pk
H
Note:
1. The 1st quartile and the 25th percentile are the same:
Q1 = P25.
2. The median, the 2nd quartile, and the 50th percentile are
~
all the same: x  Q2  P50
Procedure for finding Pk (and quartiles):
1. Rank the n observations, lowest to highest.
2. Compute A = (nk)/100.
3. If A is an integer:
d(Pk) = A.5 (depth)
Pk is halfway between the value of the data in the Ath
position and the value of the next data.
If A is a fraction:
d(Pk) = B, the next largest integer.
Pk is the value of the data in the Bth position.
Example: The following data represents the pH levels of a
random sample of swimming pools in a California town.
5.6
6.0
6.7
7.0
5.6
6.1
6.8
7.3
5.8
6.2
6.8
7.4
5.9
6.3
6.8
7.4
6.0
6.4
6.9
7.5
Find the first and third quartile, and the 35th percentile.
k = 25: (20) (25) / 100 = 5,
depth = 5.5,
Q1 = 6
k = 75: (20) (75) / 100 = 15, depth = 15.5, Q3 = 6.95
k = 35: (20) (35) / 100 = 7,
depth = 7.5,
P35 = 6.15
Midquartile: The numerical value midway between the first
and the third quartile.
Q1  Q3
midquartile 2
Example: Find the midquartile for the 20 pH values in the
previous example:
Q1  Q3 6  6.95 12.95
midquartil e 


 6.475
2
2
2
Note: The mean, median, midrange, and midquartile are all
measures of central tendency. They are not necessarily equal.
Can you think of an example when they would be the same
value?
5-Number Summary: The 5-number summary is composed
of:
1. L, the smallest value in the data set.
2. Q1, the first quartile (also P25).
3. ~x, the median.
4. Q3, the third quartile (also P75).
5. H, the largest value in the data set.
Note:
1. The 5-number summary indicates how much the data is
spread out in each quarter.
2. The interquartile range is the difference between the first
and third quartiles. It is the range of the middle 50% of the
data.
Box-and-Whisker Display: A graphic representation of the
5-number summary.
• The five numerical values (smallest, first quartile, median,
third quartile, and largest) are located on a scale, either
vertical or horizontal.
• The box is used to depict the middle half of the data that
lies between the two quartiles.
• The whiskers are line segments used to depict the other
half of the data.
• One line segment represents the quarter of the data that is
smaller in value than the first quartile.
• The second line segment represents the quarter of the data
that is larger in value that the third quartile.
Example: A random sample of students in a sixth grade class
was selected. Their weights are given in the table below.
Find the 5-number summary for this data and construct a
boxplot.
63
86
93
101
64
88
93
108
76
89
94
109
63
L
76
90
97
112
85
Q1
81
91
99
92
~
x
83
92
99
99
Q3
85
93
99
112
H
Boxplot for weight data:
Weights from Sixth Grade Class
60
70
80
90
100
110
Weight
L
Q1
~
x
Q3
H
z-Score: The position a particular value of x has relative to the
mean, measured in standard deviations. The z-score is found
by the formula
z
value  mean x  x

st.dev.
s
Note:
1. Typically, the calculated value of z is rounded to the
nearest hundredth.
2. The z-score measures the number of standard deviations
above/below, or away from, the mean.
3. z-scores typically range from -3.00 to +3.00.
4. z-scores may be used to make comparisons of raw scores.
Example: A certain data set has mean 35.6 and standard
deviation 7.1. Find the z-scores for 46 and 33.
Solution:
x  x 46  35.6
z

 176
.
s
7.1
46 is 1.46 standard deviations above the mean.
x  x 33  35.6
z

 .37
s
7.1
33 is -.37 standard deviations below the mean.
2.7: Interpreting and
Understanding Standard
Deviation
• Standard deviation is a measure of
variability, or spread.
• Two rules for describing data rely on the
standard deviation.
• Chebyshev’s theorem: applies to any
distribution.
• Empirical rule: applies to a variable that is
normally distributed.
Chebyshev’s Theorem: The proportion of any distribution
that lies within k standard deviations of the mean is at least
1  (1/k2), where k is any positive number larger than 1. This
theorem applies to all distributions of data.
Illustration:
at least
1 12
k
x  ks
x
x  ks
Note:
1. Chebyshev’s theorem is very conservative. It holds for
any distribution of data.
2. Chebyshev’s theorem also applies to any population.
3. The two most common values used to describe a
distribution of data are k = 2, 3.
4. The table below lists some values for k and 1  (1/k2).
k
1(1/ k 2)
1.7
0.65
2
0.75
2.5
0.84
3
0.89
Example: At the close of trading, a random sample of 35
technology stocks was selected. The mean selling price was
67.75 and the standard deviation was 12.3. Use Chebyshev’s
theorem (with k = 2, 3) to describe the distribution.
Solution:
At least 75% of the observations lie within 2 standard
deviations of the mean:
( x  2s, x  2s )  (67.75  2(12.3), 67.75  2(12.3)  (43.15, 92.35)
At least 89% of the observations lie with 3 standard
deviations of the mean:
( x  3s, x  3s )  (67.75  3(12.3), 67.75  3(12.3)  (30.85, 104.65)
Empirical Rule: If a variable is normally distributed:
1. Approximately 68% of the observations lie within 1
standard deviation of the mean.
2. Approximately 95% of the observations lie within 2
standard deviations of the mean.
3. Approximately 99.7% of the observations lie within 3
standard deviations of the mean.
Note:
1. The empirical rule is more accurate than Chebyshev’s
theorem since we know more about the distribution
(normally distributed).
2. Also applies to populations.
3. Can be used to determine if a distribution is normally
distributed.
Illustration of the empirical rule:
99.7%
95%
68%
x 3s
x 2s
xs
x
xs
x 2s
x 3s
Example: A random sample of plum tomatoes was selected
from a local grocery store and their weights recorded. The
mean weight was 6.5 ounces with a standard deviation of .4
ounces. If the weights are normally distributed:
1. What percentage of weights fall between 5.7 and 7.3?
2. What percentage of weights fall above 7.7?
Solution:
( x  2s, x  2s)  (65
.  2(.4),65
.  2(.4))  (57
. ,7.3)
Approximately 95% of the weights fall between 5.7 and 7.3
( x  3s, x  3s)  (65
.  3(.4),65
.  3(.4))  (53
. ,7.7)
Approximately 99.7% of the weights fall between 5.3 and 7.7
Approximately .3% of the weight fall outside (5.3,7.7)
Approximately (.3/2)=.15% of the weights fall above 7.7
Note: The empirical rule may be used to determine whether or
not a set of data is approximately normally distributed.
1. Find the mean and standard deviation for the data.
2. Compute the actual proportion of data within 1, 2, and 3
standard deviations from the mean.
3. Compare these actual proportions with those given by the
empirical rule.
4. If the proportions found are reasonably close to those of the
empirical rule, then the data is approximately normally
distributed.
Note:
1. Graphic method to test for normality: Draw a relative
frequency ogive of grouped data on probability paper.
a. Draw a straight line from the lower-left corner to the
upper-right corner of the graph connecting the next-toend points of the ogive.
b If the ogive lies close to this straight line, the
distribution is said to be approximately normal.
2. The ogive may be used to find percentiles.
a. Draw a horizontal line through the graph at k.
b. At the point where the line intersects the ogive, draw a
vertical line to the bottom of the graph.
c. Read the value of x from the horizontal scale.
d. This value of x is the kth percentile.
2.8: The Art of Statistical
Deception
• Good arithmetic, bad statistics
• Misleading graphs
• Insufficient information
Good Arithmetic, Bad Statistics:
The mean can be greatly influenced by outliers.
Example: The mean salary for all NBA players is $15.5
million.
Misleading graphs:
1. The frequency scale should start at zero to present a
complete picture. Graphs that do not start at zero are used
to save space.
2. Graphs that start at zero emphasize the size of the numbers
involved.
3. Graphs that are chopped off emphasize variation.
This graph presents the total picture.
35
Sum of Delays
30
25
20
15
10
5
0
1990
1992
1994
Year
1996
This graph emphasizes the variation.
35
Sum of Delays
34
33
32
31
30
29
28
27
1990
1992
1994
Year
1996
Insufficient Information:
Example: An admissions officer from a state school explains
that the average tuition at a nearby private university is
$13,000 and only $4500 at his school. This makes the state
school look more attractive.
If most students pay the full tuition, then the state school
appears to be a better choice.
However, if most students at the private university receive
substantial financial aid, then the actual tuition cost could be
quite lower!