Chapter 1: Statistics
Download
Report
Transcript Chapter 1: Statistics
Chapter 2 ~ Descriptive Analysis &
Presentation of Single-Variable Data
Black Bears
Mean: 60.07 inches
Median: 62.50 inches
Range: 42 inches
20
Variance: 117.681
Standard deviation: 10.85 inches
Minimum: 36 inches
Frequency
Maximum: 78 inches
10
First quartile: 51.63 inches
Third quartile: 67.38 inches
Count: 58 bears
Sum: 3438.1 inches
0
30
40
50
60
70
80
Length in Inches
1
Chapter Goals
• Learn how to present and describe sets of data
• Learn measures of central tendency, measures of
dispersion (spread), measures of position, and types of
distributions
• Learn how to interpret findings so that we know what
the data is telling us about the sampled population
2
2.1 ~ Graphic Presentation of Data
• Use initial exploratory data-analysis techniques to
produce a pictorial representation of the data
• Resulting displays reveal patterns of behavior of the
variable being studied
• The method used is determined by the type of data and
the idea to be presented
• No single correct answer when constructing a graphic
display
3
Circle Graphs & Bar Graphs
Circle Graphs and Bar Graphs: Graphs that are used to
summarize attribute data
• Circle graphs (pie diagrams) show the amount of
data that belongs to each category as a proportional
part of a circle
• Bar graphs show the amount of data that belongs to
each category as proportionally sized rectangular
areas
4
Example
Example: The table below lists the number of automobiles
sold last week by day for a local dealership.
Describe the data using a circle graph and a bar
graph:
Day Number Sold
Monday
15
Tuesday
23
Wednesday
35
Thursday
11
Friday
12
Saturday
42
5
Circle Graph Solution
Automobiles Sold Last Week
6
Bar Graph Solution
Automobiles Sold Last Week
Frequency
7
Pareto Diagram
• Pareto Diagram: A bar graph with the bars arranged
from the most numerous category to the least numerous
category. It includes a line graph displaying the
cumulative percentages and counts for the bars.
Notes:
Used to identify the number and type of defects that
happen within a product or service
Separates the “vital few” from the “trivial many”
The Pareto diagram is often used in quality control
applications
8
Example
Example: The final daily inspection defect report for a cabinet
manufacturer is given in the table below:
Defect
Dent
Stain
Blemish
Chip
Scratch
Others
Number
5
12
43
25
40
10
1) Construct a Pareto diagram for this defect report. Management has
given the cabinet production line the goal of reducing their defects by
50%.
2) What two defects should they give special attention to in working
toward this goal?
9
Solutions
Daily Defect Inspection Report
1)
140
100
120
80
100
60
80
Count
Percent
60
40
40
20
20
0
Defect:
Count
Percent
Cum%
0
Blemish
Scratch
Chip
Stain
Others
Dent
43
31.9
31.9
40
29.6
61.5
25
18.5
80.0
12
8.9
88.9
10
7.4
96.3
5
3.7
100.0
2) The production line should try to eliminate blemishes and
scratches. This would cut defects by more than 50%.
10
Key Definitions
Quantitative Data: One reason for constructing a graph of
quantitative data is to examine the distribution - is the data compact,
spread out, skewed, symmetric, etc.
Distribution: The pattern of variability displayed by the data of a
variable. The distribution displays the frequency of each value of
the variable.
Dotplot Display: Displays the data of a sample by representing each
piece of data with a dot positioned along a scale. This scale can be
either horizontal or vertical. The frequency of the values is
represented along the other scale.
11
Example
Example: A random sample of the lifetime (in years) of 50
home washing machines is given below:
2.5
16.9
4.5
0.9
1.5
17.8
8.5
8.9
2.5
6.4
14.5
0.7
7.3
1.4
12.2
3.5
2.9
4.0
3.7
6.8
7.4
4.1
0.4
3.3
0.9
4.2
3.3
4.7
18.1
2.6
4.4
7.2
6.9
7.0
0.7
1.6
2.2
9.2
5.2
15.3
4.0
10.4
12.2
4.0
4.1
1.8
21.8
18.3
3.6
The figure below is a dotplot for the 50 lifetimes:
.
.
: . . .:.
.
..: :.::::::..
:
. .
. :.
.::. ...
.
+---------+---------+---------+--------+---------+------Note: Notice how the data is “bunched” near the lower extreme and more
0.0
4.0
higher extreme
12.0“spread out” near the
16.0
8.0
20.0
12
Stem & Leaf Display
• Background:
– The stem-and-leaf display has become very popular for
summarizing numerical data
– It is a combination of graphing and sorting
– The actual data is part of the graph
– Well-suited for computers
Stem-and-Leaf Display: Pictures the data of a sample using the
actual digits that make up the data values. Each numerical data is
divided into two parts: The leading digit(s) becomes the stem, and the
trailing digit(s) becomes the leaf. The stems are located along the
main axis, and a leaf for each piece of data is located so as to display
the distribution of the data.
13
Example
Example: A city police officer, using radar, checked the speed
of cars as they were traveling down the main street in
town. Construct a stem-and-leaf plot for this data:
41 31 33 35 36 37 39 49
33 19 26 27 24 32 40
39 16 55 38 36
Solution:
All the speeds are in the 10s, 20s, 30s, 40s, and 50s. Use the first
digit of each speed as the stem and the second digit as the leaf.
Draw a vertical line and list the stems, in order to the left of the line.
Place each leaf on its stem: place the trailing digit on the right side
of the vertical line opposite its corresponding leading digit.
14
Example
20 Speeds
--------------------------------------1 | 6 9
2 | 4 6 7
3 | 1 2 3 3 5 6 6 7 8 9 9
4 | 0 1 9
5 | 5
---------------------------------------• The speeds are centered around the 30s
Note: The display could be constructed so that only five possible
values (instead of ten) could fall in each stem. What would
the stems look like? Would there be a difference in
appearance?
15
Remember!
1. It is fairly typical of many variables to display a distribution
that is concentrated (mounded) about a central value and
then in some manner be dispersed in both directions.
(Why?)
2. A display that indicates two “mounds” may really be two
overlapping distributions
3. A back-to-back stem-and-leaf display makes it possible to
compare two distributions graphically
4. A side-by-side dotplot is also useful for comparing two
distributions
16
2.2 ~ Frequency Distributions & Histograms
• Stem-and-leaf plots often present adequate
summaries, but they can get very big, very fast
• Need other techniques for summarizing data
• Frequency distributions and histograms are used to
summarize large data sets
17
Frequency Distributions
Frequency Distribution: A listing, often expressed in chart form,
that pairs each value of a variable with its frequency
Ungrouped Frequency Distribution: Each value of x in the
distribution stands alone
Grouped Frequency Distribution: Group the values into a set of
classes
1. A table that summarizes data by classes, or class intervals
2. In a typical grouped frequency distribution, there are usually 5-12 classes of
equal width
3. The table may contain columns for class number, class interval, tally (if
constructing by hand), frequency, relative frequency, cumulative relative
frequency, and class midpoint
4. In an ungrouped frequency distribution each class consists of a single value
18
Frequency Distribution
Guidelines for constructing a frequency distribution:
1. All classes should be of the same width
2. Classes should be set up so that they do not overlap and so that
each piece of data belongs to exactly one class
3. For problems in the text, 5-12 classes are most desirable. The
square root of n is a reasonable guideline for the number of
classes if n is less than 150.
4. Use a system that takes advantage of a number pattern, to
guarantee accuracy
5. If possible, an even class width is often advantageous
19
Frequency Distributions
Procedure for constructing a frequency distribution:
1. Identify the high (H) and low (L) scores. Find the range.
Range = H - L
2. Select a number of classes and a class width so that the
product is a bit larger than the range
3. Pick a starting point a little smaller than L. Count from L by
the width to obtain the class boundaries. Observations that
fall on class boundaries are placed into the class interval to
the right.
20
Example
Example: The hemoglobin test, a blood test given to diabetics
during their periodic checkups, indicates the level of
control of blood sugar during the past two to three
months. The data in the table below was obtained for
40 different diabetics at a university clinic that treats
diabetic patients:
6.5
6.4
5.0
7.9
5.0
6.0
8.0
6.0
5.6
5.6
6.5
5.6
7.6
6.0
6.1
6.0
4.8
5.7
6.4
6.2
8.0
9.2
6.6
7.7
7.5
8.1
7.2
6.7
7.9
8.0
5.9
7.7
8.0
6.5
4.0
8.2
9.2
6.6
5.7
9.0
1) Construct a grouped frequency distribution using the classes
3.7 - <4.7, 4.7 - <5.7, 5.7 - <6.7, etc.
2) Which class has the highest frequency?
21
Solutions
1)
Class
Frequency
Relative
Cumulative
Class
Boundaries
f
Frequency Rel. Frequency Midpoint, x
--------------------------------------------------------------------------------------3.7 - <4.7
1
0.025
0.025
4.2
4.7 - <5.7
6
0.150
0.175
5.2
5.7 - <6.7
16
0.400
0.575
6.2
6.7 - <7.7
4
0.100
0.675
7.2
7.7 - <8.7
10
0.250
0.925
8.2
8.7 - <9.7
3
0.075
1.000
9.2
2) The class 5.7 - <6.7 has the highest frequency. The frequency
is 16 and the relative frequency is 0.40
22
Histogram
Histogram: A bar graph representing a frequency distribution of a
quantitative variable. A histogram is made up of the following
components:
1. A title, which identifies the population of interest
2. A vertical scale, which identifies the frequencies in the various
classes
3. A horizontal scale, which identifies the variable x. Values for the
class boundaries or class midpoints may be labeled along the xaxis. Use whichever method of labeling the axis best presents the
variable.
Notes:
The relative frequency is sometimes used on the vertical scale
It is possible to create a histogram based on class midpoints
23
Example
Example: Construct a histogram for the blood test results given
in the previous example
The Hemoglobin Test
Solution:
15
10
Frequency
5
0
4.2
5.2
6.2
7.2
8.2
9.2
Blood Test
24
Example
Example: A recent survey of Roman Catholic nuns summarized
their ages in the table below. Construct a histogram for
this age data:
Age
Frequency
Class Midpoint
-----------------------------------------------------------20 up to 30
34
25
30 up to 40
58
35
40 up to 50
76
45
50 up to 60
187
55
60 up to 70
254
65
70 up to 80
241
75
80 up to 90
147
85
25
Solution
Roman Catholic Nuns
200
Frequency
100
0
25
35
45
55
65
75
85
Age
26
Terms Used to Describe Histograms
Symmetrical: Both sides of the distribution are identical mirror
images. There is a line of symmetry.
Uniform (Rectangular): Every value appears with equal frequency
Skewed: One tail is stretched out longer than the other. The
direction of skewness is on the side of the longer tail. (Positively
skewed vs. negatively skewed)
J-Shaped: There is no tail on the side of the class with the highest
frequency
Bimodal: The two largest classes are separated by one or more
classes. Often implies two populations are sampled.
Normal: A symmetrical distribution is mounded about the mean and
becomes sparse at the extremes
27
Important Reminders
The mode is the value that occurs with greatest frequency
(discussed in Section 2.3)
The modal class is the class with the greatest frequency
A bimodal distribution has two high-frequency classes
separated by classes with lower frequencies
Graphical representations of data should include a
descriptive, meaningful title and proper identification of
the vertical and horizontal scales
28
Cumulative Frequency Distribution
Cumulative Frequency Distribution: A frequency
distribution that pairs cumulative frequencies with
values of the variable
• The cumulative frequency for any given class is the sum
of the frequency for that class and the frequencies of all
classes of smaller values
• The cumulative relative frequency for any given class is
the sum of the relative frequency for that class and the
relative frequencies of all classes of smaller values
29
Example
Example: A computer science aptitude test was given to 50
students. The table below summarizes the data:
Class
Relative
Cumulative
Cumulative
Boundaries Frequency Frequency
Frequency
Rel. Frequency
-------------------------------------------------------------------------------------
0 up to 4
4
0.08
4
0.08
4 up to 8
8
0.16
12
0.24
8 up to 12
8
0.16
20
0.40
12 up to 16
20
0.40
40
0.80
16 up to 20
6
0.12
46
0.92
20 up to 24
3
0.06
49
0.98
24 up to 28
1
0.02
50
1.00
30
Ogive
Ogive: A line graph of a cumulative frequency or cumulative relative
frequency distribution. An ogive has the following components:
1. A title, which identifies the population or sample
2. A vertical scale, which identifies either the cumulative frequencies
or the cumulative relative frequencies
3. A horizontal scale, which identifies the upper class boundaries.
Until the upper boundary of a class has been reached, you cannot
be sure you have accumulated all the data in the class. Therefore,
the horizontal scale for an ogive is always based on the upper class
boundaries.
Note: Every ogive starts on the left with a relative frequency of zero at the lower
class boundary of the first class and ends on the right with a relative frequency
of 100% at the upper class boundary of the last class.
31
Example
Example: The graph below is an ogive using cumulative relative
frequencies for the computer science aptitude data:
Computer Science Aptitude Test
1.0
0.9
0.8
0.7
Cumulative
Relative
Frequency
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0
4
8
12
16
20
24
28
Test Score
32
2.3 ~ Measures of Central Tendency
• Numerical values used to locate the middle of a
set of data, or where the data is clustered
• The term average is often associated with all
measures of central tendency
33
Mean
Mean: The type of average with which you are probably most familiar. The mean
is the sum of all the values divided by the total number of values, n:
1
1
x = xi = ( x1 + x2 + . . . + xn )
n
n
Notes:
The population mean, , (lowercase mu, Greek alphabet), is the
mean of all x values for the entire population
We usually cannot measure but would like to estimate its value
A physical representation: the mean is the value that balances the
weights on the number line
34
Example
Example: The following data represents the number of accidents
in each of the last 8 years at a dangerous intersection.
Find the mean number of accidents: 8, 9, 3, 5, 2, 6, 4, 5:
Solution:
1
x = (8 + 9 + 3 + 5 + 2 + 6 + 4 + 5) = 5.25
8
In the data above, change 6 to 26:
Solution:
1
x = (8 + 9 + 3 + 5 + 2 + 26 + 4 + 5) = 7.75
8
Note: The mean can be greatly influenced by outliers
35
Median
Median: The value of the data that occupies the middle position when
the data are ranked in order according to size
Notes:
~
Denoted by “x tilde”: x
The population median, (uppercase mu, Greek alphabet), is
the data value in the middle position of the entire population
To find the median:
1. Rank the data
x ) = n +1
2. Determine the depth of the median: d ( ~
2
3. Determine the value of the median
36
Example
Example: Find the median for the set of data:
{4, 8, 3, 8, 2, 9, 2, 11, 3}
Solution:
1. Rank the data: 2, 2, 3, 3, 4, 8, 8, 9, 11
x ) = (9 +1)/ 2 = 5
2. Find the depth: d ( ~
3. The median is the fifth number from either end in the ranked
x =4
data: ~
Suppose the data set is {4, 8, 3, 8, 2, 9, 2, 11, 3, 15}:
1. Rank the data: 2, 2, 3, 3, 4, 8, 8, 9, 11, 15
2. Find the depth: d ( ~x ) = (10 + 1) / 2 = 5.5
3. The median is halfway between the fifth and sixth
observations: ~
x = (4 +8)/ 2 = 6
37
Mode & Midrange
Mode: The mode is the value of x that occurs most frequently
Note: If two or more values in a sample are tied for the
highest frequency (number of occurrences), there
is no mode
Midrange: The number exactly midway between a lowest value
data L and a highest value data H. It is found by averaging the
low and the high values:
midrange=
L+ H
2
38
Example
Example: Consider the data set {12.7, 27.1, 35.6, 44.2, 18.0}
+ H 12.7 + 44.2
L
=
= 2845
Midrange =
.
2
2
Notes:
When rounding off an answer, a common rule-of-thumb is
to keep one more decimal place in the answer than was
present in the original data
To avoid round-off buildup, round off only the final
answer, not intermediate steps
39
2.4 ~ Measures of Dispersion
• Measures of central tendency alone cannot
completely characterize a set of data. Two very
different data sets may have similar measures of
central tendency.
• Measures of dispersion are used to describe the
spread, or variability, of a distribution
• Common measures of dispersion: range, variance,
and standard deviation
40
Range
Range: The difference in value between the highest-valued (H) and
the lowest-valued (L) pieces of data:
range= H L
• Other measures of dispersion are based on the following quantity
Deviation from the Mean: A deviation from the mean, x x ,
is the difference between the value of x and the mean x
41
Example
Example: Consider the sample {12, 23, 17, 15, 18}.
Find 1) the range and 2) each deviation from the mean.
Solutions:
1) range= H L = 2312 =11
2) x = 1(12 + 23+17 +15+18) =17
5
Data Deviation from Mean
x x
x
_________________________
12
-5
23
6
17
0
15
-2
18
1
42
Mean Absolute Deviation
Note:
(x
x) = 0
(Always!)
Mean Absolute Deviation: The mean of the absolute values of
the deviations from the mean:
1
Mean absolute deviation = n | x x |
For the previous example:
1
1 + + + + = 14 =
=
| x x|
(5 6 0 2 1)
2.8
n
5
5
43
Sample Variance & Standard Deviation
Sample Variance: The sample variance, s2, is the mean of the
squared deviations, calculated using n 1 as the divisor:
s2 =
1
( x x) 2
n 1
where n is the sample size
Note: The numerator for the sample variance is called the sum of
squares for x, denoted SS(x):
s2 = SS( x)
n 1
where
SS( x ) = ( x x ) 2 = x 2
1
n
( x )
2
Standard Deviation: The standard deviation of a sample, s, is the
positive square root of the variance:
s = s2
44
Example
Example: Find the 1) variance and 2) standard deviation for the
data {5, 7, 1, 3, 8}:
Solutions:
First: x = 1(5+ 7 +1+ 3+ 8) = 48
.
5
Sum:
1) s 2 =
x
x x
( x x)2
5
7
1
3
8
24
0.2
2.2
-3.8
-1.8
3.2
0
0.04
4.84
14.44
3.24
10.24
32.08
1
( 32 . 8 ) = 8 . 2
4
2) s =
8 . 2 = 2 . 86
45
Notes
The shortcut formula for the sample variance:
( x )
x n
2
2
s2 =
n 1
The unit of measure for the standard deviation is the
same as the unit of measure for the data
46
2.5 ~ Mean & Standard
Deviation of Frequency Distribution
• If the data is given in the form of a frequency
distribution, we need to make a few changes to
the formulas for the mean, variance, and standard
deviation
• Complete the extension table in order to find
these summary statistics
47
To Calculate
• In order to calculate the mean, variance, and standard
deviation for data:
1. In an ungrouped frequency distribution, use the
frequency of occurrence, f, of each observation
2. In a grouped frequency distribution, we use the frequency
of occurrence associated with each class midpoint:
xf
x=
f
s =
2
x
2
xf )
(
f
2
f
f 1
48
Example
Example: A survey of students in the first grade at a local school
asked for the number of brothers and/or sisters for
each child. The results are summarized in the table
below. Find 1) the mean, 2) the variance, and
3) the standard deviation:
Solutions:
First:
x
f
xf
x2 f
0
1
2
4
5
15
17
23
5
2
62
0
17
46
20
10
93
0
17
92
80
50
239
Sum:
1) x = 93/ 62 =15
.
2
(
93
)
239
2) s2 = 62 62
.
1 = 163
. =128
.
3) s= 163
49
TI-83 Calculations
• When dealing with a grouped frequency
distribution, use the following technique:
Input the class midpoints or data values into L1
and the frequencies into L2; then continue with
Highlight:
Enter:
Highlight:
Enter:
Highlight:
L3
L3 = L1*L2
L4
L4 = L1*L3
L5(1) (first position in L5 column)
50
TI-83 Calculations
Enter:
L5(1) = sum(L2)
(Σf)
To find sum use 2nd “List”>Math>5:sum(
L5(2) = sum(L3)
(Σxf)
L5(3) = sum(L4)
(Σx2f)
L5(4) = L5(2)/L5(1) to find the mean
L5(5) = (L5(3)-(L5(2))2/L5(1))/(L5(1)-1) to find the variance
L5(6) = 2nd √ (L5(5) to find the standard deviation
Let’s work problem 2.108 as an example!
51
Problem 2.108
• Find the mean and the variance for this grouped
frequency distribution:
Class Boundaries
f
2–6
7
6 – 10
15
10 – 14
22
14 – 18
14
18 – 22
2
Step 1: Enter the midpoints into L1
Step 2: Enter the frequencies into L2
52
Problem 2.108
• Highlight L3 and enter L1 * L2
• Highlight L4 and enter L1 * L3
• Highlight L5(1) and enter Sum(L2)
53
Problem 2.108
•
•
•
•
•
Highlight L5(2) and enter Sum(L3)
Highlight L5(3) and enter Sum(L4)
Highlight L5(4) and enter L5(2)/L5(1)
Highlight L5(5) and enter (L5(3)-(L5(2))2/L5(1))/(L5(1)-1)
Finally highlight L5(6) and enter 2nd √(L5(5))
54
Problem 2.73
Runs At Home
Runs Away
Difference
Mean
9.77
9.80
-0.03
Median
9.65
9.78
-0.06
Maximum
13.65
11.06
4.89
Minimum
7.64
8.67
-1.74
Midrange
10.65
9.87
1.58
55
Problem 2.73 cont.
• Teams playing the Rockies at Coors Field generated the
maximum number of runs scored (13.65). On the other
hand, while playing their opponents away, the Rockies
and their opponents, were able to generate only 8.76
runs, which ranked second from the bottom.
Collectively, these two performances produced the
greatest spread (4.89) by a considerable margin between
runs scored at home and runs scored away by any
stadium/team combination in the major leagues. This
unusually large value inflates the midrange difference. It
appears the playing conditions at Coors Field, therefore,
are more responsible for producing the higher combined
scores than the strength of either the Rockies’ or their
opponents’ bats or any weakness of the pitching staffs.
56
Problem 2.75
a.
b.
c.
d.
e.
f.
g.
h.
∑ x need to be 500, therefore need any three numbers that total 330.
Need two numbers smaller than 70 and one larger
Need multiple 87’s
Need any two numbers that total 140 for the extreme values where one is
100 or larger
Need two numbers smaller than 70 and one larger than 70 so their total is
330
Need two numbers of 87 and a third number large enough so that the total of
all five is 500.
Mean equal to 100 requires the five data to total 500 and the midrange of 70
requires the total of L and H to be 140; 40, __, 70, ___, 100; that is a sum
of 210, meaning the other two data must total 290. One of the last two
numbers must be larger than 145, which would then become H and change
the midrange. Impossible.
There must be two 87’s in order to have a mode, and there can only be two
data larger than 70 in order for 70 to be the median. , 70, 87, 87, 100;
Impossible
57
Problem 2.83
a. Range = H – L = 9 – 2 = 7
b. 1st: find the mean: 6
∑
30
x
x – xbar
(x – xbar)2
2
-4
16
4
-2
4
7
1
1
8
2
4
9
3
9
0
34
s2 = ∑(x-xbar)2/(n-1) = 34/4 = 8.5
s = √s2 = √8.5 = 2.9
58
2.6 ~ Measures of Position
• Measures of position are used to describe the
relative location of an observation
• Quartiles and percentiles are two of the most
popular measures of position
• An additional measure of central tendency, the
midquartile, is defined using quartiles
• Quartiles are part of the 5-number summary
59
Quartiles
Quartiles: Values of the variable that divide the ranked data into
quarters; each set of data has three quartiles
1. The first quartile, Q1, is a number such that at most 25% of
the data are smaller in value than Q1 and at most 75% are
larger
2. The second quartile, Q2, is the median
3. The third quartile, Q3, is a number such that at most 75%
of the data are smaller in value than Q3 and at most 25%
are larger
Ranked data, increasing order
25%
L
25%
Q1
25%
Q2
25%
Q3
H
60
Percentiles
Percentiles: Values of the variable that divide a set of ranked
data into 100 equal subsets; each set of data has 99 percentiles.
The kth percentile, Pk, is a value such that at most k% of the data
is smaller in value than Pk and at most (100 k)% of the data is
larger.
at most k %
L
at most (100 - k )%
Pk
H
Notes:
The 1st quartile and the 25th percentile are the same: Q1 = P25
The median, the 2nd quartile, and the 50th percentile are
x = Q2 = P50
all the same: ~
61
Finding Pk (and Quartiles)
• Procedure for finding Pk (and quartiles):
1. Rank the n observations, lowest to highest
2. Compute A = (nk)/100
3. If A is an integer:
– d(Pk) = A.5 (depth)
– Pk is halfway between the value of the data in the Ath
position and the value of the next data
If A is a fraction:
– d(Pk) = B, the next larger integer
– Pk is the value of the data in the Bth position
62
Example
Example: The following data represents the pH levels of a
random sample of swimming pools in a California
town. Find: 1) the first quartile, 2) the third quartile,
and 3) the 37th percentile:
5.6
6.0
6.7
7.0
5.6
6.1
6.8
7.3
5.8
6.2
6.8
7.4
5.9
6.3
6.8
7.4
6.0
6.4
6.9
7.5
Solutions:
1) k = 25: (20) (25) / 100 = 5,
depth = 5.5,
Q1 = 6
2) k = 75: (20) (75) / 100 = 15, depth = 15.5, Q3 = 6.95
3) k = 37: (20) (37) / 100 = 7.4,
depth = 8,
P37 = 6.2
63
Midquartile
Midquartile: The numerical value midway between the first and
third quartile:
Q1 + Q3
midquartile= 2
Example: Find the midquartile for the 20 pH values in the
previous example:
Q1 + Q3 6 + 6.95 12.95
=
=
= 6.475
midquartile =
2
2
2
Note: The mean, median, midrange, and midquartile are all measures
of central tendency. They are not necessarily equal. Can you
think of an example when they would be the same value?
64
5-Number Summary
5-Number Summary: The 5-number summary is composed of:
1.
2.
3.
4.
5.
L, the smallest value in the data set
Q1, the first quartile (also P25)
~
x , the median (also P50 and 2nd quartile)
Q3, the third quartile (also P75)
H, the largest value in the data set
Notes:
The 5-number summary indicates how much the data is
spread out in each quarter
The interquartile range is the difference between the first
and third quartiles. It is the range of the middle 50% of the
data
65
Box-and-Whisker Display
Box-and-Whisker Display: A graphic representation of the
5-number summary:
• The five numerical values (smallest, first quartile, median, third
quartile, and largest) are located on a scale, either vertical or
horizontal
• The box is used to depict the middle half of the data that lies
between the two quartiles
• The whiskers are line segments used to depict the other half of the
data
• One line segment represents the quarter of the data that is smaller
in value than the first quartile
• The second line segment represents the quarter of the data that is
larger in value that the third quartile
66
Example
Example: A random sample of students in a sixth grade class
was selected. Their weights are given in the table
below. Find the 5-number summary for this data and
construct a boxplot:
Solution:
63
85
92
99
112
63
L
64
86
93
99
85
Q1
76
88
93
99
92
~
x
76
89
93
101
99
Q3
81
90
94
108
83
91
97
109
112
H
67
Boxplot for Weight Data
Weights from Sixth Grade Class
60
70
80
90
100
110
Weight
L
Q1
~
x
Q3
H
68
z-Score
z-Score: The position a particular value of x has relative to the mean,
measured in standard deviations. The z-score is found by the
formula:
value mean x x
z=
=
st.dev.
s
Notes:
Typically, the calculated value of z is rounded to the nearest
hundredth
The z-score measures the number of standard deviations
above/below, or away from, the mean
z-scores typically range from -3.00 to +3.00
z-scores may be used to make comparisons of raw scores
69
Example
Example: A certain data set has mean 35.6 and standard
deviation 7.1. Find the z-scores for 46 and 33:
Solutions:
z = x s x = 46 35.6 =1.46
7.1
46 is 1.46 standard deviations above the mean
x x = 33 35.6 =
=
z
0.37
s
7.1
33 is 0.37 standard deviations below the mean.
70
2.7 ~ Interpreting & Understanding
Standard Deviation
• Standard deviation is a measure of variability, or
spread
• Two rules for describing data rely on the standard
deviation:
– Empirical rule: applies to a variable that is
normally distributed
– Chebyshev’s theorem: applies to any distribution
71
Empirical Rule
Empirical Rule: If a variable is normally distributed, then:
1. Approximately 68% of the observations lie within 1 standard
deviation of the mean
2. Approximately 95% of the observations lie within 2 standard
deviations of the mean
3. Approximately 99.7% of the observations lie within 3 standard
deviations of the mean
Notes:
The empirical rule is more informative than Chebyshev’s theorem since
we know more about the distribution (normally distributed)
Also applies to populations
Can be used to determine if a distribution is normally distributed
72
Illustration of the Empirical Rule
99.7%
95%
68%
x 3s
x 2s
xs
x
x+s
x +2s
x +3s
73
Example
Example: A random sample of plum tomatoes was selected
from a local grocery store and their weights recorded.
The mean weight was 6.5 ounces with a standard
deviation of 0.4 ounces. If the weights are normally
distributed:
1) What percentage of weights fall between 5.7 and 7.3?
2) What percentage of weights fall above 7.7?
Solutions:
1) ( x 2s, x + 2s) = (65
. 2(0.4), 65
. + 2(0.4)) = (57
. , 7.3)
Approximately 95% of the weights fall between 5.7 and 7.3
2) ( x 3s, x + 3s) = (65
. 3(0.4), 65
. + 3(0.4)) = (53
. , 7.7)
Approximately 99.7% of the weights fall between 5.3 and 7.7
Approximately 0.3% of the weights fall outside (5.3, 7.7)
Approximately (0.3/2)=0.15% of the weights fall above 7.7
74
A Note about the Empirical Rule
Note: The empirical rule may be used to determine whether or
not a set of data is approximately normally distributed
1. Find the mean and standard deviation for the data
2. Compute the actual proportion of data within 1, 2, and 3
standard deviations from the mean
3. Compare these actual proportions with those given by the
empirical rule
4. If the proportions found are reasonably close to those of the
empirical rule, then the data is approximately normally
distributed
75
Chebyshev’s Theorem
Chebyshev’s Theorem: The proportion of any distribution that lies
within k standard deviations of the mean is at least 1 (1/k2), where
k is any positive number larger than 1. This theorem applies to all
distributions of data.
Illustration:
at least
1 k12
x ks
x
x + ks
76
Important Reminders!
Chebyshev’s theorem is very conservative and holds for
any distribution of data
Chebyshev’s theorem also applies to any population
The two most common values used to describe a
distribution of data are k = 2, 3
The table below lists some values for k and 1 - (1/k2):
k
1(1/ k 2)
1.7
0.65
2
0.75
2.5
0.84
3
0.89
77
Example
Example: At the close of trading, a random sample of 35
technology stocks was selected. The mean selling
price was 67.75 and the standard deviation was 12.3.
Use Chebyshev’s theorem (with k = 2, 3) to describe
the distribution.
Solutions:
Using k=2: At least 75% of the observations lie within 2 standard
deviations of the mean:
( x 2 s, x + 2 s ) = (67.75 2(12.3), 67.75 + 2(12.3) = (43.15, 92.35)
Using k=3: At least 89% of the observations lie within 3
standard deviations of the mean:
( x 3s, x + 3s ) = (67.75 3(12.3), 67.75 + 3(12.3) = (30.85, 104.65)
78
2.8 ~ The Art of Statistical Deception
Good Arithmetic, Bad Statistics
Misleading Graphs
Insufficient Information
79
Good Arithmetic, Bad Statistics
• The mean can be greatly influenced by outliers
– Example: The mean salary for all NBA players is $15.5 million
Misleading graphs:
1. The frequency scale should start at zero to present a
complete picture. Graphs that do not start at zero are
used to save space.
2. Graphs that start at zero emphasize the size of the
numbers involved
3. Graphs that are chopped off emphasize variation
80
Flight Cancellations
35
30
25
Number of
Cancellations
20
15
10
5
0
1996
1998
2000
2002
Year
81
Flight Cancellations
35
34
33
Number of
Cancellations
32
31
30
29
28
27
1996
1998
2000
2002
Year
82
Insufficient Information
• Example: An admissions officer from a state school
explains that the average tuition at a nearby private
university is $13,000 and only $4500 at his school. This
makes the state school look more attractive.
– If most students pay the full tuition, then the state
school appears to be a better choice
– However, if most students at the private university
receive substantial financial aid, then the actual
tuition cost could be much lower!
83