Essential Statistics 1/e
Download
Report
Transcript Essential Statistics 1/e
4B-1
Chapter
4B
Descriptive Statistics (Part 2)
Standardized Data
Percentiles and Quartiles
Box Plots
McGraw-Hill/Irwin
© 2008 The McGraw-Hill Companies, Inc. All rights reserved.
4B-3
Standardized Data
Chebyshev’s Theorem
• Developed by mathematicians Jules Bienaymé
(1796-1878) and Pafnuty Chebyshev (1821-1894).
• For any population with mean m and standard
deviation s, the percentage of observations that lie
within k standard deviations of the mean must be at
least 100[1 – 1/k2].
4B-4
Standardized Data
Chebyshev’s Theorem
• For k = 2 standard deviations,
100[1 – 1/22] = 75%
• So, at least 75.0% will lie within m + 2s
• For k = 3 standard deviations,
100[1 – 1/32] = 88.9%
• So, at least 88.9% will lie within m + 3s
• Although applicable to any data set, these limits
tend to be too wide to be useful.
4B-5
Standardized Data
The Empirical Rule
• The normal or Gaussian distribution was named for
Karl Gauss (1771-1855).
• The normal distribution is symmetric and is also
known as the bell-shaped curve.
• The Empirical Rule states that for data from a
normal distribution, we expect that for
k = 1 about 68.26% will lie within m + 1s
k = 2 about 95.44% will lie within m + 2s
k = 3 about 99.73% will lie within m + 3s
4B-6
Standardized Data
The Empirical Rule
• Distance from the mean is measured in terms of
the number of standard deviations.
Note: no
upper bound
is given.
Data values
outside
m + 3s
are rare.
4B-7
Standardized Data
Example: Exam Scores
• If 80 students take an exam, how many will score
within 2 standard deviations of the mean?
• Assuming exam scores follow a normal distribution,
the empirical rule states
about 95.44% will lie within m + 2s
so 95.44% x 80 76 students will score
+ 2s from m.
• How many students will score more than 2
standard deviations from the mean?
4B-8
Standardized Data
Unusual Observations
• Unusual observations are those that lie beyond
m + 2s.
• Outliers are observations that lie beyond
m + 3s.
4B-9
Standardized Data
Unusual Observations
• For example, the P/E ratio data contains several
large data values. Are they unusual or outliers?
7
8
8
10 10
10
10
12
13
13
13
13
13
13
13
14 14
14
15
15
15
15
15
16
16
16
17
18 18
18
18
19
19
19
19
19
20
20
20
21 21
21
22
22
23
23
23
24
25
26
26
26 26
27
29
29
30
31
34
36
37
40
41
45 48
55
68
91
4B-10
Standardized Data
The Empirical Rule
• If the sample came from a normal distribution, then
the Empirical rule states
x 1s = 22.72 ± 1(14.08) = (8.9, 38.8)
x 2s = 22.72 ± 2(14.08) = (-5.4, 50.9)
x 3s = 22.72 ± 3(14.08) = (-19.5, 65.0)
4B-11
Standardized Data
The Empirical Rule
• Are there any unusual values or outliers?
7 8
. . .
48 55
Unusual
68 91
Unusual
Outliers
Outliers
-19.5
-5.4
8.9
22.72
38.8
50.9
65.0
4B-12
Standardized Data
Defining a Standardized Variable
• A standardized variable (Z) redefines each
observation in terms the number of standard
deviations from the mean.
Standardization
formula for a
population:
xi m
zi
s
Standardization
formula for a
sample:
xi x
zi
s
4B-13
Standardized Data
Defining a Standardized Variable
• zi tells how far away the observation is from the
mean.
• For example, for the P/E data, the first value x1 = 7.
The associated z value is
xi x
zi
s
= 7 – 22.72 = -1.12
14.08
4B-14
Standardized Data
Defining a Standardized Variable
• A negative z value means the observation is below
the mean.
• Positive z means the observation is above the
mean. For x68 = 91,
xi x
zi
= 91 – 22.72 = 4.85
14.08
s
4B-15
Standardized Data
Defining a Standardized Variable
• Here are the standardized z values for the P/E
data:
• What do you conclude for these four values?
4B-16
Standardized Data
Defining a Standardized Variable
• MegaStat calculates standardized values as well
as checks for outliers.
• In Excel, use =STANDARDIZE(Array, Mean,
STDev) to calculate a
standardized z value.
4B-17
Standardized Data
Outliers
• What do we do with outliers in a data set?
• If due to erroneous data, then discard.
• An outrageous observation (one completely outside
of an expected range) is certainly invalid.
• Recognize unusual data points and outliers and
their potential impact on your study.
• Research books and articles on how to handle
outliers.
4B-18
Standardized Data
Estimating Sigma
• For a normal distribution, the range of values is 6s
(from m – 3s to m + 3s).
• If you know the range R (high – low), you can
estimate the standard deviation as s = R/6.
• Useful for approximating the standard deviation
when only R is known.
• This estimate depends on the assumption of
normality.
4B-19
Percentiles and Quartiles
Percentiles
• Percentiles are data that have been divided into
100 groups.
• For example, you score in the 83rd percentile on a
standardized test. That means that 83% of the
test-takers scored below you.
• Deciles are data that have been divided into
10 groups.
• Quintiles are data that have been divided into
5 groups.
• Quartiles are data that have been divided into
4 groups.
4B-20
Percentiles and Quartiles
Percentiles
• Percentiles are used to establish benchmarks for
comparison purposes (e.g., health care,
manufacturing and banking industries use 5, 25,
50, 75 and 90 percentiles).
• Quartiles (25, 50, and 75 percent) are commonly
used to assess financial performance and stock
portfolios.
• Percentiles are used in employee merit evaluation
and salary benchmarking.
4B-21
Percentiles and Quartiles
Quartiles
• Quartiles are scale points that divide the sorted
data into four groups of approximately equal size.
Q1
Lower 25%
|
Q2
Second 25%
|
Q3
Third 25%
|
Upper 25%
• The three values that separate the four groups are
called Q1, Q2, and Q3, respectively.
4B-22
Percentiles and Quartiles
Quartiles
• The second quartile Q2 is the median, an important
indicator of central tendency.
Q2
Lower 50%
|
Upper 50%
• Q1 and Q3 measure dispersion since the
interquartile range Q3 – Q1 measures the degree of
spread in the middle 50 percent of data values.
Q1
Lower 25%
|
Q3
Middle 50%
|
Upper 25%
4B-23
Percentiles and Quartiles
Quartiles
• The first quartile Q1 is the median of the data
values below Q2, and the third quartile Q3 is the
median of the data values above Q2.
Q1
Lower 25%
|
Q2
Second 25%
For first half of data,
50% above,
50% below Q1.
|
Q3
Third 25%
|
Upper 25%
For second half of data,
50% above,
50% below Q3.
4B-24
Percentiles and Quartiles
Quartiles
• Depending on n, the quartiles Q1,Q2, and Q3 may
be members of the data set or may lie between
two of the sorted data values.
4B-25
Percentiles and Quartiles
Method of Medians
• For small data sets, find quartiles using method of
medians:
Step 1. Sort the observations.
Step 2. Find the median Q2.
Step 3. Find the median of the data values that lie
below Q2.
Step 4. Find the median of the data values that lie
above Q2.
4B-26
Percentiles and Quartiles
Excel Quartiles
• Use Excel function =QUARTILE(Array, k) to return
the kth quartile.
• Excel treats quartiles as a special case of
percentiles. For example, to calculate Q3
=QUARTILE(Array, 3)
=PERCENTILE(Array, 75)
• Excel calculates the quartile positions as:
Position of Q1
0.25n + 0.75
Position of Q2
Position of Q3
0.50n + 0.50
0.75n + 0.25
4B-27
Percentiles and Quartiles
Example: P/E Ratios and Quartiles
• Consider the following P/E ratios for 68 stocks in a
portfolio.
7
8
8
10 10 10 10 12 13 13 13 13 13 13 13 14 14
14 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19
19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26
26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91
• Use quartiles to define benchmarks for stocks that
are low-priced (bottom quartile) or high-priced (top
quartile).
4B-28
Percentiles and Quartiles
Example: P/E Ratios and Quartiles
• Using Excel’s method of interpolation, the quartile
positions are:
Quartile
Position
Q1
Q2
Q3
Formula
= 0.25(68) + 0.75 = 17.75
= 0.50(68) + 0.50 = 34.50
= 0.75(68) + 0.25 = 51.25
Interpolate
Between
X17 + X18
X34 + X35
X51 + X52
4B-29
Percentiles and Quartiles
Example: P/E Ratios and Quartiles
• The quartiles are:
Quartile
First (Q1)
Second (Q2)
Third (Q3)
Formula
Q1 = X17 + 0.75 (X18-X17)
= 14 + 0.75 (14-14) = 14
Q2 = X34 + 0.50 (X35-X34)
= 19 + 0.50 (19-19) = 19
Q3 = X51 + 0.25 (X52-X51)
= 26 + 0.25 (26-26) = 26
4B-30
Percentiles and Quartiles
Example: P/E Ratios and Quartiles
• So, to summarize:
Q1
Lower 25%
of P/E Ratios
14
Q2
Second 25%
of P/E Ratios
19
Q3
Third 25%
of P/E Ratios
26
Upper 25%
of P/E Ratios
• These quartiles express central tendency and
dispersion. What is the interquartile range?
• Because of clustering of identical data values,
these quartiles do not provide clean cut points
between groups of observations.
4B-31
Percentiles and Quartiles
Tip
Whether you use the method of
medians or Excel, your quartiles will be
about the same. Small differences in
calculation techniques typically do not
lead to different conclusions in
business applications.
4B-32
Percentiles and Quartiles
Caution
• Quartiles generally resist outliers.
• However, quartiles do not provide clean cut points
in the sorted data, especially in small samples with
repeating data values.
Data set A:
1, 2, 4, 4, 8, 8, 8, 8
Q1 = 3, Q2 = 6, Q3 = 8
Data set B:
0, 3, 3, 6, 6, 6, 10, 15
Q1 = 3, Q2 = 6, Q3 = 8
• Although they have identical quartiles, these two
data sets are not similar. The quartiles do not
represent either data set well.
4B-33
Percentiles and Quartiles
Dispersion Using Quartiles
• Some robust measures of central tendency and
dispersion using quartiles are:
Statistic
Midhinge
Formula Excel
Q1 Q3
2
=0.5*(QUARTILE
(Data,1)+QUARTILE
(Data,3))
Pro
Con
Robust to
presence
of extreme
data
values.
Less
familiar
to most
people.
4B-34
Percentiles and Quartiles
Dispersion Using Quartiles
Statistic
Midspread
Formula
Excel
Q3 – Q1
Stable
when
=QUARTILE(Data,3)extreme
QUARTILE(Data,1)
data values
exist.
Coefficient
Q3 Q1
100
of quartile
Q3 Q1
variation
(CQV)
Pro
None
Relative
variation in
percent so
we can
compare
data sets.
Con
Ignores
magnitude
of extreme
data
values.
Less
familiar to
nonstatisticians
4B-35
Percentiles and Quartiles
Midhinge
• The mean of the first and third quartiles.
Q1 Q3
Midhinge =
2
• For the 68 P/E ratios,
Q1 Q3 14 26
20
Midhinge =
2
2
• A robust measure of central tendency since
quartiles ignore extreme values.
4B-36
Percentiles and Quartiles
Midspread (Interquartile Range)
• A robust measure of dispersion
Midspread = Q3 – Q1
• For the 68 P/E ratios,
Midspread = Q3 – Q1 = 26 – 14 = 12
4B-37
Percentiles and Quartiles
Coefficient of Quartile Variation (CQV)
• Measures relative dispersion, expresses the
midspread as a percent of the midhinge.
Q3 Q1
CQV 100
Q3 Q1
• For the 68 P/E ratios,
Q3 Q1
26 14
CQV 100
100
30.0%
Q3 Q1
26 14
• Similar to the CV, CQV can be used to compare
data sets measured in different units or with
different means.
4B-38
Box Plots
• A useful tool of exploratory data analysis (EDA).
• Also called a box-and-whisker plot.
• Based on a five-number summary:
Xmin, Q1, Q2, Q3, Xmax
• Consider the five-number summary for the
68 P/E ratios:
Xmin, Q1, Q2, Q3, Xmax
7
14 19 26 91
4B-39
Box Plots
Whiskers
Center of Box is Midhinge
Box
Q1
Q3
Minimum
Median (Q2)
Right-skewed
Maximum
4B-40
Box Plots
Fences and Unusual Data Values
• Use quartiles to detect unusual data points.
• These points are called fences and can be found
using the following formulas:
Lower fence
Upper fence
Inner fences
Q1 – 1.5 (Q3–Q1)
Q3 + 1.5 (Q3–Q1)
Outer fences:
Q1 – 3.0 (Q3–Q1)
Q3 + 3.0 (Q3–Q1)
• Values outside the inner fences are unusual while
those outside the outer fences are outliers.
4B-41
Box Plots
Fences and Unusual Data Values
• For example, consider the P/E ratio data:
Inner fences
Outer fences:
Lower fence:
14 – 1.5 (26–14) = 4
14 – 3.0 (26–14) = 22
Upper fence:
26 + 1.5 (26–14) = +44
26 + 3.0 (26–14) = +62
• Ignore the lower fence since it is negative and P/E
ratios are only positive.
4B-42
Box Plots
Fences and Unusual Data Values
• Truncate the whisker at the fences and display
unusual values
Inner
Outer
and outliers
Fence
Fence
as dots.
Unusual
Outliers
• Based on these fences, there are three unusual
P/E values and two outliers.
4B-43
Grouped Data
Nature of Grouped Data
• Although some information is lost, grouped data
are easier to display than raw data.
• When bin limits are given, the mean and standard
deviation can be estimated.
• Accuracy of grouped estimates depend on
- the number of bins
- distribution of data within bins
- bin frequencies