Transcript Slide 1

(Part 2)
4
Standardized Data
Percentiles, Quartiles and
Box Plots
Grouped Data
Skewness and Kurtosis
McGraw-Hill/Irwin
Copyright © 2009 by The McGraw-Hill Companies, Inc.
Chapter
Descriptive Statistics
Standardized Data
Chebyshev’s Theorem
• Developed by mathematicians Jules Bienaymé
(1796-1878) and Pafnuty Chebyshev (18211894).
• For any population with mean m and standard
deviation s, the percentage of observations
that lie within k standard deviations of the
mean must be at least 100[1 – 1/k2].
4B-2
Standardized Data
Chebyshev’s Theorem
• For k = 2 standard deviations,
100[1 – 1/22] = 75%
• So, at least 75.0% will lie within m + 2s
• For k = 3 standard deviations,
100[1 – 1/32] = 88.9%
• So, at least 88.9% will lie within m + 3s
• Although applicable to any data set, these
limits tend to be too wide to be useful.
4B-3
Standardized Data
The Empirical Rule
• The normal or Gaussian distribution was
named for Karl Gauss (1771-1855).
• The normal distribution is symmetric and is
also known as the bell-shaped curve.
• The Empirical Rule states that for data from a
normal distribution, we expect that for
k = 1 about 68.26% will lie within m + 1s
k = 2 about 95.44% will lie within m + 2s
k = 3 about 99.73% will lie within m + 3s
4B-4
Standardized Data
The Empirical Rule
• Distance from the mean is measured in terms
of the number of standard deviations.
Note: no
upper bound
is given.
Data values
outside
m + 3s
are rare.
4B-5
Standardized Data
Example: Exam Scores
• If 80 students take an exam, how many will score
within 2 standard deviations of the mean?
• Assuming exam scores follow a normal
distribution, the empirical rule states
about 95.44% will lie within m + 2s
so 95.44% x 80  76 students will score
+ 2s from m.
• How many students will score more than 2
standard deviations from the mean?
4B-6
Standardized Data
Unusual Observations
• Unusual observations are those that lie beyond
m + 2s.
• Outliers are observations that lie beyond
m + 3s.
4B-7
Standardized Data
Unusual Observations
• For example, the P/E ratio data contains
several large data values. Are they unusual or
outliers?
7
4B-8
8
8 10 10
10 10 12 13 13 13 13
13 13 13 14 14
14 15 15 15 15 15 16
16 16 17 18 18
18 18 19 19 19 19 19
20 20 20 21 21
21 22 22 23 23 23 24
25 26 26 26 26
27 29 29 30 31 34 36
37 40 41 45 48
55 68 91
Standardized Data
The Empirical Rule
• If the sample came from a normal distribution,
then the Empirical rule states
4B-9
x 1s
= 22.72 ± 1(14.08) = (8.6, 38.8)
x  2s
= 22.72 ± 2(14.08) = (-5.4, 50.9)
x  3s
= 22.72 ± 3(14.08) = (-19.5, 65.0)
Standardized Data
The Empirical Rule
• Are there any unusual values or outliers?
7 8
. . .
48 55
68 91
Unusual
Unusual
Outliers
4B-10
-19.5
Outliers
-5.4
8.6
22.72
36.8
50.9
65.0
Standardized Data
Defining a Standardized Variable
• A standardized variable (Z) redefines each
observation in terms the number of standard
deviations from the mean.
4B-11
Standardization
formula for a
population:
xi  m
zi 
s
Standardization
formula for a
sample:
xi  x
zi 
s
Standardized Data
Defining a Standardized Variable
• zi tells how far away the observation is from
the mean.
• For example, for the P/E data, the first value x1
= 7. The associated z value is
xi  x
zi 
s
4B-12
= 7 – 22.72 = -1.12
14.08
Standardized Data
Defining a Standardized Variable
• A negative z value means the observation is
below the mean.
• Positive z means the observation is above the
mean. For x68 = 91,
xi  x 91 – 22.72
zi 
=
= 4.85
14.08
s
4B-13
Standardized Data
Defining a Standardized Variable
• Here are the standardized z values for the P/E
data:
• What do you conclude for these three values?
4B-14
Standardized Data
Defining a Standardized Variable
• MegaStat calculates standardized values as
well as checks for outliers.
• In Excel, use =STANDARDIZE(Array, Mean,
STDev) to calculate a
standardized z value.
4B-15
Standardized Data
Outliers
• What do we do with outliers in a data set?
• If due to erroneous data, then discard.
• An outrageous observation (one completely
outside of an expected range) is certainly
invalid.
• Recognize unusual data points and outliers
and their potential impact on your study.
• Research books and articles on how to handle
outliers.
4B-16
Standardized Data
Estimating Sigma
• For a normal distribution, the range of values
is 6s (from m – 3s to m + 3s).
• If you know the range R (high – low), you can
estimate the standard deviation as s = R/6.
• Useful for approximating the standard
deviation when only R is known.
• This estimate depends on the assumption of
normality.
4B-17
Percentiles and Quartiles
Percentiles
• Percentiles are data that have been divided into 100
groups.
• For example, you score in the 83rd percentile on a
standardized test. That means that 83% of the testtakers scored below you.
• Deciles are data that have been divided into
10 groups.
• Quintiles are data that have been divided into
5 groups.
• Quartiles are data that have been divided into
4 groups.
4B-18
Percentiles and Quartiles
Percentiles
• Percentiles are used to establish benchmarks
for comparison purposes (e.g., health care,
manufacturing and banking industries use 5,
25, 50, 75 and 90 percentiles).
• Quartiles (25, 50, and 75 percent) are
commonly used to assess financial
performance and stock portfolios.
• Percentiles are used in employee merit
evaluation and salary benchmarking.
4B-19
Percentiles and Quartiles
Quartiles
• Quartiles are scale points that divide the
sorted data into four groups of approximately
equal size.
Q1
Lower 25%
4B-20
|
Q2
Second 25%
|
Q3
Third 25%
|
Upper 25%
• The three values that separate the four groups are
called Q1, Q2, and Q3, respectively.
Percentiles and Quartiles
Quartiles
• The second quartile Q2 is the median, an
important indicator of central tendency.
Q2
 Lower 50% 
|
 Upper 50% 
• Q1 and Q3 measure dispersion since the
interquartile range Q3 – Q1 measures the
degree of spread in the middle 50 percent of
data values.
4B-21
Lower 25%
4B-21
Q1
|
Q3
 Middle 50% 
|
Upper 25%
Percentiles and Quartiles
Quartiles
• The first quartile Q1 is the median of the data
values below Q2, and the third quartile Q3 is
the median of the data values above Q2.
Q1
Lower 25%
|
Q2
Second 25%
For first half of data,
50% above,
50% below Q1.
4B-22
|
Q3
Third 25%
|
Upper 25%
For second half of data,
50% above,
50% below Q3.
Percentiles and Quartiles
Quartiles
• Depending on n, the quartiles Q1,Q2, and Q3
may be members of the data set or may lie
between two of the sorted data values.
4B-23
Percentiles and Quartiles
Method of Medians
• For small data sets, find quartiles using
method of medians:
Step 1. Sort the observations.
Step 2. Find the median Q2.
Step 3. Find the median of the data values that
lie below Q2.
Step 4. Find the median of the data values that
lie above Q2.
4B-24
Percentiles and Quartiles
Excel Quartiles
• Use Excel function =QUARTILE(Array, k) to
return the kth quartile.
• Excel treats quartiles as a special case of
percentiles. For example, to calculate Q3
=QUARTILE(Array, 3)
=PERCENTILE(Array, 75)
• Excel calculates the quartile positions as:
4B-25
Position of Q1
0.25n + 0.75
Position of Q2
0.50n + 0.50
Position of Q3
0.75n + 0.25
Percentiles and Quartiles
Example: P/E Ratios and Quartiles
• Consider the following P/E ratios for 68 stocks
in a portfolio.
7
8
8
10 10 10 10 12 13 13 13 13 13 13 13 14 14
14 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19
19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26
26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91
• Use quartiles to define benchmarks for stocks
that are low-priced (bottom quartile) or highpriced (top quartile).
4B-26
Percentiles and Quartiles
Example: P/E Ratios and Quartiles
• Using Excel’s method of interpolation, the
quartile positions are:
Quartile
Position
Q1
Q2
Q3
4B-27
Formula
= 0.25(68) + 0.75 = 17.75
= 0.50(68) + 0.50 = 34.50
= 0.75(68) + 0.25 = 51.25
Interpolate
Between
X17 + X18
X34 + X35
X51 + X52
Percentiles and Quartiles
Example: P/E Ratios and Quartiles
• The quartiles are:
Quartile
First (Q1)
Formula
Q1 = X17 + 0.75 (X18-X17)
= 14 + 0.75 (14-14) = 14
Second (Q2) Q2 = X34 + 0.50 (X35-X34)
= 19 + 0.50 (19-19) = 19
Third (Q3)
Q3 = X51 + 0.25 (X52-X51)
= 26 + 0.25 (26-26) = 26
4B-28
Percentiles and Quartiles
Example: P/E Ratios and Quartiles
• So, to summarize:
Q1
Lower 25%
of P/E Ratios
14
Q2
Second 25%
of P/E Ratios
19
Q3
Third 25%
of P/E Ratios
26
Upper 25%
of P/E Ratios
• These quartiles express central tendency and
dispersion. What is the interquartile range?
• Because of clustering of identical data values,
these quartiles do not provide clean cut points
between groups of observations.
4B-29
Percentiles and Quartiles
Tip
Whether you use the method of
medians or Excel, your quartiles will
be about the same. Small
differences in calculation techniques
typically do not lead to different
conclusions in business applications.
4B-30
Percentiles and Quartiles
Caution
• Quartiles generally resist outliers.
• However, quartiles do not provide clean cut
points in the sorted data, especially in small
samples with repeating data values.
Data set A:
1, 2, 4, 4, 8, 8, 8, 8
Q1 = 3, Q2 = 6, Q3 = 8
Data set B:
0, 3, 3, 6, 6, 6, 10, 15
Q1 = 3, Q2 = 6, Q3 = 8
• Although they have identical quartiles, these
two data sets are not similar. The quartiles do
not represent either data set well.
4B-31
Box Plots
• A useful tool of exploratory data analysis
(EDA).
• Also called a box-and-whisker plot.
• Based on a five-number summary:
Xmin, Q1, Q2, Q3, Xmax
• Consider the five-number summary for the
68 P/E ratios:
Xmin, Q1, Q2, Q3, Xmax
7
4B-32
14 19 26 91
Box Plots
• The box plot is displayed visually, like this.
• A box plot shows central tendancy, dispersion,
and shape.
4B-33
Box Plots
Fences and Unusual Data Values
• Use quartiles to detect unusual data points.
• These points are called fences and can be
found using the following formulas:
Inner fences
Outer fences:
Lower fence
Q1 – 1.5 (Q3–Q1)
Q1 – 3.0 (Q3–Q1)
Upper fence
Q3 + 1.5 (Q3–Q1)
Q3 + 3.0 (Q3–Q1)
• Values outside the inner fences are unusual while
those outside the outer fences are outliers.
4B-34
Box Plots
Fences and Unusual Data Values
• For example, consider the P/E ratio data:
Inner fences
Outer fences:
Lower fence: 14 – 1.5 (26–14) = 4
14 – 3.0 (26–14) = 22
Upper fence: 26 + 1.5 (26–14) = +44
26 + 3.0 (26–14) = +62
• Ignore the lower fence since it is negative and
P/E ratios are only positive.
4B-35
Box Plots
Fences and Unusual Data Values
• Truncate the whisker at the fences and display
unusual values
Inner
Outer
and outliers
Fence
Fence
as dots.
Unusual
Outliers
• Based on these fences, there are three
unusual P/E values and two outliers.
4B-36
4B-36
Percentiles and Quartiles
Midhinge
• The average of the first and third quartiles.
Q1  Q3
Midhinge =
2
• The name “midhinge” derives from the idea
that, if the “box” were folded in half, it would
resemble a “hinge”..
4B-37
Box Plots
Whiskers
Center of Box is Midhinge
Box
Q1
Q3
Minimum
Median (Q2)
4B-38
Right-skewed
Maximum
Correlation
Correlation Coefficient
• The sample correlation coefficient is a statistic
that describes the degree of linearity between
paired observations on two quantitative
variables X and Y.
4B-39
Correlation
Correlation Coefficient
• Its range is -1 ≤ r ≤ +1.
• Excel’s formula =CORREL(Xdata, Ydata)
4B-40
Correlation
Correlation Coefficient
• Illustration of Correlation Coefficients
4B-41
Correlation
• What is the nature of the relationship between
square feet of shopping area and sales that is
implied by the following correlation?
4B-42
Grouped Data
Nature of Grouped Data
• Although some information is lost, grouped
data are easier to display than raw data.
• When bin limits are given, the mean and
standard deviation can be estimated.
• Accuracy of grouped estimates depend on
- the number of bins
- distribution of data within bins
- bin frequencies
4B-43
Grouped Data
Mean and Standard Deviation
• Consider the frequency distribution for prices
of Lipitor® for three cities:
4B-44
• Where
mj = class midpoint
k = number of classes
fj = class frequency
n = sample size
Grouped Data
Nature of Grouped Data
• Estimate the mean and standard deviation by
k
f jmj
j 1
n
x
s
3427.5

 72.92552
47
k
f j (m j  x )2
j 1
n 1

2091.48936

 6.74293
47  1
• Note: don’t round off too soon.
4B-45
Grouped Data
Nature of Grouped Data
• Now estimate the coefficient of variation
CV = 100 (s / x ) = 100 (6.74293 / 72.92552) = 9.2%
Accuracy Issues
• How accurate are grouped estimates
compared to ungrouped estimates?
• For the previous example, we can compare the
grouped data statistics to the ungrouped data
statistics.
4B-46
Grouped Data
Accuracy Issues
• Accuracy tends to improve as the number of bins
increases.
• If the first or last class is open-ended, there will be
no class midpoint (no mean can be estimated).
• Assume a lower limit of zero for the first class when
the data are nonnegative.
• You may be able to assume an upper limit for some
variables (e.g., age).
• Median and quartiles may be estimated even with
open-ended classes.
4B-47
Skewness and Kurtosis
Skewness
• Generally, skewness may be indicated by
looking at the sample histogram or by
comparing the mean and median.
• This visual indicator is imprecise and does not
take into consideration sample size n.
4B-48
Skewness and Kurtosis
Skewness
• Skewness is a unit-free statistic.
• The coefficient compares two samples
measured in different units or one sample with
a known reference distribution (e.g.,
symmetric normal distribution).
• Calculate the sample’s skewness coefficient
as:
3
n
n
 xi  x 
Skewness =


(n  1)(n  2) i 1  s 
4B-49
Skewness and Kurtosis
Skewness
• In Excel, go to
Tools | Data Analysis |
Descriptive Statistics
or use the function
=SKEW(array)
4B-50
Skewness and Kurtosis
Skewness
• Consider the following table showing the 90%
range for the sample skewness coefficient.
4B-51
Skewness and Kurtosis
Skewness
• Coefficients within the 90% range may be
attributed to random variation.
4B-52
Skewness and Kurtosis
Skewness
(Figure 4.36)
• Coefficients outside the range suggest the
sample came from a nonnormal population.
4B-53
Skewness and Kurtosis
Skewness
• As n increases, the range of chance variation
narrows.
4B-54
Skewness and Kurtosis
Kurtosis
• Kurtosis is the relative length of the tails and
the degree of concentration in the center.
• Consider three kurtosis prototype shapes.
Heavier tails
4B-55
Skewness and Kurtosis
Kurtosis
• A histogram is an unreliable guide to kurtosis
since scale and axis proportions may differ.
• Excel and MINITAB calculate kurtosis as:
4
n(n  1)
3(n 1)2
 xi  x 

Kurtosis =


(n  1)(n  2)(n  3) i 1  s  (n  2)(n  3)
n
4B-56
Skewness and Kurtosis
Kurtosis
• Consider the following table of expected 90%
range for sample kurtosis coefficient.
4B-57
Skewness and Kurtosis
Kurtosis
• A sample coefficient within the ranges may be
attributed to chance variation.
4B-58
Skewness and Kurtosis
Kurtosis
• Coefficients outside the range would suggest
the sample differs from a normal population.
4B-59
Skewness and Kurtosis
Kurtosis
• As sample size increases, the chance range
narrows.
Inferences about kurtosis are risky for n < 50.
4B-60
Applied Statistics in
Business & Economics
End of Chapter 4B
4B-61