Transcript Chapter 4
A PowerPoint Presentation Package to Accompany
Applied Statistics in Business &
Economics, 4th edition
David P. Doane and Lori E. Seward
Prepared by Lloyd R. Jaisingh
McGraw-Hill/Irwin
Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved.
Chapter 4
Descriptive Statistics
Chapter Contents
4.1 Numerical Description
4.2 Measures of Center
4.3 Measures of Variability
4.4 Standardized Data
4.5 Percentiles, Quartiles, and Box Plots
4.6 Correlation and Covariance
4.7 Grouped Data
4.8 Skewness and Kurtosis
4-2
Chapter 4
Descriptive Statistics
Chapter Learning Objectives
LO4-1:
LO4-2:
LO4-3:
LO4-4:
LO4-5:
LO4-6:
Explain the concepts of center, variability, and shape.
Use Excel to obtain descriptive statistics and visual displays.
Calculate and interpret common measures of center.
Calculate and interpret common measures of variability.
Transform a data set into standardized values.
Apply the Empirical Rule and recognize outliers.
4-3
Chapter 4
Descriptive Statistics
Chapter Learning Objectives
LO4-7: Calculate quartiles and other percentiles.
LO4-8: Make and interpret box plots.
LO4-9: Calculate and interpret a correlation coefficient and
covariance.
LO4-10: Calculate the mean and standard deviation from
grouped data.
LO4-11: Assess skewness and kurtosis in a sample.
4-4
Chapter 4
LO4-1
4.1 Numerical Description
LO4-1: Explain the concepts of center, variability, and shape.
Three key characteristics of numerical data:
4-5
Chapter 4
LO4-2
4.1 Numerical Description
LO4-2: Use Excel to obtain descriptive statistics and visual displays.
EXCEL Histogram Display for Tables 4.3
4-6
Chapter 4
LO4-3
4.2 Measures of Center
LO4-3: Calculate and interpret common measures of center.
Mean
•
A familiar measure of center
Population Mean
•
Sample Mean
In Excel, use function =AVERAGE(Data) where Data is an array of
data values.
4-7
Chapter 4
4.2 Measures of Center
LO4-3
Median
•
•
•
•
The median (M) is the 50th percentile or midpoint of the sorted
sample data.
M separates the upper and lower halves of the sorted observations.
If n is odd, the median is the middle observation in the data array.
If n is even, the median is the average of the middle two
observations in the data array.
4-8
Chapter 4
4.2 Measures of Center
LO4-3
Mode
•
•
•
The most frequently occurring data value.
May have multiple modes or no mode.
The mode is most useful for discrete or categorical data with only a
few distinct data values. For continuous data or data with a wide
range, the mode is rarely useful.
4-9
Chapter 4
LO4-1
4.2 Measures of Center
LO4-1: Explain the concepts of center, variability, and shape.
Shape
• Compare mean and median or look at the histogram to determine
degree of skewness.
• Figure 4.10 shows prototype population shapes showing varying
degrees of skewness.
4-10
Chapter 4
LO4-3
4.2 Measures of Center
Geometric Mean
•
The geometric mean (G) is a
multiplicative average.
Growth Rates
A variation on the geometric
mean used to find the average
growth rate for a time series.
4-11
4.2 Measures of Center
Growth Rates
•
Chapter 4
LO4-3
For example, from
2006 to 2010, JetBlue
Airlines revenues are:
Year
Revenue (mil)
2006
2,361
2007
2,843
2008
3,392
2009
3,292
2010
3,779
The average growth rate:
or 12.5 % per year.
4-12
Chapter 4
LO4-3
4.2 Measures of Center
Midrange
•
The midrange is the point halfway between the lowest and highest
values of X.
•
Easy to use but sensitive to extreme data values.
•
For the J.D. Power quality data:
•
Here, the midrange (126.5) is higher than the mean (114.70) or
median (113).
4-13
Chapter 4
LO4-3
4.2 Measures of Center
Trimmed Mean
•
To calculate the trimmed mean, first remove the highest and lowest
k percent of the observations.
•
For example, for the n = 33 P/E ratios, we want a 5 percent trimmed
mean (i.e., k = .05).
•
To determine how many observations to trim, multiply k by n, which
is 0.05 x 33 = 1.65 or 2 observations.
•
So, we would remove the two smallest and two largest observations
before averaging the remaining values.
4-14
Chapter 4
LO4-3
4.2 Measures of Center
Trimmed Mean
•
Here is a summary of all the measures of central tendency for the
J.D. Power data.
Mean:
114.70
=AVERAGE(Data)
Median:
113
=MEDIAN(Data)
Mode:
111
=MODE.SNGL(Data)
Geometric Mean:
113.35
=GEOMEAN(Data)
Midrange:
126.5
(MIN(Data)+MAX(Data))/2
5% Trim Mean:
113.94
=TRIMMEAN(Data, 0.1)
•
The trimmed mean mitigates the effects of very high values, but still
exceeds the median.
4-15
Chapter 4
LO4-4
4.3 Measures of Variability
LO4-4: Calculate and interpret common measures of variability.
•
Variation is the “spread” of data points about the center of the
distribution in a sample. Consider the following measures of
variability:
Measures of Variability
Statistic
Range
Sample
Variance
(s2)
Formula
Excel
xmax – xmin
=MAX(Data) MIN(Data)
Pro
Con
Sensitive to
Easy to calculate extreme data
values.
Plays a key role
=VAR.S(Data) in mathematical
statistics.
Nonintuitive
meaning.
4-16
Chapter 4
LO4-4
4.3 Measures of Variability
Measures of Variation
Statistic
Sample
standard
deviation
(s)
Sample
coefficient. of
variation
(CV)
Formula
Excel
Pro
Most common
measure. Uses
=STDEV.S(Data) same units as the
raw data ($ , £, ¥,
grams etc.).
None
Measures
relative variation
in percent so can
compare data
sets.
Con
Nonintuitive
meaning.
Requires
nonnegative
data.
4-17
Chapter 4
4.3 Measures of Variability
LO4-4
Measures of Variability
Statistic
Mean
absolute
deviation
(MAD)
Formula
Excel
Pro
n
xi x
i 1
=AVEDEV(Data)
Easy to
understand.
n
Con
Lacks “nice”
theoretical
properties.
Population variance
Population
standard
deviation
4-18
Chapter 4
LO4-4
4.3 Measures of Variability
Coefficient of Variation
•
Useful for comparing variables measured in different units or with
different means.
•
A unit-free measure of dispersion.
•
Expressed as a percent of the mean.
•
Only appropriate for nonnegative data. It is undefined if the mean is
zero or negative.
4-19
Chapter 4
LO4-4
4.3 Measures of Variability
Mean Absolute Deviation
•
This statistic reveals the average distance from the center.
•
Absolute values must be used since otherwise the deviations
around the mean would sum to zero. It is stated in the unit of
measurement.
•
The MAD is appealing because of its simple interpretation.
4-20
Chapter 4
4.3 Measures of Variability
LO4-1
Central Tendency vs. Dispersion:
Manufacturing
•
Take frequent samples to monitor quality.
4-21
Chapter 4
4.4 Standardized Data
Chebyshev’s Theorem
•
•
•
•
•
For any population with mean m and standard deviation s, the
percentage of observations that lie within k standard deviations of
the mean must be at least 100[1 – 1/k2].
For k = 2 standard deviations,
• Although
100[1 – 1/22] = 75%
applicable to
So, at least 75.0% will lie within m + 2s
any data set,
For k = 3 standard deviations,
these limits
100[1 – 1/32] = 88.9%
tend to be
rather wide.
So, at least 88.9% will lie within m + 3s
4-22
Chapter 4
4.4 Standardized Data
The Empirical Rule
•
The normal distribution is symmetric and is also known as the
bell-shaped curve.
•
The Empirical Rule states that for data from a normal distribution,
we expect the interval m ± ks to contain a known percentage
of data. For
k = 1, 68.26% will lie within m + 1s
k = 2, 95.44% will lie within m + 2s
k = 3, 99.73% will lie within m + 3s
4-23
Chapter 4
4.4 Standardized Data
The Empirical Rule
Note: No upper
bound is given.
Data values
outside
m + 3s
are rare.
4-24
Chapter 4
LO4-5
4.4 Standardized Data
LO4-5: Transform a data set into standardized values.
•
A standardized variable (Z) redefines each observation in terms of
the number of standard deviations from the mean.
Standardization formula
for a population:
Standardization formula
for a sample (for n > 30):
A negative z
value means the
observation is to the
left of the mean.
Positive z means
the observation is to
the right of the mean.
4-25
Chapter 4
LO4-6
4.4 Standardized Data
LO4-6: Apply the Empirical Rule and recognize outliers.
4-26
Chapter 4
4.4 Standardized Data
Estimating Sigma
•
For a normal distribution, the range of values is almost 6s
(from m – 3s to m + 3s).
•
If you know the range R (high – low), you can estimate the
standard deviation as s = R/6.
•
Useful for approximating the standard deviation when only R is
known.
•
This estimate depends on the assumption of normality.
4-27
Chapter 4
LO4-7
4.5 Percentiles, Quartiles, and Box-Plots
LO4-7: Calculate quartiles and other percentiles
Percentiles
•
Percentiles are data that have been divided into 100 groups.
•
For example, you score in the 83rd percentile on a standardized
test. That means that 83% of the test-takers scored below you.
•
Deciles are data that have been divided into
10 groups.
Quintiles are data that have been divided into
5 groups.
Quartiles are data that have been divided into
4 groups.
•
•
4-28
Chapter 4
LO4-7
4.5 Percentiles, Quartiles, and Box Plots
Percentiles
•
Percentiles may be used to establish benchmarks for comparison
purposes (e.g. health care, manufacturing, and banking industries
use 5th, 25th, 50th, 75th and 90th percentiles).
•
Quartiles (25, 50, and 75 percent) are commonly used to assess
financial performance and stock portfolios.
•
Percentiles can be used in employee merit evaluation and salary
benchmarking.
4-29
4.5 Percentiles, Quartiles, and Box Plots
Quartiles
•
Quartiles are scale points that divide the sorted data into four
groups of approximately equal size.
Q1
Lower 25%
•
Chapter 4
LO4-7
|
Q2
Second 25%
|
Q3
Third 25%
|
Upper 25%
The three values that separate the four groups are called Q1, Q2,
and Q3, respectively.
4-30
Chapter 4
LO4-7
4.5 Percentiles, Quartiles, and Box Plots
Quartiles
•
The second quartile Q2 is the median, a measure of central
tendency.
Q2
Lower 50%
•
|
Upper 50%
Q1 and Q3 measure dispersion since the interquartile range Q3 – Q1
measures the degree of spread in the middle 50 percent of data
values.
Q1
Lower 25%
|
Q3
Middle 50%
|
Upper 25%
4-31
Chapter 4
LO4-7
4.5 Percentiles, Quartiles, and Box Plots
Quartiles – The method of medians
•
The first quartile Q1 is the median of the data values below Q2, and
the third quartile Q3 is the median of the data values above Q2.
Q1
Lower 25%
|
Q2
Second 25%
For first half of data, 50% above,
50% below Q1.
|
Q3
Third 25%
|
Upper 25%
For second half of data, 50%
above, 50% below Q3.
4-32
Chapter 4
LO4-7
4.5 Percentiles, Quartiles, and Box Plots
Method of Medians
•
For small data sets, find quartiles using method of medians:
Step 1: Sort the observations.
Step 2: Find the median Q2.
Step 3: Find the median of the data values that lie below Q2.
Step 4: Find the median of the data values that lie above Q2.
4-33
Chapter 4
LO4-7
4.5 Percentiles, Quartiles, and Box Plots
Method of Medians
Example:
4-34
Chapter 4
4.5 Percentiles, Quartiles, and Box Plots
LO4-7
Example: P/E Ratios and Quartiles
•
So, to summarize:
Q1
Lower 25%
of P/E Ratios
•
27
Q2
Second 25%
of P/E Ratios
35.5
Q3
Third 25%
of P/E Ratios
40.5
Upper 25%
of P/E Ratios
These quartiles express central tendency and dispersion. What is
the interquartile range?
4-35
Chapter 4
LO4-8
4.5 Percentiles, Quartiles, and Box Plots
LO4-8: Make and interpret box plots.
•
A useful tool of exploratory data analysis (EDA).
•
Also called a box-and-whisker plot.
•
Based on a five-number summary:
Xmin, Q1, Q2, Q3, Xmax
•
Consider the five-number summary for the previous P/E ratios
example:
Xmin, Q1, Q2, Q3, Xmax
7
27 35.5 40.5 49
4-36
Chapter 4
LO4-8
4.5 Percentiles, Quartiles, and Box Plots
Box Plots
• The box plot is displayed visually, like this.
• A box plot shows variability and shape.
4-37
Chapter 4
LO4-8
4.5 Percentiles, Quartiles, and Box Plots
Box Plots
4-38
Chapter 4
LO4-8
4.5 Percentiles, Quartiles, and Box Plots
Box Plots: Fences and Unusual Data Values
•
Use quartiles to detect unusual data points.
•
These points are called fences and can be found using the following
formulas:
Inner fences
Outer fences:
Lower fence
Q1 – 1.5 (Q3 – Q1)
Q1 – 3.0 (Q3 – Q1)
Upper fence
Q3 + 1.5 (Q3 – Q1)
Q3 + 3.0 (Q3 – Q1)
•
Values outside the inner fences are unusual while those outside the
outer fences are outliers.
4-39
Chapter 4
LO4-8
4.5 Percentiles, Quartiles, and Box Plots
Box Plots: Fences and Unusual Data Values
•
For example, consider the P/E ratio data:
Inner fences
Outer fences:
Lower fence:
107 – 1.5 (126 –107) = 78.5
107 – 3.0 (126 –107) = 50
Upper fence:
126 + 1.5 (126 –107) =
154.5
126 + 3.0 (126 –107) = 183
There is one outlier (170) that lies above the inner fence. There are no
extreme outliers that exceed the outer fence.
4-40
Chapter 4
LO4-8
4.5 Percentiles, Quartiles, and Box Plots
Box Plots: Fences and Unusual Data Values
•
Truncate the whisker at the fences and display
unusual values and outliers as dots.
Outlier
•
Based on these fences, there is only one outlier.
4-41
Chapter 4
LO4-8
4.5 Percentiles, Quartiles, and Box Plots
Box Plots: Midhinge
•
The average of the first and third quartiles.
•
The name midhinge derives from the idea that, if the “box” were
folded in half, it would resemble a “hinge”.
4-42
Chapter 4
LO4-9
4.6 Correlation and Covariance
LO4-9: Calculate and interpret a correlation coefficient and covariance.
Correlation Coefficient
•
The sample correlation coefficient is a statistic that describes the
degree of linearity between paired observations on two quantitative
variables X and Y.
Note: -1 ≤ r ≤ +1.
4-43
Chapter 4
LO4-9
4.6 Correlation and Covariance
Correlation Coefficient
•
Illustration of Correlation Coefficients
4-44
Chapter 4
LO4-9
4.6 Correlation and Covariance
Covariance
The covariance of two random variables X and Y (denoted σXY )
measures the degree to which the values of X and Y change together.
4-45
Chapter 4
LO4-9
LO
4.6 Correlation and Covariance
Covariance
A correlation coefficient
is the covariance divided
by the product of the
standard deviations of X
and Y.
4-46
Chapter 4
LO4-10
4.7 Grouped Data
LO4-10: Calculate the mean and standard deviation from grouped
data.
Weighted Mean
Group Mean and Standard Deviation
4-47
Chapter 4
LO4-10
4.7 Grouped Data
Group Mean and Standard Deviation
4-48
Chapter 4
LO4-11
4.8 Skewness and Kurtosis
LO4-11: Assess skewness and kurtosis in a sample.
Skewness
4-49
Chapter 4
LO4-11
4.8 Skewness and Kurtosis
LO4-11: Assess skewness and kurtosis in a sample.
Kurtosis
4-50