How spread out are the data values?
Download
Report
Transcript How spread out are the data values?
McGraw-Hill/Irwin
4
Chapter
Descriptive Statistics
Numerical Description
Central Tendency
Dispersion
Standardized Data
Percentiles, Quartiles, and Box Plots
Correlation
Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved.
Numerical Description
• Three key characteristics of numerical data:
Characteristic
Interpretation
Central Tendency
Where are the data values concentrated?
What seem to be typical or middle data values?
Dispersion
How much variation is there in the data?
How spread out are the data values?
Are there unusual values?
Shape
Are the data values distributed symmetrically?
Skewed? Sharply peaked? Flat? Bimodal?
4-2
Central Tendency
Six Measures of Central Tendency
Statistic
Formula
Mean
1 n
xi
n i 1
Median
Middle value
in sorted
array
Excel Formula
=AVERAGE(Data)
=MEDIAN(Data)
Pro
Con
Familiar and
uses all the
sample
information.
Influenced
by extreme
values.
Robust when
extreme data
values exist.
Ignores
extremes
and can be
affected by
gaps in data
values.
4-3
Central Tendency
Six Measures of Central Tendency
Statistic
Mode
Midrange
Formula
Most
frequently
occurring
data value
xmin xmax
2
Excel Formula
=MODE(Data)
=0.5*(MIN(Data)
+MAX(Data))
Pro
Con
Useful for
attribute data
or discrete
data with a
small range.
May not be
unique, and
is not
helpful for
continuous
data.
Easy to
understand
and
calculate.
Influenced
by extreme
values and
ignores
most data
values.
4-4
Central Tendency
Six Measures of Central Tendency
Statistic
Geometric
mean (G)
Trimmed
mean
Formula
n
x1 x2 ... xn
Same as the
mean except
omit highest
and lowest k%
of data values
(e.g., 5%)
Excel Formula
Pro
Con
=GEOMEAN(Data)
Useful for
growth
rates and
mitigates
high
extremes.
Less
familiar
and
requires
positive
data.
=TRIMMEAN(Data,
Percent)
Mitigates
effects of
extreme
values.
Excludes
some data
values that
could be
relevant.
4-5
Central Tendency
Skewness
• Compare mean and median or look at
histogram to determine degree of skew ness.
4-6
4-6
Dispersion
• Variation is the “spread” of data points about
the center of the distribution in a sample.
Consider the following measures of dispersion:
Measures of Variation
Statistic
Range
Formula
xmax – xmin
n
Variance
(s2)
4-7
xi x
i 1
n 1
Excel
Pro
Con
=MAX(Data)MIN(Data)
Easy to calculate
Sensitive to
extreme data
values.
=VAR(Data)
Plays a key role in
Non-intuitive
mathematical
meaning.
statistics.
2
4-7
Dispersion
Measures of Variation
Statistic
Formula
Excel
Pro
Con
=STDEV(Data)
Most common
measure. Uses
same units as the
raw data ($ , £, ¥,
etc.).
Non-intuitive
meaning.
Measures relative
variation in
percent so can
compare data sets.
Requires
nonnegative
data.
n
Standard
deviation (s)
Coef-ficient.
of
variation
(CV)
2
x
x
i
i 1
n 1
100
s
x
None
4-8
Dispersion
Measures of Variation
Statistic
Formula
Mean
absolute
deviation
(MAD)
n
xi x
i 1
Excel
=AVEDEV(Data)
Pro
Con
Easy to
understand.
Lacks “nice”
theoretical
properties.
n
Standardized Data
Chebyshev’s Theorem
4-9
4-9
Standardized Data
The Empirical Rule
• Are there any unusual values or outliers?
7 8
. . .
48 55
68 91
Unusual
Unusual
Outliers
-19.5
Outliers
-5.4
8.6
22.72
36.8
50.9
65.0
4-10
Standardized Data
Defining a Standardized Variable
• A standardized variable (Z) redefines each
observation in terms the number of standard
deviations from the mean.
Standardization
formula for a
population:
Standardization
formula for a
sample:
xi
zi
xi x
zi
s
4-11
Percentiles and Quartiles
Percentiles
• Percentiles are data that have been divided into 100
groups.
• For example, you score in the 83rd percentile on a
standardized test. That means that 83% of the testtakers scored below you.
• Deciles are data that have been divided into
10 groups.
• Quintiles are data that have been divided into
5 groups.
• Quartiles are data that have been divided into
4 groups.
4-12
Box Plots
• A useful tool of exploratory data analysis
(EDA).
• Also called a box-and-whisker plot.
• Based on a five-number summary:
Xmin, Q1, Q2, Q3, Xmax
• Consider the following five-number summary :
Xmin, Q1, Q2, Q3, Xmax
7
14 19 26 91
4-13
Box Plots
• The box plot is displayed visually, like this.
• A box plot shows central tendancy, dispersion,
and shape.
4-14
Correlation
Correlation Coefficient
• The sample correlation coefficient is a statistic
that describes the degree of linearity between
paired observations on two quantitative
variables X and Y.
Its range is -1 ≤ r ≤ +1.
4-15