Transcript LO3-1

Chapter 3
Descriptive Statistics: Numerical
Methods
McGraw-Hill/Irwin
Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.
Descriptive Statistics
3.1 Describing Central Tendency
3.2 Measures of Variation
3.3 Percentiles, Quartiles and Box-andWhiskers Displays
3.4 Covariance, Correlation, and the Least
Square Line (Optional)
3.5 Weighted Means and Grouped Data
(Optional)
3.6 The Geometric Mean (Optional)
3-2
LO3-1: Compute and
interpret the mean,
median, and mode.
3.1 Describing Central Tendency

In addition to describing the shape of a distribution,
want to describe the data set’s central tendency
◦ A measure of central tendency represents the center or
middle of the data
◦ Population mean (μ) is average of the population
measurements


Population parameter: a number calculated from all
the population measurements that describes some
aspect of the population
Sample statistic: a number calculated using the
sample measurements that describes some aspect of
the sample
3-3
LO3-1
Measures of Central Tendency
Mean, 
Median, Md
Mode, Mo
The average or expected value
The value of the middle point
of the ordered measurements
The most frequent value
3-4
LO3-1
The Mean
Population X1, X2, …, XN

Sample x1, x2, …, xn
x
Population Mean
Sample Mean
n
N


Xi
i=1
N
x
x
i
i=1
n
3-5
LO3-1
The Sample Mean
For a sample of size n, the sample mean (x) is defined as
n
x
x
i 1
n
i
x1  x2  ...  xn

n
and is a point estimate of the population mean 
• It is the value to expect, on average and in the long run
3-6
LO3-1
Example 3.1 Car Mileage Case: Estimating
Mileage
Sample mean for first five car mileages from
Table 3.1
30.8, 31.7, 30.1, 31.6, 32.1
5
x
x1  x2  x3  x4  x5
x

5
5
30.8  31.7  30.1  31.6  32.1 156.3
x

 31.26
5
5
i 1
i
3-7
LO3-1
The Median
The median Md is a value such that 50% of
all measurements, after having been arranged
in numerical order, lie above (or below) it
◦ If the number of measurements is odd, the median
is the middlemost measurement in the ordering
◦ If the number of measurements is even, the
median is the average of the two middlemost
measurements in the ordering
3-8
LO3-1
Example 3.1 The Car Mileage Case

First five observations from Table 3.1:
30.8, 31.7, 30.1, 31.6, 32.1

In order: 30.1, 30.8, 31.6, 31.7, 32.1

There is an odd so median is one in middle,
or 31.6
3-9
LO3-1
The Mode
The mode Mo of a population or sample of
measurements is the measurement that occurs most
frequently
◦ Modes are the values that are observed “most typically”
◦ Sometimes higher frequencies at two or more values
 If there are two modes, the data is bimodal
 If more than two modes, the data is multimodal
◦ When data are in classes, the class with the highest
frequency is the modal class
 The tallest box in the histogram
3-10
LO3-1
Relationships Among Mean, Median
and Mode
Figure 3.3
3-11
LO3-2: Compute and
interpret the range,
variance, and standard
deviation.


Figure 3.13
3.2 Measures of Variation
Knowing the measures of central tendency is
not enough
Both of the distributions below have
identical measures of central tendency
3-12
3-13
LO3-2
Measures of Variation
Range
Largest minus the smallest
measurement
Variance
The average of the squared deviations
of all the population measurements
from the population mean
Standard
Deviation
The square root of the population
variance
3-14
LO3-2
The Range

Largest minus smallest

Measures the interval spanned by all the data

For the left side of Figure 3.13, largest is 5
and smallest is 3

Range is 5 – 3 = 2 days
3-15
LO3-2
Population Variance and Standard
Deviation

The population variance (σ2) is the average
of the squared deviations of the individual
population measurements from the
population mean (µ)

The population standard deviation (σ) is the
positive square root of the population
variance
3-16
LO3-2
Variance

For a population of size N, the population
variance σ2 is:
N
2 

2


x


 i
i 1
N
2
2
2

x1     x2       xN   

N
For a sample of size n, the sample variance s2
is:
n
s2 
2


x

x
 i
i 1
n 1
2
2
2

x1  x   x2  x     xn  x 

n 1
3-17
LO3-2
Standard Deviation

Population standard deviation (σ):
 

2
Sample standard deviation (s):
s s
2
3-18
LO3-2
Example: Chris’s Class Sizes This
Semester
Data points are: 60, 41, 15, 30, 34
Mean is 36 (180/5)
Variance is:




2
2
2
2
2
2

60  36  41  36  15  36  30  36  34  36

5
576  25  441  36  4 1082


 216.4
5
5
Standard deviation is:
  216.4  14.71
3-19
LO3-2
Example: Sample Variance and
Standard Deviation



Example 3.6: data for first five car mileages
from Table 3.1: 30.8, 31.7, 30.1, 31.6, 32.1
The sample mean is 31.26
The variance and standard deviation are:
5
s2 
 x  x 
i 1
2
i
5 1
2
2
2
2
2

30.8  31.26   31.7  31.26   30.1  31.26   31.6  31.26   32.1  31.26 

4
2.572

 0.643
4
s  s 2  0.643  0.8019
3-20
LO3-3: Use the
Empirical
Rule and Chebyshev’s
Theorem to describe
variation.
The Empirical Rule for Normal
Populations
If a population has mean µ and standard
deviation σ and is described by a normal
curve, then
 68.26% of the population measurements lie
within one standard deviation of the mean:
[µ-σ, µ+σ]
 95.44% lie within two standard deviations of
the mean: [µ-2σ, µ+2σ]
 99.73% lie within three standard deviations
of the mean: [µ-3σ, µ+3σ]

3-21
LO3-3
Chebyshev’s Theorem



Let µ and σ be a population’s mean and
standard deviation, then for any value k > 1
At least 100(1 - 1/k2)% of the population
measurements lie in the interval [µ-kσ,
µ+kσ]
Only practical for non-mound-shaped
distribution population that is not very
skewed
3-22
LO3-3
z Scores

For any x in a population or sample, the associated z
score is
x  mean
z
standard deviation

The z score is the number of standard deviations
that x is from the mean
◦ A positive z score is for x above (greater than) the mean
◦ A negative z score is for x below (less than) the mean
3-23
LO3-3
Coefficient of Variation

Measures the size of the standard deviation relative
to the size of the mean
Standard deviation
Coefficien t of variation 
100%
Mean

Used to:
◦ Compare the relative variabilities of values about the mean
◦ Compare the relative variability of populations or samples
with different means and different standard deviations
◦ Measure risk
3-24
LO3-4: Compute and
interpret percentiles,
quartiles, and box-andwhiskers displays.
3.3 Percentiles, Quartiles, and Box-andWhiskers Displays
For a set of measurements arranged in increasing
order, the pth percentile is a value such that p
percent of the measurements fall at or below the
value and (100-p) percent of the measurements fall
at or above the value




The first quartile Q1 is the 25th percentile
The second quartile (median) is the 50th percentile
The third quartile Q3 is the 75th percentile
The interquartile range IQR is Q3 - Q1
3-25
LO3-4
Calculating Percentiles
Arrange the measurements in increasing
order
2. Calculate the index i=(p/100)n where p is
the percentile to find
3. (a) If i is not an integer, round up and the
next integer greater than i denotes the pth
percentile
(b) If i is an integer, the pth percentile is the
average of the measurements in the i and
i+1 positions
1.
3-26
LO3-4
Percentile Example






i=(10/100)12=1.2
Not an integer so round up to 2
10th percentile is in the second position so 11,070
i=(25/100)12=3
Integer so average values in positions 3 and 4
25th percentile (18,211+26,817)/2 or 22,514
3-27
LO3-4
Five Number Summary
1.
2.
3.
4.
5.
The smallest
measurement
The first quartile, Q1
The median, Md
The third quartile, Q3
The largest measurement

Displayed visually
using a box-andwhiskers plot
3-28
LO3-4
Box-and-Whiskers Plots

The box plots the:
◦
◦
◦
◦
◦
First quartile, Q1
Median, Md
Third quartile, Q3
Inner fences
Outer fences

Inner fences
◦ Located 1.5IQR away
from the quartiles:
 Q1 – (1.5  IQR)
 Q3 + (1.5  IQR)

Outer fences
◦ Located 3IQR away
from the quartiles:
 Q1 – (3  IQR)
 Q3 + (3  IQR)
3-29
LO3-4
Box-and-Whiskers Plots



Continued
The “whiskers” are dashed lines that plot the
range of the data
A dashed line drawn from the box below Q1
down to the smallest measurement
Another dashed line drawn from the box
above Q3 up to the largest measurement
Figures 3.17 and 3.18
3-30
3-31
3-32
LO3-4
Outliers

Outliers are measurements that are very
different from other measurements
◦ They are either much larger or much smaller than
most of the other measurements

Outliers lie beyond the limits of the box-andwhiskers plot
◦ Measurements less than the lower limit or greater
than the upper limit
3-33
LO3-5: Compute and
interpret covariance,
correlation, and the
least squares line
(Optional).
3.4 Covariance, Correlation, and the
Least Squares Line (Optional)


When points on a scatter plot seem to
fluctuate around a straight line, there is a
linear relationship between x and y
A measure of the strength of a linear
relationship is the covariance sxy
 x  x y
n
s xy 
i 1
i
i
y

n 1
3-34
LO3-5
Covariance

A positive covariance indicates a positive
linear relationship between x and y
◦ As x increases, y increases

A negative covariance indicates a negative
linear relationship between x and y
◦ As x increases, y decreases
3-35
LO3-5
Correlation Coefficient

Magnitude of covariance does not indicate
the strength of the relationship
◦ Magnitude depends on the unit of measurement
used for the data

Correlation coefficient (r) is a measure of the
strength of the relationship that does not
depend on the magnitude of the data
r
s xy
sx s y
3-36
LO3-5
Correlation Coefficient

Continued
Sample correlation coefficient r is always
between -1 and +1
◦ Values near -1 show strong negative correlation
◦ Values near 0 show no correlation
◦ Values near +1 show strong positive correlation

Sample correlation coefficient is the point
estimate for the population correlation
coefficient ρ
3-37
LO3-5
Least Squares Line



If there is a linear relationship between x and
y, might wish to predict y on the basis of x
This requires the equation of a line
describing the linear relationship
Line is calculated based on least squares line
◦ Discussed in detail in a later chapter

Need to find slope (b1) and y-intercept (b0)
b1 
s xy
s x2
b0  y  b1 x
3-38
LO3-6: Compute and
interpret weighted
means and the mean
and standard deviation
of grouped data
(Optional).

3.5 Weighted Means and Grouped
Data (Optional)
Sometimes, some measurements are more important
than others
◦ Assign numerical “weights” to the data
 Weights measure relative importance of the value

Calculate weighted mean as
w x
w
i i
i
where wi is the weight assigned to the ith
measurement xi
3-39
LO3-6
Descriptive Statistics for Grouped Data



Data already categorized into a frequency
distribution or a histogram is called grouped
data
Can calculate the mean and variance even
when the raw data is not available
Calculations are slightly different for data
from a sample and data from a population
3-40
LO3-6
Descriptive Statistics for Grouped Data
(Sample)

Sample mean for grouped data:
x

 fi M i
 fi M i

n
 fi
Sample variance for grouped data:
2
 f i M i  x 
s 
n 1
2
 fi is the frequency for class i
 Mi is the midpoint of class i
 n = Σfi = sample size
3-41
LO3-6
Descriptive Statistics for Grouped Data
(Population)

Population mean for grouped data:
 fi M i
 fi M i


N
 fi

Population variance for grouped data:
2


f
M

x

i
i
2 
N
 fi is the frequency for class i
 Mi is the midpoint of class i
 N = Σfi = population size
3-42
LO3-7: Compute and
interpret the geometric
mean (Optional).
3.6 The Geometric Mean (Optional)



For rates of return of an investment, use the
geometric mean to give the correct wealth at
the end of the investment
Suppose the rates of return (expressed as
decimal fractions) are R1, R2, …, Rn for
periods 1, 2, …, n
The mean of all these returns is the
calculated as the geometric mean:
Rg 
n
1  R1  1  R2  1  Rn  1
3-43