Transcript Ch3 - YSU
Chapter 3 – Descriptive Statistics
Numerical Measures
1
Chapter Outline
Measures of Central Location
Mean
Median
Mode
Percentile (Quartile, Quintile, etc.)
Measures of Variability
Range
Variance (Standard Deviation, Coefficient of Variation)
2
A Recall
A sample is a subset of a population.
Numerical measures calculated for sample data are
called sample statistics.
Numerical measures calculated for population data
are called population parameters.
A sample statistic is referred to as the point
estimator of the corresponding population
parameter.
3
Mean
As a measure of central location, mean is simply
the arithmetic average of all the data values.
The sample mean x is the point estimator of the
population mean .
4
Sample Mean x
x
x
i
n
The symbol (called sigma) means ‘sum up’.
xi is the value of ith observation in the sample.
n is the number of observations in the sample.
5
Population Mean
x
i
N
The symbol (called sigma) means ‘sum up’.
xi is the value of ith observation in the sample.
N is the number of observations in the population.
is pronounced as ‘miu’.
6
Sample Mean
Example: Sales of Starbucks Stores
50 Starbucks stores are randomly chosen in the NYC. The table below
shows the sales of those stores in December 2012.
95
108
67
99
93
77
86
93
119
118
77
120
97
103
97
88
89
105
104
105
97
78
97
106
95
100
97
87
93
82
99
79
79
82
61
109
93
82
104
73
89
88
93
93
109
90
88
98
101
101
7
Sample Mean
Example: Sales of Starbucks Stores
x
x
n
95
108
67
99
93
77
86
93
119
118
77
120
97
103
97
88
89
105
104
105
i
4,685
93.69
50
97
78
97
106
95
100
97
87
93
82
99
79
79
82
61
109
93
82
104
73
89
88
93
93
109
90
88
98
101
101
8
Median
The median of a data set is the value in the middle
when the data items are arranged in ascending order.
Whenever a data set has extreme values, the median
is the preferred measure of central location.
The median is the measure of location most often
reported for annual income and property value data.
A few extremely large incomes or property values
can inflate the mean since the calculation of mean
uses all the data items.
9
Median
For an odd number of observations:
26
18
27
12
14
27
19
12
14
18
19
26
27 27
7 observations
in ascending order
the median is the middle value.
Median = 19
10
Median
For an even number of observations:
26
18
27
12 14
27
12
14
18
19
27 27
26
19
30
8 observations
30 in ascending order
the median is the average of the middle two values.
Median = (19 + 26)/2 = 22.5
11
Mean vs. Median
As noted, extremes values can change means remarkably,
while medians might not be affected much by extreme
values. Therefore, in that regard, median 30
is a better
representative of central location.
12
14
18
19
26
27 27
30
280
For the previous example, the median is 22.5
and the mean is 21.6. If we add one large number
(280) to the data, the median becomes 26 (the
value in the middle). But the mean becomes 50.3.
In this case we prefer median to mean as a
measure of central location.
12
Mode
The mode of a data set is the value that occurs most
frequently.
The greatest frequency can occur at two or more different
values.
If the data have exactly two modes, the data are bimodal.
If the data have more than two modes, the data are
multimodal.
Caution: If the data are bimodal or multimodal, Excel’s
MODE function will incorrectly identify a single mode.
13
Mode
12
14
18
19
26
27 27
30
For the example above, 27 shows up twice
while all the other data values show up once. So,
the mode is 27.
14
Percentiles
A percentile provides information about how the data are
spread over the interval from the smallest value to the
largest value.
Admission test scores for colleges and universities are
frequently reported in terms of percentiles.
The pth percentile of a data set is a value such that at least
p percent of the items are less than or equal to this value
and at least (100 - p) percent of the items are more than or
equal to this value.
The 50th percentile is simply the median.
15
Percentiles
Arrange the data in ascending order.
Compute index i, the position of the pth percentile.
i = (p/100)n
If i is not an integer, round up. The pth percentile
is the value in the ith position.
If i is an integer, the pth percentile is the average
of the values in positions i and i+1.
16
Percentiles
Find the 75th percentile of the following data
12
14
18
19
26
27 29
30
Note: The data is already in ascending order.
i = (p/100)n = (75/100)8 = 6
So, averaging the 6th and 7th data values:
75th percentile = (27 + 29)/2 = 28
17
Percentiles
Find the 20th percentile of the following data
12
14
18
19
26
27 29
30
Note: The data is already in ascending order.
i = (p/100)n = (20/100)8 = 1.6, which is rounded up to 2.
So, the 20th percentile is simply the 2nd data value, i.e. 14.
18
Quartiles
Quartiles are specific percentiles.
First Quartile = 25th percentile
Second Quartile = 50th percentile = Median
Third Quartile = 75th percentile
19
Measures of Variability
It is often desirable to consider measures of variability
(dispersion), as well as measures of central location.
For example, when two stocks provide the same average
return of 5% a year, but stock A’s return is very stable –
close to 5% and stock B’s return is volatile ( it could be as
low as –10%), are you indifferent with regard to which
stock to invest in?
For another example, in choosing supplier A or supplier B
we might consider not only the average delivery time for
each, but also the variability in delivery time for each.
20
Measures of Variability
Range
Interquartile Range
Variance/Standard Deviation
Coefficient of Variation
21
Range
The range of a data set is the difference between
the largest and smallest data values.
It is the simplest measure of variability.
It is very sensitive to the smallest and largest data
values.
22
Range
Example:
12
14
18
19
26
27 29
30
Range = largest value - smallest value
= 30 – 12
=8
23
Interquartile Range
The interquartile range of a data set is the
difference between the 3rd quartile and the 1st
quartile.
It is the range of the middle 50% of the data.
It overcomes the sensitivity to extreme data
values.
24
Interquartile Range
Example:
12
14
18
19
26
27 29
30
3rd Quartile (Q3) = 75th percentile = 28
1st Quartile (Q1) = 25th percentile = 16
Interquartile Range = Q3 – Q1 = 28 – 16 = 12
25
Variance
The variance is a measure of variability that utilizes all the
data.
It is based on the difference between the value of each
observation (xi) and the mean ( x for a sample, for a
population)
The variance is useful in comparing the variability of two
or more variables.
26
Variance
The variance is the average of the squared
differences between each data value and the mean.
The variance is calculated as follows:
s2
2
(
x
x
)
i
n 1
for a
sample
2 ( xi )
N
2
for a
population
27
Standard Deviation
The standard deviation of a data set is the
positive square root of the variance.
It is measured in the same units as the data,
making it more appropriately interpreted
than the variance.
28
Standard Deviation
The standard deviation is computed as follows:
s s2
2
for a
sample
for a
population
29
Variance and Standard Deviation
Example
12
14
18
Variance
s2
19
26
2
x
x
i
n 1
27 29
30
48.98
Standard Deviation
s s 2 48.98 7
30
Coefficient of Variation
The coefficient of variation indicates how
large the standard deviation is in relation to
the mean.
In a comparison between two data sets with
different units or with the same units but a
significant difference in magnitude,
coefficient of variation should be used
instead of variance.
31
Coefficient of Variation
The coefficient of variation is computed as follows:
s
100
%
x
for a
sample
100 %
for a
population
32
Coefficient of Variation
Example
12
14
18
19
26
27 29
30
s
7
100 % 32%
100 %
x
21.875
33
Coefficient of Variation
Example: Height vs. Weight
In a class of 30 students, the average height is 5’5’’ with a standard
deviation of 3’’ and the average weight is 120 lbs with a standard
deviation of 20 lbs. Question, in which measure (height or weight)
are students more different?
Since height and weight don’t have the same unit, we have to use
coefficient of variation to remove the units before comparing the
variations in height and weight.
As shown below, students’ weight is more variant than their height.
s height
3' '
100 %
100 % 4.6%
x
5'5' '
height
s weight
20
100 %
100 % 16.7%
x
120
weight
34
Measures of Distribution Shape, Relative
Location, and Detecting Outliers
Distribution Shape
z-Scores
Chebyshev’s Theorem
Empirical Rule
Detecting Outliers
35
Distribution Shape: Skewness
An important measure of the shape of a distribution is
called skewness.
The formula for the skewness of sample data is
n
xi x
Skewness
(n 1)( n 2) s
3
Skewness can be easily computed using statistical
software.
36
Distribution Shape: Skewness
Symmetric (not skewed)
• Skewness is zero.
• Mean and median are equal.
Skewness = 0
.35
Relative Frequency
.30
.25
.20
.15
.10
.05
0
37
Distribution Shape: Skewness
Skewed to the left
• Skewness is negative.
• Mean is usually less than the median.
.35
Relative Frequency
Skewness = .33
.30
.25
.20
.15
.10
.05
0
38
Distribution Shape: Skewness
Skewed to the right
• Skewness is positive.
• Mean is usually more than the median.
.35
Relative Frequency
Skewness = .31
.30
.25
.20
.15
.10
.05
0
39
Z-Scores
The z-score is often called the standardized value.
It denotes the number of standard deviations a data value xi
is from the mean.
xi x
zi
s
Excel’s STANDARDIZE function can be used to compute
the z-score.
40
Z-Scores
An observation’s z-score is a measure of the
relative location of the observation in a data set.
A data value less than the sample mean has a
negative z-score.
A data value greater than the sample mean has a
positive z-score.
A data value equal to the sample mean has a zscore of zero.
41
Z-Scores
Example
12
14
18
19
26
27 29
30
x1 x 12 21.875
z1
1.41
s
7
x8 x 30 21.875
z8
1.16
s
7
42
x
Chebyshev’s Theorem
At least (1 - 1/z2) of the items in any data set will be
within z standard deviations of the mean, I.e. between
( x z s ) and (x z s ), where z is any value greater
than 1.
Chebyshev’s theorem requires z > 1, but z need not
be an integer.
43
Chebyshev’s Theorem
At least 55.6% of the data values must be within z = 1.5
standard deviations of the mean.
At least 89% of the data values must be within z = 3
standard deviations of the mean.
At least 94% of the data values must be within z = 4
standard deviations of the mean.
44
Chebyshev’s Theorem
Example: Given that x = 10 and s = 2, at least what
percentage of all the data values falls into 2 standard
deviations of the mean?
At least (1-1/22) = 1-1/4 = 75% of all the data values must
be between 6 and 14.
x z s = 10-2(2) = 6
x z s = 10+2(2) = 14
45
Empirical Rule
When the data are believed to approximate
a bell-shaped distribution, the empirical rule
can be used to determine the percentage of
data values that must be within a specified
number of standard deviations of the mean.
The empirical rule is based on the normal
distribution, which is covered in Chapter 6.
46
Empirical Rule
For data having a bell-shaped distribution:
About 68% of values of a normal random variable
are between - and + .
Expected
number of
About 95% of values of a normal random variable
correct
are between - 2 and + 2.
answers
About 99% of values of a normal random variable
are between - 3 and + 3.
47
Empirical Rule
About 99%
About 95%
About 68%
Expected
number of
correct
answers
– 3
– 1
– 2
+ 3
+ 1
+ 2
x
48
Detecting Outliers
An outlier is an unusually small or unusually large
value in a data set.
A data value with a z-score less than –3 or greater
than +3 might be considered an outlier.
It might be:
• An incorrectly recorded data value
• A data value that was incorrectly included in the data
set.
• A correctly recorded data value that belongs in the data
set.
49
Measures of Association Between
Two Variables
So far, we have examined numerical methods used
to summarize the data for one variable at a time.
Often a manager or decision maker is interested in
the relationship between two variables.
Two numerical measures of the relationship
between two variables are covariance and
correlation coefficient.
50
Covariance
The covariance is a measure of the linear
association between two variables.
Positive values indicate a positive
relationship.
Negative values indicate a negative
relationship.
51
Covariance
The covariance is computed as follows:
s xy
xy
(x
i
x )( yi y )
n 1
( xi x )( yi y )
N
for
samples
for
populations
52
Correlation Coefficient
Correlation is a measure of linear
association and not necessarily causation.
Just because two variables are highly
correlated, it does not mean that one
variable is the cause of the other.
53
Correlation Coefficient
The correlation coefficient is computed as follows:
rxy
sxy
sx s y
for
samples
xy
xy
x y
for
populations
54
Correlation Coefficient
The correlation can take on values between
–1 and +1.
Values near –1 indicate a strong negative
linear relationship.
Values near +1 indicate a strong positive
linear relationship.
The closer the correlation is to zero, the
weaker the relationship.
55
Covariance and Correlation Coefficient
Example: Stock Returns
The table below presents the monthly returns (in percentage) of the
market index S&P 500 (SPY) and the Apple stock (AAPL) from
December 2012 to May 2013.
Date
Dec-12
Jan-13
Feb-13
Mar-13
Apr-13
May-13
SPY
2.17
0.55
1.62
0.83
1.01
0.24
AAPL
-6.76
-1.11
0.12
0.01
0.96
0.10
56
Covariance and Correlation Coefficient
Example: Stock Returns
Date
Dec-12
Jan-13
Feb-13
Mar-13
Apr-13
May-13
Average
Std. Dev.
x
y
2.17
0.55
1.62
0.83
1.01
0.24
1.07
0.71
-6.76
-1.11
0.12
0.01
0.96
0.10
-1.11
2.84
xi x yi y xi x yi y
1.10
-0.52
0.55
-0.24
-0.06
-0.83
-5.65
0.00
1.24
1.12
2.08
1.21
Total:
-6.21
0.00
0.68
-0.27
-0.12
-1.00
-6.92
57
Covariance and Correlation Coefficient
Example: Stock Returns
•
Sample Covariance
s xy
•
x
i
x y i y
n 1
6.92
1.38
6 1
Sample Correlation Coefficient
rxy
s xy
sx s y
1.38
0.68
0.712.84
58