Why Standard Deviation?

Download Report

Transcript Why Standard Deviation?

Chapter III
Descriptive Statistics:
Numerical Methods
1
Key Learning Objectives and Topics in this
Chapter
 Measures of Location:
 (Mean, Median, Mode, Percentiles, Quartiles)
 Measures of Dispersion/Spread/Variability
 ( Range, Variance, Standard Deviation,
 Coefficient of Variation)
 Measures of distribution shape, and association
between two variables
2
Important Note
In all cases :
 Know the formulas, learn the computation procedures (i.e.,
apply the formulas) and understand the meaning
(interpretation) of the measures computed.
 Use Excel; Practice!
Practice! and Practice!
3
3.1. Introduction
 When describing data, usually we focus our attention on
two types of measures..
 Central location (e.g., average or mean)
 Variability or Spread (e.g., variance, standard deviation)
 Both measures could be computed for
 Population
 Sample
4
3.2 Measures of Central Location
 A center is a reference point. Thus a good measure of
central location is expected to reflect the locations of
all the other actual points in the data.
With two data points,
 How?
the central location
if the third
data
should
fallpoint
in the middle
With one data point
appearsbetween
on the left
hand-side
them
(in order
clearly the central
of the center,
it should
“pull” of
to reflect
the location
location is at the point
the central
to the left.
both location
of them).
itself.
5
Measures of Location





Mean
Median
Mode
Percentiles
Quartiles
If the measures are computed
for data from a sample,
they are called sample statistics.
If the measures are computed
for data from a population,
they are called population parameters.
A sample statistic is referred to
as the point estimator of the
corresponding population parameter.
6
i) The Arithmetic Mean (µ)
 Mean is the most popular and useful measure of central
location
Sum of the observations
Mean =
Number of observations
7
i) The Arithmetic Mean
Sample mean
Sum of the values of
Observations in the data
Population mean
n
x
 Xi
i 1
n
Number of observations
In the sample
(Sample size)
N

x
i 1
i
N
Number of Observations
In the Population
(Population size)
8
i) The Arithmetic Mean
• Example 1
Time (hours) spent by 10 students on the Internet are as follows:
0, 7, 12, 5, 33, 14, 8, 0, 9, 22 hours.
Based on this data, compute the mean (average) amount of time
spent (per day) on the Internet?
n
x
X
i 1
n
i
0 + 7 + 12 + 5 + 33 + 14 + 8 + 0 + 9 + 22 110
=
=11hours
10
10
Based on this data, the average amount of time spent on the internet
by a typical student is 11 hours.
9
ii) The Median
 The Median of a set of observations is the value that falls in the
middle of a data that is arranged in certain order (ascending or
descending).
 It is the value that divides the observation into two equal halves
10
ii) The Median
To find the median:
 Put the data in an array (in increasing or decreasing order)
and then count the total number of observations in the data.

If the total is an ODD number, the median is the middle
value.

If the total is EVEN number, then the median is the
AVERAGE of the middle two values.
iii) The Median
Example 2a
Find the median for the following observations.
0, 7, 12, 5, 14, 8, 0, 9, 22
0, 0, 5, 7, 8 9, 12, 14, 22
Odd Number Observations
Median= 8
Step-1: Arrange the data in
increasing/ decreasing order, …
Step-2: Count the total number
of observation in the data (9) …
12
iii) The Median
Example 2b
Find the median for the following observations.
0, 7, 12, 5, 33, 14, 8, 0, 9, 22
0, 0, 5, 7, 8, 9, 12, 14, 22, 33
Even number Observations
Median=(8+9)/2=8.5
Step-1: Arrange the data in
increasing/ decreasing order
Step-2: Count the total
number of observation in the
data (10)…
13
ii) The Median
 Note:
 The median value (8 in example 2a)of an odd set of data is
a member of the data values.
 However, the median value (8.5 in example 2b) of an even
data set is not necessarily a member of the set of values.
 What is special about median?

Unlike the mean, the median value of a data set is not affected by the
value that all observations in the data set may assume.
III) The Center: Mode
Mode is the most frequent value.
 The Mode is the value that occurs most
frequently in the data. It is the value with the
highest frequency
 In any data set there is only one value for the
mean or the median. However, a data set may
have more than one value for the mode.
III) The Center: Mode
Histogram of Income distribution
One modal class
Two modal classes
16
III) The Center: Mode
Example 3: What is the mode for the following data?
0, 7, 12, 5, 33, 14, 8, 0, 9, 22
Solution
 All observation except “0” occur once. There are two “0”
values. Thus, the mode is zero.
 But is this value a good indicator of the central
location of this data?

The value “0” does not reside at the center of this set
(compare with the mean = 11.0 and the median = 8.5).
17
After Comparing Measures of
Central Tendency: Mean, Median, Mode:
•
If mean = median = mode, the shape of the distribution is
symmetric.
18
After Comparing Measures of
Central Tendency: Mean, Median, Mode:
 If mode < median < mean, the shape of the distribution
trails to the right, is positively skewed.
•
If mode > median > mean, the shape of the distribution
trails to the left, is negatively skewed.
A positively skewed distribution
(“skewed to the right”)
A negatively skewed distribution
(“skewed to the left”)
Mode
Mean
Median
Mean
Mode
Median
19
Percentiles
 A percentile provides information about the relative
location and spread of the data between the smallest
to the largest value.
 Percentile tells us the proportion of observations
that lie below or above a certain value in the data.
Example: Admission test scores for colleges and universities
are frequently reported in terms of percentiles.
20
Percentiles

The pth percentile of a data set is a value such that at least
p percent of the items take on this value or less while
(100 - p) percent of the items take on this value or more.
21
Computing Percentiles
Arrange the data in ascending order.
Compute the ith position of the pth percentile.
 p 
i 
 xn
 100 
If i is not an integer, round up. The p th percentile
is the value in the i th position.
If i is an integer, the p th percentile is the average
of the values in positions i and i +1.
22
Compute the 75th percentile of the following data
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
i = (p/100)n = (75/100)X10 =7.5
Rounding 7.5, we note that the 8th data value is
The 75th Percentile = 435
23
Compute the 50th percentile of the following data
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
i = (p/100)n = (50/100)X10 =5
Averaging the 5th and 6th data value, we get
5th Percentile = (435 + 435)/2 = 435
24
Quartiles
Quartiles are specific percentiles.
First Quartile = 25th Percentile
Second Quartile = 50th Percentile = the Median
Third Quartile = 75th Percentile
25
Quartiles
 Divide a data set into four equal parts
( N + 1)
Q1 =
;
4
2( N + 1)
3( N + 1)
Q2 =
;
Q3 =
;
4
4
W here Qi is thelocationof thei th Quartile
26
3.2 Measures of Variability
27
3.2 Measures of Variability
 Measures of central location fail to tell the whole story about
the distribution.
 A question of interest that remains unanswered even after
obtaining measures of central location is how spread out are
the observations around the central (say, mean) value?
•
Variability is Important in business decisions—as it
indicates the level of risk.
•
For example, in choosing between two suppliers A and B, we
might consider not only the average delivery time for each, but
also the variability in delivery time for each.
28
Measures of Variability
 Range
 Inter-Quartile Range
 Variance
 Standard Deviation
 Coefficient of Variation
29
i) The Range

The range in a set of observations is the difference between
the largest and smallest observations.
 The range is the distance between the smallest and
the largest data value in the set.
•
Range = largest value – smallest value


Its major advantage is the ease with which it can be computed.
Its major shortcoming is its failure to provide information on the
dispersion of the observations between the two end points.
 It is also very sensitive to the smallest and largest data values
30
ii) Inter Quartile Range
 This is a measure of the spread of the middle
50% of the observations
Inter quartile range = Q3 – Q1
 Large value indicates a large spread of the
observations
 Is not sensitive to extreme data values
31
iii) The Variance
 Is the average of the squared differences between each
data value and the measure of central location (mean)
The variance is a measure of variability that utilizes
all the data.
 Is calculated differently when we use population and
when we use a sample
32
iv) The Variance
N
Variance of a Population
 
2
2
(
x

)
 i
i 1
N
n
Variance of a sample
s 
2
 (x - x)
i 1
2
i
n - 1
33
Example- Computing the VarianceBased on a Sample data
n
Variance of a sample
s 
2
 (x - x)
i 1
2
i
n - 1
Find the variance of the following sample observations
9
11
8
12
34
Computing Variance of a sample
Step-1: Find the mean
9  11  8  12 40
X

 10
4
4
Step-2: Compute deviations from the mean
Step-3: Square the deviations,
add them together, and divide
the sum of the squared
deviations by n-1
9-10= -1
11-10= +1
8-10= -2
12-10= +2
4
s 
2
2


x


 i
i 1
n 1
 12  12  (2) 2  22 10

  3.33
4 1
3
35
n
iii) The Variance
s 
2
2
(
x
x
)
 i
i 1
n - 1
Why square the
difference?
Sum of deviation from the
mean is zero
Why divide by n-1
instead of n ?
Better approximation of the
population variance
36
iv) Standard Deviation
 The standard deviation of a set of observations is the
square root of the variance .
Sample standard dev iation: s  s
2
Population standard dev iation:   
2
37
Why Standard Deviation?

The standard deviation
 Is often reported in the actual unit of measure in
which the data is recorded.

Thus it can be used to compare the variability of
several distributions that are measured in the same
units,

It can also be used to make a statement about the
general shape of a distribution (Kurtosis).
38
Computing the standard deviation
Step-1: Find the mean
9-10= -1
11-10= +1
8-10= -2
12-10= +2
Step-2: Compute deviations from the mean
Step-3: Square the deviations,
add them together, and divide
the sum of the squared
deviations
by n-1
9  11  8  12 40
X
4

4
 10
step-4: Take the square
root of the variance
4
s 
2
 x   
i 1
2
i
n 1
 12  12  (2) 2  2 2 10

  3.33
4 1
3
s  s 2  3.33  1.824
39
V) Coefficient of Variation
The coefficient of variation is a measure of how large the
standard deviation is relative to the mean.
The coefficient of variation is computed as follows:
CV=
s

  100  %
x

for a
sample


 100  %


for a
population
40
Why Coefficient of Variation?
Example: Is a standard deviation of 10 large?
A standard deviation of 10 may be perceived large when the mean
value is 100, but it is only moderately large if the mean value is 500
Coefficient of Variation can be used to compare variability in data sets
that are measured in different units.
41
Variance, Standard Deviation,
and Coefficient of Variation
 Variance
s2 
 Standard
Deviation
2
(
x

x
)
 i
n1

2, 996.16
s  s2  2996.47  54.74
the standard
deviation is
about 11%
of the mean
 s

 54.74


100
%


100
 Coefficient of  x


%  11.15%


 490.80

Variation
42
Compute the Mean, Median, Mode, Range, Variance,
Standard Deviation and Coefficient of Variation for
income (in $1000) data from the following cities
City
Income
Akron, OH
74.1
Atlanta, GA
82.4
Birmingham, AL
71.2
Cleveland, OH
62.3
Columbia, SC
79.9
Danbury, CT
66.8
Denver, CO
132.3
Detroit, MI
83.4
Lancaster, PA
100.0
Madison, WI
77.0
Minneapolis, MN
67.8
43
Compute every single measure of central location
and Variability you have learned in this chapter
for the following sample rent data on 70 efficiency
apartments
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
44