The Variance and Standard Deviation
Download
Report
Transcript The Variance and Standard Deviation
The Variance and Standard Deviation
The most important measure of variability is
based on deviations of individual
observations about the central value. For
this purpose the mean usually serves as
the center.
MEASURES OF VARIABILITY
• Variance
– Population variance
– Sample variance
• Standard Deviation
– Population standard deviation
– Sample standard deviation
• Coefficient of Variation (CV)
– Sample CV
– Population CV
MEASURES OF VARIABILITY
POPULATION VARIANCE
• The population variance is the mean squared
deviation from the population mean:
N
2
•
•
•
•
•
(x
i 1
i
)
N
Where 2 stands for the population variance
is the population mean
N is the total number of values in the population
xi is the value of the i-th observation.
represents a summation
An example related to deviation
about the central value
• There are five SAT scores as below:
584, 613, 622, 693, 755.
• The mean is
(584+613+622+693+755)/5 = 653.4
• The deviation for each score can be
computed by subtracting mean from each
score:
755-653.4 = 101.6
An example related to deviation
about the central value (cont..)
693-653.4 = 39.6
622-653.4 = -31.4
613.653.4 = -40.4
584-653.4 = -69.4
These deviations may be summarized by
the collective measure that considers each
deviation.
An example related to deviation
about the central value (cont..)
With the previous data, this procedure results in
(101.6) 2 (40.4) 2 (69.4) 2 (39.6) 2 (31.4) 2 19325.2
3.865.04
5
5
Population Variance
• In practice population variance cannot be
computed directly because the entire
population is not ordinarily observed.
• An analogous measure of variability may
be determined with sample data.
• This referred to as sample variance
MEASURES OF VARIABILITY
SAMPLE VARIANCE
• The sample variance is defined as follows:
N
s2
•
•
•
•
•
(x
i 1
i
x)
n 1
Where s2 stands for the sample variance
x is the sample mean
n is the total number of values in the sample
xi is the value of the i-th observation.
represents a summation
MEASURES OF VARIABILITY
SAMPLE VARIANCE
• Notice that the sample variance is defined as the sum
of the squared deviations divided by n-1.
• Sample variance is computed to estimate the
population variance.
• An unbiased estimate of the population variance may
be obtained by defining the sample variance as the
sum of the squared deviations divided by n-1 rather
than by n.
• Defining sample variance as the mean squared
deviation from the sample mean tends to
underestimate the population variance.
MEASURES OF VARIABILITY
SAMPLE VARIANCE
• A shortcut formula for the sample variance:
2
n
xi
n
1
i 1
2
s2
x
i
n 1 i 1
n
•
•
•
•
Where s2 is the sample variance
n is the total number of values in the sample
xi is the value of the i-th observation.
represents a summation
MEASURES OF VARIABILITY
POPULATION/SAMPLE STANDARD DEVIATION
• The standard deviation is the positive square root of
the variance:
2
Population standard deviation:
2
s
s
Sample standard deviation:
• Compute the standard deviations of advertising and
sales.
MEASURES OF VARIABILITY
POPULATION/SAMPLE STANDARD DEVIATION
• Compute the sample standard deviation of
advertising data: 2.5, 1.3, 1.4, 1.0 and 2.0
• Compute the sample standard deviation of sales
data: 264, 116, 165, 101 and 209
MEASURES OF VARIABILITY
POPULATION/SAMPLE CV
• The coefficient of variation is the standard deviation
divided by the means
Population coefficient of variation: CV
s
Sample coefficient of variation: cv
x
MEASURES OF VARIABILITY
POPULATION/SAMPLE CV
• Compute the sample coefficient of variation of
advertising data: 2.5, 1.3, 1.4, 1.0 and 2.0
• Compute the sample coefficient of variation of sales
data: 264, 116, 165, 101 and 209
MEASURES OF ASSOCIATION
• Scatter diagram plot provides a graphical description
of positive/negative, linear/non-linear relationship
• Some numerical description of the positive/negative,
linear/non-linear relationship are obtained by:
– Covariance
• Population covariance
• Sample covariance
– Coefficient of correlation
• Population coefficient of correlation
• Sample coefficient of correlation
MEASURES OF ASSOCIATION: EXAMPLE
• A sample of monthly advertising and sales data are
collected and shown below:
Month
Sales
(000 units)
Advertising
(000 $)
1
2
3
4
5
264
116
165
101
209
2.5
1.3
1.4
1.0
2.0
• How is the relationship between sales and
advertising? Is the relationship linear/non-linear,
positive/negative, etc.
POPULATION COVARIANCE
• The population covariance is mean of products of
deviations from the population mean:
N
COV ( X ,Y )
x
i 1
i
x yi y
N
• Where COV(X,Y) is the population covariance
• x, y are the population means of X and Y
respectively
• N
xi ,isyithe total number of values in the population
•
are the values of the i-th observations of X and Y
respectively.
SAMPLE COVARIANCE
• The sample covariance is mean of products of
deviations from the sample mean:
x x y y
n 1
cov( X ,Y )
•
•
•
•
i 1
i
i
n 1
Where cov(X,Y) is the sample covariance
x , y are the sample means of X and Y respectively
n is the total number of values in the population
xi , yi are the values of the i-th observations of X and
Y respectively.
• represents a summation
SAMPLE COVARIANCE
Sales
Advertising
Month (in 000$) (in 000 units)
264
2.5
1
116
1.3
2
165
1.4
3
101
1
4
209
2
5
171
1.64
Mean
0.602495 67.18258703
SD
Total=
cov =
POPULATION/SAMPLE COVARIANCE
• If two variables increase/decrease together,
covariance is a large positive number and the
relationship is called positive.
• If the relationship is such that when one variable
increases, the other decreases and vice versa, then
covariance is a large negative number and the
relationship is called negative.
• If two variables are unrelated, the covariance may be
a small number.
• How large is large? How small is small?
POPULATION/SAMPLE COVARIANCE
• How large is large? How small is small? A drawback
of covariance is that it is usually difficult to provide
any guideline how large covariance shows a strong
relationship and how small covariance shows no
relationship.
• Coefficient of correlation can overcome this drawback
to a certain extent.
POPULATION COEFFICIENT OF CORRELATION
• The population coefficient of correlation is the
population covariance divided by the population
standard deviations of X and Y:
COV ( X ,Y )
x y
• Where is the population coefficient of correlation
• COV(X,Y) is the population covariance
• x, y are the population means of X and Y
respectively
SAMPLE COEFFICIENT OF CORRELATION
• The sample coefficient of correlation is the sample
covariance divided by the sample standard deviations
of X and Y:
COV ( X ,Y )
x y
• Where r is the sample coefficient of correlation
• cov(X,Y) is the sample covariance
• sx, sy are the sample means of X and Y respectively
SAMPLE COEFFICIENT OF CORRELATION
Sales
Advertising
Month (in 000$) (in 000 units)
264
2.5
1
116
1.3
2
165
1.4
3
101
1
4
209
2
5
171
1.64
Mean
0.602495 67.18258703
SD
Total=
cov =
r=
RELATIVE STANDING
BOX PLOTS
• When the data set contains a small number of values, a
box plot is used to graphically represent the data set.
These plots involve five values:
– the minimum value (S)
– the lower quartile (Q1)
– the median (Q2)
– the upper quartile (Q3)
– and the maximum value (L)
RELATIVE STANDING: BOX PLOTS
EXAMPLE
• Example: Construct a box plot with the following data which
shows the assets of the 15 largest North American banks,
rounded off to the nearest hundred million dollars: 111,
135, 217, 108, 51, 98, 65, 85, 75, 75, 93, 64, 57, 56, 98
RELATIVE STANDING: BOX PLOTS
RANKING AND SUMMARIZING
Data
217
135
111
108
98
98
93
85
75
75
65
64
57
56
51
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Smallest = 51
Q1 = 64
Median = 85
Q3 = 108
Largest = 217
IQR = 44
Outliers = (217, )
Box Plot
0
50
100
150
200
Assets (in 100 million dollars)
250
RELATIVE STANDING: BOX PLOTS
INTERPRETATION
• If the median is near the center of the box, the
distribution is approximately symmetric.
• If the median falls to the left of the center of the box, the
distribution is positively skewed.
• If the median falls to the right of the center of the box, the
distribution is negatively skewed.
• If the lines are about the same length, the distribution is
approximately symmetric.
• If the line segment to the right of the box is larger than
the one to the left, the distribution is positively skewed.
• If the line segment to the left of the box is larger than the
one to the right, the distribution is positively skewed.
SYMMETRIC BOX PLOT
0
50
100
150
200
Number of units sold
250
300
POSITIVELY SKEWED BOX PLOT
0
50
100
150
200
Number of units sold
250
300
Summary Statistical Measure: The
Proportion
EXAMPLE
• Salary and expenses for cultural activities, and sports
related activities are collected from 100 households. Data
of only 5 households shown below:
Salary and expenses
data for 100 households
Salary Culture Sports
$54,600 $1,020
$990
$57,500 $1,100
$460
$53,300
$900
$780
$43,500
$570
$860
$57,200
$900 $1,390
How are the
relationships (linear/nonlinear, positive/negative)
between (i) salary
and culture, (ii) salary
and sports, and
(iii) sports and culture?
Expenses for Cultural
Activities
SALARY-CULTURE
$1,600
$1,200
$800
$400
$0
$35,000
$55,000
$75,000
Salary
cov = 1094787, r = 0.5065 (positive, linear)
$95,000
Expenses for cultural
activities
SPORTS-CULTURE
1600
1200
800
400
0
$500
$1,000
$1,500
$2,000
Expenses for sports related activities
cov = -33608, r = -0.5201 (negative, linear)
SALARY-SPORTS
Expenses for sports
related activities
$1,900
$1,400
$900
$400
$35,000
$55,000
$75,000
$95,000
Salary
cov = -219026, r = -0.08122 (no linear relationship)