Transcript Handout 2

Lecture 2
Describing Data II
©
Summarizing and Describing
Data
 Frequency distribution and the
shape of the distribution
 Measures of variability
1. Frequency distribution and
the shape of the distribution
In the previous lecture, we saw that the
mean of the household savings gives an
inflated image of the saving of a “normal
household”.
This was because the shape of the
histogram was not symmetric.
It is important to look at how the
observations are distributed.
38,00040,000
36,00038,000
34,00036,000
32,00034,000
30,00032,000
28,00030,000
26,00028,000
24,00026,000
22,00024,000
20,00022,000
18,00020,000
16,00018,000
14,00016,000
12,00014,000
10,00012,000
8,000-10,000
6,000-8,000
4,000-6,000
2,000-4,000
below 2,000
Savings in thousand yen
Above 40,000
16 14.1
10,520,000
Sample Average
14
=17,280,000
10.7
10.6
12
9.5
10
8.2
6.9 6.2
8
5.1 4.5
6
3.5 3 3 2.7
4
2 2 1.9 1.7 1.2 1.3
1 1
2
0
Percentage
Japanese household savings
Histgram of Japanese Household Savings
Median =
1-1 Frequency Distribution
The frequency table that we used in the previous
lecture is also called the frequency distribution. A
frequency distribution is usually referred to how
observations are distributed. When we plot the
frequency table, it is called a Histogram.
A histogram usually shows the number of
observations in a specific range. However,
sometimes, it shows the percentage of
observations in a specific range.
1-2 Shape of the Distribution
The shape of the distribution refers to
the shape of the Histogram.
1-3 Symmetric Distribution
The shape of the distribution is said to be
symmetric if the observations are
balanced, or evenly distributed, about the
mean. The shape of the distribution is
symmetric if the shape of the histogram is
symmetric
Symmetric Distribution
Frequency
Symmetric Distribution
10
9
8
7
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8
9
Note: For a symmetric distribution, the mean
and median are equal.
Symmetric Distribution: An
example
The age distribution
of the clients (from
the previous lecture
note) is nearly
symmetric.
1-4 Skewed Distribution
A distribution is skewed if the observations are
not symmetrically distributed above and below
the mean. A positively skewed (or skewed to
the right) distribution has a tail that extends to
the right in the direction of positive values. A
negatively skewed (or skewed to the left)
distribution has a tail that extends to the left in
the direction of negative values.
Positively skewed distribution
Positively Skewed Distribution
12
Frequency
10
8
6
4
2
0
1
2
3
4
5
6
7
8
9
Positively skewed distribution:
An example
The household saving histogram (from the
previous lecture) is an example of a positively
skewed distribution.
Histgram of Japanese Household Savings
Percentage
Median =
Sample Average
=17,280,000
3 2.7
2
1
Above 40,000
38,00040,000
36,00038,000
34,00036,000
32,00034,000
30,00032,000
28,00030,000
Savings in thousand yen
10.7
2 1.9 1.7 1.2 1.3
1
26,00028,000
24,00026,000
22,00024,000
20,00022,000
18,00020,000
16,00018,000
14,00016,000
12,00014,000
10,00012,000
8,000-10,000
6,000-8,000
4,000-6,000
2,000-4,000
below 2,000
16 14.1
10,520,000
14
10.6
12
9.5
10
8.2
6.9 6.2
8
5.1 4.5
6
3.5 3
4
2
0
Positively skewed distribution:
A note
For a positively skewed distribution the
mean is greater than the median.
Negatively skewed distribution
Negatively Skewed Distribution
12
Frequency
10
8
6
4
2
0
1
2
3
4
5
6
7
8
9
Note: For a negatively skewed distribution,
the mean is less than the median.
2. Measures of Variability
Variance
Standard deviation
Example
Data “Sales at two different stores”
contain daily sales data for two different
stores. Data are collected for 60 days.
Store A’s average daily sales is 231,800
yen. Store B’s average daily sales is
230,500 yen.
Can we say that they are similar stores?
Look at the following graphs.
Daily sales of the two stores
Store B: Daily Sales
450.0
450.0
400.0
400.0
350.0
350.0
300.0
Average
=
231,800
yen
250.0
200.0
150.0
Daily sales in 100 yen
Daily sales in 1000 yen
Store A: Daily Sales
300.0
200.0
150.0
100.0
100.0
50.0
50.0
0.0
0.0
0
10
20
30
40
Day
50
60
70
Average
=
230,500
yen
250.0
0
10
20
30
40
Day
50
60
70
Daily sales of the two stores
The difference between the two stores is
that, Store A’s sales have much higher
variation than Store B’s sales.
We need a measure of variability in data.
2-1 How to measure the variability
(1)
Store A: Daily Sales
For each
observation, you can
compute the
difference from the
average
450.0
400.0
Daily sales in 1000 yen
 Take the Store A’s data
as an example, variability
of each observation can be
seen from the difference
between the observation
and the mean.
 But, how do we measure
the overall variability of
the data?
350.0
300.0
Average
=
231,800
yen
250.0
200.0
150.0
100.0
50.0
0.0
0
10
20
30
40
Day
50
60
70
How to measure the variability (2)
Overall variability
Store A: Daily Sales
For each
observation, you can
compute the
difference from the
average
450.0
400.0
Daily sales in 1000 yen
 How about taking the
average of all differences?
This is not a good idea,
since the differences can be
both positive or negative,
so they would sum up to
zero.
 Therefore, we take the
square of each difference.
This is the first step to
compute the “Variance”, a
measure of overall
variability.
350.0
300.0
Average
=
231,800
yen
250.0
200.0
150.0
100.0
50.0
0.0
0
10
20
40
30
Day
50
60
70
2-2 Variance
A measure of variability

1.
2.
3.
4.

Variance is computed in the following way.
Subtract the mean from each observation (compute
the difference between each observation and the
mean. Note that the difference can be minus)
Then, square each difference
Sum all the squared differences
Divide the sum of squared differences by n-1 (the
number of observations minus 1)
We will learn the reason why we divide the sum of
squares by n-1 after we learn the concept of the
expectation.
Computation of the variance:
Exercise
Open the data “Computation of
Variance”, and compute the variance of
Store A’s daily sales
Compute the variance of Store B’s daily
sales
Computation of the variance:
Exercise
Store A:
Average daily sales =231.8 thousand yen
Variance =4979.9
Store B:
Average daily sales=230.5 thousand yen
Variance =335.9
Notice that variance for Store A is higher than
that for Store B. This is because the variation in
the daily sales is higher for Store A.
Variance: note
In the previous slide, we did not use any unit of
measurement for variance. (For example, we do
not say that the variance for Store A is 4979.9
thousand yen.)
 This is because, when we compute the variance, we
square the data. Therefore, the unit of measurement for
variance is “square of thousand yen”, which is not a
meaningful unit.
 Therefore, we use the Standard Deviation, another
measure of variation.
2-3 A measure of variability:
Standard deviation
Standard deviation is the square
root of the variance.
Standard Deviation  Variance
Exercise:
Compute the standard deviation
of the daily sales for Store A and
Store B.
Standard Deviation: Store sales
data example
Standard deviation of Store A’s daily
sales=70.57 thousand yen.
Standard deviation for Store B’s daily sales=
18.33 thousand yen.
This means that the average variation of the
store A’s sales is about 70.6 thousand yen, and
the average variation of the store B’s sales is
about 18.3 thousand yen.
Standard deviation and variance as
measures of risk (or uncertainty)
Often standard deviation and variance are
used as measures of uncertainty or risk.
If you would like to work as a store
manager, then store B may be a better
store to work for; although the average
sales is almost the same as store A, the
uncertainty is lower (low standard
deviation)
Standard deviation and variance as
measures of risk (or uncertainty)
 In the store sales data, the average sales for both stores
are similar.
 However, in many other occasions, higher return
(higher average sales) comes with higher risk (higher
standard deviation).
 One makes a decision by choosing a good combination
of return and risk. For example, if you invest in a stock,
you would choose a stock with a combination of return
and risk that suits your preference.
 Therefore, standard deviation and variance are
important numerical measures of summarizing data for
a decision making purpose.
2-4. Understanding the
mathematical notation of the
variance
Most of the time, we only have sample data
(not population data).
Variance computed from a sample is called
sample variance. We denote sample variance by
s 2.
When we have population data (which does not
happen often), we can compute the population
variance. We denote the population variance by
σ2.
Understanding the mathematical
notation of sample variance
Observation
id
Variable X
1
x1
2
x2
3
x3
.
.
.
.
n
xn
The typical data we use
comes in this format. Using
this format, we would like to
represent variance in a
mathematical form.
Understanding the mathematical
notation of sample variance
Obs id
Variable
X
Each datathe
mean
(Each data-the
mean)2
1
X1
X1 - X
(X1 -
X )2
2
X2
X2 - X
(X2 -
2
X )
3
X3
X3 -
X
(X3 -
2
X )
:
:
:
n
Xn
Xn -
X
(Xn -
X
Average
X
)2
The first steps of
computing variance are
written in the table.
The variance can be
computed by summing
the last column, and
divide the sum by (n-1)
Therefore,
mathematically, a sample
variance, s2, can be
written as
next page
Understanding the mathematical
notation for sample variance
Mathematically, sample variance, denoted as s2, can be
written as
n
2
2
2
2
(
x

X
)

(
x

X
)

(
x

X
)



(
x

X
)
2
3
n
s2  1

n 1
2
(
x

X
)
 i
i 1
n 1
Mathematical notation for
population variance
Though not often, we may have population data.
Then we can compute the population variance.
We use the notation, σ2, to denote the population
variance. We also use upper case N to denote the
number of observations. The mathematical
notation for the population variance is
N
( x1   ) 2  ( x2   ) 2  ( x3   ) 2    ( xn   ) 2
 

N
2
2
(
x


)
 i
i 1
N
Unlike the case for sample variance, we do not have to
divide the sum of squares by N-1. We simply divide it by
N.
2-5. Mathematical notation for the
sample standard deviation
The sample standard deviation, s, is written as
n
s s 
2
2
(
x

X
)
 i
i 1
n 1
Mathematical Notation for
population standard deviation
The population standard deviation, , is written as
N
  
2
 (x  )
i 1
i
N
2
2-6. Short-cut formula for sample
variance
The short-cut formula for the sample variance is:
n
s 
2
2
2
x

n
(
X
)
 i
i 1
n 1
Exercise
Compute the variance for the sales of
Store A by applying the short-cut formula
for sample variance, and show that this
indeed coincides with our previous
calculation.
Other Measures of Variability
1. The Range
The range in a set of data is the
difference between the largest and
smallest observations
Other Measures of Central Tendency
2. Mode
The mode, if one exists, is the most
frequently occurring observation in
the sample or population.
This lecture note covers:
Textbook P23~P28: Frequency
distribution
Textbook 3.1, 3.2: Measures of central
tendency and variability