Transcript Handout 7

Lecture note 7
Sampling Distribution
Estimation
Hypothesis testing
1
Sampling distribution
Let X1, X2 , …, Xn be independent random
variables. Further assume that all the
random variables have the normal
distribution with mean μ and variance σ2,
i.e., N(μ, σ2)
The sample mean of the above random
variables is defined as
1 n
X   Xi
n i 1
2
Thus, the sample mean is a linear
combination of n random variables. Since
it is a linear combination of random
variables, the sample mean itself is a
random variable as well
Then, what is the distribution of the
sample mean?
I will follow 3 steps to show you the
distribution of the sample mean.
3
Step1: Notice that the expectation of the
sample mean is μ. This is because:
1
E[ X ]  E[
n
1

n
1

n

n
 E[ X
i 1
i
n
X
i 1
i
]      (1)
]          ( 2)
n
             (3)
i 1
4
To understand the second line (2) of the
previous slide, consider the simplest case
where there are only two random
variables. Then
1
1
X  X1  X 2
2
2
Thus
1
1
1 2
E[ X ]  E[ X 1 ]  E[ X 2 ]   
2
2
2 i 1
The equation (2) is a simple extension of this
exercise to the case where there are n
5
random variables
The third line (3) follows from the
assumption that all the n random
variables follows N(μ, σ2), thus has
identical mean: μ.
Therefore, the expectation of the sample
mean is equal to μ. Noting this is the first
step.
6
Step 2: Note that the variance of the sample
mean is given by
1
Var[ X ]  Var[
n
1
 2
n
1
 2
n
n
 Var[ X
i 1
n

2
i
n
X
i 1
i
]        (1)
]             ( 2)
             (3)
i 1
n 2

n2

2
n
7
To understand the second line (2) of the
previous slide, consider the simplest case
where there are only two random
1
variables. Then X  1
X 
X
2
2
Thus
1
Var[ X ] 
2
1
1
1 1
var[
X
]

Var
[
X
]

2

 COV ( X 1 , X 2 )
1
2
2
2
2
2
2 2
Since X1 and X2 are independent by
assumption, Cov(X1, X2)=0. This means
that
1
1
1
Var[ X ]  2 var[ X 1 ]  2 Var[ X 2 ]  2
2
2
2
2
2


i 1
8
The line (2) is a simple extension of this to
the case of n random variables.
The line (3) follows from the assumption
that all n random variables follow N(μ,
σ2), thus have identical variance: σ2.
Thus, the variance of the sample mean is
σ2/n. Noting this is the second step.
9
Step 3: In this step, we use the fact that a
linear combination of normal random
variables is also a normal random
variable.
1
1
1
Since X  X 1  X 2  ...  X n
n
n
n
the sample mean is the a linear combination
of normal random variables. Therefore,
the sample mean X is a normal random
variable
10
Combining the results from step 1 and
step 2, the distribution of the sample
mean is
X ~ N (  ,  2 / n)
Next slide summarizes this finding
11
The distribution of the sample mean
Let X1, X2 , …, Xn be independent random
variables with identical distribution, N(μ, σ2)
Define the sample mean of the above
random variables as
1 n
X   Xi
n i 1
Then the sample mean follows
X ~ N (  ,  / n)
2
12
Exercise 1
Let X1, X2 , …, X64 be independent random
variables with identical distribution, N(0, 1)
What is the distribution of the sample mean
1
X

 X where n is the
defined as
n
number of the observations (i.e., the
number of variables)?
n
i 1
i
13
Estimation
14
Estimation
Suppose that the monthly sales of a shop
in the past 9 month is given by
Month
April
May
June
July
August
September
October
November
December
Revenue in
1000 yen
400
200
150
400
100
80
160
150
200
15
In estimation, we consider a data set as
random draws from an unknown
distribution.
For example, we consider that there is an
unknown distribution function which
characterizes the monthly sales of the
store.
Then, we consider the data in the
previous page as realized values of 9
independent draws from this unknown
distribution.
16
In other words, we consider the 9 data
points (from April to December) as the
realized value of 9 random variables, X1,
X2, …, X9, which are independently
distributed, and which have identical
distribution.
They are independent because they are
`random draws’ from the population.
`Identical distribution’ is the assumption.
17
The purpose of statistics is to estimate a
parameter of the unknown distribution,
such as the population mean.
In this section, I focus on the estimation of
the population mean μ.
18
Point and Interval Estimates
There are two types of estimates.
A point estimate is a single number,
A confidence interval provides additional
information about variability
Lower
Confidence
Limit
Point Estimate
Upper
Confidence
Limit
Confidence interval
19
The point estimator of the
population mean μ
Let X1, X2,…, Xn be the data (n random
draws from an unknown distribution).
Then the point estimator of the
population mean μ is given by
1 n
X   Xi
n i 1
20
The Point Estimate of the
population mean.
We can estimate the unknown
population parameter …
Population Mean=μ
by using the sample
mean
(a Point Estimate)
x
21
Distribution unknown but ‘normal’
The type of distribution that characterizes
a data set can be anything.
However, in this handout, we consider
the case where the distribution is normal
with unknown mean and variance.
This normal assumption simplifies the
analysis.
22
It is know that, even if the distribution
that characterizes the data is not normal, a
normal distribution can be used as a good
approximation. Thus, we focus on the
case where the population distribution is
normal.
23
Exercise 2
Month
April
May
June
July
August
September
October
November
December
Sales in
1000 yen
400
200
150
400
100
80
160
150
200
Suppose the monthly
sales of a store follow a
normal distribution
with unknown mean
and variance. Compute
the point estimate of
the population mean.
This data is stored in
the file ‘Monthly sales
data’
24
Confidence interval
Confidence interval: An interval that
contains the population mean μ with a
given probability.
An interval estimate provides more
information about the population than
does a point estimate.
In particular, it can show the uncertainty
associated with the point estimate.
25
An example of a confidence
interval
Suppose that X1, X2, .., X20 are a random
sample taken from the population having
the normal distribution with variance 4,
but unknown mean μ. If the sample mean
is 0.5, then the population mean μ is in the
interval [-0.3765, 1.3765] with probability
95%.
Proof (important) See the front board
26
The interval [-0.3765, 1.3765] is an
example of the interval estimate.
In particular, the interval in this example
is called the 95% confidence interval, since
the population mean would fall in this
interval with probability equal to 95%
I am 95% confident
that μ is between
-0.3765 & 1.3765.
27
Confidence interval estimator
Confidence level: The probability you
choose in order to estimate the confidence
interval. This is usually set at 95%.
Significance level: A small number α such
that the confidence level =100*(1- α)
For example, if you set confidence level at
95%, then the significance level =0.05.
28
I explain the construction of the confidence
interval for two cases in order for you to
understand the concept more easily.
Confidence interval for
the population mean μ
Case 1
The population
variance σ2 known
Case 2
The population variance
σ2 unknown
29
Case 1: Confidence interval of the
population mean when the
population variance is known
Let confidence level be 100*(1- α).
Then, define a number Z α/2 as the number
satisfying the following
P(Z> Z α/2 )= α/2
where Z is the standard normal random variable.
The definition of Z α/2 is illustrated in the following
slide.
30
Definition of Z α/2
Standard normal
distribution
N(0,1)
/2
0
Z α/2
x
Thus, Z α/2 is a cutoff point where right tail probability is
equal to α/2.
31
Exercises 3
Q1. Find Z α/2 when α=0.05
Q1 Find Z α/2 when α=0.10
32
The confidence interval estimator for the
population mean when the population variance is
known
Let X1, X2,…, Xn be a random draw from
the normal distribution with unknown
population mean μ, but known
population variance σ2. Then 100*(1- α)
confidence interval for the unknown
population mean μ is given by
x  z α/2
σ
σ
 μ  x  z α/2
n
n
Proof (Important): See the front board
33
Exercise 4
Month
April
May
June
July
August
September
October
November
December
Sales in
1000 yen
400
200
150
400
100
80
160
150
200
Suppose that the monthly
sales of a store follow a
normal distribution. Suppose
that the population mean is
unknown, but the population
variance is know to be 10000.
Q1. Find the 95% confidence
interval of the population
mean.
Q2. Find the 90% confidence
interval of the population
mean.
34
Case 2: The confidence interval of the
population mean when the population
variance is unknown
In case 1, we assumed that the population
variance was known. In reality, we rarely
know the population variance.
Now, in case 2, we consider the situation
where the population variance is
unknown. Thus it is a more realistic case.
35
In case 1, the confidence interval was
derived from the fact that
Z 
X  
 /
n
follows N(0,1) when σ is known. (see the
proof for case 1)
For case 2, we replace σ with the sample
standard deviation s.
36
When we replace σ with the sample standard
deviation s, we have the following.
X 
t 
       ( A)
s/
n
where sample standard deviation s is defined
as
s
n
1
2
(
X

X
)
 i
n  1 i 1
It is known that (A) has t-distribution with
degree of freedom equal to n-1.
37
t -distribution
Standard
Normal
(t with df = ∞)
t (df = 13)
t-distributions are bell-shaped
and symmetric, but have
‘fatter’ tails than the normal
t (df = 5)
0
Chap 8-38
t
A notation
Let tv be the random variable having tdistribution with v degree of freedom. Let
100(1- ) be the confidence level. Then
the number tv, /2 is defined as the
number satisfying the following.
P(tv> tv,
/2
)= /2
39
Definition of t v,α/2
t-distribution
with degree of
freedom equal to
v.
/2
0
tv, α/2
Thus, tv,α/2 is a cutoff point where right tail probability is
equal to α/2.
40
How to find t v,α/2
Cutoff Point for
student's t-distribution
df
.10
.05
.025
1 3.078 6.314 12.706
See textbook 866, Table
8
For example if
v=2
 = 0.1
/2 =.05
2 1.886 2.920 4.303
3 1.638 2.353 3.182
The body of the table
contains the cut off
point, not probabilities
/2 = .05
t02,0.05=2.920
t
41
Exercise 5
Find tv,
/2
when v=40 and =0.05.
42
Case 2: The confidence interval for the population
mean when the population variance is unknown
Let X1, X2,…, Xn be a random draw from
the normal distribution with unknown
population mean, and unknown
population variance. Then 100*(1- α)
confidence interval for the unknown
population mean μ is given by
x  t n-1,α/2
S
S
 μ  x  t n-1,α/2
n
n
Where S is the sample standard deviation
Proof (Important): See the front board
43
Exercise 6
Suppose that the monthly
Sales in
Month
sales of a store follow a
1000 yen
normal distribution with
April
400
unknown mean and unknown
May
200
June
150
variance
July
August
September
October
November
December
400
100
80
160
150
200
Q1. Find the 95% confidence
interval of the population
mean.
44
Hypothesis testing
45
Hypothesis testing
Statistical theory provides you with a
scientific way to test a hypothesis.
Hypothesis testing is an important part of
decision making. For example you can
test the following:
1.Whether or not the population mean of
the stock return in a particular industry is
8%
2. If a weight reduction program has any
real effects.
46
Null hypothesis and
alternative hypothesis
The first step to conduct a hypothesis
testing is to develop appropriate (i) null
hypothesis and (ii) alternative hypothesis.
In the following, I will provide two
examples
47
[Example 1]
If you want to test if the population mean of the
stock return in a particular industry is 8%, then
we have two hypotheses
The null hypothesis
H0: μ=8%
The alternative hypothesis H1: μ≠8%
In hypothesis testing, you test the null
hypothesis against the alternative hypothesis.
You should develop an appropriate set of
two competing hypothesis
48
[Example 2]
Suppose you run a weight reduction program.
You have the weight reduction data for each
client. Then you may want to test if your weight
reduction program has any real effect. Let μ be
the unknown population mean of the weight
reduction. Then the appropriate null and
alternative hypothesis would be
The null hypothesis
H0: μ=0
The alternative hypothesis H1: μ>0
49
In this case, your null hypothesis means
“the weight reductionprogram has no
effect”, and the alternative hypothesis
means “there are some real effects”.
50
Two sided test
Two sided test has the following null and
alternative hypothesis.
H0: μ=μ0
H1: μ≠μ0
Thus, the example of stock return is an
example of two sided test where μ0 = 8%.
51
Test procedure for two sided test
1. First, set the significance level α. This
number should be reasonably small. It is
usually set at 0.05.
2. Second, compute “t-statistic” which is
defined as
From null
X  0
t 
s/
n
hypothesis
3. Reject H0 if the absolute value of tstatistic is greater than tn-1,α/2 .
Otherwise, do not reject (i.e, accept) H0.
52
Two sided test decision
t-distribution with n-1 degree of
freedom
1 
/2
-tn-1,α/2
/2
tn-1,α/2
Reject H0 if t-statistic falls in the shaded region. This
region is called the rejection region. You reject H0 because,
if H0 is true, then t-statistic falls in this region only with a
small probability:100*α%.
Do not reject (i.e., accept) H0 if the t-statistic falls between
53
[-tn-1,α/2, tn-1,α/2]
If the null hypothesis is rejected, you say
that the null hypothesis is rejected at the
100*α% significance level.
If you rejected H0, the alternative
hypothesis (H1) is “accepted”.
If you did not reject H0, the null
hypothesis (H0) is “accepted”.
54
Note
Strictly speaking, this test is valid only for
the case where the population distribution
is normal with unknown mean and
unknown variance. If the population
distribution is not normal, this does not
apply.
However, even if the population
distribution is not normal, it is known that
this test is good approximation for any
arbitrary distributions.
55
Exercise 7
The excel file `Test Score’ shows the final
exam scores for a particular class.
The professor wanted the mean score for
the final exam to be about 60.
Test if the population mean of the test
scores is equal to 60 at the significance
level equal to 5%.
56
One sided test (upper tail test)
Two sided test has the following null and
alternative hypothesis.
H0: μ=μ0
H1: μ>μ0
Thus, the example of the weight reduction
program is an example of the one sided
test.
57
Test procedure for one sided test
1. First, set the significance level α. This is
usually set at 0.05.
2. Second, compute “t-statistic” which is
defined as
From null
t 
X  0
s/
n
hypothesis
3. Reject H0 if t-statistic is greater than
the criteria value tn-1,α. Otherwise, do
not reject (i.e., accept) H0.
58
One sided test decision
T-distribution with n-1
degree of freedom
1 

tn-1,α
Reject H0 if the t-statistic falls in the shaded region.
This region is called the rejection region. You reject
H0 since, if H0 is true, t-statistic falls in this region
only with a small probability: 100* α%.
Do not reject (i.e., accept) H0 if the t-statistic is
smaller than tn-1,α/2
59
Weight
reduction in kg
5
0
4
3
-0.5
1
0
3
5
5
0
0
Exercise 8
You run a weight reduction program. This
data shows the weight reduction data for
each of your client in kilogram. Positive
number means a reduction in weight.
Negative means an increase in weight.
Test, at 5% significance level, whether this
program has indeed reduced the weights of
your clients.
The data is stored in file `Weight
Reduction’
60
Testing the difference in the
population means between two
different samples
We are often interested to see if there are
any differences between two groups
(Female v.s. male etc).
61
In this section, I will show how to
examine if there is any difference in the
population means of two different groups.
Assumptions
(i) The distributions for both group are
normal.
(ii) The population means may be different
but the population variances are the same
between the two groups.
62
Suppose that you have two sets of data,
one for group X and the other for group Y.
Then you may be interested in the
following tests
Example:
One sided test
H0: Female students (X) and male
students (Y) have the same average
H0: μx-μy=0
test score
H1: μx-μy>0
H0: female students have higher
Two sided test
test score.
H0: μx-μy=0
H1: μx-μy≠0
63
Testing procedure
I’m going to describe only the following
one-sided test.
H0: μx-μy=0
H1: μx-μy>0
Let next nx and ny be the number of
observations for each group.
Let Sx and Sy be the sample standard
deviations for group X and group Y.
64
1. Construct the `pooled sample variance’
as
(nx  1) S x  (n y  1) S y
2
Sp 
2
2
nx  n y  2
2. Then construct the t-statistic as
t 
(X Y)  0
Sp
2
nx

Sp
2
Just to emphasize that
this is from the
H0: μx-μy=0
ny
Then t-statistic follows t distribution with
(nx+ny-2) degree of freedom.
65
Reject the null hypothesis H0 if the tstatistic is greater than tnx+ny-2,α
Do not reject (accept H0) if otherwise.
66
Exercise 9
File `Test scores’ shows the test score for a
final exam for a particular class.
This file also contains information about
the students’ information, such as gender
or the use of office hours. Answer the
following questions.
67
Q1. Do students who make a good use of office
hours perform better than those who have never
used office hours? Test this at significance level
5%.
Q2. Is there any difference in test scores between
male students and female students? Test this at
significance level 5%.
68