Transcript Slide 1

An Introduction to Estimation
Estimation I: 1
Biostatistics serves two purposes (among others):
To use information from a sample of data to
1. Describe our best guess of the characteristics of
the population.
• Best guess  estimation
2. Gauge the plausibility of alternative explanations
for what is observed in the sample.
• Hypothesis testing
Estimation I: 2
Estimates vs Parameters:
Sample Statistics
Mean: X
Variance: S2
Standard Deviation: S
Population Parameters
Mean: m
Variance: s2
Standard Deviation: s
What does it mean to say we know X from a sample,
but we don‘t know m?
• We can observe a sample mean and then
use it as an estimate of the true, unknown
population mean.
Estimation I: 3
Some Notation / Definitions
1. Estimation: The computation of a statistic from
sample data for the purposes of obtaining a guess
of the unknown population parameter value.
2. Estimator: The label given to the statistic that is to
be calculated in estimation.
e.g., sample mean, X
sample standard deviation, S
1. Estimate: The value of the estimator takes when
calculated using an actual sample of data.
e.g., x = 10 minutes
s = 3 minutes
Estimation I: 4
What criteria should we use to define a good
estimate?
1) In the long run, “correct”
if we imagine sampling over and over, the
average of repeated sampling should
result in the correct answer:
UNBIASEDNESS
2) In the short run, “in error by as little as possible”
(most of the time, it should be “close” to the true
value)
This is the concept of precision.
It is also called the statistical concept of
minimum variance.
Estimation I: 5
Example: Is the sample mean or the sample median a
better choice as an estimate of µ, the true population
mean, for the normal distribution?
1. Unbiasedness:
Both the sample mean and the median are
unbiased estimates of µ.
(Note this is true for the Normal Distribution, but does
not hold for all distributions).
2. Precision:
For the sampling distributions of sample means
and sample medians, it can be shown that
Variance(sample means) < Variance(sample medians)
For X~ Normal, the sample mean is said to be a
minimum variance unbiased estimator (mvue) of µ.
Estimation I: 6
If the data are normally distributed:
X ~ N(m, s2)
 X ~ N(m, s2/n)
That is, we know that the sampling distribution of
sample means from this population will follow
• a Normal distribution with
• the same mean as the underlying
population
• A decreased variance relative to the
underlying population: s2/n
Estimation I: 7
We can then make statements about the
probability of observing a value of X within some
interval around m:
• X is within 1 standard error of m 68% of the time.
• X is within 1.96 standard errors of m 95% of the time.
• X is within 2.576 standard errors of m 99% of the
time.
95%
68%
m
s
n
m m s
n
m  1.96
s
n
m
m  1.96
s
n
Estimation I: 8
2 types of Estimators
Point Estimators
Single best guess
Form of estimate:
a value
e.g., x = 10 ml
Interval Estimators
Range of values
Form of estimate:
(lower limit, upper limit)
e.g., (5, 15) ml
We’ve been working with point estimators;
Our next step is to define interval estimators.
Estimation I: 9
There are 3 ingredients to a confidence interval:
1.
a point estimator (e.g., x)
2.
the SE of the point estimator (e.g., s/n)
3.
a confidence coefficient with an associated
probability (e.g., a percentile of a Normal
distribution)
Estimation I: 10
The form of the confidence interval is then:
Lower limit = (Point Estimator) – [(conf. coeff)  (SE)]
Upper limit = (Point Estimator) + [(conf. coeff)  (SE)]
For example, for a mean:
LL = X – c (s/n)
UL = X + c (s/n)
Next Step: where does ‘c’ come from?
Estimation I: 11
Interpretation of a 95% Confidence Interval
Population
m
(
)
(
)
(
(
)
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
In repeated
sampling,
each
sample
gives rise
to a point
and interval
estimate of
the mean.
)
x
Estimation I: 12
Each sample gives rise to its own
• point estimate and
• confidence interval estimate built around the
point estimate.
We will construct our intervals so that:
• If all possible samples of a given sample size
were drawn from the underlying distribution and
• each sample gave rise to its own interval estimate
• then 95% of all such intervals would include the
unknown µ while 5% would not
Estimation I: 13
Interpreting Confidence Intervals:
Example: Take a sample, estimate xbar, and compute
an interval estimate for the mean: (1.3, 9.5)
Correct:
“95% of intervals constructed in this manner will
include the population mean m.”
Incorrect:
“The probability that the interval (1.3, 9.5) contains
µ is 0.95.”
The 2nd statement is incorrect because once an
interval is constructed it either contains the mean
or it doesn’t. Since we don’t know mu, we just
don’t know which it is.
Estimation I: 14
Computing a Confidence Interval for m
Case 1: s2 known
Example: The weight in micrograms of a drug inside
of capsules is Normally distributed, with s2 = 0.25
We are given a sample of n=30 capsules with a mean
weight of 0.51 micrograms, and asked to construct a
95% confidence interval estimate of the population
mean weight.
1. The point interval estimate is x = 0.51 mgm
2. The standard error of x is:
s
0.25
sx 

 0.091m gm
n
30
Estimation I: 15
3. We set the confidence coefficient to equal
z0.975=1.96 for a 95% confidence interval.
We can then compute the lower and upper limits as:
s 

LL   x  zl
  0.51  (1.96)(0.091)  0.332
n

s 

UL   x  zu
  0.51  (1.96)(0.091)  0.688
n

The 95% Confidence interval for the mean weight of
drug per capsule is (0.33, 0.69) micrograms.
Estimation I: 16
Deriving the expression for a Confidence Interval
We want: Pr (zl  Z  zu) = 1a = 0.95
zl = 2.5th percentile of Normal
zu = 97.5th percentile of Normal
This area is
(1- a) = .95
This shaded area
is a/2 = .025
zl
0
This shaded area
is a/2 = .025
zu
We split the excluded area (a=.05) symmetrically around the mean.
Estimation I: 17
Recall that we want the Inverse Cumulative Distribution
to find a percentile of the Normal.
In Minitab: Calc Probability Distributions Normal
percentile:
.025, .975
Estimation I: 18
We can find the percentiles for:
a/2 = .05/2 = .025  z.025 = -1.96
and
1 – a/2 = 1 – (.05/2) = .975  z.975 = 1.96
Pr[-1.96  Z  1.96) = .95
We now have:
This shaded area
is a/2 = .025
z.025
This area is
(1- a) = .95
0
This shaded area
is a/2 = .025
z.975
Estimation I: 19
Derivation of a confidence interval on X:
We know (simple standardization): Z  X  m
s/ n


X

m
Pr( zl  Z  zu )  Pr  zl 
 zu  Substitute in for Z


s/ n






s 
 s
 Pr  zl
  X  m   zu

n
n

Multiply by SE
s
s  Subtract X

 Pr   X  zl
  m    X  zu

n
n


s
s 

 Pr  X  zl
  m   X  zu

n
n

This is of form: (pt. estimate) + (conf. coeff) (std error)
Estimation I: 20
When zl=-1.96 and zu=1.96 ,
s
s 

Pr  X  zl
  m   X  zu
  0.95
n
n

We can then compute the lower and upper limits as:
s 

LL   x  zl
  0.51   1.96  0.091  0.332
n

s 

UL   x  zu
  0.51  1.96  0.091  0.688
n

The 95% Confidence interval for the mean weight of
drug per capsule is (0.33, 0.69) micrograms.
Estimation I: 21
Bottom Line:
Confidence
Point

Interval
Estimate
Estimate
X
 Confidence
Coefficient
Percentile
From N(0,1)
Std
Error
s
n
Commonly used Confidence Coefficients from N(0,1):
• For a 90% confidence interval z.95 = 1.645
• For a 95% confidence interval z.975 = 1.96
• For a 99% confidence interval z.995 = 2.576
Estimation I: 22
Example
In 1990 U.S. census, the mean height of men in the
US was m = 69 inches, and s = 3 inches.
By the year 2004 heights may have changed but we
will assume the standard deviation is the same, and
known: s = 3 inches.
Since we can’t afford to measure the whole
population we’ll take a sample of n=100 men.
We observe a mean height of x = 70 inches.
a. What is the 95% confidence interval estimate
of the year 2004 population mean height?
b. the 99% confidence interval estimate?
Estimation I: 23
Solution
We want (1a) = 0.95 for a 95% confidence interval:
This area is .025
.95
z.975=1.96
s 
3

LL   x  zl
)  69.4
  70  (1.96)(
n
100

s 
3

UL   x  zl
)  70.6
  70  (1.96)(
n
100

Estimation I: 24
The 95% C.I. estimate of the population mean height
is (69.4, 70.6) inches.
Do we have evidence that heights of men have
changed since 1990?
That is, is the 1990 mean height within this interval?
Since 69 inches is outside of the 95% confidence
interval, and most samples would result in a
confidence interval that includes the population mean,
it seems reasonable to conclude that heights of men in
the U.S. may have changed.
We suspect that the mean height is greater in 2004
than in 1990.
Estimation I: 25
b.
For desired confidence = 0.99
This area is 0.005
.99
z.995  2.576
s 
3

99% CI   x  z.995
)
  70  (2.576)(
n
100

The 99% confidence interval estimate is (69.2, 70.8) inches.
Since 69 inches is outside of the 99% confidence interval,
it seems reasonable to conclude that heights of men in the
U.S. might have changed. The average height in 2004
appears greater than in 1990.
Estimation I: 26
Note that the 99% confidence interval is wider than
the 95% CI.
• To have greater confidence that we know the
mean, based upon the same sample, we have
a wider interval of values for m.
69.2 69.4
70
70.6
70.8
x
95% CI
99% CI
Estimation I: 27
Hint on the confidence coefficient:
For a (1a) C.I.:
use the 1a / 2 percentile of the N(0,1).
a/2
a/2
(1-a)
Za/2
Z1-(a/2)
For example, for a 95% confidence interval, (1– a) = .95
Thus,
a = .05
so that
a/2 = .025
1 – (a/2) = 1 – .025 = .975
Thus, we want the .975 or 97.5th percentile for a 95%
confidence interval.
Estimation I:
28
Another Example:
A random sample of 25 women has a mean systolic
blood pressure = 120 mmhg. Assuming the
underlying distribution of SBP across women is
Normal with s = 10 mmhg, find the 99% confidence
interval estimate of the unknown true mean, µ.
Solution:
1. Point Estimate: x = 120
2. s = 10 is known
 sx= s / n = 10 / 25 = 10 / 5 = 2
Estimation I: 29
3. To get confidence coefficient:
(1a) = 0.99  a = .01
 a/2 = .005  1– (a/2) = .995
 z.995 = 2.576
This area
=.005
0.99
This area
=.005
z.995 = 2.576
4. Confidence Interval: Pt. Est.  (Conf.Coeff)(SE)
s 

99% CI   x  z.995
  120  (2.576)(2)  (114.9,125.2)
n

With 99% confidence, the mean systolic blood pressure
of the population of women that this sample represents,
is between 114.9 and 125.2 mmhg.
Estimation I: 30
Recap:
A confidence interval for a mean has the form:
Confidence
Point

Interval
Estimate
Estimate
X
 Confidence
Coefficient
Percentile
From N(0,1)
Std
Error
s
n
This holds when:
1. The data are normally distributed, with known
variance, s2
2. The data are not normally distributed, but the
sample size n is large, so that the sampling
distribution of the mean is approx. normal
Estimation I: 31
Up to this point we have been looking at
• estimation of an unknown population mean (m)
• using data from a sample ( x and C.I.)
• assuming that we “know” the population
variance, s2.
In reality, we typically have to
• estimate the population variance, using s2
• along with estimating the mean
• from the same sample.
How does this effect confidence interval estimation for
a mean?
Estimation I: 32
1. We know how to calculate a confidence interval
for the mean when s2 is known:
X  Z1a /2
s
n
What do we do if s is UNKNOWN?
2. It seems like a reasonable idea to replace s with “s”
Recall: s is the sample standard deviation
n
1
2
2
2
S  S , where S 
( xi  x )

n  1 i1
Note that estimation of s depends upon our
estimate of µ .
Estimation I: 33
3. The snag is that we can no longer use the multiplier
z from the Normal distribution. In particular
X m
Z
~ N (0,1)
s/ n
X m
t
~ ? (not Normal )
S/ n
When we replace the true (but unknown) value of the
standard error with an estimate of the standard error:
• Instead of a Z-score, we now have
• A t-statistic:
X m
t
S/ n
Estimation I: 34
This random variable, t, is said to follow
• a Student’s t-distribution
• with degrees of freedom = n-1
• IF the underlying data come from a Normal
Distribution!!
That is:
If
Then
X ~ N (m , s )
2
X m
t
~ tn 1
S/ n
Estimation I: 35
Features of the Student’s t-Distribution
X m
t
~ tn 1
S/ n
N(0,1)
tn-1
0
The Student’s t distribution is
 Bell-shaped
 Symmetric about zero
 Flatter than the Normal (0,1). This means
• Variability is greater
• More area under the tails, less at center
• Resulting confidence intervals will be wider.
Estimation I: 36
This greater variability or spread of the t-distribution
should make intuitive sense –
• we are using an estimate of the standard error
rather than the true value
 we have added uncertainty in our
confidence interval
Each degree of freedom (df) defines a separate tdistribution
• The greater the df, the closer to the normal
distribution
 df = n-1
 As n gets large, tn-1  N(0,1)
Estimation I: 37
As n gets large, tn-1  N(0,1)
t df = 5
t
df = 25
Normal (0,1)
Estimation I: 38
How to Use the Table of Percentiles of the
Student’s t-distribution
Table 5 (p. 757) in Rosner
d.f.
1
2
...
t .90
3.078
1.886
…
t .995
63.657
9.925
Each row gives information for a separate
t-distribution defined by the df=n-1
The column heading tells you which percentile
will be given to you in the body of the table.
The body of the table is comprised of values
of the percentile
Estimation I: 39
t-distribution with 1 df
This area = 0.90
.90
3.078
This number is the
percentile in the
body of the table.
• From the first row (df=1), under the column t.90
• Read
tdf=1,.90 = 3.078
That is,
• Pr(tdf=1 3.078) = .90
Estimation I: 40
Using Minitab to Get Percentiles of the t-distribution
Calc  Probability Distributions  t…
Select Inverse
Cumulative Prob
to get percentile
Enter df = n-1
Enter desired
Percentile
Estimation I: 41
Using Minitab to Get Percentiles of the t-distribution
Inverse Cumulative Distribution Function
Student's t distribution with 1 DF
P(t  t1,0.90 )
0.9000
t1,0.90
3.0777
t-distribution with 1 df
.90
tdf=1,.90 = 3.078
Estimation I: 42
Computation of a confidence interval for the mean
when the population variance is unknown:
• Replace the confidence coefficient from the
N(0,1)
• with one from tn-1
• use an estimate of the standard error in place of
the true standard error
X  ( z1a / 2 )(s / n )
When s2 known
X  (tn1;1a / 2 )(S / n )
When s2 UNknown
Estimation I: 43
Example
A random sample of n = 20 recent cardiac bypass
surgeries has a mean duration of x = 267 minutes,
and sample variance s2 = 36,700 min2.
Assuming the underlying distribution of surgery
duration is normal with unknown variance, find the
90% CI estimate of the true mean duration of
surgery, m.
1. x = 267
2. s = 36700 = 191.6
3. se(x) = s / n = 191.6 / 20 = 42.8
Estimation I: 44
4. Find the confidence coefficient from the t-distribution
a.
n = 20  df = 19
b.
For (1a) = .90  a = .10  a/2 = .05
 1a/2 = .95
c.
tdf;1a/2 = t19;.95 = 1.729
t19
This area = .05
90%
This area = .05
95%
-1.73
t19;.95 =1.729
Look up value of
the 95th percentile
Estimation I: 45
5. 90% CI =
Pt. Est.  (Conf Coeff)(Std Error Est.)
 (t19;.95) ( s / n )
=
x
=
267  (1.729) (42.8)
=
(193.2 , 340.8)
A 90% confidence interval for the true mean
duration of surgery is (193.2, 340.8) minutes.
Estimation I: 46
Estimation Highlights / Main Points
1. Estimation provides “guesses” of unknown
population parameter values using information
from a sample
2. While there may be many criteria for the selection
of a “good” estimate, we’ll use two criteria:
1)
UNbiasedness (in the long run, correct)
2)
Minimum variance (in the short run, the
smallest error possible)
Estimation I: 47
1. We can calculate both point and interval
estimates
2. Confidence interval estimates have the
advantage of providing a sense of the
precision in our data.
• Wide intervals
 poor precision
• Narrow intervals  high precision
Estimation I: 48
3. The width of the interval is a function of the
confidence coefficient (percentile of a
distribution) and the standard error
greater confidence 
larger samples
wider intervals
 more narrow intervals
(as n gets large  s/n or s/n gets smaller )
Estimation I: 49
Computing a Confidence Interval for m
s2 known
s2 NOT known
Point Estimate
x
x
What to use for
variance
s
s
s n
SE of x
x- m
Standardizatio
n
Distribution
of
z=
s
t=
n
Confidence Interval
x ± z[s n
x- m
s n
Student’s t, with df=n-1
Normal(0,1)
Standard Score
Assumptions
s n
]
Random Sample from
Normal, or n large
x ± t [s n ]
Random Sample from
Normal distribution
Estimation I: 50