Transcript Document

Introduction to
Biostatistics (ZJU 2008)
Wenjiang Fu, Ph.D
Associate Professor
Division of Biostatistics, Department of
Epidemiology
Michigan State University
East Lansing, Michigan 48824, USA
Email: [email protected]
www: http://www.msu.edu/~fuw
Homework 1


correction:
2. Referring to Table 3.4
The three people are:
77 years old man, 76 years old woman and 82
years old woman.
Parameter estimation





What we have learned so far:
Random variables.
Distributions of random variables (Bin, Pois,
Gaussian).
Calculation of probability based on distributions
including approximation methods.
Application of probability theory (small
probability events).
All the above are based on known distribution: know
types of distribution and known parameters of
distribution.
Parameter estimation







Distributions of random variables (Bin, Pois, Gaussian).
Calculation of probability based on distributions including
approximation methods.
Examples: D.B.P. N (80, 12.52)
# cases of cancer
Pois () ,  = 6
# lymphocytes
B(100, .34)
Calculate probability based on assumptions of the
distribution and the parameters of the distribution.
Application of probability theory (small probability events).
All the above are based on known distribution: know types of
distribution and known parameters of distribution.
Where do we find the info for parameters?
The only answer is from the data (or samples)!
Parameter estimation



The only answer is from the data (or
samples).
Data set
Estimation of parameters
Hypothesis testing
Statistical inference
Estimation:
point estimation
Interval estimation: CI.
Relation between population and sample

Random sample --- selection of some members of population such that each
member is independently chosen and has a known non-zero probability.
Example 1: 10 birth weights x1, …, x10 is a sample from the entire
population of birth weight.
Example 2: WBC of 30 students independently selected from MSU x1, …,
x30 is a sample from the population of WBC of all MSU students.

Simple random sample --- a random sample in which each member has the
sample probability of being selected. A random sample is referred to a simple
random sample.
Some non-simple random samples: cluster sampling:
Within state choose clusters (geographic locations, regions, sparse populations)

Random samples within selected cluster
The reference, target or study population is the group we wish to study (to
make inference). The random sample is selected from the study population
(hoped to be a good representation of the study population to draw
conclusion from).
Estimation of the Mean





Estimation of the mean of a distribution  = E(X)
A random sample x1, …, xn from the distribution of X.
The natural estimate of  is the sample mean
Let x1, …, xn be a random sample drawn from the
same population with mean . Then the sample mean
satisfies
E(x) = , an unbiased estimator.
An estimator e of parameter  is unbiased if E(e) = .
Then we know x is an unbiased estimator of .
1 n
1 n
1 n
E ( x )  E (  xi )   E ( xi )     
n i 1
n i 1
n i 1
Estimation of the Mean


For normal distribution N (, 2), x is the “best” unbiased
estimator – having the smallest variance and no bias.
Standard Error of the Mean
x1, …, xn a random sample from a underlying distribution with
mean  and variance 2. Then
1 n
1
var( x )  var(  xi )  2
n i 1
n
n
1 2
var(
x
)



i
n
i 1

Standard error of the mean (SEM) is the standard deviation of
the sample mean x, which is equal to  / n Standard error of
the mean is estimated by S / n . S2 sample variance.

I.I.D. (or i.i.d.) – independently identically distributed
x1, …, xn iid r.v.’s – x1, …, xn are indep r.v. with the same
distribution (same mean, variance, quantiles, etc.).
A random sample from a population is iid.

 ( xi  x ) 2
n 1
Estimation of the standard error


Estimation of standard error (s.e.)
se (x) =  / n .
n – sample size, usually known.
 -- may be known or unknown.
If  2 is unknown, use sample variance
2
(
x

x
)
 i
S2 
n 1
Use S to estimate  and use S / n to estimate
se (x).
x
Central Limit Theorem



Notation ^ to be an estimate of certain parameter
̂ -- the estimate of the mean , ˆ 2 -- the estimate of the
variance
Central Limit Theorem for normal rv.
If x1, …, xn are iid N ( , 2 ) then x ~ N ( , 1  2 )
.
n
In fact, for large n, even for iid r.v.'s x1, …, xn not normally
distributed, the central limit theorem still holds.
Central limit theorem
If x1, …, xn are indep r.v.'s with the same mean  and
variance  2 . Then for large n
x ~ N ( , 1  2 )
n


the mean is approximately normally distributed with mean  and
variance  2 /n .
Point estimation:
^
se(x)
Interval estimation confidence interval (C.I.)
estimate of precision
Interval Estimation


Interval estimation – known variance
Assume population follows normal distr. N (, 2).
A random sample x1, …, xn has mean x  N (,  2/n )
If  and 2 are known, then we have
Pr(1.96  x    1.96)  .95



or equivalently Pr ( - 1.96  / n < x <  + 1.96  / n ) = .95
or equivalently
Pr (x - 1.96  / n <  < x + 1.96  / n ) = .95
Definition (Confidence Interval)
A 95% confidence interval (C.I.) for  when  2 is known is
defined by the interval
( x - 1.96  / n , x + 1.96  / n )
Interpretation: we are 95% confident that the population mean
 is in the CI ( x - 1.96  / n , x + 1.96  / n ).
Confidence Interval

Note CI (x-1.96/ n , x+1.96/ n ) is random
since it depends on x , which is random and
depends on the random sample. If many
samples are drawn from the population, and
the CI is calculated for each sample, then
over the collection of all 95% CIs that could
be constructed from the repeated random
samples of size n, 95% will contain the
parameter  of population.
95% Confidence Intervals
95% Confidence Intervals
## plot standard normal N(0,1)
## density and construct 95%
## CI for random samples of
## size 20 from N(0,1)
#### Plot of density N(0,1)
a <- c(-100:100)/25
plot (a, dnorm(a), type = ‘l’ ,
ylim = c(-1,.5) )
abline (v = 0, col=2)
## Construct 95% CI for a
## random sample of size 20
## and repeat for 1000 times
B <- 1000
size <- 20
CImat <- matrix(NA, B, 3)
for (i in 1:B) {
samp <- rnorm(size, mean=0, sd = 1)
samp.mean <- mean(samp)
## normal N(0,1) distribution with known variance 1
CImat[i,1:2] <- c(samp.mean-1.96*sd(samp) /sqrt
(size), samp.mean+1.96*sd(samp)/sqrt(size) )
## normal distribution with unknown variance
## CImat[i,1:2] <- c (samp.mean –qt (p=.975,df=size1)*sd (samp) /sqrt(size), samp.mean+qt (p=.975,
df=size-1)*sd(samp)/sqrt(size))
CImat[i,3] <- 1*(CImat[i,1]*CImat[i,2]<=0)
## plot a segment for the CI at a random height with
different colors
lines ( CImat[i,1:2], rep(runif(1,min=-1,max=0),
col=5*(i/5-ceiling(i/5))+1,2) )
}
sum(CImat[,3]) / B
Confidence Intervals of Mean



Length of CI: the larger the CI, the less precise the estimate.
CI – a safeguard: not to make mistakes in estimation.
Large CI – not to make mistakes frequently, but useless.
Example. SBP x1  x2  150 If 12 = 100, 22 = 400, n = 9, then
95% CI = ?
Sample 1. x1  1.96 1/√9 =150  1.96x 10/3=150 6.53 =
(143.47, 156.53)
Sample 2. x2  1.96 2/√9 =150  1.96x 20/3=150 13.07 =
(136.93, 163.07)
CI at any  - level : Factors affecting the length of CI (width)
1). n – sample size: n increases, length of CI decreases: narrower;
2).  -- standard deviation of population:  increases, length of
CI increases (wider);
3).  -- (1-) level of confidence: (1- ) increases, length of CI
increases (wider).
Confidence Intervals of Mean


CI at any  - level : Using percentile Zu, Pr (X  Zu ) = u for
X  N (0, 1)
Pr (X  -Z1-/2) = Pr (X Z1-/2)=/2 left tail and right tail prob.
Tail probability = Pr (|X|  Z1-/2 ) = 
(1-) x 100% CI for  is (x - Z1-/2  / n , x + Z1-/2  / n )
 can be any level. the most frequently used are .01, .05, .1
Interval estimation -  2 unknown. Using estimate S2 to
estimate  2 for CI.
x
For  2 known:  / n  N (0, 1)
x
For  2 unknown:
 / n  tn-1
t- distribution (Student's t-distribution) (W. Gossett)
 tn-1, a student's t- distribution with (n-1) degrees of freedom (df)
Percentile of t- distribution: td,u of (100x u) %
or Pr (td  td, u ) = u t-distribution table.
Confidence Intervals of Mean


Note that t d  N (0, 1) for very large d :
When d < 30, we see the difference between td, u
and Zu.
When d > 30, the difference is small.
CI of  with  2 unknown:
Estimate  2 by S 2 and change Zu to tn-1,u to
follow similar procedure for  2 known.
(1- )x 100% C.I. for  when  2 is unknown is
( x - tn-1,1-/2 S / n , x + tn-1,1-/2 S / n )
Confidence Interval of Mean



Example. Table 6.9,
27 rats with LVEF
It is known that  xi = 6.05,  xi2 = 1.522,
Assume normal distr. Calculate mean, S2, s.e., 95% CI
x =  xi /n = 6.05/27 = .224
S2 = .0064 S = .08
s.e.(x) = S/√27 = .0154
95% CI : x t26, .975 S/ √27
= .224  2.056 x .0154
= .224  .0317 = (.1923, .2557)
Estimation of Variance

Point estimation
natural estimate: sample variance
n
1
2) =  2 ?
2
E(S
S 
(
x

x
)
 i
2



n 1 i1
Theorem. If x1, …, xn is a random sample from population with
mean  and variance  2, then E(S 2) =  2
i.e. S 2 is an unbiased estimator of  2.
If we use denominator n rather than (n-1) in S 2 to estimate  2,
~
E{ 2} = E{ [(n-1)/n] S2 } = [(n-1)/n] E {S2}
= [(n-1)/n]  2 <  2
i.e. the average of the squared distance from the sample mean is
a biased estimator of  2.
Interval estimation of variance




Chi-squares distribution
If G =X12 + … + Xn2 , where X1, …, Xn iid
N (0, 1), then G is said to follow a Chi-squares distribution with
n degrees of freedom.
Denote G  2n , it only takes positive values
with mean E (2n) = n.
u-th percentile of 2n, denoted by 2n, u, satisfies
Pr (2n < 2n, u) = u, can be obtained from 2n table.
Distribution of S2
x1, …, xn an iid random sample from N (,  2)
Then
(n  1) S 2
1 n
2 2

(
x

x
)
n 1
i
2
2 
i 1
or equivalently
2 2
S2
n 1
n1
Confidence Interval of Variance

2 2
  2 2


n

1,

/2
n

1,1


/2
  1
Pr 
 S2 

n 1
n 1 


Similar to the derivation of the CI for , we have

2
2 
(
n

1)
S
(
n

1)
S
2
  1
Pr 
 
2
 2


n1, /2 
 n1,1 /2


(1-) x 100% CI for  2 is

2
(
n

1)
S

,
  2
 n1,1 /2
(n 1)S 2 
 n21, /2
Example S.B.P. N (,  2)
sample 1. 1 = 150
S12 = 250 n = 5
sample 2. 2 = 150
S22 = 1700 n = 5


95% CI for σ2

95% CI : (1-) = .95, = .05
521,1 /2
2
 4,.975
 11.14
2
521, /2  4,.025
 .484
95% CI = [ (n-1) S2/χ24,.975, (n-1)S2/ χ24,.025 ]
sample 1: [4x250 / 11.14, 4x250 / .484]
= [89.77, 2066.12]
sample 2: [4x1700 / 11.14, 4x1700 / .484]
= [610.4, 14049.6]
Estimation for Bin(n, p)






Example. A random sample of 1000 adults. Among them 30 had
heart attack(s). How to estimate p ?
p = Pr (having heart attack(s) before) = relative freq.
= 30 / 1000 = .03
Q: Is this a good estimator?
A: X  B (n, p). Let X1, …, Xn be indep. Bernoulli trials.
Pr (Xi=1) = p, and Pr (Xi= 0) = 1- p. 1 ≤ i ≤ n
Then X = ∑1nXi and X/n = ∑1nXi /n = X
--- sample mean with expected value E(Xi) = p
E (X) = E (∑Xi /n) = p
So, X= X/n is an unbiased estimator for p. or
^p = X/n, s.e.(p)
^ =?
 /n
pq
Estimation for Bin(n, p)





^
var (p) = var(X/n)= var(X)/n2 = npq/n2 =pq/n
^ =
s.e. (p)
pq / n with q = (1- p)
^ replace p with ^p
Estimate s.e. (p):
^
s.e. (p)
= (pq/n)1/2
Example: n = 1000, X = 30.
p^ = X/n =.03
^
s.e. (p) = .00539
Interval estimation of Binomial p






Normal theory method
X  B (n, p) then X = ∑1nXi with indep.
Bernoulli trials X1, …, Xn
p^ = X/n, sample mean of Bernoulli trials.
By central limit theorem (CLT),
p^  N (p, pq/n) then use normal distribution
Condition: npq  5.

Pr  1.96 


pˆ  p  1.96   .95

pq / n

Confidence Interval for p





^^
95% CI for p with normal theory (npq≥5)
is
1/2 , ^
1/2 )
^
^^
^^
(p-1.96(pq/n)
p+1.96(pq/n)
(1-)x 100% CI for p with normal theory
^^
(npq≥5)
is
1/2 , p+z
1/2 )
^
^^
^
^^
(p-z
(pq/n)
(pq/n)
1- /2
1- /2
Example. Eosinophils : p = 2/100 = .02 ,
^ ^ = 100x.02x.98=1.96 < 5
np(1-p)
Normal approximation does not work!
Exact method: use Table 7 for 95% CI.
One-sided CI




Example: hypertensive treatment to lower BP
Comparing Standard v.s. new drug.
Suppose out of 100 hypertensives, new drug brings 40 subjects’
BP down to normal while the standard has 30% efficacy.
Q: 1). Is the new drug different from the standard ?
2). Is the new drug better than the standard?
A: 1). Two sided. Can be better or worse if different.
2). One sided. Can be better or no better.
Upper one-sided (1-)x 100% CI for p of B(n; p)
^^ 1/2 for npq
^^ >= 5
p > ^p - Z(1-) (pq/n)
Lower one-sided (1-)x 100% CI for p of B(n;p)
^^ 1/2 for npq
^^ >= 5
p < p^ +Z(1-) (pq/n)
CI: one-sided vs two-sided

Example. Hypertension study
100 people receive drug for treatment on high BP. 20 of
them got BP lowered by the drug. If by reference, BP is
also lowered by placebo on 10% people. Q1: any drug
effect? Q2: drug better than placebo?

A: Pr (lowering BP in drug group) = 20/100 = .2 > .1 of Placebo
^ ^ = 100x.2x.8 = 16 > 5
Use p^ = .2 to calculate np(1-p)
Normal approximation valid:
p^  N (p, pq/n) . 95% CI of p is
^ ^ 1/2 = .2 1.96 x .04 = .2  .0784 = (.1216, .2784)
p^  1.96 (npq)
Since p = .1 for placebo and .1 is not in the 95% CI for p,
i.e. we are 95% confident that the placebo effect (p=.1) is different
from the drug effect.

CI: one-sided vs two-sided

Example. Hypertension study
100 people receive drug for treatment on high BP. 20 of them
got BP lowered by the drug. If by reference, BP is also lowered
by placebo on 10% people. Q1: any drug effect? Q2: drug better
than placebo?

Q2. Is the drug better than placebo: p > .1
A: One-sided 95% CI for p is
^^
1/2 = .2 – 1.645 x .04 = .2 - .0658 = .1342
p > ^p - Z1-.05 (pq/n)
One-sided 95% CI:
(0.1342, +∞)
Compare with two-sided (.1216, .2784)
0.2784
0.1216
0.1342
Estimation for Poisson Distribution



Pois () .  = t,
 -- intensity
Estimator  =  / t,
estimated by X/t
Instead of estimating the mean, estimate the intensity.
where t can be area, time duration, etc.