Thu Sep 18 - Wharton Statistics Department

Download Report

Transcript Thu Sep 18 - Wharton Statistics Department

Lecture 5 Outline: Thu, Sept 18
• Announcement: No office hours on Tuesday, Sept.
23rd after class. Extra office hour: Tuesday, Sept.
23rd from 12-1 p.m.
• Chapter 1.5.4 (additional material on sampling
units), 2.1.2, 2.2
– Sampling frame and sampling units
– Paired t-test
• Sampling distribution of sample average
• t-ratio and t-test
• Confidence intervals
Notes on Box Plots
• Dotted lines extend to the largest (and smallest)
points in data that are within 1.5 IQRs of the third
(first) quartile. All other points are marked by
dots.
• The red bracket on the side of the box plot shows
the shortest half of data (shortest interval
containing half the data). The shortest half is at
the center for symmetric distributions, but offcenter for non-symmetric ones.
Simple Random Sample
• A simple random sample (of size n) is a subset of
a population obtained by a procedure giving all
sets of n distinct items in the population an equal
chance of being chosen.
• Need a frame: a numbered list of all subjects.
• Simple random sample: Generate random number
for each subject. Choose subjects with n smallest
numbers.
• Simple random sample in JMP:
– Click on Tables, Subset, then put the number n in the
box “Sampling Rate or Sample Size.”
Sampling units
• In conducting a random sample, it is important that we are
randomly sampling the units of interest. Otherwise we
may create a selection bias.
• Sampling families
– If we want mean number of children per family, we
should either
• Sample by family
• Sample by person but downweight kids from large families.
– Suppose we want to know mean level of radiation in
community and have available a frame of housing lots
in the community. We need to use variable probability
sampling, giving a larger probability of being sampled
to larger lots.
The clinician’s illusion
• For several diseases such as schizophrenia,
alcoholism and opiate addiction, clinicians think
that the long-term prognosis is much worse than
do researchers.
• Part of disagreement may arise from differences in
the population they sample
– Clinicians: “Prevalence” sample – sample from
population currently suffering disease which contains a
disproportionate number of people suffering disease for
long time
– Researchers: “Incidence” sample – sample from
population who has ever contracted the disease.
Case Study 2.1.2
• Broad Question: Are any physiological indicators
associated with schizophrenia? Early studies
suggested certain areas of brain may be different
in persons with schizophrenia than in others but
confounding factors clouded the issue.
• Specific Question: Is the left hippocampus region
of brain smaller in people with schizophrenia?
• Research design: Sample pairs of monozygotic
twins, where one of twins was schizophrenic and
other was not. Comparing monozy. twins controls
for genetic and socioeconomic differences.
Case Study 2.1.2 Cont.
• The mean difference (unaffected-affected) in volume of
left hippocampus region between 15 pairs is 0.199. Is this
larger than could be explained by “chance”?
• Probability (chance) model: Random sampling (fictitious)
from a single population.
• Scope of inference
– Goal is to make inference about population mean but
inference to larger population is questionable because
we did not take a random sample.
– No causal inference can be made. In fact researchers
had no theories about whether abnormalities preceded
the disease or resulted from it.
Probability Model
• Goal is to compare two groups (affecteds and
unaffecteds) but we have taken a paired sample.
We can think of having one population (pairs of
twins) and looking at the mean of one variable, the
difference in hippocampus volumes in each pair.
• Probability model: Simple random sample with
replacement from population. For a large
population, this is essentially equivalent to a
simple random sample without replacement.
Parameters and Statistics
• Population parameters (  , )
–  = population mean
–  2 = population variance = average size of
(Y   )2 in population
• Hypotheses: H 0 :   0, H1 :   0
• Sample statistics ( Y , s )
– Sample: Y1 ,, Yn
1 n
– Y   Yi = sample mean
n i 1
–
s2 
n
1
2
(
Y

Y
)
 i
n  1 i 1
= sample variance
Sampling distribution of sample
mean
• See Displays 2.3 and 2.4
• Standard deviation of Y :
S .D.(Y )   / n
• Standard error of Y :
– S .E.(Y )  s / n
– Estimated standard deviation of the sampling
distribution of Y
– For schizophrenia study,
Y  .199, s  .238
S .E.(Y )  .238 / 14  .062
Test Statistics
• Z-ratio
Est.  true param.
S .D.( Est.)
Y 
Y 

S .D.(Y )  / n
– For a general parameter:
– For 1-group:
• t-ratio
Est.  true param.
– For a general parameter:
S .E.( Est.)
– For 1-group: Y   Y  

S .E.(Y ) s / n
Distribution of test statistics
• Facts from statistical theory: If* the population
distribution of Y is normal, then the sampling
distribution of
– (i) the z-ratio is standard normal
– (ii) the t-ratio is student’s t on n-1 degrees of freedom
– * = We will study the “if” part later; for now we will
assume it is true
• See Display 2.5
Testing a hypothesis about

• H 0 :   0, H1 :   0
• Could the difference of Y from  * (the
hypothesized value for  , =0 here ) be due
to chance (in random sampling)?
| Y  * |
|
t
|

• Test statistic:
SE (Y )
• If H0 is true, then t equals the t-ratio and has
the Student’s t-distribution with n-1 degrees
of freedom
P-value
• The (2-sided) p-value is the proportion of
random samples with absolute value of t
ratios >= observed test statistic (|t|)
• Schizophrenia example: t = 3.23
8
7
Estim Mean 0.1986666667
Hypoth Mean 0
T Ratio 3.2289280811
P Value 0.0060615436
6
Y
5
4
3
2
1
0
-0.4
-0.3
Sample Size = 15
-0.2
-0.1
.0
X
.1
.2
.3
.4
Schizophrenia Example
• p-value (2-sided, paired t-test) = .006
• So either,
– (i) the null hypothesis is incorrect OR
– (ii) the null hypothesis is correct and we happened to
get a particularly unusual sample (only 6 out of 1000
are as unusual)
• Strong evidence against H 0 :   0
• One-sided test: H 0 :   0, H1 :   0
– Test statistic: t 
Y 0
s/ n
– For schizophrenia example, t=3.21, p-value (1-sided)
=.003
Matched pairs t-test in JMP
• Click Analyze, Matched Pairs, put two
columns (e.g., affected and unaffected) into
Y, Paired Response.
• Can also use one-sample t-test. Click
Analyze, Distribution, put difference into Y,
columns. Then click red triangle under
difference and click test mean.
Confidence Interval for

• A confidence interval is a range of “plausible
values” for a statistical parameter (e.g., the
population mean) based on the data. It conveys
the precision of the sample mean as an estimate of
the population mean.
• A confidence interval typically takes the form:
point estimate  margin of error
• The margin of error depends on two factors:
– Standard error of the estimate
– Degree of “confidence” we want.
CI for population mean
• If the population distribution of Y is normal
(* we will study the if part later) 95% CI for
mean of single population:
Y  tn1 (.975) * SE (Y ) 
s
Y  tn1 (.975) *
n
• For schizophrenia data:
.199cm3  2.145  0.615cm3 
0.067cm3 to 0.331cm3
Interpretation of CIs
• A 95% confidence interval will contain the true
parameter (e.g., the population mean) 95% of the
time if repeated random samples are taken.
• It is impossible to say whether it is successful or
not in any particular case, i.e., we know that the CI
will usually contain the true mean under random
sampling but we do not know for the
schizophrenia data if the CI (0.067cm3 ,0.331cm3)
contains the true mean difference.
Confidence Intervals in JMP
• For both methods of doing paired t-test
(Analyze, Matched Pairs or Analyze,
Distribution), the 95% confidence intervals
for the mean are shown on the output.