Tue Sep 16 - Wharton Statistics Department
Download
Report
Transcript Tue Sep 16 - Wharton Statistics Department
Lecture 4 Outline: Tue, Sept 16
• Chapter 1.4.2, Chapter 1.5, additional material on
sampling units and meaningful comparisons
– Review of probability models for randomized
experiments and random samples, probability models
for observational studies
– Graphical methods (histograms, stem-and-leaf
diagrams, boxplots)
– Random assignment and random sampling in JMP
– Sampling units
– Meaningful comparisons (use of control group and use
of rates)
Review
• Hypothesis Testing Examples:
– (i) H_0: There is no causal effect of a treatment on
outcome vs. H_1: there is a causal effect;
– (ii) H_0: Two populations have the same mean vs. H_1:
two populations do not have the same mean
• Statistical inference about hypotheses is based on a
probability model for how the sample (observed data) was
taken.
– P-value: Probability of observing as large a value of test
statistic if null hypothesis is true, measure of evidence
against H_0 (<.05 – moderate evidence against H_0,
<.01 – strong evidence against H_0)
P-value for Randomized
Experiment
• Additive Potential Outcomes Model: For each
unit, Y=outcome if assigned to group I (control
group), Y*=Y+ =outcome if assigned to group
II (treatment group).
• = causal effect of treatment.
• P-value for testing H_0: =0 vs. H_1: 0
– Test statistic: T= | Y2 Y1 |
– Calculate T for every possible grouping of the observed
outcomes into groups of size n_1 (size of control group)
and n_2 (size of treatment group).
– The p-value is the proportion of regroupings with T>=
observed T_O= | Y Y |
2
1
Probability Models
• Probability model for randomized experiment:
Random assignment of units to groups
• Probability model for random sample: Random
sample from each population
• The probability model for an observational study
or nonrandom sample is unknown. We can
assume random assignment or a random sample
but any inference is substantially weaker because
we do not know the real probability model by
which the data was obtained.
Relative Frequency Histograms
• A histogram is a graph that shows the relative frequency
per unit of measurement.
• The areas of blocks represent the percentage of
observations in the blocks.
• The heights of the blocks represent relative frequency per
unit of measurement, i.e., crowding – percentage per unit
of measurement
• Histograms show broad features – particularly the center,
spread and shape of the distribution (symmetric or
skewed, light tailed or heavy tailed).
Histograms in JMP
• Click Analyze, then Distribution
• Click red triangle next to Distributions, stack to
see horizontal layout
• Click tools, hand and click on histogram, drag to
change position of bars.
• To make histograms by group (e.g., sex
discrimination), put Salaries in Y and Sex in By
box.
Stem and leaf diagrams
• Cross between graph and table
• Gives quick idea of distribution
• Shows center, spreads and shapes as does
histogram but also shows exact values, easy to
construct by hand, median can be computed.
• Stem and leaf plots in JMP
– Click Analyze, Distribution
– Put variable of interest in Y and click OK
– Click red triangle next to variable of interest (e.g.,
salaries) and click Stem and Leaf
– Back to back stem and leaf plots are not available in
JMP but are useful (see page 17)
Box plots
• Middle 50% of a group of measurements is
represented by a box.
– Line in middle of box is the median
• Various features of upper and lower 25% by other
symbols
– The whiskers extend to the farthest point that is within
1.5 interquartile ranges of upper and lower quartiles.
(IQR=third quartile – first quartile)
– Points farther away are shown individually as outliers.
– Width of a box plot is chosen to make the box look
nice; it does not represent any aspect of data.
Box plots in JMP
• To draw one box plot
– Click Analyze, Distribution.
• To draw side by side box plots
– Click Analyze, Fit Y by X, putting outcome in
Y and group variable in X
– Click red triangle next to One Way Analysis,
click Display Options and then click Box Plot.
Random Assignment in JMP
• To randomly assign units to two groups of
size n_1 and n_2 in JMP:
– Right click on the top of the random column,
click on formula, click on the random function
and then click on Random Uniform.
– Click on Tables, Sort and then sort by random.
– Create a column group. Label the first n_1
units in the table as Group I and the rest of the
units as Group II
Simple Random Sample
• A simple random sample (of size n) is a subset of
a population obtained by a procedure giving all
sets of n distinct items in the population an equal
chance of being chosen.
• Need a frame: a numbered list of all subjects.
• Simple random sample: Generate random number
for each subject. Choose subjects with n smallest
numbers.
• Simple random sample in JMP:
– Click on Tables, Subset, then put the number n in the
box “Sampling Rate or Sample Size.”
Sampling units
• In conducting a random sample, it is important
that we are randomly sampling the units of
interest. Otherwise we may create a selection
bias.
• Sampling families
– If we want mean number of children per family, we
should sample by family and need to make correction if
sampling by person
– If we want to know mean opinion about building new
school in a community, have available a frame of
families and plan to sample one person per family, we
need to use variable probability sampling, giving a
larger probability of being sampled to larger families.
The clinician’s illusion
• For several diseases such as schizophrenia,
alcoholism and opiate addiction, clinicians think
that the long-term prognosis is much worse than
do researchers.
• Part of disagreement may arise from differences in
the population they sample
– Clinicians: “Prevalence” sample – sample from
population currently suffering disease which contains a
disproportionate number of people suffering disease for
long time
– Researchers: “Incidence” sample – sample from
population who has ever contracted the disease.
Meaningful Comparisons
• Main lesson of chapter: The best way to compare
two (or more) groups is to do a random
experiment or take a random sample. This avoids
systematic bias due to confounding variables and
selection bias
• But if this is not possible, we should generally try
to make the groups as “comparable” as possible by
adjusting for known confounding variables and
selection biases. Often times, important first steps
are to use an appropriate control group and to
compare the appropriate rate rather than absolute
numbers
Control Group
• In a randomized experiment, we want the treatment and
control group to be similar in every way except that one
takes the treatment and the other doesn’t, i.e., we use
placebo and double blinding.
• Similarly in an observational study, we want to compare
the treatment group to a control group that is as similar as
possible.
• Explain the need for a control group by criticizing the
statement “A study on the benefits of vitamin C showed
that 90% of the people suffering from a cold who take
vitamin C get over their cold within a week”
Use of Rates
• An article in This Week magazine says that if you
went “hurtling down the highway at 70 miles an
hour, careening from side to side,” you would
have four times as good a chance of staying alive
if the time were seven in the morning than seven
at night.
• The evidence: “Four times more fatalities occur on
the highways at 7 p.m. than 7 a.m.”
• Does the conclusion follow from the evidence?
• More accidents occur in clear weather than foggy
weather. Is clear weather safer to drive in?
Polio Example
• Using figure 1 as an example, explain why a
contemporaneous control group is needed in experiments
where the effectiveness of a drug or vaccine is being
tested?
• Comment on the use of the number of cases. What would
be a more appropriate indicator of whether polio incidence
was increasing?