Transcript p 1

HSRP 734:
Advanced Statistical Methods
May 22, 2008
Course Website
• Course site in Public Health
Sciences (PHS) website:
http://www.phs.wfubmc.edu/publi
c/edu_statMeth.cfm
Course Syllabus
HSRP 734:
Advanced Statistical Methods
•
•
•
•
Categorical Data Analysis
Logistic Regression
Survival analysis
Cox PH regression
What is Categorical Data Analysis?
• Statistical analysis of data that are noncontinuous
• Includes dichotomous, ordinal, nominal and
count outcomes
• Examples: Disease incidence, Tumor
response
What is Logistic Regression?
A statistical method used to model
dichotomous or binary outcomes
(but not limited to) using predictor
variables.
What is Logistic Regression?
• Used when the research method is
focused on whether or not an event
occurred, rather than when it
occurred
• Time course information is not used
Logistic Regression quantifies
“effects” using Odds Ratios
• Does not model the outcome directly, which
leads to effect estimates quantified by means
(i.e., differences in means)
• Estimates of effect are instead quantified by
“Odds Ratios”
The Logistic Regression Model
predictor variables
 P Y 
ln 
  0  1 X 1   2 X 2 

 1-P  Y  


 K X K
dichotomous outcome
 PY  
 is the log(odds) of the outcome.
ln 
 1  PY  
The Logistic Regression Model
 P Y 
ln 
  0  1 X 1   2 X 2 

 1-P  Y  


intercept
 K X K
model coefficients
 PY  
 is the log(odds) of the outcome.
ln 
 1  PY  
A Short Review
Philosophy of Science
• Idea: We posit a paradigm and attempt to
falsify that paradigm.
• Science progresses faster via attempting to
falsify a paradigm than attempting to
corroborate a paradigm.
(Thomas S. Kuhn. 1970. The Structure of
Scientific Revolutions. University of Chicago
Press.)
Philosophy of Science
•
The fastest way to progress in science under
this paradigm of falsification is through
perturbation experiments.
•
In epidemiology,
– often unable to do perturbation experiments
– it becomes a process of accumulating
evidence
•
Statistical testing provides a rigorous datadriven framework for falsifying hypothesis
The P-Value
•
What is the probability of having gotten a sample
mean as extreme as 4.8 if the null hypothesis was
true (H0: m = 0)?
•
P-value = probability of obtaining a result as or
more “extreme” than observed if H0 was true.
•
Consider for the above example, if p = 0.0089
(less than a 9 out of 1,000 chance)
•
What if p = 0.0501 (5 out of 100 chance) ?
Hypothesis Testing
1. Set up a null and alternative hypothesis
2. Calculate test statistic
3. Calculate the p-value for the test
statistic
4. Based on p-value make a decision to
reject or fail to reject the null hypothesis
5. Make your conclusion
Hypothesis Testing
Your decision
vs. Truth
Truth: H0 True
Truth: H0 False
Decision:
Correct Decision Incorrect Decision
Fail to reject H0
Type II Error ()
Decision:
Reject H0
Incorrect
Correct Decision
Decision
(Power)
Type I Error (a)
Hypothesis Testing
• Type I error (a) = the probability of rejecting the null
hypothesis given that H0 is true (the significance level of a
test).
• Type II error (): the probability of not rejecting the null
hypothesis given that H0 is false (not rejecting when you
should have).
• Power = 1 - 
Power
•
The power of a test is:
The probability of rejecting a false null
hypothesis under certain assumed
differences between the populations.
•
We like a study that has “high” power
(usually at least 80%).
• Any difference can become
significant if N is large enough
• Even if there is statistical significance
is there clinical significance?
Controversy around HT and p-value
“A methodological culprit responsible for spurious
theoretical conclusions”
(Meehl, 1967; see Greenwald et al, 1996)
“The p-value is a measure of the credibility of the null
hypothesis. The smaller the p-value is, the less
likely one feels the null hypothesis can be true.”
HT and p-value
• “It cannot be denied that many journal editors
and investigators use p-value < 0.05 as a
yardstick for the publishability of a result.”
• “This is unfortunate because not only p-value, but
also the sample size and magnitude of a
physically important difference determine the
quality of an experimental finding.”
HT and p-value
• Consider a new cancer drug that possibly
shows significant improvements.
• Should we consider a p = 0.01 the same as a
p = 0.00001 ?
HT and p-value
• “[We] endorse the reporting of estimation
statistics (such as effect sizes, variabilities,
and confidence intervals) for all important
hypothesis tests.”
– Greenwald et al (1996)
Reporting Statistics
• Reporting I. Statistical Methods
The changes in blood pressure after oral
contraceptive use were calculated for 10
women. A paired t-test was used to
determine if there was a significant change
in blood pressure and a 95% confidence
was calculated for the mean blood pressure
change (after-before).
Reporting Statistics
• Reporting II. Results
Blood pressure measurements increased on
average 4.8 mmHg with standard deviation of
4.57. The 95% confidence interval for the
mean change was (1.53, 8.07).
There was evidence that blood pressure
measurements after oral contraceptive use
were significantly higher than before oral
contraceptive use (p = 0.009).
HSRP 734
Lecture 1:
Measures of Disease
Occurrence and Association
Objectives:
1.Define and compute the measures of
disease occurrence and association
2.Discuss differences in study design and their
implications for inference
Example
CT images rated
by radiologist
(Rosner p.65)
Rated as
normal
Rated as
questionable
Rated as
abnormal
Normal
39
6
13
Abnormal
5
2
44
(Cell %)
Row %
Rated as
normal
Rated as
questionable
Rated as
abnormal
Normal
39
(35.8%)
67%
88.6%
6
(5.5%)
10.3%
75%
13
(11.9%)
22.4%
22.8%
58
Abnormal
5
(4.6%)
9.8%
11.4%
2
(1.8%)
3.9%
25%
44
(40.4%)
86.3%
77.2%
51
44
8
57
109
Col %
Basic Probability
• Conditional probability
– Restrict yourself to a “subspace” of the sample
space
Male
Female
Young
20%
10%
Old
35%
35%
Conditional probabilities
• Probability that something occurs (event B),
given that event A has occurred (conditioning
on A)
• Pr(B given that A is true) = Pr(B | A)
Conditional probabilities
• Categorical data analysis
• odds ratio = ratio of odds of two
conditional probabilities
• Conditional probabilities in survival analysis
of the form :
Pr(live till time t1+t2 | survive up till time t1)
Basic probability
• Example: automatic blood-pressure machine
• 84% hypertensive and 23% normotensives are
classified as hypertensive
• Given 20% of adult population is hypertensive
• We now know:
Pr(machine says hypertensive | truly hypertensive)
• What is Pr(truly hypertensive| machine says
hypertensive)?
Basic probability
Hypertension (H)
Yes
No
Machine diagnosed
as hypertensive (D)
Yes
No
Basic probability
• Positive predictive value — Probability that a
randomly selected subject from the population actually
has the disease given that the screening test is positive
• Negative predictive value — Probability that a
randomly selected subject from the population is
actually disease free given that the screening test is
negative
Basic probability
• Sensitivity — Probability that the procedure is positive
given that the person has the disease
• Specificity — Probability that the procedure is negative
given that the person does not have the disease
Review examples 3.26, 3.27, and 3.28 in Rosner
• Measures of Occurrence
– Measure using proportions (e.g.,
prevalence, odds)
– Rates (e.g., incidence, cumulative
incidence)
• Measure of Association
– Based on odds (e.g., odds ratio)
– Based on probabilities (e.g., risk ratio)
Absolute Measures of
Disease Occurrence
•
Point prevalence = proportion of cases at a
given point in time
– cross-sectional measure
•
Incidence = number of new cases within a
specified time interval
– prospective measure
Absolute Measures of
Disease Occurrence
• Example:
Consider four individuals diagnosed with lung cancer
Person
Years of Follow-up
Status
1
3
Dead
2
5
Alive
3
2
Alive
4
1
Dead
• Proportion of death = 2/4 = 0.5
• Rate of death = 2/(3+5+2+1) = 0.18 deaths per person
year
Absolute Measures of
Disease Occurrence
• Two kinds of quantities used in measurement:
– Proportion: the numerator of a proportion as a
subset of the denominator, e.g., prevalence
– Rate: # events which occur during a time interval
divided by the total amount of time, e.g., incidence
rate
Absolute Measures of
Disease Occurrence
Remarks:
1) Diseases of long duration tend to have a higher
prevalence
2) Incidence tends to be more informative than
prevalence for causal understanding of the disease
etiology
3) Incidence is more difficult to measure & more
expensive
Absolute Measures of
Disease Occurrence
4) Prevalence & incidence can be influenced by the
evolution of screening procedures and diagnostic
tests
5) Both incidence and prevalence rates may be age
dependent
Absolute Measures of
Disease Occurrence
• Odds = ratio of P(event occurs) to the
P(event does not occur).
p
odds 
1 p
Example:
The probability of a disease is 0.20.
Thus, the odds are 0.20/(1-0.20) = 0.20/0.80 =0.25 = 1:4
That is, for every one person with an event, there are 4
people without the event.
Absolute Measures of
Disease Occurrence
• Risk of disease in time interval [t0, t1)
P(t) = Pr(developing disease in interval of length
t = t1 - t0 given disease free at the start
of the interval)
• Average Prevalence = Incidence x Duration
duration = average duration of disease after onset
Measures of Disease Association
• So far we have discussed
– Prevalence
– Incidence rate
– Cumulative incidence rate
– Risk of disease within an interval t
• All absolute measures
• Next, relative measures and associations
– Exposed (E) versus Unexposed ( E )
Measures of Disease Association
• Population versus sample
– Probabilities (population) are denoted by symbols
such as
p1 = P(disease within the exposed population)
•
– Sample estimates are denoted by
p̂1
Measures of Disease Association
Disease
D
No Disease
D
Total
Exposed
E
Not Exposed
E
Total
a
b
n1
c
d
n0
m1
m0
n
Conditional distribution
Exposed
E
Disease
D
p1
No Disease
D
1  p1
Margin
1
Not Exposed
E
Margin
Conditional distribution
Exposed
E
Not Exposed
E
Disease
D
p0
No Disease
D
1  p0
Margin
1
Margin
Measures of Association
• Odds ratio: Odds of disease among
exposed divided by odds of disease
among unexposed
p1
OR 
p0
1  p1
1  p0
Measures of Association
OR > 1 implies a positive association between
disease and exposure
OR < 1 implies a negative association between
disease and exposure
OR for disease = OR for exposure
Measures of Association
• Risk ratio = ratio between P(disease for
exposed) and P(disease for unexposed) , both
P(.) measured within the same duration of time
p1
RR   
p0
Measures of Association?
•
Risk Difference (Excess Risk):
RD = 1 - 0
RD not scale free
e.g., What is the meaning of these two equal differences
RR = 0.009.
RD = 0.010-0.001 vs. RD = 0.210-0.201
•
Attributable Risk for Exposed Persons:
AR = (1 - 0) / 1 = 1 – 1 / RR
• Measurements of risk and relative risk
in different sampling designs
• Cross-sectional
• Cohort
• Case-control
Measures of Disease Association
Disease
D
No Disease
D
Total
Exposed
E
Not Exposed
E
Total
a
b
n1
c
d
n0
m1
m0
n
•
Cross-Sectional Sampling
Randomly sample n subjects from population at time t
and determine disease and exposure status.
Important: n is fixed for this design.
1) a/m1 estimates prevalence of disease at t among
exposed
2) b/m0 estimates prevalence of disease at t among
unexposed
3) ad/bc estimates the OR for disease and exposure
Odds Ratio
p1
OR 
p0
(1  p1 )
(1  p0 )
p1 = a/m1 = disease risk among exposed
p0 = b/m0 = disease risk among unexposed
If p1 and p0 are small (rare disease) and the time
interval is relatively short, it can be shown that OR
≈ RR
Cross-sectional Sampling
• Cross-sectional design not prospective
• Can only test for association between
exposure and prevalence and not incidence
• Cannot test hypotheses about causality
•
Cohort Sampling
Sample n disease-free individuals from the
population at time t0 and follow them until time
t1.
Measure exposure history for each subject
and observe which subjects develop disease
in interval [t0, t1)
Important: m1, m0, and n are fixed
Cohort study: Estimates of risk
1) p1 = a/m1 estimates risk of developing disease in interval
among exposed
2) p0 = b/m0 estimates risk of developing disease in interval
among unexposed
3) RR ≈ p1 / p0
4) OR = ad / bc
5) IR (incidence rate): i ≈ pi / t for i = 0, 1 (and small t)
6) RD (risk difference): RD ≈ 1 – 0 ≈ (p1 – p0) / t
•
Case-Control Sampling
Sample n1 cases and n0 disease free
controls from target population during
interval [t0, t1)
Important: n1, n0, and n are fixed
1) a/m1 and b/m0 do not estimate population disease risks
2) a/n1 estimates Pr(prior exposure | disease incidence in
[t0, t1)
3) c/n0 estimates Pr(prior exposure | no disease incidence
in [t0, t1)
4) OR = ad / bc
5) RR ≈ OR for rare disease or short time intervals
6) IR (incidence rate) or disease risks cannot be
estimated; RD (risk difference) cannot be estimated
• Hypothetical example
Frequency of disease and exposure in a target
population
Exposure
Not
Exposure
Total
Disease
8
32
40
No Disease
92
868
960
Total
100
900
1000
p1 = ?
p0 = ?
RR = p1 / p0 = ?
OR = ?
• Hypothetical example
Frequency of disease and exposure in a target
population
Exposure
Not
Exposure
Total
Disease
8
32
40
No Disease
92
868
960
Total
100
900
1000
p1 = 8 / 100 = 0.08;
p0 = 32 / 900 = 0.036
RR = p1 / p0 = 0.08 / 0.036 = 2.25
OR = (8 x 868) / (92 x 32) = 2.36
• Cohort Study
50% of exposed individuals sampled
25% of unexposed individuals sampled
Exposure Not Exposure
Total
Disease
4
8
12
No Disease
46
217
263
Total
50
225
275
p1 = 4 / 50 = 0.08;
p0 = 8 / 225 = 0.036
RR = p1 / p0 = 0.08 / 0.036 = 2.25
OR = (4 x 217) / (46 x 8) = 2.36
• Case-Control Study
100% of diseased individuals sampled
25% of disease-free individuals sampled
Exposure
Not
Exposure
Total
Disease
8
32
40
No Disease
23
217
240
Total
31
249
280
p1 = 8 / 31 = 0.26 ≠ 0.08;
p0 = 32 / 249 = 0.13 ≠ 0.036
RR = p1 / p0 = (8/31) / (32/249) = 2.01 ≠ 2.25
OR = (8 x 217) / (23 x 32) = 2.36
Odds ratio
• The odds ratio is equally valid for
retrospective, prospective, or cross-sectional
sampling designs
• That is, regardless of the design it estimates
the same population parameter
Take home messages
– Occurrence of disease measured by
prevalence, or proportion
– Incidence measured by incidence rates, or
proportion per unit time
– Risk is probability of developing disease over
a specified period of time
Take home messages
– Association of disease with exposure
measured by odds ratios and risk ratios
– Odds ratios are valid for cross-sectional,
cohort, and case-control designs, risk ratios
are not
HW #1
• Due May 29
• Can talk to others but turn in own
work