Transcript Statistics

Statistics
April 23, 2009
Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D.
Nemours Bioinformatics Core Facility
Nemours Biomedical Research
Odds
•
The odds in favor of an event (e.g. occurrence of a disease) is
the ratio of the probabilities of the event occurring to that of not
occurring, i.e., p/(1-p)
–
•
where p is the probability of the event occurring
If probability of an event is 0.6 (i.e., it is observed in 60% of the
cases) then the probability of it not occurring is (1 - 0.6) = 0.4
•
The odds in favor of the event occurring is thus 0.6/0.4 = 1.5
•
The greater the odds of an event, the greater it’s probability
Nemours Biomedical Research
Calculating odds for a 2x2 table
Response Variable
Predictor Variable
Cancer
No Cancer
Total
Smokers
a
b
a+b
Non- Smokers
c
d
c+d
a+c
b+d
N=a+b+c+d
Total
The proportion of smokers with cancer: p = a/(a+b)
(this is the likelihood of smokers having cancer)
The proportion of smokers without cancer: 1-p = b/(a+b)
(this is the likelihood of smokers not having cancer)
The odds of smokers having cancer: p / (1 - p) = (a/(a+b))/(b/(a+b)) = a/b
Nemours Biomedical Research
Calculating odds for a 2x2 table
Response Variable
Predictor Variable
Cancer
No Cancer
Total
Smokers
a
b
a+b
Non-Smokers
c
d
c+d
a+c
b+d
N=a+b+c+d
Total
The proportion of non-smokers with cancer: p = c/(c+d)
The proportion of non-smokers without cancer: 1-p = d/(d+d)
The odds of cancer among non-smokers: (c/(c+d)) / (d/(c+d)) = c/d
Nemours Biomedical Research
Odds Ratio
•
The odds ratio of an event is the ratio of the odds of the event
occurring in one group to the odds of it occurring in another group.
•
Let p1 be the probability of an event in group 1 and p2 be the
probability of the same event in group 2. Then the odds ratio (OR) of
the event in these two groups is:
p1 /(1  p1 )
OR 
p2 /(1  p2 )
•
The odds ratio is a measure of effect size
Nemours Biomedical Research
Odds Ratio
• In the previous example,
Cancer
No Cancer Column Total
Smokers
a
b
a+b
Non- Smokers
c
d
c+d
Row Total
a+c
b+d
N=a+b+c+d
The odds of cancer among smokers is a/b
The odds of cancer among non-smokers is c/d
a
ad
So, the odds ratio of cancer among smokers vs non-smokers  b 
c bc
d
Nemours Biomedical Research
Odds Ratio
Estimating odds ratio in a 2x2 table
Cancer
No Cancer
Smokers
120
80
200
Non- Smokers
60
140
200
180
220
N=400
The odds of cancer in smokers is 120/80 = 1.5 and the odds of cancer
in non-smokers is 60/140 = 0.4286 and the ratio of two odds is,
120 / 80 120 x140
OR 

 3.5
60 / 140
60 x80
Interpretation: The risk of cancer is 3.5 times greater for smokers
compared to non-smokers (in this sample)
Nemours Biomedical Research
Odds Ratio
•
The odds ratio must be greater than or equal to zero.
•
As the odds of the first group approaches to zero, the odds ratio
approaches to zero.
•
As the odds of the second group approaches to zero, the odds ratio
approaches to positive infinity
•
An odds ratio of 1 indicates that the condition or event under study
is equally likely in both groups. In our example, that would mean no
association between cancer and smoking was observed.
Nemours Biomedical Research
Odds Ratio
•
An odds ratio greater than 1 indicates that the condition or
event is more likely in the first group. In the previous example,
an odds ratio of 2 means that cancer is 2 times more likely in
smokers compared to non-smokers
•
An odds ratio less than 1 indicates that the condition or event is
less likely in the first group. In our example, an odds ratio of 0.8
would mean that the cancer is 20% (i.e., 1 - 0.8) less likely in
smokers compared to non-smokers
Nemours Biomedical Research
R-demo: Odds ratio (from 2 x 2
contingency table)
• Install Package epitools for odds ratio
• Load epitools: >library(“epitools”)
• Input data in previous example:
• x<-matrix(c(120,60,80,140), 2, 2)
•
oddsratio(x, method="wald")
Nemours Biomedical Research
R-demo: Odds ratio - Output
•
•
•
•
•
•
•
> oddsratio(x, method="wald")
$data
Outcome
Predictor Disease1 Disease2 Total
Exposed1
120
80
200
Exposed2
60
140
200
Total
180
220
400
•
•
•
•
•
$measure
•
•
•
•
•
$p.value
•
•
$correction
[1] FALSE
•
•
attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
odds ratio with 95% C.I.
Predictor estimate
lower
upper
Exposed1
1.0
NA
NA
Exposed2
3.5 2.313230 5.295625
two-sided
Predictor
midp.exact fisher.exact
chi.square
Exposed1
NA
NA
NA
Exposed2 1.467097e-09 2.298090e-09 1.637296e-09
Nemours Biomedical Research
R-demo: Odds ratio (from raw data)
• Categorize the outcome variable PLUC.post in two groups (high
and low PLUC groups) .
– Data management -> Manage variable in active data set ->
Bin numeric variable ->From this window: pick variable to bin
(e.g. PLUC.post ), Name the new variable (e.g. Newvar),
Select number of bins (e.g. 2), select binning method (e.g. kmeans clustering)
• oddsratio(data$Ped, data$Newvar, method="wald"), where
Newvar is the categorized outcome variable and Ped is also a
categorical variable with pediatrician 1 and 2.
Nemours Biomedical Research
R-demo: Odds ratio - Output
•
•
•
•
•
•
•
> oddsratio(data$Ped, data$Newvar, method="wald")
$data
Outcome
Predictor 0 1 Total
0
12 18
30
1
16 14
30
Total 28 32
60
•
•
•
•
•
$measure
•
•
•
•
•
$p.value
•
•
$correction
[1] FALSE
•
•
attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
odds ratio with 95% C.I.
Predictor estimate
lower
upper
0 1.0000000
NA
NA
1 0.5833333 0.2095645 1.623737
two-sided
Predictor midp.exact fisher.exact chi.square
0
NA
NA
NA
1 0.3166258
0.4378954
0.300623
Nemours Biomedical Research
Logit of p
•
The logit of a number P between 0 and 1 is log(P/1-P). It is
defined by logit(P)
•
If P is the probability of an event, then P/1-P is odds of that
event and logit(P) is the log(odds) = log(P/1-P).
•
The difference between the logits of two probabilities is the log
of the odds ratio (OR).
•
log (OR) = log p1 /(1  p1 )   log p1   log p2   logit ( p1 )  logit ( p2 )
•
The logit scale is linear and functions much like a z-score
 p2 /(1  p2 ) 
 1  p1 
scale.
Nemours Biomedical Research
 1  p2 
Logit of p
• Logit is a continuous score in the range - to 
• p = 0.50, then logit = 0.0
p = 0.70, then logit = 0.85
p = 0.30, then logit = -0.85
• The standard deviation of logit is sqrt(1/a + 1/b + 1/c + 1/d).
Nemours Biomedical Research
Logistic Regression
•
Recall: For a categorical variable, we focus on number or proportion for
each category.
•
Proportion of a category simply says about how likely to happen that
category
•
Suppose, y is a variable that represent occurrence or not occurrence of
cancer (two categories only).
•
And y=1 indicates occurrence and y=0 indicates (not occurrence).
•
Let, p=likelihood of the event (y=1), so 1-p = likelihood of the event
(y=0).
•
We want to relate p or (1-p) i.e. likelihood of happening or not
happening, instead the response y itself), with an independent variable
Nemours Biomedical Research
Logistic Regression
Plot of two variables: An outcome variable (say cancer happening/not
happening ) and an independent variable (say age)
Signs of coronary disease
Yes
Could we use a linear
regression?
No
0
20
40
60
AGE (years)
Nemours Biomedical Research
80
100
Logistic Regression
Plot of proportions for different ages
1.0
Likelihood of cancer
0.8
Likelihood of cancer increases with
an increasing age
0.6
Relationship: not linear
0.4
0.2
0.0
Age (in years)
Nemours Biomedical Research
Linear vs. Logistic Function
Logistic Regression
• Recall: Simple linear regression:
•
y = b0 + b1x, where y is a continuous quantitative outcome
variable, x is a quantitative/categorical variable.
• Like y, logit is a quantitative variable, and we can replace y by
logit of p where p is the likelihood of an event
• That is, log(p/(1-p)) = log (odds) = b0 + b1x, which is simple
linear regression between log(p/(1-p)) =log(odds) and the
independent variable x (say age).
• Association patterns with log(odds) are the same as the patterns
with odds itself.
Nemours Biomedical Research
Logistic Regression
• log(p/1-p) = log (odds of cancer) = b0 + b1*age
• Interpretation of b1: change of log odds of cancer for 1 year
change of age.
• Let us consider two persons of ages 55 and 56, then,
• log (odds of cancer at age 55) = b0 + b1*55
•
log (odds of cancer at age 56) = b0 + b1*56
• The difference, log(odds at 56) – log (55) = b1
 odds at 56 
b1  log
  log(odds ratio)
 odds at 55 
Odds ratio exp(b1 )
Nemours Biomedical Research
Logistic Regression
•
b1=0 (or equivalently odds ratio exp(b1) =1), indicates no association of
log(Odds of cancer) with the variable age (X).
•
b1>0, indicates a positive association of log(Odds of cancer) with the
variable age (X).
•
b1<0, indicates a negative association of log(Odds of cancer) with the
variable age (X).
•
If 95% confidence interval (CI) of b1 does not contains 0 (the null
hypothesis), it indicates that the independent variable has an
significant influence on the response variable at 5% level of
significance.
•
b0 is the intercept
Nemours Biomedical Research
Logistic Regression
•
exp(b1)= 1, No association of response with predictor. For categorical
predictor, an event is equally likely in both reference as well as
comparative group.
•
exp(b1)>1, indicates that an event is more likely to the comparative
group compare to the reference group.
•
exp(b1)<1, indicates that an event is less likely to the comparative
group compare to the reference group.
•
If 95% CI of odds ratio contains 1, it indicates that likelihood of an event
two groups are significantly different at 5% level of significance.
Nemours Biomedical Research
Rcmdr demo: Logistic Regression
• Statistics -> Fit models -> Generalized linear model > from this window ( under the model formula, select
dependent variable (e.g. Newvar) ~ independent (e.g.
age), family – binomial, link – logit.
• Newvar is the categorical form of the variable
PLUC.post (Details in the example for odds ratio)
Nemours Biomedical Research
Simple Logistic Regression: Rcmdr output
•
•
Call:
glm(formula = Newvar ~ age, family = binomial(logit), data = data)
•
•
•
Deviance Residuals:
Min
1Q Median
-1.246 -1.233
1.109
•
•
•
•
Coefficients:
•
(Dispersion parameter for binomial family taken to be 1)
•
•
•
Null deviance: 82.911
Residual deviance: 82.906
AIC: 86.906
3Q
1.123
Max
1.131
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0818711 0.8222250
0.100
0.921
age
0.0005715 0.0086349
0.066
0.947
on 59
on 58
degrees of freedom
degrees of freedom
95% CI for the coefficient of age is 0.0005715  2*0.0086349 = (-0.0166983, 0.0178413)
In terms odds ratio: Odds ratio = exp (0.0005715) = 1.000572
95% CI for the odds ratio is (exp(-0.0080634), exp(0.0092064)) = (0.991969, 1.009249)
Nemours Biomedical Research
Simple logistic regression: Rcmdr output
•
•
Call:
glm(formula = Newvar ~ Ped, family = binomial(logit), data = data)
•
•
•
Deviance Residuals:
Min
1Q Median
-1.354 -1.121
1.011
•
•
•
•
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
0.4055
0.3727
1.088
0.277
Ped[T.1]
-0.5390
0.5223 -1.032
0.302
•
(Dispersion parameter for binomial family taken to be 1)
•
•
•
Null deviance: 82.911
Residual deviance: 81.836
AIC: 85.836
3Q
1.011
on 59
on 58
Max
1.235
degrees of freedom
degrees of freedom
Nemours Biomedical Research
Logistic Regression
• Outcome (response) variable is binary
• Independent variable (predictor) can be either categorical or
quantitative
• Relationship of outcome variable and predictor (s) is not linear
Nemours Biomedical Research
Multiple Logistic Regression
• More than one independent variables in the model i.e.
• log(p/1-p) = b0 + bx1 + bx2
• Interpretation: the same as it is for the simple logistic regression
• Response variable is binary
Nemours Biomedical Research
Multinomial and Ordinal Logistic
Regressions
• Multinomial: The response (outcome) variable is
multicategorical (e.g. race- Caucasian, African
American, Hispanic, Asian etc).
• Ordinal: Categories of the response (outcome)
variable can be ranked or order (e.g. disease
condition: mild, moderate, and severe)
Nemours Biomedical Research
Thank you
Nemours Biomedical Research