Transcript Document

Experimental design and analyses
of experimental data
Lesson 6
Logistic regression
Generalized Linear Models (GENMOD)
1
Logistic regression
• Used when data are dichotomous.
• Used when data are fractions between 0 and 1
2
Example:
Does predation of eggs in nests of Oyster
catcher depend on
• The distance from the nest to the nearest
nest of Herring gull?
• On the vegetation surrounding the nest?
• On the number of eggs in the nest?
3
Data:
OBS
DIST
EGGS
VEG
KILLED
1
0.5
3
B
3
2
1.0
7
C
5
3
5.7
5
B
1
4
3.8
9
A
6
5
3.0
7
C
5
6
6.1
8
A
3
........
57
3.3
3
A
3
4
Analysis of dichotomous data:
• Nests are categorized according to whether
predation has occurred or not.
• No predation is scored as 0
• Predation is scored as 1
5
Plus/minus predator visit to Oyster catcher nest
Visit to nest
1
0
0
1
2
3
4
5
6
7
8
9
10
Distance (m) from nearest Herring gull nest
6
The purpose is to fit a model to
the data – a model that predicts
the probability of a nest being
predated
7
The logistic regression model:
i 
where
   x   x ....  x
e 00  11x11  22 x22 ....  kp xkp
   x   x ....  x
1  e 00  11x11  22 x22 ....  kp xkp
ey

 
y
1 e
y   0  1x1   2 x2 ....  p x p
and ε BIN(0, π(1-π))
  
ln
  y   0  1 x1   2 x2   p x p  
1 
The logit-transformation
The odds
(the ratio between the probability of a positive and a negative event)
8
y =0
y  
y
ey
e0
1
1
 



1  e y 1  e0 1  1 2
e 
0
 

0

1
1 e
e
e
 
  1

1 e
e
So that
  y  
0    1
9
How to do it in SAS
10
DATA logist;
OPTIONS LINESIZE = 90;
/* Example on logistic regression */
/* The example is inspirered by Dorthe Lahrmann's investigations of
Oyster catchers (strandskader) on Langli in Ho Bugt */
INFILE 'h:\lin-mod\logist.prn' FIRSTOBS=2;
INPUT dist eggs veg $ killed;
/* dist = Distance to the nearest nest of Herring gull (sølvmåge)*/
/* eggs = Number of Oyster catcher eggs in a nest */
/* veg = vegetation type surrounding an Oyster catcher nest*/
IF killed > 0 THEN visit= 1;
IF killed = 0 THEN visit = 0;
/* If killed > 0 then the nest has been visited by a predator at least
11
once */
/* Eksempel A: Analysis of a nest has been visited or not-visited by
predators, i.e. visit = 1 or 0 */
PROC GENMOD;
/* The procedure is Generalized Linear Models */
TITLE 'Eksempel A';
CLASS veg;
/* veg is a class variable */
MODEL visit = dist veg /DIST=binomial LINK=logit TYPE3 DSCALE OBSTATS;
/* DIST = distribution function (here chosen as binomial) */
/* LINK = the model uses a logit-transformation of data */
/* TYPE3 = type 3 is used in order to evaluate the relative
contribution of the different factors on the independent variable */
/* DSCALE = an option which tells SAS to scale the error in order to
meet the demands of the model. If DSCALE is approximately 1, scaling is
not needed. */
/* OBSTATS = gives the predicted values as well as their confidence
limits */
RUN;
12
Eksempel A
10:19 Thursday, November 22, 2001
87
The GENMOD Procedure
Model Information
Description
Value
Data Set
WORK.LOGIST
Distribution
BINOMIAL
Link Function
LOGIT
Dependent Variable
VISIT
Observations Used
57
Number Of Events
52
Number Of Trials
57
Class Level Information
Class
VEG
Levels
3
Values
A B C
13
These values indicate the fit of the model.
Low values (for a given DF) indicate a good fit
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
53
20.2819
0.3827
Scaled Deviance
53
53.0000
1.0000
Pearson Chi-Square
53
22.2740
0.4203
Scaled Pearson X2
53
58.2057
1.0982
.
-26.5000
.
Log Likelihood
These values should be close to unity if the model’s assumptions are met
Values
than
unity
indicate
overdispersion(variance
(varianceless
greater
expected)
Valuesgreater
less than
unity
indicate
underdispersion
thanthan
expected)
Values after scaling with DSCALE
14
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Std Err
ChiSquare
Pr>Chi
INTERCEPT
1
8.5639
2.1271
16.2093
0.0001
DIST
1
-1.0032
0.2651
14.3173
0.0002
VEG
A
1
0.2489
0.9555
0.0678
0.7945
VEG
B
1
0.4370
0.9250
0.2232
0.6366
VEG
C
0
0.0000
0.0000
.
.
0
0.6186
0.0000
.
.
SCALE
NOTE:
The scale parameter was estimated by the square root of DEVIANCE/DOF.
LR Statistics For Type 3 Analysis
Source
NDF
DDF
F
Pr>F
ChiSquare
Pr>Chi
DIST
1
53
34.8596
0.0001
34.8596
0.0001
VEG
2
53
0.1118
0.8944
0.2237
0.8942
15
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
55
20.3675
0.3703
Scaled Deviance
55
55.0000
1.0000
Pearson Chi-Square
55
21.6364
0.3934
Scaled Pearson X2
55
58.4265
1.0623
.
-27.5000
.
Log Likelihood
Analysis Of Parameter Estimates
NOTE:
Parameter
DF
Estimate
Std Err
ChiSquare
Pr>Chi
INTERCEPT
1
8.8288
2.0182
19.1363
0.0001
DIST
1
-1.0012
0.2587
14.9777
0.0001
SCALE
0
0.6085
0.0000
.
.
The scale parameter was estimated by the square root of DEVIANCE/DOF.
LR Statistics For Type 3 Analysis
Source
DIST
NDF
DDF
F
Pr>F
ChiSquare
1
55
36.4999
0.0001
36.4999
Pr>Chi 16
0.0001
Observation Statistics
VISIT
Pred
Xbeta
Std
HessWgt
Lower
Upper
Resraw
1
0.9998
8.3283
1.8909
0.000652
0.9903
1.0000
0.000242
1
0.9996
7.8277
1.7639
0.001075
0.9875
1.0000
0.000398
1
0.9578
3.1222
0.6185
0.1091
0.8710
0.9871
0.0422
1
0.9935
5.0244
1.0628
0.0175
0.9498
0.9992
0.006533
1
0.9971
5.8253
1.2605
0.007924
0.9663
0.9998
0.002943
1
0.9383
2.7217
0.5356
0.1563
0.8418
0.9775
0.0617
1
0.9971
5.8253
1.2605
0.007924
0.9663
0.9998
0.002943
1
0.9973
5.9255
1.2854
0.007173
0.9679
0.9998
0.002663
0
0.3358
-0.6822
0.5813
0.6023
0.1392
0.6123
-0.3358
1
0.9764
3.7229
0.7525
0.0622
0.9045
0.9945
0.0236
0
0.7150
..........................................
17
Predicted values and 95% confidence limits
1.00
Visit to nest
0.80
0.60
0.40
0.20
0.00
0
1
2
3
4
5
6
7
8
9
10
Distance (m) from nearest Herring gull nest
18
/* Example B: Analysis of the Note
fraction
in a nest
that
lost */
that of
thiseggs
procedure
takes
theare
absolute
PROC GENMOD;
/* procedure
TITLE 'Eksempel B';
CLASS veg;
number of eggs killed out of the total
isnumber
Generalized
Linear
Models */
of eggs
into consideration,
and not
merely the proportion of killed eggs
/* veg is a class variable */
MODEL killed/eggs = dist veg
OBSTATS;
eggs/DIST=binomial LINK=logit TYPE3 DSCALE
/* DIST = distribution function (here chosen as binomial) */
/* LINK = the model uses a logit-transformation of data */
/* TYPE3 = SS3 is used to determine the contribution of the individual
factors to the dependent variable */
/* DSCALE = option that can be used if Deviance/DF is different from 1.
It reduces the risk of Type 1 errors if the scale parameter is > 1
og the risk of a Type II errors, if the scale parameter is < 1 */
/* OBSTATS = gives the predicted values, and the confidence limits */
RUN;
19
Eksempel B
12:26 Thursday, November 22, 2001
7
The GENMOD Procedure
Model Information
Description
Value
Data Set
WORK.LOGIST
Distribution
BINOMIAL
Link Function
LOGIT
Dependent Variable
KILLED
Dependent Variable
EGGS
Observations Used
57
Number Of Events
183
Number Of Trials
336
Class Level Information
Class
VEG
Levels
3
Values
A B C
20
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
52
53.9491
1.0375
Scaled Deviance
52
52.0000
1.0000
Pearson Chi-Square
52
44.1413
0.8489
Scaled Pearson X2
52
42.5465
0.8182
.
-171.3777
.
Log Likelihood
21
Analysis Of Parameter Estimates
NOTE:
Parameter
DF
Estimate
Std Err
ChiSquare
Pr>Chi
INTERCEPT
1
2.6437
0.5644
21.9369
0.0001
DIST
1
-0.5284
0.0623
71.9060
0.0001
VEG
A
1
0.1425
0.3629
0.1541
0.6946
VEG
B
1
0.1623
0.3602
0.2029
0.6524
VEG
C
0
0.0000
0.0000
.
.
EGGS
1
-0.0314
0.0637
0.2433
0.6219
SCALE
0
1.0186
0.0000
.
.
The scale parameter was estimated by the square root of DEVIANCE/DOF.
LR Statistics For Type 3 Analysis
Source
NDF
DDF
F
Pr>F
ChiSquare
Pr>Chi
DIST
1
52
97.2164
0.0001
97.2164
0.0001
VEG
2
52
0.1135
0.8929
0.2271
0.8927
EGGS
1
52
0.2443
0.6232
0.2443
0.6211
22
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
55
54.5182
0.9912
Scaled Deviance
55
55.0000
1.0000
Pearson Chi-Square
55
45.0882
0.8198
Scaled Pearson X2
55
45.4867
0.8270
.
-179.6600
.
Log Likelihood
Analysis Of Parameter Estimates
NOTE:
Parameter
DF
Estimate
Std Err
ChiSquare
Pr>Chi
INTERCEPT
1
2.5156
0.2950
72.7128
0.0001
DIST
1
-0.5212
0.0589
78.3656
0.0001
SCALE
0
0.9956
0.0000
.
.
The scale parameter was estimated by the square root of DEVIANCE/DOF.
LR Statistics For Type 3 Analysis
Source
DIST
NDF
DDF
F
Pr>F
ChiSquare
1
55
107.8859
0.0001
107.8859
Pr>Chi 23
0.0001
Predicted values and 95% confidence limits
Fraction of eggs removed
1.0
0.8
0.6
0.4
0.2
0.0
0
1
2
3
4
5
6
7
8
9
10
Distance (m) from nearest Herring gull nest
24
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
52
53.9491
1.0375
Scaled Deviance
52
52.0000
1.0000
Pearson Chi-Square
52
44.1413
0.8489
Scaled Pearson X2
52
42.5465
0.8182
.
-171.3777
.
Log Likelihood
What is this?
25
The likelihood function
26
The binomial distribution
A nest contains n eggs of which r are eaten by predators.
The probability that a given egg is eaten is denoted π.
The probability that exactly r of the eggs are killed is
where
 n r
nr
P(r )    1   
r
 
e
 0  1 x1   2 x2 ....  p x p
1 e
 0  1 x1   2 x2 ....  p x p
27
r1 = number of killed eggs out of n1 eggs in the first nest
r2 = number of killed eggs out of n2 eggs in the second nest
ri = number of killed eggs out of ni eggs in the ith nest
The probability of observing exactly r1, r2, ...,ri events is
times
 n1  r1
n1  r1


P(r1 )    1 1   1 
 r1 
nnni32 rir r2
n2 r r2

3 1   ni n
3r



P
(
r
)


P (r3i 2))   i3211i 3 2 i 3
rir32
Log-likelihood function
k
L = P(r1) P(r2) P(r3)....... P(ri)...... P(rk) =
 P(r )
i
i 1
ln L = ln P(r1) + ln P(r2) + ln P(r3) +...+ ln P(ri) + ...+ ln P(rk) =
k
 ln P(r )
i 1
i
28
Maximum likelihood
The parameters of
i 
e
 0  1 x1   2 x2 ....  p x p
1 e
 0  1 x1   2 x2 ....  p x p
are found as the values that maximize the likelihood of observing
exactly r1, r2, ....,ri.... positive events out of n1, n2, ....,ni.... events
The maximum value of L can be found by differentiation of L with
respect to β0 , β1, ...., βp, and setting the derivative equal to 0.
This is the same as differentiation with respect to ln L
 ln L
0
 0
 ln L
0
1
 ln L
 ln L
 0 ......
0
 2
 p
29