Quantifying opinions about a logistic regression using interactive

Download Report

Transcript Quantifying opinions about a logistic regression using interactive

Quantifying Opinion about a
Logistic Regression using
Interactive Graphics
Paul Garthwaite
The Open University
Joint work with Shafeeqah Al-Awadhi
1
Introduction/Plan
• This work arose from a practical problem in
logistic regression.
• The theory extends easily to elicit opinion about
the link function of any glm.
• I will outline the method for glm’s in general.
• The motivating problem has some additional
(commonly occurring) structure that the elicitation
method exploits.
• Interactive computing is used to elicit opinion.
• Prior models can be formed that aim to allow a
small amount of data to correct some potential
systematic biases in assessments.
• Results for the practical problem will be given.
2
Motivating Example
The task is to model the habitat distribution of fauna
in south-east Queensland - bats, birds, mammals etc.
Available information:
• Environmental attributes on a GIS database.
• Sample information of presence/absence at 300400 sites.
• Background knowledge of ecologists.
The ecologists have seen the bat (say) in various
locations but this information is difficult to use
in a traditional statistical analysis because it has
not been obtained from any sampling scheme.
Prob(presence) = f (environmental attributes)
3
prob(presence)
Continuous variables: elevation; quarterly rainfall and
temperatures; canopy cover; slope; aspect.
Factors: land type; vegetation; forest structure;
logging; grazing; etc.
A workshop with 15 ecologists indicated
• unimodal or monotic relationships
• independence between attributes in their effect on
the probability of presence.
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
2
4
6
attribute
8
10
12
4
Generalised Linear Model (glm)
The model has the form Y  g[ (r)] where g[.] is
the link function.
For logistic regression, g[ ]ln( /(1  )) and
 is the probability of presence.
r
is the vector of predictor variables.
From the ith predictor variable, ri , a vector of
explanatory variables is constructed
Xi  ( X i,1,..., X i, (i))' such that we have the linear
equation
Y   
'
1X1
 ...
'
  mnXmn
5
Define:
X i, j 











0
if Ri  ri, j1
Ri  ri, j1 if ri, j1  Ri  ri, j
ri, j  ri, j1
if ri, j  Ri.
and then Y is a linear function of
Xi  ( Xi,1 , ... , Xi, (i))'.
6
Factors:One factor level (the best one, say) is chosen
as the reference level. Each other level is given a
dummy 0/1 variable X i, j that equals 1 for that level
and 0 for all other levels:







1 if Ri  ri, j
X i, j 
0 otherwise
7
The sampling model is
Y   
'
1X1
 ...
'
  mnXmn
  ( 1 , ... ,  mn)'.
Let
For the prior distribution we put
























 



















 
 

 00 1'
b
MVN 0 ,
b 1 








The values of the parameters in red must be chosen
by the expert to represent his or her opinions.
8
Assessing medians and quartiles.
These are fundamental assessment tasks the expert
performs.
How far is it from Aberdeen to Southampton?
25%
|
|
25%
470m
|
|
525m
25%
|
|
25%
600miles
The median (blue) is assessed first and then the
lower and upper quartiles (red).
Ecologists were given practice at performing these
tasks in preparatory training and explanation.
9
Eliciting b and 
0
b  E( ) and 
0
00
00
 Var( ). Also,
Y   at the reference point. The expert
assesses m
, the median of  at this point.
0.50
(For logistic regression m
is the probability
0.50
of presence.)
b  g[m
We put
0
]
0.50 .
The expert also assesses the lower and upper
m
quartiles
0.25 and
 00
m
0.75 .
We put

2


 g(m

)

g
(
m
)

0.75
0.25 
 





1.348




10
Eliciting b and

1
• b is determined from the unconditional
assessments.
•  is determined from assessments conditional on
1
.

equalling
.
0.75
m
11
b and 1 for factors.
Eliciting
Put y
0.75
 g [m
0.75
E[ |   y
enabling
] . Then

]  b    1 ( y
0.75
1
1 00
0.75
b )
0
to be estimated.
12
[Go to program]
Assessments to obtain

Conditional on the first three line segments being
correct, the dashed lines are quartiles of where the
line might continue.
13
Conditional Assessments for Factors
• The circles indicate conditions.
• Dotted horizontal bars are previous assessments.
• Solid bars are current assessments and must be
within the dotted bars if  is positive-definite.
14
[Go to program]
Calculating

Iterative calculations determine  .
Start by estimating the lower-right scalar
element of  , and call it A p. Then estimate the
lower-right 22 of  and call it A , etc.
p1
If


















aii ai '
Ai 
ai Ai1
and Ai1 is positive-definite, then so is Ai
provided
aii  ai' A1 ai .
i1
15
Alternative Prior Models
Individuals can show systematic bias in their
subjective assessments. The aim is to form prior
models that allow a small amount of data to
largely correct some potential biases.
Prior 2
The marginal distribution of  is diffuse, rather
than N (b ,  ) .
The conditional distribution
0 00
of  is assumed to be unchanged:
 |
MVN (b, )
This allows for error in specifying the origin of the
Y-axis.
16
Prior 3
Prior 3 replaces the scale for Y with some other
linear scale.  is again given a diffuse
distribution and the conditional distribution of  | 
is taken to be
 |
MVN ( b,  )
2
 is also given a diffuse distribution.
Prior 4
This is the same as Prior 3, except it allows for
systematic bias in quartile assessments by putting
 |
MVN ( b,  )
 ,  and  are given diffuse distributions.
17
Cross-validation and scoring
• The usefulness of a prior distribution can be objectively
examined by using cross-validation and a scoring rule.
• For the cross-validation the data for a species were
divided into four sets. Each set in turn was omitted and
the remaining sets used to form prediction equations.
• Prediction equations were applied to the omitted set
and squared error loss determined:
Squared error loss 
(k  wk )2
k
where the summation is over all sites in the omitted
(validation) set,  is the probability of presence
k
given by the prediction equation, and wk is a 0/1
dummy variable indicating absence/presence.
• This defines a proper scoring rule.
18
Results for little bent-wing bat
Method
Set 1 Set 2 Set 3 Set4 Total
_______________________________________
Prior 1
Prior 2
Prior 3
Prior 4
Frequent.
No data
9.57
9.62
9.52
9.73
11.03
10.83
8.93
9.03
8.86
8.87
9.72
9.81
8.94
8.98
8.92
8.90
9.55
9.92
9.30
9.24
8.81
8.62
10.78
10.56
36.74
36.87
36.11
36.13
41.07
41.12
Sample
Results
11/94
10/94
10/93
11/94 42 in
375
19
2.0
Prior 1
1.0
posterior value using Prior 2
posterior value using Prior 1
1.0
2.0
0.0
-1.0
-2.0
-3.0
MVN (b, )
-4.0
-3.0
-2.0
-1.0
0.0
1.0
0.0
-1.0
-2.0
-3.0
MVN (b, )
-4.0
-4.0
-5.0
-5.0
Prior 2
-5.0
-5.0
2.0
-4.0
-3.0
2.0
Prior 3
1.0
posterior value using Prior 4
posterior value using Prior 3
0.0
1.0
2.0
2.0
0.0
-1.0
-2.0
-3.0
MVN ( b,  )
2
-4.0
-5.0
-5.0
-1.0
prior value
prior value
1.0
-2.0
-4.0
-3.0
-2.0
-1.0
prior value
0.0
1.0
Prior 4
0.0
-1.0
-2.0
-3.0
MVN ( b,  )
-4.0
2.0
-5.0
-5.0
-4.0
-3.0
-2.0
-1.0
prior value
0.0
1.0
20
2.0
.
Comm
Little -on
bent- bentPow- Greatwing wing Frog- erful er
Method
bat
bat
mouth owl
glider
____________________________________________
Prior 1
36.74 12.75 28.76 13.61 43.90
Prior 2
36.87 12.73 28.91 13.60 43.94
Prior 3
36.11 12.41 25.99 13.17 42.35
Prior 4
36.13 12.75 28.61 13.61 43.90
Frequent.
41.07 13.70 30.91 14.38 44.15
No data
41.12 13.66 29.54 15.07 48.81
____________________________________________
Sample
Results
42 in
375
13 in
375
31 in
324
14 in
324
53 in
343
21
Concluding Comments
• The elicitaion method described here is able to
handle large problems by:
(a) using interactive graphics
(b) suggesting values to the expert that might
represent his or her opinions.
• It is believed that the use of graphs can improve
the quality of the assessed distributions.
• Cross-validation can demonstrate clearly the gain
from using prior knowledge, when there is such
gain.
• Additional parameters in the prior model can
allow limited data to be used more effectively.
22