SMART Experimental Designs for Developing Adaptive

Download Report

Transcript SMART Experimental Designs for Developing Adaptive

Sequential, Multiple Assignment,
Randomized Trials
and
Treatment Policies
S.A. Murphy
UAlberta, 09/28/12
Outline
• Treatment Policies
• Data Sources
• Q-Learning
• Confidence Intervals
2
Treatment Policies are individually tailored
treatments, with treatment type and dosage changing
according to patient outcomes. Operationalize
sequential decisions in clinical practice.
k Stages for each individual
Observations available at jth stage
Action at jth stage (usually a treatment)
3
Example of a Treatment Policy
•Adaptive Drug Court Program for drug
abusing offenders.
•Goal is to minimize recidivism and drug
use.
•Marlowe et al. (2008, 2009, 2011)
4
Adaptive Drug Court Program
non-responsive
low risk
As-needed court hearings
+ standard counseling
As-needed court hearings
+ ICM
non-compliant
high risk
non-responsive
Bi-weekly court hearings
+ standard counseling
Bi-weekly court hearings
+ ICM
non-compliant
Court-determined
disposition
5
k=2 Stages
The treatment policy is a sequence of two
decision rules:
Goal: Use a data set of n trajectories, each of the
form
(a trajectory per subject) to construct a treatment
policy. The treatment policy should produce
maximal reward,
6
Why should a Machine Learning
Researcher be interested in Treatment
Policies?
• The dimensionality of the data available
for constructing decision rules accumulates
at an exponential rate with the stage.
•Need both feature construction as well as
feature selection.
7
Outline
• Treatment Policies
• Data Sources
• Q-Learning
• Confidence Intervals
8
Experimental Data
Data from sequential, multiple assignment,
randomized trials: n subjects each yielding a
trajectory. For 2 stages, the trajectory for each
subject is of the form
(Exploration, no exploitation.)
Aj is a randomized treatment action with known
randomization probability. Here binary actions
with P[Aj=1]=P[Aj=-1]=.5
9
Pelham’s ADHD Study
A1. Continue, reassess monthly;
randomize if deteriorate
Yes
8 weeks
A. Begin low-intensity
behavior modification
A2. Augment with other
treatment
AssessAdequate response?
No
Random
assignment:
A3. Increase intensity of
present treatment
Random
assignment:
B1. Continue, reassess monthly;
randomize if deteriorate
8 weeks
B. Begin low dose
medication
AssessAdequate response?
B2. Increase intensity of
present treatment
Random
assignment:
No
B3. Augment with other
treatment
10
Oslin’s ExTENd Study
Naltrexone
8 wks Response
Random
assignment:
Early Trigger for
Nonresponse
Random
assignment:
TDM + Naltrexone
CBI
Nonresponse
CBI +Naltrexone
Random
assignment:
8 wks Response
Naltrexone
Random
assignment:
TDM + Naltrexone
Late Trigger for
Nonresponse
Random
assignment:
Nonresponse
CBI
CBI +Naltrexone
11
Jones’ Study for Drug-Addicted
Pregnant Women
rRBT
2 wks Response
Random
assignment:
tRBT
Random
assignment:
tRBT
tRBT
Nonresponse
eRBT
Random
assignment:
2 wks Response
aRBT
Random
assignment:
rRBT
rRBT
Random
assignment:
Nonresponse
tRBT
rRBT
Kasari Autism Study
JAE+EMT
Yes
12 weeks
A. JAE+ EMT
AssessAdequate response?
JAE+EMT+++
Random
assignment:
No
JAE+AAC
Random
assignment:
Yes
12 weeks
B. JAE + AAC
B!. JAE+AAC
AssessAdequate response?
No
B2. JAE +AAC ++
13
Newer Experimental Designs
• Using Smart phones to collect data, Xi’s, in real
time and to provide treatments, Ai’s, in real time
to n subjects. The treatments, Ai’s, are
randomized among a feasible set of treatment
options.
– The number of treatment stages is very large—want
a Markovian property
– Feature construction of states in Markov process
14
Observational data
• Longitudinal Studies
• Patient Registries
• Electronic Medical Record Data
15
Outline
• Treatment Policies
• Data Sources
• Q-Learning/ Fitted Q-Iteration
• Confidence Intervals
16
Secondary Data Analysis: Q-Learning
•Q-Learning, Fitted Q-Iteration, Approximate
Dynamic Programming (Watkins, 1989; Ernst et
al., 2005; Murphy, 2003; Robins, 2004)
• This results in a proposal for an optimal
treatment policy.
•A subsequent randomized trial would evaluate
the proposed treatment policy.
17
2 Stages—Terminal Reward Y
Goal: Use data to construct
d1 (X 1); d2 (X 1; A 1; X 2 )
for which the average value, E d1 ;d2 [Y ], is
maximal.
The maximal average value is
V
opt
= max E d1 ;d2 [Y ]
d1 ;d2
18
Idea behind Q-Learning/Fitted Q
¯
¸¸
¯
= E max E max E [Y jX 1 ; A 1 ; X 2 ; A 2 = a2 ] ¯¯X 1 ; A 1 = a1
·
V opt
·
a1
a2
² Stage 2 Q-function Q2 (X 1 ; A 1 ; X 2 ; A 2 ) = E [Y jX 1 ; A 1 ; X 2 ; A 2 ]
¯
·
·
¸¸
¯
V opt = E max E max Q2 (X 1 ; A 1 ; X 2 ; a2 ) ¯¯X 1 ; A 1 = a1
a1
a2
¯
¸
¯
² Stage 1 Q-function Q1 (X 1 ; A 1 ) = E maxa 2 Q2 (X 1 ; A 1 ; X 2 ; a2 ) ¯¯X 1 ; A 1
·
·
¸
V opt = E max Q1 (X 1 ; a1 )
a1
19
Optimal Treatment Policy
The optimal treatment policy is (d1 ;
d2 )
where
d2 (X 1 ; A 1 ; X 2 ) = arg max Q2 (X 1 ; A 1 ; X 2 ; a2 )
a1
d1 (X 1 ) = arg max Q1 (X 1 ; a1 )
a1
20
Simple Version of Fitted Q-iteration –
Use regression at each stage to approximate Q-function.
• Stage 2 regression: Regress Y on
obtain
T 0
T
to
^2 = ®
Q
^ 2 S2 + ¯^2 S2 a2
• Arg-max over a2 yields
21
Value for subjects entering stage 2:
•
•
is a predictor of
maxa2 Q2 (X 1 ; A 1 ; X 2; a2)
is the dependent variable in the stage 1
regression for patients who moved to
stage 2
22
Simple Version of Fitted Q-iteration –
• Stage 1 regression: Regress
to obtain
on
• Arg-max over a1 yields
23
Decision Rules:
24
Pelham’s ADHD Study
A1. Continue, reassess monthly;
randomize if deteriorate
Yes
8 weeks
A. Begin low-intensity
behavior modification
A2. Augment with other
treatment
AssessAdequate response?
No
Random
assignment:
A3. Increase intensity of
present treatment
Random
assignment:
B1. Continue, reassess monthly;
randomize if deteriorate
8 weeks
B. Begin low dose
medication
AssessAdequate response?
B2. Increase intensity of
present treatment
Random
assignment:
No
B3. Augment with other
treatment
25
ADHD
138 trajectories of form: (X1, A1, R1, X2, A2, Y)
• X1 includes baseline school performance, Y0 ,
whether medicated in prior year (S1), ODD
(O1)
– S1 =1 if medicated in prior year; =0, otherwise.
• R1=1 if responder; =0 if non-responder
• X2 includes the month of non-response, M2,
and a measure of adherence in stage 1 (S2 )
– S2 =1 if adherent in stage 1; =0, if non-adherent
• Y = end of year school performance
26
Q-Learning using data on
children with ADHD
• Stage 2 regression for Y:
(1; Y0 ; S1 ; O1 ; A 1 ; M 2 ; S2 )®2 +
A 2 (¯21 + A 1 ¯22 + S2 ¯23 )
• Estimated decision rule is “ if child is
non-responding then intensify initial
treatment if ¡ :72 + :05A 1 + :97S2 > 0,
otherwise augment”
27
Q-Learning using data on
children with ADHD
• Decision rule is “if child is non-responding
then intensify initial treatment if
. + :05A 1 + :97S2 > 0 , otherwise augment”
¡ :72
Decision Rule for
Non-responding
Children
Initial Treatment
=BMOD
Initial
Treatment=MED
Adherent
Intensify
Intensify
Not Adherent
Augment
Augment
28
ADHD Example
• Stage 1 regression for
(1; Y0 ; S1 ; O1 )®1 + A 1 (¯11 + S1 ¯12 )
• Decision rule is, “Begin with BMOD if
. ¡ :32S1 > 0 , otherwise begin with MED”
:17
29
Q-Learning using data on
children with ADHD
• Decision rule is “Begin with BMOD if
. ¡ :32S1 > 0, otherwise begin with
:17
MED”
Initial Decision
Rule
Initial Treatment
Prior MEDS
MEDS
No Prior MEDS
BMOD
30
ADHD Example
• The treatment policy is quite decisive. We
developed this treatment policy using a trial on
only 138 children. Is there sufficient evidence
in the data to warrant this level of
decisiveness??????
• Would a similar trial obtain similar results?
• There are strong opinions regarding how to
treat ADHD.
• One solution –use confidence intervals.
31
Outline
• Treatment Policies
• Data Sources
• Q-Learning
• Confidence Intervals
32
ADHD Example
Treatment Decision for Non-responders. Positive
Treatment Effect  Intensify
90% Confidence Interval
Adherent to BMOD
(-0.08, 0.69)
Adherent to MED
(-0.18, 0.62)
Non-adherent to BMOD
(-1.10, -0.28)
Non-adherent to MED
(-1.25, -0.29)
33
ADHD Example
Initial Treatment Decision: Positive Treatment
Effect  BMOD
90% Confidence Interval
Prior MEDS
(-0.48, 0.16)
No Prior MEDS
(-0.05, 0.39)
34
Proposal for Treatment Policy
IF medication was not used in the prior year
THEN begin with BMOD;
ELSE select either BMOD or MED.
IF the child is nonresponsive THEN
IF child was non-adherent, THEN augment
present treatment;
ELSE IF child was adherent, THEN select
either intensification or augmentation of
current treatment.
35
Confidence Intervals
Constructing confidence intervals concerning
treatment effects at stage 2 and stage 1.
The stage 2 is classical regression (at least if
S2 ; S20 is low dimensional); constructing
confidence intervals is standard.
Constructing confidence intervals for the
treatment effects at stage 1 is challenging.
36
Confidence Intervals for Stage 1
Treatment Effects
^2 is nonChallenge: Stage 2 estimated value, V
smooth in the estimators from the stage 2
regression--due to non-differentiability of the
maximization:
^2
V
=
T 0
®
^ 2 S2
T
^
+ max ¯2 S2 a2
=
T 0
®
^ 2 S2
T
^
+ j ¯2 S2 j
a2
37
Non-regularity
• The estimated policy can change abruptly from
training set to training set. Standard approximations
used to construct confidence intervals perform poorly
(Shao, 1994; Andrews, 2000).
• Problematic area in parameter space is around
for which P[¯ T S ¼ 0] > 0
2
2
• We combined a local generalization-type error bound
with standard statistical confidence interval to
produce a valid confidence interval.
38
Why is this non-smoothness, and the
resulting inferential problems, relevant to
high dimensional machine learning
research?
• Sparsity assumptions in high dimensional
data analysis 
• Thresholding
• Nonsmoothness at important
parameter values
39
Where are we going?......
• Increasing use of wearable computers (e.g
smart phones, etc.) to both collect real time data
and provide real time treatment.
• We are working on the clinical trial designs
involving randomization (soft-max or epsilongreedy choice of actions) so as to develop/
continually improve treatment policies.
• Need confidence measures for infinite horizon
problems
40
This seminar can be found at:
http://www.stat.lsa.umich.edu/~samurphy/
seminars/UAlberta.09.28.12.pdf
This seminar is based on work with many
collaborators, some of which are: L. Collins, E. Laber,
M. Qian, D. Almirall, K. Lynch, J. McKay, D. Oslin,
T. Ten Have, I. Nahum-Shani & B. Pelham. Email
with questions or if you would like a copy:
[email protected]
41