DMC-Jan08 - Carnegie Mellon School of Computer Science

Download Report

Transcript DMC-Jan08 - Carnegie Mellon School of Computer Science

Betty Cheng, Jaime Carbonell
Language Technologies Institute, School of Computer Science
Carnegie Mellon University





HIV & Drug Resistance
Phenotype Prediction Models
Machine Learning
Language of Proteins
Document Classification of HIV Genotypes



Comparison to state-of-the-art & human experts
Other Area of Application: GPCR
Conclusions




Drug resistance is an obstacle in treatment
and control of many infectious diseases
33.2 million living with AIDS in 2007
2.1 million died from AIDS in 2007
High mutation rate of HIV leads to quasispecies of virus strains inside each patient
4%
25% diversity


Currently ~25 drugs in 4 main drug classes
Treatments with 3+ drugs (HAART) used to
cover as many virus strains as possible in
quasi-species  Personalized Medicine
 Trial-and-error not an option due to cross resistance


Goal: Optimize treatment to take longest for
virus population to develop resistance
Current: Phenotype predicted from genotype
test results to identify resistance present now

Problem: Predict resistance (high/low/none) to
each drug given patient’s HIV genotype

Example: Rega and ANRS systems


If at least Z of <list of mutations> are present, then
predict resistance level Y to drug X.
Example: HIVdb

Sum the penalty scores from each mutation.

Advantage: easy to understand reason for
prediction
Disadvantage: impossible to maintain as more data
and drugs become available


Find db sequence most similar to test
sequence at all selected mutation positions
Does not interpolate between partial matches

Example: VirtualPhenotypeTM [from Virco]

Advantage: no rules to maintain
Disadvantages:




Human experts still needed to identify mutation positions
Large amount of data needed to ensure a db match




Systems can “learn” by detecting patterns in training
data and deduction
Enables knowledge discovery
Varies in the type of features and learning
algorithm
Sufficient for Protease Inhibitors
Features:
Presence of mutation
 Mutation
 Structure-based



Majority of studies
Maintenance is just re-running learning algorithm
on new data
Takes minutes ~ hours to train, seconds ~
minutes to test a sequence
M
U
T
A
T
I
O
N
S
R
E
S
I
S
T
A
N
C
E
Decision tree for EFV
(Beerenwinkel, ‘02)
Neural Network:
27 Mutations



Glass-box alg. allows knowledge discovery
Black-box alg. more tolerant of extra features
Existing systems trade-off between black-box systems
and expert-selected mutations
PPI
Meaning
3D
Structure
Sentences
Secondary
Structure
Phrases
Motifs
Words
Amino Acids
Letters
touchdown
hoop
to
a
glove
ball
the
to
the
ball
a
basket
the



tackle
ball
a
bat
to
Classify document by topic based on words
Trade-off between using all English words or
select keyword
Chi-square feature selection found to be best at
selecting keywords in text [Yang et al. ‘97]

View target virus proteins as documents





Alphabet size: 20 amino acids
No word/motif boundaries (e.g. Thai, Japanese)
Features: position-independent n-grams,
position-dependent n-grams (mutations)
Extract n-grams from every reading frame
Represent as vector of n-gram counts
G S V E R D S V E E V L K A F R L F D D G N S G T…
G S G M R M S R E Q L L N A W R L F C K D N S H T…
G S G E R D S R E E I L K A F R L F D D D N S G T…
N-gram \ Count
≥5
≥ 10
…
≥ 100
≥1
≥2
…
≥ 20
…
≥1
A
… (unigrams)
N-gram \ Count
AA
… (bigrams)
AAA
… (trigrams)
Mutation \ Count ≥ 0.05 ≥ 0.1
188G
Chi-square feature selection is the best for
document classification. (Yang & Pedersen, 1997)
 ( x)  
2
Expected # of seqs
with feature x and
resistance level c
Observed # of
sequences with
feature x
e(c, x)  o(c, x)
cC
2
e(c, x)
tx
e(c, x)  nc 
N
# of sequences
with resistance
level c
# of sequences
with feature x
Total # of
sequences
N-gram \ Count
≥5
…
≥ 100
23.5
≥ 10
30.2
A
… (unigrams)
…
12.1
N-gram \ Count
≥1
≥2
…
≥ 20
AA
23.1
19.9
…
29.9
45.1
15.1
…
10.2
… (bigrams)
AAA
… (trigrams)
AAA ( ≥ 1)  …  A ( ≥ 10)  …  AA ( ≥ 20)
G S V E R D S V E E V L K A F R L F D D G N S G T…
G S G M R M S R E Q L L N A W R L F C K D N S H T…
G S G E R D S R E E I L K A F R L F D D D N S G T…
N-grams extracted at every reading
frame of protein sequence
25
12
7
15
5
………
1
0
0
1
0
Counts of all n-grams
Chi-Square
Feature Selection
F
F
T
F
T
……
Classifier
T
Selected n-grams occurring
more frequently than their most
discriminative thresholds
F
F

Previous study (Rhee et al., 2006) compared
performance of 3 feature sets:
Expert-selected mutations
 Treatment-selected mutations (TSM)
 Mutations occurring more than 2x in dataset


TSM trained from additional database of
patients treated with a given drug class but no
drugs targeting same protein


Not possible to be specific to each drug
Found human experts or TSM to perform best
Drug
χ2-Selected
3Indep
PosDep
Rhee et al., 2006
TSM
Drug
Expert
χ2-Selected
3Indep PosDep
Nucleoside RT Inhibitors (NRTI)
Rhee et al., 2006
TSM
Expert
Protease Inhibitors (PI)
3TC
0.897
0.934
0.92
0.92
ABC
0.680
0.713
0.74
0.75
AZT
0.733
0.752
0.71
0.72
D4T
0.778
0.781
0.76
0.77
DDI
0.723
0.745
0.75
0.75
TDF
0.705
0.705
0.69
Avg.
0.753
0.772
0.76
APV 0.788
0.786
0.78
0.77
ATV 0.660
0.678
0.65
0.65
IDV 0.753
0.732
0.75
0.75
LPV 0.764
0.797
0.74
0.73
0.67
NFV 0.761
0.774
0.80
0.80
0.76
RTV 0.840
0.837
0.85
0.84
SQV 0.793
0.812
0.80
0.77
Non-Nucleoside RT Inhibitors (NNRTI)
DLV
0.823
0.842
0.84
0.82
Avg. 0.765
0.774
0.77
0.76
EFV
0.864
0.855
0.85
0.80
Avg. 0.780
0.791
0.78
0.78
NVP
0.912
0.910
0.91
0.89
Avg.
0.866
0.869
0.87
0.84

Using same dataset and classifier
(decision tree), our X2-selected
features performed comparably to
TSM and expert-selected mutations

Evaluated on several learning algorithms




Glass-box: decision tree, naïve Bayes, random forest
Black-box: SVM
Average 100-120 X2 features
Choice of classifier did not make much
difference
Learning Alg.
PI
NRTI
NNRTI
Avg.
Decision Tree
0.774
0.772
0.869
0.791
Naïve Bayes
0.781
0.767
0.858
0.790
Random Forest
0.800
0.785
0.875
0.808
SVM
0.809
0.807
0.880
0.822
Drug
Our Method
r2
Regression
Rhee et al., 2006
r2
Regression
Protease Inhibitors (PI)



Drug
Our Method
Regression
r2
Rhee et al., 2006
r2
Regression
Nucleoside RT Inhibitors (NRTI)
APV
SVM
0.821
0.82
SVM
3TC
SVM
0.935
0.95
SVM
ATV
SVM
0.775
0.76
LSR
ABC
Linear
0.788
0.79
LARS
IDV
SVM
0.826
0.83
LSR
AZT
Linear
0.767
0.74
SVM
LPV
SVM
0.865
0.87
SVM
D4T
SVM
0.747
0.79
SVM
NFV
Linear
0.854
0.84
LSR
DDI
SVM
0.729
0.75
SVM
RTV
SVM
0.900
0.89
LSR
TDF
SVM
0.527
0.59
SVM
SQV
SVM
0.838
0.84
LSR
Non-Nucleoside RT Inhibitors (NNRTI)
DLV
SVM
0.815
0.79
LARS
EFV
SVM
0.864
0.85
LARS
NVP
SVM
0.811
0.79
LARS
Used regression algorithms to predict resistance factor (IC50 ratio)
Comparing the best models from each study for each drug, our
model matched or outperformed Rhee et al. on 12 of 16 drugs
Average difference < 0.01
53 of 54 expert-selected mutations for PI
ranked 108th or higher by χ2
20 of 21 expert-selected
mutations for NRTI
ranked 120th or higher by χ2
All 15 expert-selected mutations for
NNRTI
ranked 107th or higher by χ2
RTV
3TC
EFV
RTV
3TC
EFV
V54
-184
N103
I24
M75
E179
V71
V184
-103
I36
E43
-190
M90
N67
I100
S73
F215
L227
A82
W210
A190
T43
A190
N219
-10
-215
V74
T82
Y208
Y221
I10
L41
C181
I20
K83
E43
I46
R65
S190
L46
R82
S103
V84
Y215
E101
-63
I54
L230
-54
D69
P101
-20
G98
M135
-71
H228
L188
F10
L227
Y208
-82
D44
G98
I32
E218
R102
-90
C181
H225
L53
-210
D179
R20
-67
R228
-84
A106
I74
F33
-41
Q190
D37
I69
A106
-46
I118
E190
V48
A39
Q102


Phenotype systems predict drug resistance the
detected genotype has currently
Not a summation of resistance to individual drugs



Mutations can cause resistance to one drug while
increasing sensitivity to another
Minor strains not detected by genotype testing 
Treatment history
Variation in human host affects response
Adherence [Ying et al., 2007]
 Haplotype? Gender? State of health?
 Lifestyle habits?

Patient
Virus
Genotype
Potency /
Drug Resistance
Treatment
History
Future Drug Options



Model impact of interaction between all these factors
using a feature for each combination
χ2 reduces to manageable number of important
features before applying to glass-box model
Amortized optimization of HAART requires shortterm and long-term response model

Given a new protein sequence, classify it into the
correct category at each level in the hierarchy


Subfamily classification based on function
G-Protein Coupled Receptors (GPCR) is target of 60%
of current drugs



Previous classification studies
rely on alignment-based
features
Karchin et al.(2002) evaluated
performance of classifiers at
varying levels of complexity
and concluded SVMs were
necessary to attain 85%+
accuracy
Document classification
approach with χ2 features and
naïve Bayes or decision tree
Complex
SVM, Neural Nets,
Clustering
Hidden Markov
Models (HMM)
K-Nearest
Neighbours
Decision Trees,
Naïve Bayes
Simple
Naïve Bayes with chi-square attained 39.7% reduction in residual error.
Classifier
Naïve Bayes
SVM
# of Features
7400
SAM-T2K HMM
kernNN
Chi-square n-gram features
Accuracy
93.2 %
9 per match
Gradient of the log-likelihood
state in the HMM that the sequence is generated
by the given HMM model
88.4 %
Local sequence alignment
83.3 %
BLAST
Decision Tree
Type of Features
2700
Chi-square n-gram features
78.0 %
A HMM model built for each protein subfamily
69.9 %
9 per match state Gradient of the log-likelihood
in the HMM
that the sequence is generated
by the given HMM model
64.0 %
Position-independent n-grams outperformed position-specific ones
because diversity of GPCR seqs made sequence alignment difficult.
Naïve Bayes with chi-square attained 44.5% reduction in residual error.
Classifier
Naïve Bayes
SVM
# of Features
8100
SAM-T2K HMM
kernNN
Chi-square n-gram features
Accuracy
92.4 %
9 per match state Gradient of the log-likelihood
in the HMM
that the sequence is generated
by the given HMM model
86.3 %
Local sequence alignment
74.5 %
BLAST
Decision Tree
Type of Features
2300
Chi-square n-gram features
70.2 %
A HMM model built for each protein subfamily
70.0 %
9 per match state Gradient of the log-likelihood
in the HMM
that the sequence is generated
by the given HMM model
51.0 %
N-grams selected
by chi-square
joined to form
motifs found in
literature.





Current phenotype prediction systems require
human experts to maintain – either rules or
resistance-associated mutations
Text document classification approach led to fully
automatic prediction model with comparable
results to state-of-the-art yet requiring no human
expertise
χ2 identified mutations overlap strongly with
human experts
Similar approach had found success in previous
work on GPCR proteins
Aim: An automatic prediction model for short-term
and long-term viral load response to HAART so
that amortized treatment optimization is possible
Betty Cheng ([email protected])
Jaime Carbonell ([email protected])