Transcript Talk5

Boosting
Shai Raffaeli
Seminar in mathematical biology
http://www1.cs.columbia.edu/~freund/
Toy Example
• Computer receives telephone call
• Measures Pitch of voice
• Decides gender of caller
Male
Human
Voice
Female
Generative modeling
Probability
mean1
mean2
var1
var2
Voice Pitch
No. of mistakes
Discriminative approach
Voice Pitch
of mistakes
No.
Probability
Ill-behaved data
mean1
mean2
Voice Pitch
Traditional Statistics vs.
Machine Learning
Machine Learning
Data
Estimated Decision
Statistics world state Theory
Predictions
Actions
A weighted training set
x1,y1,w1,x2 ,y2 ,w2 , ,xm ,ym ,wm 
A weak learner
A weak rule
(x1,y1,w1),(x2,y2,w2) … (xn,yn,wn)
weak learner
h
Weighted
training set
instances
x1,x2,x3,…,xn
The weak requirement:
labels
h
y1,y2,y3,…,yn
The boosting process
(x1,y1,1/n), … (xn,yn,1/n)
(x1,y1,w1), … (xn,yn,wn)
weak learner
h1
weak learner
h2
(x1,y1,w1), … (xn,yn,wn)
(x1,y1,w1), … (xn,yn,wn)
(x1,y1,w1), … (xn,yn,wn)
(x1,y1,w1), … (xn,yn,wn)
(x1,y1,w1), … (xn,yn,wn)
(x1,y1,w1), … (xn,yn,wn)
(x1,y1,w1), … (xn,yn,wn)
(x1,y1,w1), … (xn,yn,wn)
Final rule: Sign[ a1 h1 + a2 h2 +
h3h4
h5h6
h7
h8h9
hT
+ aT hT
]
Adaboost
•
•
•
•
Binary labels y = -1,+1
margin(x,y) = y [St at ht(x)]
P(x,y) = (1/Z) exp (-margin(x,y))
Given ht, we choose at to minimize
S(x,y) exp (-margin(x,y))
Adaboost
Freund, Schapire 1997
F0 x  0
for t 1..T
w  expyi Ft1(xi )
t
i
Get ht from weak  learner

t
t 
a t  ln i:h x 1,y 1 wi i:h x 1,y 1 wi 
 t  i  i

t i
i
Ft+1  Ft + at ht
Main property of adaboost
• If advantages of weak rules over random
guessing are: g1,g2,..,gT then in-sample
error of final rule is at most
Adaboost as gradient descent
• Discriminator class: a linear discriminator in the
space of “weak hypotheses”
• Original goal: find hyper plane with smallest
number of mistakes
– Known to be an NP-hard problem (no algorithm that
runs in time polynomial in d, where d is the dimension
of the space)
• Computational method: Use exponential loss as
a surrogate, perform gradient descent.
Margins view
x, w  R n ; y {1,+1}
Margin =
-
+ -+ +
+ -+ +
Prediction = sign ( w  x )
y ( w  x)
w x
Cumulative # examples
w
Mistakes
-
+- -
Correct
Project
Margin
Adaboost et al.
Logitboost
Adaboost = e  y ( w x )
Loss
Brownboost
0-1 loss
Margin
Mistakes
Correct
One coordinate at a time
• Adaboost performs gradient descent on exponential loss
• Adds one coordinate (“weak learner”) at each iteration.
• Weak learning in binary classification = slightly better
than random guessing.
• Weak learning in regression – unclear.
• Uses example-weights to communicate the gradient
direction to the weak learner
• Solves a computational problem
What is a good weak learner?
• The set of weak rules (features) should be flexible
enough to be (weakly) correlated with most conceivable
relations between feature vector and label.
• Small enough to allow exhaustive search for the minimal
weighted training error.
• Small enough to avoid over-fitting.
• Should be able to calculate predicted label very
efficiently.
• Rules can be “specialists” – predict only on a small
subset of the input space and abstain from predicting on
the rest (output 0).
Decision Trees
Y
X>3
+1
5
-1
Y>5
-1
-1
-1
+1
3
X
Decision tree as a sum
Y
-0.2
X>3
sign
+0.2
+1
+0.1
-0.1
-0.1
-1
Y>5
-0.3
-0.2 +0.1
-0.3
-1
+0.2
X
An alternating decision tree
Y
-0.2
Y<1
sign
0.0
+0.2
+1
X>3
+0.7
-1
-0.1
+0.1
-0.1
+0.1
-1
-0.3
Y>5
-0.3
+0.2
+0.7
+1
X
Example: Medical Diagnostics
•Cleve dataset from UC Irvine database.
•Heart disease diagnostics (+1=healthy,-1=sick)
•13 features from tests (real valued and discrete).
•303 instances.
Adtree for Cleveland heart-disease diagnostics problem
Cross-validated accuracy
Learning
algorithm
Number
of splits
ADtree
6
17.0%
0.6%
C5.0
27
27.2%
0.5%
446
20.2%
0.5%
16
16.5%
0.8%
C5.0 +
boostin
g
Boost
Stumps
Average Test error
test error variance
Curious phenomenon
Boosting decision trees
Using <10,000 training examples we fit >2,000,000 parameters
Explanation using margins
0-1 loss
Margin
Explanation using margins
0-1 loss
No examples
with small
margins!!

Margin
Experimental Evidence
Theorem
Schapire, Freund, Bartlett & Lee
Annals of stat. 98
For any convex combination and any threshold
Probability of mistake
Fraction of training example
with small margin
Size of training sample
No dependence on number
of weak rules
that are combined!!!
VC dimension of weak rules
Suggested optimization problem
d
m


Margin
Idea of Proof
Applications of Boosting
• Academic research
• Applied research
• Commercial deployment
Academic research
% test error rates
Database
Other
Boosting
Error
reduction
Cleveland
27.2 (DT)
16.5
39%
Promoter
s
22.0 (DT)
11.8
46%
Letter
13.8 (DT)
3.5
74%
Reuters 4
5.8, 6.0, 9.8
2.95
~60%
7.4
~40%
Reuters 8 11.3, 12.1, 13.4
Schapire, Singer, Gorin 98
Applied research
•
•
•
•
“AT&T, How may I help you?”
Classify voice requests
Voice -> text -> category
Fourteen categories
Area code, AT&T service, billing credit,
calling card, collect, competitor, dial
assistance, directory, how to dial, person to
person, rate, third party, time charge ,time
Examples
• Yes I’d like to place a collect call long
distance please  collect
• Operator I need to make a call but I need to
bill it to my office  third party
• Yes I’d like to place a call on my master card
please  calling card
• I just called a number in Sioux city and I
musta rang the wrong number because I got
the wrong party and I would like to have that
taken off my bill  billing credit
Weak rules generated by
“boostexter”
Category
Calling
card
Collect
Third
party
call
Weak Rule
Word occurs
Word does
not occur
Results
• 7844 training examples
– hand transcribed
• 1000 test examples
– hand / machine transcribed
• Accuracy with 20% rejected
– Machine transcribed: 75%
– Hand transcribed: 90%
Freund, Mason, Rogers, Pregibon, Cortes 2000
Commercial deployment
• Distinguish business/residence
customers
• Using statistics from call-detail records
• Alternating decision trees
– Similar to boosting decision trees, more
flexible
– Combines very simple rules
– Can over-fit, cross validation used to stop
Summary
• Boosting is a computational method for learning
accurate classifiers
• Resistance to over-fit explained by margins
• Underlying explanation –
large “neighborhoods” of good classifiers
• Boosting has been applied successfully to a
variety of classification problems
Gene Regulation
• Regulatory proteins bind to non-coding regulatory
sequence of a gene to control rate of transcription
binding
sites
regulators
DNA
mRNA
transcript
Measurable quantity
From mRNA to Protein
mRNA
transcript
Protein
folding
Protein sequence
ribosome
Protein Transcription Factors
regulator
Genome-wide Expression
Data
• Microarrays measure mRNA
transcript expression levels for all
of the ~6000 yeast genes at once.
• Very noisy data
• Rough time slice over all
compartments of many cells.
• Protein expression not observed
Partial “Parts List” for Yeast
Many known and putative
TF
– Transcription factors
MTF
SM
TF
– Signaling molecules
that activate transcription factors MTF
– Known and putative binding site “motifs”
– In yeast, regulatory sequence = 500 bp
upstream region
GeneClass: Problem Formulation
M. Middendorf, A. Kundaje, C. Wiggins, Y. Freund, C. Leslie.
Predicting Genetic Regulatory Response Using Classification. ISMB 2004.
 Predict target gene regulatory response from
regulator activity and binding site data
“Parent”
gene
expression
R1
R2
R3
R4 ….. Rp
Microarray
Image
G1
G2
G3
…
G1
G2
G3
G4
Binding sites (motifs)
in upstream region
…
Gt
G4
Gt
Target
gene
expression
Role of quantization
By Quantizing expression into three classes
We reduce noise but maintain most of signal
-1
0
+1
Weighting +1/-1 examples linearly with
Expression level performs slightly better.
Problem setup
• Data point = Target gene X Microarray
• Input features:
– Parent state {-1,0,+1}
– Motif Presence {0,1}
• Predict output:
– Target Gene {-1,+1}
Boosting with Alternating
Decision Trees (ADTs)
• Use boosting to build a single ADT, marginbased generalization of decision tree
Splitter Node
Prediction Node
F(x) given by sum of
prediction nodes along
all paths consistent
with x
Is MotifMIG1 present
AND ParentXBP1 up?
Statistical Validation
• 10-fold cross-validation experiments, ~50,000
(gene/microarray) training examples
• Significant correlation between prediction score and true
log expression ratio on held-out data.
• Prediction accuracy on +1/-1 labels: 88.5%
Biological Interpretation
From correlation to causation
• Good prediction only implies Correlation.
• To infer causation we need to integrate additional
knowledge.
• Comparative case studies: train on similar conditions
(stresses), test on related experiments
• Extract significant features from learned model
– Iteration score (IS): Boosting iteration at which feature
first appears
Identifies significant motifs, motif-parent pairs
– Abundance score (AS): Number of nodes in ADT
containing feature
Identifies important regulators
• In silico knock-outs: remove significant regulator and
retrain.
Case Study: Heat Shock and
Osmolarity
 Training set: Heat shock, osmolarity, amino acid
starvation
 Test set: Stationary phase, simultaneous heat
shock+osmolarity
 Results:
 Test error = 9.3%
 Supports Gasch hypothesis: heat shock and osmolarity
pathways independent, additive
– High scoring parents (AS): USV1 (stationary phase and
heat shock), PPT1 (osmolarity response), GAC1
(response to heat)
Case Study: Heat Shock and
Osmolarity
 Results:
 High scoring binding sites (IS):
MSN2/MSN4 STRE element
Heat shock related: HSF1 and RAP1 binding sites
Osmolarity/glycerol pathways: CAT8, MIG1, GCN4
Amino acid starvation: GCN4, CHA4, MET31
– High scoring motif-parent pair (IS):
TPK1~STRE pair (kinase that regulates MSN2 via
cellular localization) – indirect effect
P
Mp
Direct binding
P
TF
MTF
Indirect effect
P
Mp
M
Co-occurrence
Case Study: In silico knockout
• Training and test sets: Same as heat shock and
osmolarity case study
• Knockout: Remove USV1 from regulator list and
retrain
• Results:
– Test error: 12% (increase from 9%)
– Identify putative downstream targets of USV1: target
genes that change from correct to incorrect label
– GO annotation analysis reveals putative functions:
Nucleoside transport, cell-wall organization and
biogenesis, heat-shock protein activity
– Putative functions match those identified in wet lab
USV1 knockout (Segal et al., 2003)
Conclusions: Gene Regulation
• New predictive model for study of gene regulation
– First gene regulation model to make quantitative
predictions.
– Using actual expression levels - no clustering.
– Strong prediction accuracy on held-out
experiments
– Interpretable hypotheses: significant regulators,
binding motifs, regulator-motif pairs
• New methodology for biological analysis:
comparative training/test studies, in silico
knockouts
Summary
• Boosting is an efficient and flexible method for
constructing complex and accurate classifiers.
• Correlation -> Causation : still a hard problem,
requires domain specific expertise and
integration of data sources.
Improvement suggestions...
• Use of binary labels simplify the algorithm,
but doesn’t reflect reality.
• “Confusion table”.
The End.

Large margins
a h x

F x
(x,y) y
y
Ý
a
 a
T
margin
FT
t1 t t
T
t1
margin
FT
(x,y)  0 
T
t
fT (x)  y
Thesis:
large margins => reliable predictions
Very similar to SVM.
1
Experimental Evidence
Theorem
Schapire, Freund, Bartlett & Lee / Annals of statistics 1998
H: set of binary functions with VC-dimension d
C   ai hi | hi  H , ai  0, ai  1
T  x1,y1,x2 ,y2 ,...,xm ,ym ; T ~ Dm
c  C,   0,
with probability 1  w.r.t. T ~ D m
Px,y~D sign c(x)  y Px,y~T margin c x,y   
 d / m   1 
+O˜ 
+ Olog 
     
No dependence on no. of combined functions!!!
Idea of Proof