Boosting and Predictive modeling

Download Report

Transcript Boosting and Predictive modeling

University of Washington
Boosting and predictive modeling
Yoav Freund
Columbia University
1
University of Washington
What is “data mining”?
Lots of data - complex models
• Classifying customers using transaction
logs.
• Classifying events in high-energy physics
experiments.
• Object detection in computer vision.
• Predicting gene regulatory networks.
• Predicting stock prices and portfolio
management.
2
Leo Breiman
Statistical Modeling / the two cultures
Statistical Science, 2001
•
The data modeling culture (Generative modeling)
–
–
–
–
University of Washington
•
Assume a stochastic model (5-50 parameters).
Estimate model parameters.
Interpret model and make predictions.
Estimated population: 98% of statisticians
The algorithmic modeling culture (Predictive modeling)
–
–
–
–
–
Assume relationship btwn predictor vars and response vars
has a functional form (10^2 -- 10^6 parameters).
Search (efficiently) for the best prediction function.
Make predictions
Interpretation / causation - mostly an after-thought.
Estimated population: 2% 0f statisticians (many in other fields).
3
Toy Example
University of Washington
• Computer receives telephone call
• Measures Pitch of voice
• Decides gender of caller
Male
Human
Voice
Female
4
Generative modeling
Probability
University of Washington
mean1
mean2
var1
var2
Voice Pitch
5
No. of mistakes
University of Washington
Discriminative approach
Voice Pitch
6
of mistakes
No.
Probability
University of Washington
Ill-behaved data
mean1
mean2
Voice Pitch
7
University of Washington
Plan of talk
•
•
•
•
•
•
•
•
•
Boosting
Alternating Decision Trees
Data-mining AT&T transaction logs.
The I/O bottleneck in data-mining.
Resistance of boosting to over-fitting.
Confidence rated prediction.
Confidence-rating for object recognition.
Gene regulation modeling.
Summary
8
University of Washington
Plan of talk
•
•
•
•
•
•
•
•
•
Boosting: Combining weak classifiers.
Alternating Decision Trees
Data-mining AT&T transaction logs.
The I/O bottleneck in data-mining.
Resistance of boosting to over-fitting.
Confidence rated prediction.
Confidence-rating for object recognition.
Gene regulation modeling.
Summary
9
batch learning for
binary classification
Data distribution:
GOAL: find
x, y ~ D;
h : X  1,1

that
University of Washington
 error:
Generalization

Training set:
y  1,1
minimizes
h P
Ý x,y ~D h(x)  y
T  x1,y1,x2 ,y2 ,...,xm ,ym ; T ~ Dm
Training error: 
1
ˆ(h) 
Ý 1h(x)  y P
Ý x,y ~T h(x)  y

m x,y T
10
University of Washington
A weighted training set
x1,y1,w1,x2 ,y2 ,w2 , ,xm ,ym ,wm 
11
A weak learner
University of Washington
Weighted
training set
x1,y1,w1,x2 ,y2 ,w2 , ,xm ,ym ,wm 
A weak rule
Weak Learner
instances
h
predictions
x1,x2 , ,xm
h
yˆ1, yˆ2 , , yˆm ; yˆi  {0,1}
 y yˆ w
 w
m
The weak requirement:


i1 i i
m
i1
i
 0
i
12
The boosting process
x1, y1,1, x 2, y 2 ,1, , x n , y n ,1
x , y ,w ,x , y ,w , ,x , y ,w 
University of Washington
1

1
1
1
2
2
1
2
n
n
1
n
weak learner
h1
weak learner
h2

x , y ,w ,x , y ,w , ,x , y ,w 
1
1
2
1
2
2
2
2
n
n
2
n
x , y ,w ,x , y ,w , ,x , y ,w 
1
1
T 1
1
2
2
T 1
2
n
n
T 1
n


FT x  1h1x  2h2 x 
Final rule:
h3
hT
 T hT x
fT (x)  sign FT x

13
Adaboost
Freund, Schapire 1997
F0 x  0
for t 1..T
w  expyi Ft1(xi )
University of Washington
t
i
Get ht from weak  learner

t
t 
 t  ln i:h x 1,y 1 wi i:h x 1,y 1 wi 
 t  i  i

t i
i
Ft1  Ft  t ht
14
Main property of Adaboost
University of Washington
If advantages of weak rules over random
guessing are: 1,2,..,T then
training error of final rule is at most
 T 2 
ˆ fT   exp   t 

 t1 

15
Boosting block diagram
University of Washington
Strong Learner
Accurate
Rule
Weak
Learner
Weak
rule
Example
weights
Booster
16
Adaboost as gradient-descent

Logitboost= ln 1 eyFt (x)

University of Washington
Brownboost=
Loss
yFt x   c  t 
1 
1
erf



2 
ct



0-1 loss

Adaboost = e
yFt (x )
y Ft (x)
Mistakes
Correct


17
University of Washington
Plan of talk
• Boosting
• Alternating Decision Trees: a hybrid of boosting
and decision trees
• Data-mining AT&T transaction logs.
• The I/O bottleneck in data-mining.
• Resistance of boosting to over-fitting.
• Confidence rated prediction.
• Confidence-rating for object recognition.
• Gene regulation modeling.
• Summary
18
Decision Trees
Y
X>3
University of Washington
+1
5
-1
Y>5
-1
-1
-1
+1
3
X
19
A decision tree as a sum of weak rules.
Y
-0.2
University of Washington
X>3
sign
+1
+0.2
+0.1
-0.1
-0.1
-1
Y>5
-0.3
-0.2 +0.1
-1
-0.3
+0.2
X
20
An alternating decision tree
Freund, Mason 1997
Y
-0.2
University of Washington
Y<1
sign
0.0
+0.2
+1
X>3
+0.7
-1
-0.1
+0.1
-0.1
+0.1
-1
-0.3
Y>5
-0.3
+0.2
+0.7
+1
X
21
Example: Medical Diagnostics
University of Washington
• Cleve dataset from UC Irvine database.
•
Heart disease diagnostics (+1=healthy,-1=sick)
•
13 features from tests (real valued and discrete).
•
303 instances.
22
AD-tree for heart-disease diagnostics
University of Washington
>0 : Healthy
<0 : Sick
23
University of Washington
Plan of talk
•
•
•
•
•
•
•
•
•
Boosting
Alternating Decision Trees
Data-mining AT&T transaction logs.
The I/O bottleneck in data-mining.
Resistance of boosting to over-fitting.
Confidence rated prediction.
Confidence-rating for object recognition.
Gene regulation modeling.
Summary
24
AT&T “buisosity” problem
Freund, Mason, Rogers, Pregibon, Cortes 2000
•
•
•
Distinguish business/residence customers from call detail information.
(time of day, length of call …)
230M telephone numbers, label unknown for ~30%
260M calls / day
University of Washington
• Required computer resources:
Huge: counting log entries to produce statistics -- use specialized I/O
efficient sorting algorithms (Hancock).
Significant: Calculating the classification for ~70M customers.
Negligible: Learning (2 Hours on 10K training examples on an off-line
computer).
25
University of Washington
AD-tree for “buisosity”
26
University of Washington
AD-tree (Detail)
27
Quantifiable results
Precision   
Recall   
University of Washington


TPos 
FPos   TPos 
TPos 
AllPos
Precision
Varying Score Threshold 
Define :
Recall
• For accuracy 94%
increased coverage from 44% to 56%.
• Saved AT&T 15M$ in the year 2000 in operations costs and
missed opportunities.
28
University of Washington
Plan of talk
•
•
•
•
•
•
•
•
•
Boosting
Alternating Decision Trees
Data-mining AT&T transaction logs.
The I/O bottleneck in data-mining.
Resistance of boosting to over-fitting.
Confidence rated prediction.
Confidence-rating for object recognition.
Gene regulation modeling.
Summary
29
The database bottleneck
• Physical limit: disk “seek” takes 0.01 sec
University of Washington
– Same time to read/write 10^5 bytes
– Same time to perform 10^7 CPU operations
• Commercial DBMS are optimized for
varying queries and transactions.
• Statistical analysis requires evaluation of
fixed queries on massive data streams.
• Keeping disk I/O sequential is key.
• Data Compression: improves I/O speed but
restricts random access.
30
CS theory regarding very large data-sets
• Massive datasets: “You pay 1 per disk block
you read/write per CPU operation.
Internal memory can store N disk blocks”
– Example problem: Given a stream of line segments (in
the plane), identify all segment pairs that intersect.
University of Washington
– Vitter, Motwani, Indyk, …
• Property testing: “You can only look at a
small fraction of the data”
– Example problem: decide whether a given graph is bipartite by testing only a small fraction of the edges.
– Rubinfeld, Ron, Sudan, Goldreich, Goldwasser, …
31
University of Washington
Plan of talk
•
•
•
•
•
•
•
•
•
Boosting
Alternating Decision Trees
Data-mining AT&T transaction logs.
The I/O bottleneck in data-mining.
Resistance of boosting to over-fitting.
Confidence rated prediction.
Confidence-rating for object recognition.
Gene regulation modeling.
Summary
32
A very curious phenomenon
University of Washington
Boosting decision trees
Using <10,000 training examples we fit >2,000,000 parameters
33
 h x

F x
(x,y) y
y
Ý

 
T
margin
FT
t1 t t
T
t1
University of Washington

Large margins
margin
FT
(x,y)  0 
T
t
1
fT (x)  y
Thesis:
large margins => reliable predictions
Very similar to SVM.
34
University of Washington
Experimental Evidence
35
Theorem
Schapire, Freund, Bartlett & Lee / Annals of statistics 1998
H: set of binary functions with VC-dimension d
C   i hi | hi  H , i  0, i  1
University of Washington
T  x1,y1,x2 ,y2 ,...,xm ,ym ; T ~ Dm
c  C,   0,
with probability 1  w.r.t. T ~ D m
Px,y~D sign c(x)  y Px,y~T margin c x,y   
 d / m   1 
O˜ 
 Olog 
     
No dependence on no. of combined functions!!!
36
University of Washington
Idea of Proof
37
University of Washington
Plan of talk
•
•
•
•
•
•
•
•
•
Boosting
Alternating Decision Trees
Data-mining AT&T transaction logs.
The I/O bottleneck in data-mining.
Resistance of boosting to over-fitting.
Confidence rated prediction.
Confidence-rating for object recognition.
Gene regulation modeling.
Summary
38
A motivating example
?
-
-
-
-
University of Washington
-
-
-
-
-
-
-
-
-
- -
+ Unsure
-
-
++
+
+
- +
+ ++ +
+
?+ +
+
++ ++ +
+ +
+ +
+
+
+ +
+
+
+
- -- + + + + +++ +
++
+
+
?
Unsure
+
+
-
-
+
39
The algorithm
Freund, Mansour, Schapire, Annals of Stat, August 2004
Parameters   0,   0
w(h) e
Ý
University of Washington
Hypothesis weight:

Empirical
Log Ratio

Prediction rule:

:
ˆ h 
  w(h) 
h:h ( x)1 
1
lˆ (x) 
Ý ln 

   w(h) 
h:h ( x)1 
 1
if

pˆ , x  -1,+1 if

if
 1
lˆx  
lˆx  
lˆx  
40
Suggested tuning
 =Probability of failure
H is a finite set.
Setting:
0  
1
1 2

4
m
H  1 2
  ln   m
  

Yields:
University of Washington
ln H
1)
Pmistake   Px,y ~D y  pˆ (x)

2) for
m=Size of training set

ln m 
 2h  O 1/2 
m

*

1/  

m =  ln 1  ln H  


 ln 1   ln H 


*


ˆ
P(abstain )  Px,y ~D p(x)  1,1  5h  O
1/2



m


41
Confidence Rating block diagram
Training examples
University of Washington
x1,y1,x2 , y2 , ,xm ,ym 
Confidence-rated
Rule

Candidate
Rules
RaterCombiner
42
University of Washington
Summary of
Confidence-Rated Classifiers
• Frequentist explanation for the benefits of
model averaging
• Separates between inherent uncertainty and
uncertainty due to finite training set.
• Computational hardness: unknown other than in
few special cases
• Margins from Boosting or SVM can be used as an
approximation.
• Many practical applications!
43
University of Washington
Plan of talk
•
•
•
•
•
•
•
•
•
Boosting
Alternating Decision Trees
Data-mining AT&T transaction logs.
The I/O bottleneck in data-mining.
Resistance of boosting to over-fitting.
Confidence rated prediction.
Confidence-rating for object recognition.
Gene regulation modeling.
Summary
44
Face Detection - Using confidence to save time
Viola & Jones 1999
University of Washington
• Paul Viola and Mike Jones developed a face detector that can work in
real time (15 frames per second).
QuickTime™ and a
YUV420 codec decompressor
are needed to see this picture.
45
Image Features
“Rectangle filters”
Similar to Haar wavelets
University of Washington
Papageorgiou, et al.
t if f t ( xi )   t
ht ( xi )  
  t otherwise
60,000 100  6,000,000
Unique Binary Features
46
Example Classifier for Face Detection
A classifier with 200 rectangle features was learned using AdaBoost
95% correct detection on test set with 1 in 14084
false positives.
University of Washington
Not quite competitive...
ROC curve for 200 feature classifier
47
Employing a cascade to minimize
average detection time
The accurate detector combines 6000 simple features using Adaboost.
University of Washington
In most boxes, only 8-9 features are calculated.
All
boxes
Features 1-3
Might be a face
Features 4-10
Definitely
not a face
48
Using confidence to avoid labeling
University of Washington
Levin, Viola, Freund 2003
49
University of Washington
Image 1
50
University of Washington
Image 1 - diff from time average
51
University of Washington
Image 2
52
University of Washington
Image 2 - diff from time average
53
Co-training
Blum and Mitchell 98
University of Washington
Partially trained
B/W based
Classifier
Confident
Predictions
Hwy
Images
Partially trained
Diff based
Classifier
Confident
Predictions
54
Subtract-average detection score
University of Washington
Cars
Non cars
Grey-scale detection score
55
Co-Training Results
Difference Image detector
University of Washington
Raw Image detector
Before co-training
After co-training
56
University of Washington
Plan of talk
•
•
•
•
•
•
•
•
•
Boosting
Alternating Decision Trees
Data-mining AT&T transaction logs.
The I/O bottleneck in data-mining.
Resistance of boosting to over-fitting.
Confidence rated prediction.
Confidence-rating for object recognition.
Gene regulation modeling.
Summary
57
Gene Regulation
• Regulatory proteins bind to non-coding regulatory
sequence of a gene to control rate of transcription
binding
sites
University of Washington
regulators
DNA
mRNA
transcript
Measurable quantity
58
From mRNA to Protein
University of Washington
mRNA
transcript
Protein
folding
Protein sequence
ribosome
59
Protein Transcription Factors
University of Washington
regulator
60
University of Washington
Genome-wide Expression Data
• Microarrays measure mRNA transcript
expression levels for all of the ~6000
yeast genes at once.
• Very noisy data
• Rough time slice over all
compartments of many cells.
• Protein expression not observed
61
University of Washington
Partial “Parts List” for Yeast
Many known and putative
TF
– Transcription factors
MTF
SM
TF
– Signaling molecules
that activate transcription factors
MTF
– Known and putative binding site “motifs”
– In yeast, regulatory sequence = 500 bp upstream
region
62
GeneClass: Problem Formulation
M. Middendorf, A. Kundaje, C. Wiggins, Y. Freund, C. Leslie.
Predicting Genetic Regulatory Response Using Classification. ISMB 2004.
 Predict target gene regulatory response from regulator
activity and binding site data
R1
R2
R3
R4 ….. Rp
G1
G1
G2
G3
G4
G2
G3
…
Gt
Binding sites (motifs)
in upstream region
G4
Target
gene
expression
…
University of Washington
“Parent”
gene
expression
Microarray
Image
Gt
63
Role of quantization
By Quantizing expression into three classes
We reduce noise but maintain most of signal
0
+1
University of Washington
-1
Weighting +1/-1 examples linearly with
Expression level performs slightly better.
64
Problem setup
• Data point = Target gene X Microarray
• Input features:
University of Washington
– Parent state {-1,0,+1}
– Motif Presence {0,1}
• Predict output:
– Target Gene {-1,+1}
65
Boosting with Alternating Decision Trees
(ADTs)
• Use boosting to build a single ADT, marginbased generalization of decision tree
University of Washington
Splitter Node
Prediction Node
Is MotifMIG1 present
AND ParentXBP1 up?
F(x) given by sum of
prediction nodes along
all paths consistent
with x
66
Statistical Validation
University of Washington
• 10-fold cross-validation experiments, ~50,000
(gene/microarray) training examples
• Significant correlation between prediction score and true log
expression ratio on held-out data.
• Prediction accuracy on +1/-1 labels: 88.5%
67
University of Washington
Biological Interpretation
From correlation to causation
• Good prediction only implies Correlation.
• To infer causation we need to integrate additional knowledge.
• Comparative case studies: train on similar conditions (stresses),
test on related experiments
• Extract significant features from learned model
– Iteration score (IS): Boosting iteration at which feature first
appears
Identifies significant motifs, motif-parent pairs
– Abundance score (AS): Number of nodes in ADT containing
feature
Identifies important regulators
• In silico knock-outs: remove significant regulator and retrain.
68
Case Study: Heat Shock and Osmolarity
University of Washington
 Training set: Heat shock, osmolarity, amino acid
starvation
 Test set: Stationary phase, simultaneous heat
shock+osmolarity
 Results:
 Test error = 9.3%
 Supports Gasch hypothesis: heat shock and osmolarity
pathways independent, additive
– High scoring parents (AS): USV1 (stationary phase and heat
shock), PPT1 (osmolarity response), GAC1 (response to heat)
69
Case Study: Heat Shock and Osmolarity
University of Washington
 Results:
 High scoring binding sites (IS):
MSN2/MSN4 STRE element
Heat shock related: HSF1 and RAP1 binding sites
Osmolarity/glycerol pathways: CAT8, MIG1, GCN4
Amino acid starvation: GCN4, CHA4, MET31
– High scoring motif-parent pair (IS):
TPK1~STRE pair (kinase that regulates MSN2 via
cellular localization) – indirect effect
P
Mp
Direct binding
P
TF
MTF
Indirect effect
P
Mp
M
Co-occurrence
70
Case Study: In silico knockout
University of Washington
• Training and test sets: Same as heat shock and
osmolarity case study
• Knockout: Remove USV1 from regulator list and retrain
• Results:
– Test error: 12% (increase from 9%)
– Identify putative downstream targets of USV1: target genes
that change from correct to incorrect label
– GO annotation analysis reveals putative functions: Nucleoside
transport, cell-wall organization and biogenesis, heat-shock
protein activity
– Putative functions match those identified in wet lab USV1
knockout (Segal et al., 2003)
71
University of Washington
Conclusions: Gene Regulation
• New predictive model for study of gene regulation
– First gene regulation model to make quantitative
predictions.
– Using actual expression levels - no clustering.
– Strong prediction accuracy on held-out experiments
– Interpretable hypotheses: significant regulators,
binding motifs, regulator-motif pairs
• New methodology for biological analysis: comparative
training/test studies, in silico knockouts
72
University of Washington
Plan of talk
•
•
•
•
•
•
•
•
•
Boosting
Alternating Decision Trees
Data-mining AT&T transaction logs.
The I/O bottleneck in data-mining.
Resistance of boosting to over-fitting.
Confidence rated prediction.
Confidence-rating for object recognition.
Gene regulation modeling.
Summary
73
Summary
University of Washington
• Moving from density estimation to classification
can make hard problems tractable.
• Boosting is an efficient and flexible method for
constructing complex and accurate classifiers.
• I/O is the main bottleneck to data-mining
– Sampling, data localization and parallelization help.
• Correlation -> Causation : still a hard problem,
requires domain specific expertise and integration
of data sources.
74
Future work
• New applications:
– Bio-informatics.
– Vision / Speech and signal processing.
– Information Retrieval and Information Extraction.
University of Washington
• Theory:
– Improving the robustness of learning algorithms.
– Utilization of unlabeled examples in confidence-rated
classification.
– Sequential experimental design.
– Relationships between learning algorithms and
stochastic differential equations.
75
University of Washington
Extra
76
University of Washington
Plan of talk
•
•
•
•
•
•
•
•
•
•
Boosting
Alternating Decision Trees
Data-mining AT&T transaction logs.
The I/O bottleneck in data-mining.
High-energy physics.
. Resistance of boosting to over-fitting.
Confidence rated prediction.
Confidence-rating for object recognition.
Gene regulation modeling.
Summary
77
University of Washington
Analysis for the MiniBooNE experiment
• Goal: To test for neutrino mass by
searching for neutrino oscillations.
• Important because it may lead us to
physics beyond the Standard Model.
• The BooNE project began in 1997.
• The first beam induced neutrino events
were detected in September, 2002.
MiniBooNE detector
(Fermi Lab)
78
MiniBooNE Classification Task
Ion Stancu. UC Riverside
University of Washington
Reconstruction
Simulation
Of
MiniBooNE
Detector
Feature Vector
(52 Reals)
Neural Network
52 inputs
e
x 

Other

26 Hidden
79
80
University of Washington
University of Washington
Results
81
Using confidence to reduce labeling
University of Washington
Unlabeled data
Partially trained
classifier
Sample of unconfident
examples
Labeled
examples
Query-by-committee, Seung, Opper & Sompolinsky
Freund, Seung, Shamir & Tishby
82
No. of mistakes
University of Washington
Discriminative approach
Voice Pitch
83
University of Washington
Results from Yotam Abramson.
84
85
University of Washington