Introduction to Boosting

Download Report

Transcript Introduction to Boosting

Introduction to Boosting
Slides Adapted from Che Wanxiang(车
万翔) at HIT, and Robin Dhamankar of
Many thanks!
Ideas



Boosting is considered to be one of the
most significant developments in
machine learning
Finding many weak rules of thumb is
easier than finding a single, highly
prediction rule
Key in combining the weak rules
Boosting(Algorithm)



W(x) is the distribution of weights over the N
training points ∑ W(xi)=1
Initially assign uniform weights W0(x) = 1/N
for all x, step k=0
At each iteration k :


Find best weak classifier Ck(x) using weights Wk(x)
With error rate εk and based on a loss function:



weight αk the classifier Ck‘s weight in the final hypothesis
For each xi , update weights based on εk to get Wk+1(xi )
CFINAL(x) =sign [ ∑ αi Ci (x) ]
Boosting (Algorithm)
Boosting As Additive Model

The final prediction in boosting f(x) can be expressed
as an additive expansion of individual classifiers
f ( x) 
M

m 1


m
b( x;  m )
The process is iterative and can be expressed as
follows.
f m ( x)  f m1 ( x)   mb( x;  m )
Typically we would try to minimize a loss function on the
training examples N
M


min M  L yi ,   mb( xi ;  m ) 
{  m , m }1
i 1 
m 1

Boosting As Additive Model

Simple case: Squared-error loss
L ( y , f ( x )) 

1
( y  f ( x )) 2
2
Forward stage-wise modeling amounts to just
fitting the residuals from previous iteration.
L( yi , f m 1 ( xi )  b( xi ;  ))
 ( yi  f m 1 ( xi )  b( xi ;  )) 2
 (rim  b( xi ;  )) 2

Squared-error loss not robust for classification
Boosting As Additive Model

AdaBoost for Classification:

L(y, f (x)) = exp(-y ∙ f (x)) - the exponential loss
function
N
arg min
 L( y , f ( x ))
i
i 1
f
 arg min
 , Gm
 arg min
 , Gm
i
N
 exp(  y [ f
i 1
i
m 1
N
 exp(  y  f
i 1
i
m 1
( xi )    Gm ( xi )])
( xi ))  exp(  yi    Gm ( xi ))
Boosting As Additive Model
First assume that β is constant, and minimize w.r.t. G:
N
arg min
 ,Gm
 arg min
 , Gm
 arg min
G
 exp(  y  f
i
i 1
N
w
(m)
i
i 1
N
w
yi G ( xi )

G
( xi )) exp(  yi    Gm ( xi ))
 exp(  yi    Gm ( xi )), where wi
(m)
i
 arg min (e  e
m 1
e

N
w

yi  G ( xi )

N
) [ wi
(m)
i 1
N
 arg min (e   e   )
G
 [w
i 1
i
(m)
i
 I ( yi  G ( xi ))]
N
w
i 1
i
 exp(  yi  f m 1 ( xi ))
 e
 I ( yi  G ( xi ))]  e
(m)
(m)
(m)

N
w
 e
i 1
i
(m)
Boosting As Additive Model
N
arg min (e   e   )
G
 [ wi
i 1
( m)
 I ( yi  G ( xi ))]
N
 wi
 e
( m)
i 1
 arg min (e   e   )  errm  e    H (  )
G
errm : It is the training error on the weighted samples
The last equation tells us that in each iteration we must find a
classifier that minimizes the training error on the weighted
samples.
Boosting As Additive Model
Now that we have found G, we minimize w.r.t. β:
H (  )  errm  (e   e   )  e  
H
 errm  (e   e   )  e    0

1  e   errm (e   e   )  0
1  e 2   errm  errm  0
1  errm
 e2
errm
1  errm
1
  log(
)
2
errm
AdaBoost(Algorithm)



W(x) is the distribution of weights over the N training
points ∑ W(xi)=1
Initially assign uniform weights W0(x) = 1/N for all x.
At each iteration k :





Find best weak classifier Ck(x) using weights Wk(x)
Compute εk the error rate as
εk= [ ∑ W(xi ) ∙ I(yi ≠ Ck(xi )) ] / [ ∑ W(xi )]
weight αk the classifier Ck‘s weight in the final hypothesis Set
αk = log ((1 – εk )/εk )
For each xi , Wk+1(xi ) = Wk(xi ) ∙ exp[αk ∙ I(yi ≠ Ck(xi ))]
CFINAL(x) =sign [ ∑ αi Ci (x) ]
AdaBoost(Example)
Original Training set : Equal Weights to all training samples
Taken from “A Tutorial on Boosting” by Yoav Freund and Rob Schapire
AdaBoost(Example)
ROUND 1
AdaBoost(Example)
ROUND 2
AdaBoost(Example)
ROUND 3
AdaBoost(Example)
AdaBoost (Characteristics)

Why exponential loss function?

Computational



Simple modular re-weighting
Derivative easy so determing optimal parameters is
relatively easy
Statistical

In a two label case it determines one half the log odds of
P(Y=1|x) => We can use the sign as the classification
rule

Accuracy depends upon number of iterations
( How sensitive.. we will see soon).
Boosting performance
Decision stumps are very simple rules of thumb that test condition on a single attribute.
Decision stumps formed the individual classifiers whose predictions were combined to
generate the final prediction.
The misclassification rate of the Boosting algorithm was plotted against the number of
iterations performed.
Boosting performance
Steep decrease in error
Boosting performance


Pondering over how many iterations
would be sufficient….
Observations


First few ( about 50) iterations increase the
accuracy substantially.. Seen by the steep
decrease in misclassification rate.
As iterations increase training error
decreases ? and generalization error
decreases ?
Can Boosting do well if?

Limited training data?




Probably not ..
Many missing values ?
Noise in the data ?
Individual classifiers not very accurate ?

It cud if the individual classifiers have
considerable mutual disagreement.
Application : Data mining

Challenges in real world data mining problems






Data has large number of observations and large number of
variables on each observation.
Inputs are a mixture of various different kinds of variables
Missing values, outliers and variables with skewed
distribution.
Results to be obtained fast and they should be interpretable.
So off-shelf techniques are difficult to come up with.
Boosting Decision Trees ( AdaBoost or MART) come
close to an off-shelf technique for Data Mining.
AT&T “May I help you?”