Transcript Slides

Machine Learning
in Practice
Lecture 24
Carolyn Penstein Rosé
Language Technologies Institute/
Human-Computer Interaction
Institute
Plan for the day

Announcements
 Questions?
 This
lecture: Finish Chapter 7
 Next Lecture: Cover Chapter 8
 Following Lecture: Mid-term review
 Final 3 Lectures: More Applications of Machine
Learning
Optimization Shortcut!
 Ensemble Methods
 Semi-Supervised Learning

Optimization
Shortcut!
Using CVParameterSelection
Using CVParameterSelection
Using CVParameterSelection
You have to know what
the command line options
look like.
You can find out on line or
in the Experimenter
Don’t forget to click Add!
Using CVParameterSelection
Best setting over whole set
Using CVParameterSelection
* Tuned performance.
Ensemble Methods
Is noise always bad?
http://www.city.vancouver.bc.ca/ctyclerk/cclerk/970513/citynoisereport/noise2.gif
Note: if testing data will
be noisy, it is helpful to
add similar noise to the
training data – the
learner then learns
which features to “trust”
 Noise also plays a key
role in ensemble
methods…

Simulated Annealing
http://biology.st-andrews.ac.uk/vannesmithlab/simanneal.png
Key idea:
combine multiple views on the
same data in order to increase
reliability
Ensemble Methods: Combining
Multiple Models


Current area of active research in
machine learning
Bagging and Boosting are both ways of
training a “committee” of classifiers and
then combining their predictions



For classification, they vote
For regression, they average
Stacking is where you train multiple
classifiers on the same data and
combine predictions with a trained model
Ensemble Methods: Combining
Multiple Models




In Bagging, all models have equal
weight
In Boosting, more successful
models are given more weight
In Stacking, a trained classifier
assigns “weights”
Weka has several meta classfiers
for forms of boosting, one for
bagging, and one for stacking
Multiple Models from the Same
Data

Random selection with replacement
 You
can create as many data sets as you want
of the size you want (sort of!)
 Bagging = “Bootstrap Aggregating”: Create
new datasets by resampling the data with
replacement
From n datapoints create t datasets of size n
 Trained models will differ in the places where the
models depend on quirks of the data

Multiple Models from the Same
Data
 Reduces
the effects of noise and avoids
overfitting
 Bagging helps most with unstable learning
algorithms
 Sometimes bagging has a better effect if you
increase the level of instability in the learner
(by adding noise to the data, turning off
pruning, or reducing pruning)
Bagging and Probabilities

Bagging works well when the output of the
classifier is a probability estimate and the decision
can be made probabilistically rather than by voting
 Voting approach: each model votes on one class
 Probability approach: each model contributes a
distribution of predictions
Set 1
Set 2
Set 3
Set 4
Set 5
Bagging and Probabilities

Bagging also produces good probability estimates
as output, so it works well with cost sensitive
classification
 Even
if the models only contribute one vote, you can
compute a probability from the proportion of models
that voted the same way
Set 1
Set 2
Set 3
Set 4
Set 5
Bagging and Probabilities

MetaCost is a similar idea – and is easier to
analyze
 Does
a cost sensitive version of bagging and uses this to
relabel the data
 Trains a model on the relabeled data
 New model inherits cost sensitivity from the labels it is
trained on
 Tends to work better than standard cost sensitive
classification
Bagging and Probabilities

A slightly different option is to train Option
Trees that build a “packed shared forest” of
decision trees that explicitly represent the
choice points
* A packed tree is really the same as a set of trees.
Randomness and Greedy Algorithms

Randomization of greedy algorithms: rather than
always selecting the best looking next action, select
one of the top N
 More
randomness means models are based less on
data, which means each individual model is less accurate
 But if you do it several times, it will be different each time,
so you can use this to do something like Bagging
Randomization and Nearest
Neighbor Methods

Standard bagging does not help much with
nearest neighbor classifiers because they
are not unstable in the right way
 Since
predictions are based on k neighbors,
small perterbations in the data don’t have a big
effect on decision making
Randomization and Nearest
Neighbor Methods

The trick is to randomize in a way that
makes the classifier diverse without
sacrificing accuracy
 With
nearest neighbor methods, it works well
to randomize the selection of a subset of
features used to compute the distance
between instances
 Each selection gives you a very different view
of your data
Boosting



Boosting is similar to bagging in that it trains
multiple models and then combines the
predictions
It specifically seeks to train multiple models that
complement each other
In boosting, a series of models are trained and
each trained model is influenced by the strengths
and weaknesses of the previous model
 New
models should be experts in classifying examples
that the previous model got wrong

In the final vote, model predictions are weighted
based on their model’s performance
AdaBoost


Assigning weights to instances is a way to get a
classifier to pay more attention to some instances
than other
Remember that in boosting, models are training in
a sequence
 Reweighting:
Modelx+1 weights examples that Modelx and
previous classifiers got wrong higher than the ones that
were treated correctly more often
 Resampling: errors affect the probability of selecting an
example, but the classifier treats each instance in the
selected sample with the same importance
AdaBoost



The amount of reweighting depends on the extent
of the errors
With reweighting, you use each example once, but
the examples are weighted differently
With resampling, you do selection with replacement
like in Bagging, but the probability is affected by the
“weight” assigned to an example
More about Boosting

The more iterations, the more confident the
trained classifier will be in its predictions
(since it will have more experts voting)
 This
is true even beyond where the error on
the training data goes down to 0

Because of that, it might be helpful to have a
validation set for tuning
 On


the other side, sometimes Boosting overfits
That’s another reason why it is helpful to have a
validation set
Boosting can turn a weak classifier into a
strong classifier
Why does Boosting work?
You can learn a very complex model all at
once
 Or you can learn a sequence of simpler
models

 When
you combine the simple models, you
get a more complex model
 The advantage is that at each stage, the
search is more constrained
 Sort of like a “divide-and-conquer” approach
Boosting and Additive Regression


Boosting is a form of forward, stagewise, additive
modeling
LogitBoost is like AdaBoost except that it uses a
regression model as the base classifier whereas
AdaBoost uses a classification model
Boosting and Additive Regression

Additive regression is when you:
1.
2.
3.
4.

train a regression equation
then train another to predict the residuals
then another, and so on
and then add the predictions together
With additive regression, the more iterations, the
better you do on the training data




but you might overfit
You can get around this with cross validation
You can also reduce the chance of overfitting by
decreasing the size of the increment each time – but
the run time is slower
Same idea as the momentum and learning rate
parameters in multi-layer perceptrons
Stacking

Stacking combines the predictions of multiple
learning methods over the same data
 Rather
than manipulating the training data as in
bagging and boosting


Use several different learners to add labels to
your data using cross validation
Then train a meta-learner to make an “intelligent
guess” based on the pattern of predictions it sees
 The
meta-learner can usually be a simple algorithm
Stacking


A more careful option is to train the level 0
classifiers on the training data, and train the metalearner on validation data
The trained model will make predictions about
novel examples by first applying the level 0
classifiers to the test data and then applying the
meta-learner to those labels
Error Correcting Output Codes



Classes
One Versus All
A
B
C
D
1000
0100
0010
0001
Error Correcting Codes
1111111
0000111
0011001
0101010
Instead of training 4 classifiers, you train 7
Look at the pattern of results and pick the
class with the most similar pattern (avoids ad
hoc tie breakers)
So if one classifier makes a mistake, you can
usually compensate for it with the others
Error Correcting Output Codes


One Versus All
A
B
C
D
1000
0100
0010
0001
Error Correcting Codes
1111111
0000111
0011001
0101010
Because the classifiers are making different
comparisons, they will make errors in different places
It’s like training subclassifiers to make individual
pairwise comparisons to resolve conflicts


Classes
But it always trains models on all of the data rather than part
A good error correcting code has good row separation
and column separation (so you need at least 4 class
distinctions before you can achieve this)

Separation is computed using hamming distance
Using Error Correcting Codes
Using Error Correcting Codes
Using Error Correcting Codes
Using Error Correcting Codes
Semi-Supervised
Learning
Key idea:
avoid overfitting to a small amount
of labeled data by leveraging a lot
of unlabeled data
Using Unlabeled Data

If you have a small amount of labeled data
and a large amount of unlabeled data:
 you
can use a type of bootstrapping to learn a
model that exploits regularities in the larger set
of data
 The stable regularities might be easier to spot
in the larger set than the smaller set
 Less likely to overfit your labeled data

Draws on concepts from Clustering!
 Clustering
shows you where the natural breaks
are in your data
Expectation maximization approach



Train model on labeled data
Apply model to unlabeled data
Train model on newly labeled data
 You
can use a cross-validation approach to reassign
labels to the same data from this new trained model
 You can keep doing this iteratively until the model
converges
 Probabilities on labels assigns a weight to each training
example

If you consider hand labeled data to have a score of 100%, then
as your amount of hand labeled data increases, your unlabeled
data will have less and less influence over the target model
 This
maximizes the expectation of correct classification
Doing Semi-Supervised Learning
Not built in to Weka!
 Set up labeled data as usual
 In unlabeled data, class value is always ?
 Create one whole labeled set of data

 Set
up the Explorer to output predictions
 Run classifier in Explorer with “Use supplied test set”
 You can then add the predictions to the unlabeled data
 Make one large dataset with original labeled data and
newly labeled data

Then, create train/test pairs so you can reestimate the labels
Built into TagHelper Tools!
Unlabeled examples have class ?
 Turn on Self-training

Co-training

Train two different models based on a few labeled
examples
 Each
model is learning the same labels but using
different features



Use each of these to label the unlabeled data
For each approach, take the example most
confidently labeled negative and most confidently
labeled positive and add them to the labeled data
Now repeat the process until all of the data is
labeled
Co-training


Co-training is better than EM for data that truly
has two independent feature sets (like content
versus links for web pages)
Co-EM combines the two approaches: use
labeled data to train a model with approach A,
then use approach B to learn those labels and
assign them to the data, then use A again, and
pass back and forth until convergence
 Probabilistically
iteration
re-estimates labels on all data on each
What Makes Good Applied Machine
Learning Work Based on
Bootstrapping and Co-training?
Determining what are good “alternative
views” on your data
 Involves all of the same issues as simply
applying classifiers:

 What
features do you have available?
 How will you select subsets of these?
 Where will you get your labeled data from?
What is the quality of this labeling?
Take Home Message




Noise and instability are not always
bad!
Increase stability in classification
using “multiple views”
Ensemble methods use noise to
get a “broader” view of your data
Semi-supervised learning gets a
“broader view” of your data by
leveraging regularities found in a
larger, unlabeled set of data