Language Technologies Institute/ Human

Transcript Language Technologies Institute/ Human

Machine Learning
in Practice
Lecture 21
Carolyn Penstein Rosé
Language Technologies Institute/
Human-Computer Interaction
Institute
Plan for the Day

Announcements
 No
quiz
 Last assignment Thursday
 2nd midterm goes out next Thursday after class
Finish Instance Based Learning
 Weka helpful hints
 Clustering
 Advanced Statistical Models
 More on Optimization and Tuning

Finish Instance
Based Learning
Instance Based Learning
Rote learning is at the extreme end of
instance based representations
 A more general form of instance based
representation is where membership is
computed based on a similarity measure
between a centroid vector and the vector of
the example instance
 Advantage: Possible to learn incrementally

Why “Lazy”?
?
http://www.cs.mcgill.ca/~cs644/Godfried/2005/Fall/mbouch/rubrique_fichiers/image003-3.png
Finding Nearest Neighbors
Efficiently
Bruit force method – compute distance
between new vector and every vector in the
training set, and pick the one with the
smallest distance
 Better method: divide the search space,
and strategically select relevant regions so
you only have to compare to a subset of
instances

kD-trees

kD-trees partition the space so that nearest
neighbors can be found more efficiently
 Each
split takes place along one attribute and
splits the examples at the parent node roughly
in half
 Split points chosen in such a way as to keep
the tree as balanced as possible (*not* to
optimize for accuracy or information gain)
Can you guess why you would want the tree to be
as balanced as possible?
 Hint – think about computational complexity of
search algorithms

Algorithm
D
E
A
B
F
D
G
New
Approx. Nearest Neighbor
C
E
F
G
Algorithm
D
Tweak: sometimes you
Average over k nearest
neighbors rather than
taking the absolute nearest
neighbor
E
Tweak: Use ball shaped
regions rather than rectangles
to keep number of regions
overlapped down
F
A
G
New
B
D
Approx. Nearest Neighbor
C
E
F
G
Locally Weighted Learning

Base predictions on models trained
specifically for regions within the vector
space
This is an over-simplification of what’s
happening
 Caveat!
Weighting of examples accomplished as in
cost sensitive classification
 Similar idea to M5P (learning separate
regressions for different regions of the vector
space determined by a path through a
decision tree)

Locally Weighted Learning

LBR baysean classification that relaxes
independence assumptions using similarity
between training and test instances
 Only
assumes independence within a
neighborhood
LWL is a general locally weighted learning
approach
 Note that Baysean Networks are another way
of taking non-independence into account with
probabilistic models by explicitly modeling
interactions (see last section of Chapter 6)

Problems with Nearest-Neighbor
Classification
Slow for large numbers of exemplars
 Performs poorly with noisy data if only
single closest exemplar is used for
classification
 All attributes contribute equally to the
distance comparison

 If
no normalization is done, then attributes with
the biggest range have the biggest effect,
regardless of their importance for classification
Problems with Nearest-Neighbor
Classification
Even if you normalize, you still have the
problem that attributes are not weighted by
importance
 Normally does not do any sort of explicit
generalization

Reducing the Number of Exemplars




Normally unnecessary to retain all examples ever
seen
Ideally only one important example per section of
instance space is needed
One strategy that works reasonably well is to only
keep exemplars that were initially classified
wrong
Over time the number of exemplars kept
increases, and the error rate goes down
Reducing the Number of Exemplars


One problem is that sometimes it is not clear that
an exemplar is important until sometime after it
has been thrown away
Also, this strategy of keeping just those
exemplars that are classified wrong is bad for
noisy data, because it will tend to keep the noisy
examples
Tuning K for K-Nearest Neighbors Compensates
for noise
Tuning K for K-Nearest Neighbors Compensates
for noise
Tuning K for K-Nearest Neighbors Compensates
for noise
* Tune for optimal value of K
Pruning Noisy Examples


Using success ratios, it is possible to reduce the
number of examples you are paying attention to
based on their observed reliability
You can compute a success ratio for every
instance within range K of the new instance
 based
on the accuracy of their prediction
 Computed over examples seen since they were added
to the space
Pruning Noisy Examples



Keep an upper and lower threshold
Throw out examples that fall below the lower
threshold
Only use exemplars that are above the upper
threshold
 But
keep updating the success ratio of all exemplars
Don’t do anything rash!

We can compute confidence intervals on the
success ratios we compute based on the
number of observations we have made
 You
won’t pay attention to an exemplar that just
happens to look good at first
 You won’t throw instances away carelessly
Eventually, these
will be thrown out.
What do we do about irrelevant
attributes?


You can compensate for irrelevant attributes by
scaling attribute values based on importance
Attribute weights modified after a new example is
added to the space
 Use
the most similar exemplar to the new training
instance
What do we do about irrelevant
attributes?



Adjust the weights so that the new instance
comes closer to the most similar exemplar if it
classified it correctly or farther away if it was
wrong
Weights are usually renormalized after this
adjustment
Weights will be trained to emphasize attributes
that lead to useful generalizations
Instance Based Learning with Generalization
•Instances generalized to
regions. Allows instance based
learning algorithms to behave
like other machine learning
algorithms (just another complex
decision boundary)
•Key idea is determining how far to
generalize from each instance
IB1: Plain Vanilla Nearest Neighbor
Algorithm





Keeps all training instances,
doesn’t normalize
Uses euclidean distance
Bases prediction on the first
instance found with the
shortest distance
Nothing to optimize
Published in 1991 by my AI
programming professor from
UCI!
IBK: More general than IB1




kNN: how many neighbors to
do pay attention to
crossValidate: use leave one
out cross-validation to select
optimal K
distanceWeighting: allows
you to select the method for
weighting based on distance
meanSquared: if it’s true, use
mean squared error rather than
absolute error for regression
problems
IBK: More general than IB1


noNormalization: turns off
normalization
windowSize: sets the
maximum number of instances
to keep. Prunes off older
instances when necessary. 0
means no limit.
K*
Uses an entropy based
distance metric rather
than euclidean distance
 Much slower than IBK!
 Optimizations related to
concepts we aren’t
learning in this course
 Allows you to choose
what to do with missing
values

What is special about K*?

Distance is computed based on a computation of
how many transformation operations it would
take to map one vector onto another
 There
may be multiple transformation paths, and all of
them are taken into account
 So the distance is an average over all possible
transformation paths (randomly generated – so
branching factor matters!)
 That’s why it’s slow!!!

Allows for a more natural way of handling
distance when your attribute space has many
different types of attributes
What is special about K*?


Also allows a natural way of handling unknown
values (probabilistically imputing values)
K* is likely to do better than other approaches if
you have lots of unknown values or a very
heterogeneous feature space (in terms of types
of features)
Locally Weighted Numeric
Prediction

Two Main types of trees used for numeric
prediction
 Regression
trees: average values computed at
leaf nodes
 Model trees: regression functions trained at leaf
nodes

Rather than maximize information gain,
these algorithms minimize variation within
subsets at leaf nodes
Locally Weighted Numeric
Prediction

Locally weighted regression is an
alternative to regression trees where the
regression is computed at testing time
rather than training time
 Compute
a regression for instances that are
close to the testing instance
Summary of Locally Weighted
Learning

Use Instance Based Learning together
with a base classifier – almost like a
wrapper
 Learn

a model within a neighborhood
Basic idea: approximate non-linear
function learning with simple linear
algorithms
Summary of Locally Weighted
Learning
Big advantage: allows for incremental
learning, whereas things like SVM do not
 If you don’t need the incrementality, then it
is probably better not to go with instance
based learning

Take Home Message
Many ways of evaluating similarity of
instances, which lead to different results
 Instance based learning and clustering both
make use of these approaches
 Locally weighted learning is another way
(besides the “kernel trick”) to get
nonlinearity into otherwise linear
approaches

Weka Helpful Hints
Remember SMOreg vs SMO…
Setting the Exponent in SMO
* Note that an exponent larger than 1.0 means you are using a non-linear kernel.
Clustering
What is clustering
Finding natural groupings of your data
 Not supervised! No class attribute.
 Usually only works well if you have a huge
amount of data!

InfoMagnets: Interactive Text Clustering
What does clustering do?
Finds natural
breaks in your data
 If there are obvious
clusters, you can
do this with a small
amount of data
 If you have lots of
weak predictors,
you need a huge
amount of data to
make it work

What does clustering do?
Finds natural
breaks in your data
 If there are obvious
clusters, you can
do this with a small
amount of data
 If you have lots of
weak predictors,
you need a huge
amount of data to
make it work

Clustering in Weka
* You can pick which clustering algorithm you want to use and how many clusters
you want.
Clustering in Weka
* Clustering is unsupervised, so you want it to ignore your class attribute!
Click here
Select the class attribute
Clustering in Weka
* You can evaluate
the clustering in
comparison with class
attribute assignments
Adding a Cluster Feature
Adding a Cluster Feature
* You should set it explicitly to ignore the class attribute
* Set the pulldown menu to No Class
Why add cluster features?
Class 1
Class 2
Why add cluster features?
Class 1
Class 2
Clustering with Weka



K-means and
FarthestFirst: disjoint flat
clusters
EM: statistical approach
Cobweb: hierarchical
clustering
K-Means

You choose the number of clusters you want
 You
might need to play with this by looking at what kind of
clusters you get out



K initial points chosen randomly as cluster centriods
All points assigned to the centroid they are closest
to
Once data is clustered, a new centroid is picked
based on relationships within the cluster
K-Means



Then clustering occurs again using the new
centroids
This continues until no changes in clustering take
place
Clusters are flat and disjoint
K-Means
K-Means
K-Means
K-Means
K-Means
EM: Expectation Maximization
Does not base clustering on distance from a
centroid
 Instead clusters based on probability of class
assignment
 Overlapping clusters rather than disjoint
clusters

 Every
instance belongs to every cluster with
some probability
EM: Expectation Maximization

Two important kinds of probability
distributions
 Each
cluster has an associated distribution of
attribute values for each attribute

Based on the extent to which instances are in the
cluster
 Each
instance has a certain probability of being in
each cluster

Based on how close its attribute values are to typical
attribute values for the cluster
Probabilities of Cluster Membership
Initialized
65% B
35% A
25% B
75% A
Central Tendencies Computed
Based on Cluster Membership
65% B
35% A
A
B
25% B
75% A
Cluster Membership Re-Assigned
Probabilistically
75% B
25% A
A
B
35% B
65% A
Central tendencies Re-Assigned
Based on Membership
75% B
25% A
A
B
35% B
65% A
Cluster Membership Reassigned
60% B
40% A
A
B
45% B
55% A
EM: Expectation Maximization
Iterative like k-means – but guided by a
different computation
 Considered more principled than k-means, but
much more computationally expensive
 Like k-means, you pick the number of clusters
you want

Advanced Statistical
Models
Quick View of Bayesian Networks

Windy
Play
Normally with Naïve
Bayes you have simple
conditional probabilities

Humidity
Outlook
Temperature
P[Play = yes | Humitity = high]
Quick View of Bayesian Networks

Windy
Play
With Bayes Nets, there
are interactions between
attributes

Humidity
Outlook

P[play = yes & temp = hot | Humidity =
high]
Similar likelihood
computation for an
instance
 You
Temperature
will still have one
conditional probability per
attribute to multiply
together
 But they won’t all be
simple
 Humidity is related jointly
to temperature and play
Quick View of Bayesian Networks
Windy
Humidity
Outlook
Temperature
Learning algorithm
needs to find the
shape of the network
 Probabilities come
from counts
 Two stages – similar
idea to “kernel
methods”

Play
Doing Optimization
in Weka
Optimizing Parameter Settings
Test
Use a modified form of crossvalidation:
1
Iterate over settings
2
Validation
3
4
5
Compare performance
over validation set;
Pick optimal setting
Test on Test Set
Train
Still N folds, but each
fold has less training data than
with standard cross validation
Or you can have a hold-out
Validation set you use for all
folds
Remember!
Cross-validation is for estimating your
performance
 If you want the model that achieves that
estimated performance, train over the
whole set
 Same principle for optimization

 Estimate
your tuned performance using cross
validation with an inner loop for optimization
 When you build the model over the whole set,
use the settings that work best in crossvalidation over the whole set
Optimization in Weka

Divide your data into 10 train/test pairs
 Tune
parameters using cross validation on the
training set (this is the inner loop)
 Use those optimized settings on the
corresponding test set
 Note that you may have a different set of
parameter setting for each of the 10 train/test
pairs

You can do the optimization in the
Experimenter
Train/Test Pairs
* Use the StratifiedRemoveFolds filter
Setting Up for Optimization
* Prepare to save the results
•Load in training sets for
all folds
•We’ll use cross validation
Within training folds to
Do the optimization
What are we optimizing?
Let’s optimize the confidence factor.
Let’s try .1, .25, .5, and .75
Add Each Algorithm to Experimenter
Interface
Look at the Results
* Note that optimal setting varies
across folds.
Apply the optimized settings on each fold
* Performance on Test1 using
optimized settings from Train1
What if the optimization requires
work by hand?

Do you see a problem
with the following?
 Do
feature selection over
the whole set to see
which words are highly
ranked
 Create user defined
features with subsets of
these to see which ones
look good
 Add those to your feature
space and do the
classification
What if the optimization requires
work by hand?
The problem is that is
just like doing feature
selection over your
whole data set
 You will overestimate your
performance
 So what’s a better
way of doing that?

What if the optimization requires
work by hand?
You could set aside a
small subset of data
 Using that small
subset, do the same
process
 Then use those user
defined features with
the other part of the
data

Take Home Message
Instance based learning and clustering both
make use of similarity metrics
 Clustering can be used to help you
understand your data or to add new
features to your data
 Weka provides opportunities to tune all of
its algorithms through the object editor
 You can use the Experimenter to tune the
parameter settings when you are
estimating your performance using crossvalidation


Language Technologies Institute/ Human

Transcript Language Technologies Institute/ Human

Directory