Language Technologies Institute/ Human
Download
Report
Transcript Language Technologies Institute/ Human
Machine Learning
in Practice
Lecture 21
Carolyn Penstein Rosé
Language Technologies Institute/
Human-Computer Interaction
Institute
Plan for the Day
Announcements
No
quiz
Last assignment Thursday
2nd midterm goes out next Thursday after class
Finish Instance Based Learning
Weka helpful hints
Clustering
Advanced Statistical Models
More on Optimization and Tuning
Finish Instance
Based Learning
Instance Based Learning
Rote learning is at the extreme end of
instance based representations
A more general form of instance based
representation is where membership is
computed based on a similarity measure
between a centroid vector and the vector of
the example instance
Advantage: Possible to learn incrementally
Why “Lazy”?
?
http://www.cs.mcgill.ca/~cs644/Godfried/2005/Fall/mbouch/rubrique_fichiers/image003-3.png
Finding Nearest Neighbors
Efficiently
Bruit force method – compute distance
between new vector and every vector in the
training set, and pick the one with the
smallest distance
Better method: divide the search space,
and strategically select relevant regions so
you only have to compare to a subset of
instances
kD-trees
kD-trees partition the space so that nearest
neighbors can be found more efficiently
Each
split takes place along one attribute and
splits the examples at the parent node roughly
in half
Split points chosen in such a way as to keep
the tree as balanced as possible (*not* to
optimize for accuracy or information gain)
Can you guess why you would want the tree to be
as balanced as possible?
Hint – think about computational complexity of
search algorithms
Algorithm
D
E
A
B
F
D
G
New
Approx. Nearest Neighbor
C
E
F
G
Algorithm
D
Tweak: sometimes you
Average over k nearest
neighbors rather than
taking the absolute nearest
neighbor
E
Tweak: Use ball shaped
regions rather than rectangles
to keep number of regions
overlapped down
F
A
G
New
B
D
Approx. Nearest Neighbor
C
E
F
G
Locally Weighted Learning
Base predictions on models trained
specifically for regions within the vector
space
This is an over-simplification of what’s
happening
Caveat!
Weighting of examples accomplished as in
cost sensitive classification
Similar idea to M5P (learning separate
regressions for different regions of the vector
space determined by a path through a
decision tree)
Locally Weighted Learning
LBR baysean classification that relaxes
independence assumptions using similarity
between training and test instances
Only
assumes independence within a
neighborhood
LWL is a general locally weighted learning
approach
Note that Baysean Networks are another way
of taking non-independence into account with
probabilistic models by explicitly modeling
interactions (see last section of Chapter 6)
Problems with Nearest-Neighbor
Classification
Slow for large numbers of exemplars
Performs poorly with noisy data if only
single closest exemplar is used for
classification
All attributes contribute equally to the
distance comparison
If
no normalization is done, then attributes with
the biggest range have the biggest effect,
regardless of their importance for classification
Problems with Nearest-Neighbor
Classification
Even if you normalize, you still have the
problem that attributes are not weighted by
importance
Normally does not do any sort of explicit
generalization
Reducing the Number of Exemplars
Normally unnecessary to retain all examples ever
seen
Ideally only one important example per section of
instance space is needed
One strategy that works reasonably well is to only
keep exemplars that were initially classified
wrong
Over time the number of exemplars kept
increases, and the error rate goes down
Reducing the Number of Exemplars
One problem is that sometimes it is not clear that
an exemplar is important until sometime after it
has been thrown away
Also, this strategy of keeping just those
exemplars that are classified wrong is bad for
noisy data, because it will tend to keep the noisy
examples
Tuning K for K-Nearest Neighbors Compensates
for noise
Tuning K for K-Nearest Neighbors Compensates
for noise
Tuning K for K-Nearest Neighbors Compensates
for noise
* Tune for optimal value of K
Pruning Noisy Examples
Using success ratios, it is possible to reduce the
number of examples you are paying attention to
based on their observed reliability
You can compute a success ratio for every
instance within range K of the new instance
based
on the accuracy of their prediction
Computed over examples seen since they were added
to the space
Pruning Noisy Examples
Keep an upper and lower threshold
Throw out examples that fall below the lower
threshold
Only use exemplars that are above the upper
threshold
But
keep updating the success ratio of all exemplars
Don’t do anything rash!
We can compute confidence intervals on the
success ratios we compute based on the
number of observations we have made
You
won’t pay attention to an exemplar that just
happens to look good at first
You won’t throw instances away carelessly
Eventually, these
will be thrown out.
What do we do about irrelevant
attributes?
You can compensate for irrelevant attributes by
scaling attribute values based on importance
Attribute weights modified after a new example is
added to the space
Use
the most similar exemplar to the new training
instance
What do we do about irrelevant
attributes?
Adjust the weights so that the new instance
comes closer to the most similar exemplar if it
classified it correctly or farther away if it was
wrong
Weights are usually renormalized after this
adjustment
Weights will be trained to emphasize attributes
that lead to useful generalizations
Instance Based Learning with Generalization
•Instances generalized to
regions. Allows instance based
learning algorithms to behave
like other machine learning
algorithms (just another complex
decision boundary)
•Key idea is determining how far to
generalize from each instance
IB1: Plain Vanilla Nearest Neighbor
Algorithm
Keeps all training instances,
doesn’t normalize
Uses euclidean distance
Bases prediction on the first
instance found with the
shortest distance
Nothing to optimize
Published in 1991 by my AI
programming professor from
UCI!
IBK: More general than IB1
kNN: how many neighbors to
do pay attention to
crossValidate: use leave one
out cross-validation to select
optimal K
distanceWeighting: allows
you to select the method for
weighting based on distance
meanSquared: if it’s true, use
mean squared error rather than
absolute error for regression
problems
IBK: More general than IB1
noNormalization: turns off
normalization
windowSize: sets the
maximum number of instances
to keep. Prunes off older
instances when necessary. 0
means no limit.
K*
Uses an entropy based
distance metric rather
than euclidean distance
Much slower than IBK!
Optimizations related to
concepts we aren’t
learning in this course
Allows you to choose
what to do with missing
values
What is special about K*?
Distance is computed based on a computation of
how many transformation operations it would
take to map one vector onto another
There
may be multiple transformation paths, and all of
them are taken into account
So the distance is an average over all possible
transformation paths (randomly generated – so
branching factor matters!)
That’s why it’s slow!!!
Allows for a more natural way of handling
distance when your attribute space has many
different types of attributes
What is special about K*?
Also allows a natural way of handling unknown
values (probabilistically imputing values)
K* is likely to do better than other approaches if
you have lots of unknown values or a very
heterogeneous feature space (in terms of types
of features)
Locally Weighted Numeric
Prediction
Two Main types of trees used for numeric
prediction
Regression
trees: average values computed at
leaf nodes
Model trees: regression functions trained at leaf
nodes
Rather than maximize information gain,
these algorithms minimize variation within
subsets at leaf nodes
Locally Weighted Numeric
Prediction
Locally weighted regression is an
alternative to regression trees where the
regression is computed at testing time
rather than training time
Compute
a regression for instances that are
close to the testing instance
Summary of Locally Weighted
Learning
Use Instance Based Learning together
with a base classifier – almost like a
wrapper
Learn
a model within a neighborhood
Basic idea: approximate non-linear
function learning with simple linear
algorithms
Summary of Locally Weighted
Learning
Big advantage: allows for incremental
learning, whereas things like SVM do not
If you don’t need the incrementality, then it
is probably better not to go with instance
based learning
Take Home Message
Many ways of evaluating similarity of
instances, which lead to different results
Instance based learning and clustering both
make use of these approaches
Locally weighted learning is another way
(besides the “kernel trick”) to get
nonlinearity into otherwise linear
approaches
Weka Helpful Hints
Remember SMOreg vs SMO…
Setting the Exponent in SMO
* Note that an exponent larger than 1.0 means you are using a non-linear kernel.
Clustering
What is clustering
Finding natural groupings of your data
Not supervised! No class attribute.
Usually only works well if you have a huge
amount of data!
InfoMagnets: Interactive Text Clustering
What does clustering do?
Finds natural
breaks in your data
If there are obvious
clusters, you can
do this with a small
amount of data
If you have lots of
weak predictors,
you need a huge
amount of data to
make it work
What does clustering do?
Finds natural
breaks in your data
If there are obvious
clusters, you can
do this with a small
amount of data
If you have lots of
weak predictors,
you need a huge
amount of data to
make it work
Clustering in Weka
* You can pick which clustering algorithm you want to use and how many clusters
you want.
Clustering in Weka
* Clustering is unsupervised, so you want it to ignore your class attribute!
Click here
Select the class attribute
Clustering in Weka
* You can evaluate
the clustering in
comparison with class
attribute assignments
Adding a Cluster Feature
Adding a Cluster Feature
* You should set it explicitly to ignore the class attribute
* Set the pulldown menu to No Class
Why add cluster features?
Class 1
Class 2
Why add cluster features?
Class 1
Class 2
Clustering with Weka
K-means and
FarthestFirst: disjoint flat
clusters
EM: statistical approach
Cobweb: hierarchical
clustering
K-Means
You choose the number of clusters you want
You
might need to play with this by looking at what kind of
clusters you get out
K initial points chosen randomly as cluster centriods
All points assigned to the centroid they are closest
to
Once data is clustered, a new centroid is picked
based on relationships within the cluster
K-Means
Then clustering occurs again using the new
centroids
This continues until no changes in clustering take
place
Clusters are flat and disjoint
K-Means
K-Means
K-Means
K-Means
K-Means
EM: Expectation Maximization
Does not base clustering on distance from a
centroid
Instead clusters based on probability of class
assignment
Overlapping clusters rather than disjoint
clusters
Every
instance belongs to every cluster with
some probability
EM: Expectation Maximization
Two important kinds of probability
distributions
Each
cluster has an associated distribution of
attribute values for each attribute
Based on the extent to which instances are in the
cluster
Each
instance has a certain probability of being in
each cluster
Based on how close its attribute values are to typical
attribute values for the cluster
Probabilities of Cluster Membership
Initialized
65% B
35% A
25% B
75% A
Central Tendencies Computed
Based on Cluster Membership
65% B
35% A
A
B
25% B
75% A
Cluster Membership Re-Assigned
Probabilistically
75% B
25% A
A
B
35% B
65% A
Central tendencies Re-Assigned
Based on Membership
75% B
25% A
A
B
35% B
65% A
Cluster Membership Reassigned
60% B
40% A
A
B
45% B
55% A
EM: Expectation Maximization
Iterative like k-means – but guided by a
different computation
Considered more principled than k-means, but
much more computationally expensive
Like k-means, you pick the number of clusters
you want
Advanced Statistical
Models
Quick View of Bayesian Networks
Windy
Play
Normally with Naïve
Bayes you have simple
conditional probabilities
Humidity
Outlook
Temperature
P[Play = yes | Humitity = high]
Quick View of Bayesian Networks
Windy
Play
With Bayes Nets, there
are interactions between
attributes
Humidity
Outlook
P[play = yes & temp = hot | Humidity =
high]
Similar likelihood
computation for an
instance
You
Temperature
will still have one
conditional probability per
attribute to multiply
together
But they won’t all be
simple
Humidity is related jointly
to temperature and play
Quick View of Bayesian Networks
Windy
Humidity
Outlook
Temperature
Learning algorithm
needs to find the
shape of the network
Probabilities come
from counts
Two stages – similar
idea to “kernel
methods”
Play
Doing Optimization
in Weka
Optimizing Parameter Settings
Test
Use a modified form of crossvalidation:
1
Iterate over settings
2
Validation
3
4
5
Compare performance
over validation set;
Pick optimal setting
Test on Test Set
Train
Still N folds, but each
fold has less training data than
with standard cross validation
Or you can have a hold-out
Validation set you use for all
folds
Remember!
Cross-validation is for estimating your
performance
If you want the model that achieves that
estimated performance, train over the
whole set
Same principle for optimization
Estimate
your tuned performance using cross
validation with an inner loop for optimization
When you build the model over the whole set,
use the settings that work best in crossvalidation over the whole set
Optimization in Weka
Divide your data into 10 train/test pairs
Tune
parameters using cross validation on the
training set (this is the inner loop)
Use those optimized settings on the
corresponding test set
Note that you may have a different set of
parameter setting for each of the 10 train/test
pairs
You can do the optimization in the
Experimenter
Train/Test Pairs
* Use the StratifiedRemoveFolds filter
Setting Up for Optimization
* Prepare to save the results
•Load in training sets for
all folds
•We’ll use cross validation
Within training folds to
Do the optimization
What are we optimizing?
Let’s optimize the confidence factor.
Let’s try .1, .25, .5, and .75
Add Each Algorithm to Experimenter
Interface
Look at the Results
* Note that optimal setting varies
across folds.
Apply the optimized settings on each fold
* Performance on Test1 using
optimized settings from Train1
What if the optimization requires
work by hand?
Do you see a problem
with the following?
Do
feature selection over
the whole set to see
which words are highly
ranked
Create user defined
features with subsets of
these to see which ones
look good
Add those to your feature
space and do the
classification
What if the optimization requires
work by hand?
The problem is that is
just like doing feature
selection over your
whole data set
You will overestimate your
performance
So what’s a better
way of doing that?
What if the optimization requires
work by hand?
You could set aside a
small subset of data
Using that small
subset, do the same
process
Then use those user
defined features with
the other part of the
data
Take Home Message
Instance based learning and clustering both
make use of similarity metrics
Clustering can be used to help you
understand your data or to add new
features to your data
Weka provides opportunities to tune all of
its algorithms through the object editor
You can use the Experimenter to tune the
parameter settings when you are
estimating your performance using crossvalidation