4.8 (Slides - Computer Science and Engineering

Download Report

Transcript 4.8 (Slides - Computer Science and Engineering

Data Science
Algorithms: The Basic Methods
Clustering
WFH: Data Mining, Chapter 4.8
Rodney Nielsen
Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall
Algorithms: The Basic Methods
•
•
•
•
•
•
•
•
Inferring rudimentary rules
Naïve Bayes, probabilistic model
Constructing decision trees
Constructing rules
Association rule learning
Linear models
Instance-based learning
Clustering
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Clustering
• Clustering techniques apply when there is no class to
be predicted
• Aim: divide instances into “natural” groups
• Clusters can be:
• Disjoint OR Overlapping
• Deterministic OR Probabilistic
• Flat OR Hierarchical
• Classic clustering algorithm: k-means
• k-means clusters are disjoint, deterministic, and flat
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Unsupervised Learning
a
b
d
c
e
f
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Hierarchical Agglomerative Clustering
a
b
c
d
e
a
bc
b
d
c
e
de
def
f
bcdef
abcdef
Rodney Nielsen, Human Intelligence & Language Technologies Lab
f
k-means Clustering
1. Choose number of clusters
• e.g., k=3
2. Select random centroids
• often examples
3. Until convergence
a
b
c
4. Iterate over all examples and
assign them to the cluster whose
centroid is closest
5. Re-compute the cluster centroid
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Student Q: In
k-means
clustering,
what do the
initial k-points
represent
(what is their
function)?
k-means Clustering
1. Choose number of clusters
• e.g., k=3
2. Select random centroids
• often examples
3. Until convergence
a
b
c
4. Iterate over all examples and
assign them to the cluster whose
centroid is closest
5. Re-compute the cluster centroid
Rodney Nielsen, Human Intelligence & Language Technologies Lab
k-means Clustering
1. Choose number of clusters
• e.g., k=3
2. Select random centroids
• often examples
3. Until convergence
a
b
c
4. Iterate over all examples and
assign them to the cluster whose
centroid is closest
5. Re-compute the cluster centroid
Rodney Nielsen, Human Intelligence & Language Technologies Lab
k-means Clustering
1
a
2
a
b
c
b
c
a
3
4
a
b
c
c
Rodney Nielsen, Human Intelligence & Language Technologies Lab
b
Expectation Maximization
1
Student Q: How
do we avoid
overfitting with
clustering?
2
3
Student Q: In
probability based
clustering can
regions for
different
classifications
overlap?
4
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Discussion
• k-means minimizes squared distance to cluster centers
Student Q: Why are the final
• Result can vary significantly
clusters sensitive to the initial
• Based on initial choice of seeds
cluster centers?
• Can get trapped in local minimum
initial
initial
• Example:
cluster
cluster
centers
centers
instances
• To increase chance of finding global optimum: restart with
different random seeds
• For hierarchical clustering, can be applied recursively with
k=2
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Clustering: How Many Clusters?
• How to choose k in k-means? Possibilities:
Student Q:
How can
we
determine
the best
attribute to
use with k
nearest
neighbor?
• Choose k that minimizes cross-validated squared
distance to cluster centers
• Use penalized squared distance on the training data (eg.
using an MDL criterion)
• Apply k-means recursively with k = 2 and use stopping
criterion (eg. based on MDL)
Student Q: In
k-means
clustering,
what is a
simple way to
determine the
number of
cluster points
you will need
if you do not
already know?
• Seeds for subclusters can be chosen by seeding along direction
of greatest variance in cluster
(one standard deviation away in each direction from cluster
center of parent cluster)
• Implemented in algorithm called X-means (using Bayesian
Information Criterion instead of MDL)
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Student Questions
Student Q: Why couldn't a training
set labeled via the clustering method
be used to train for a rule?
Student Q: Why exactly does saying
all of the attributes are covariant
contribute to overfitting?
Student Q: Is K-means restricted
to data for which there is a notion
of a center (centroid)?
Student Q: I am wondering how Kmeans handles outliers or how it deals
with empty clusters which can be
obtained if no points are allocated to a
cluster during the assignment step. This
can potentially lead to a larger squared
error than necessary?
Student Q: Is there a concrete
benefit to visualizing the clusters
like in the book, or does it just look
nice?
Student Q: In what sense is
clustering different from
unsupervised classification?
Student Q: Is K-means restricted
to data for which there is a notion
of a center (centroid)?
Student Q: If clusters can be formed
from the data set couldn't it be said that
there are classifications that could be
learned as well or that those points were
classified as that cluster? And if that is
the case is clustering just instance based
learning?
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Student Questions
Student Q: I am wondering how Kmeans handles outliers or how it
deals with empty clusters which
can be obtained if no points are
allocated to a cluster during the
assignment step. This can
potentially lead to a larger squared
error than necessary?
Student Q: If clusters can be
formed from the data set couldn't it
be said that there are
classifications that could be
learned as well or that those points
were classified as that cluster? And
if that is the case is clustering just
instance based learning?
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Faster Distance Calculations
• Can we use kD-trees or ball trees to speed up the
process? Yes:
• First, build tree, which remains static, for all the data
points
• At each node, store number of instances and sum of all
instances
• In each iteration, descend tree and find out which
cluster each node belongs to
• Can stop descending as soon as we find out that a node belongs
entirely to a particular cluster
• Use statistics stored at the nodes to compute new cluster
centers
Student Q: It seems like that when constructing a kD tree,
it's possible to process multiple data points in one go. Is
this a realistic assumption, or a best case scenario?
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Dimensionality Reduction
• Principle Components Analysis
• Singular Value Decomposition
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Rodney Nielsen, Human Intelligence & Language Technologies Lab
Example
Rodney Nielsen, Human Intelligence & Language Technologies Lab