Intro_to_classification_clustering
Download
Report
Transcript Intro_to_classification_clustering
Fair Use Agreement
This agreement covers the use of all slides on this
CD-Rom, please read carefully.
• You may freely use these slides for teaching, if
• You send me an email telling me the class number/ university in advance.
• My name and email address appears on the first slide (if you are using all or most of the slides), or on each
slide (if you are just taking a few slides).
• You may freely use these slides for a conference presentation, if
• You send me an email telling me the conference name in advance.
• My name appears on each slide you use.
• You may not use these slides for tutorials, or in a published work (tech report/
conference paper/ thesis/ journal etc). If you wish to do this, email me first, it is
highly likely I will grant you permission.
(c) Eamonn Keogh, [email protected]
A Gentle Introduction to Machine Learning
Dr Eamonn Keogh
University of California - Riverside
[email protected]
AUSTRALIA
St. Helena &
Dependencies
ANGUILLA
South Georgia &
South Sandwich
Islands
U.K.
Serbia &
Montenegro
(Yugoslavia)
FRANCE
NIGER
INDIA
IRELAND
BRAZIL
The Goal and Style of this Tutorial
• The tutorial is directed at advanced undergrads, grad
students and young faculty that have some background in
databases, and want a gentle introduction to data mining and
Machine Learning.
• This tutorial is presented very math-lite! Instead the focus
is on communicating (visually, where possible) the intuitions
behind the important issues in data mining, and the tools and
techniques that have been developed to deal with them.
• If you want more rigorous details about some issues, I have
included pointers to useful papers and books.
• Please feel free to interrupt me with questions.
Data Mining Definition
• Finding hidden information in a database
• Data Mining has been defined as
“The nontrivial extraction of implicit, previously
unknown, and potentially useful information from
data*”.
• Similar terms
– Exploratory data analysis
– Data driven discovery
– Deductive learning
– Discovery Science
– Knowledge Discovery
*G. Piatetsky-Shapiro and W. J. Frawley, Knowledge Discovery in Databases, AAAI/MIT Press, 1991.
Database
• Query
vs.
• Query
– Poorly defined
– No precise query language
– Well defined
– SQL
Output
– Subset of database
• Field
Data Mining
Output
–Not a subset of database
• Field
– Mature
– Still in infancy
– Hard to publish a bad
SIGMOD/VLDB paper
– Easy to publish a bad SIGKDD paper!
Query Examples
Database
• Find all customers that live in Boa Vista
• Find all customers that use Mastercard
• Find all customers that missed one payment
Data mining
• Find all customers that are likely to miss one payment
(Classification)
• Group all customers with simpler buying habits (Clustering)
• List all items that are frequently purchased with bicycles
(Association rules)
• Find any “unusual” customers (Outlier detection, anomaly
discovery)
The Major Data Mining Tasks
• Classification
• Clustering
• Associations
Most of the other tasks (for example, outlier discovery or anomaly
detection ) make heavy use of one or more of the above.
So in this tutorial we will focus most of our energy on the above,
starting with…
The Classification Problem
Katydids
(informal definition)
Given a collection of annotated data.
In this case 5 instances Katydids of
and five of Grasshoppers, decide
what type of insect the unlabeled
example is.
Katydid or Grasshopper?
Grasshoppers
For any domain of interest, we can measure features
Color {Green, Brown, Gray, Other}
Abdomen
Length
Has Wings?
Thorax
Length
Antennae
Length
Mandible
Size
Spiracle
Diameter
Leg Length
We can store features
in a database.
The classification
problem can now be
expressed as:
• Given a training database
(My_Collection), predict
the class label of a
previously unseen instance
My_Collection
Insect Abdomen Antennae Insect Class
ID
Length
Length
Grasshopper
1
2.7
5.5
2
3
4
5
6
7
8
9
10
previously unseen instance =
8.0
0.9
1.1
5.4
2.9
6.1
0.5
8.3
8.1
11
9.1
4.7
3.1
8.5
1.9
6.6
1.0
6.6
4.7
5.1
7.0
Katydid
Grasshopper
Grasshopper
Katydid
Grasshopper
Katydid
Grasshopper
Katydid
Katydids
???????
Grasshoppers
Katydids
Antenna Length
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Grasshoppers
We will also use this lager dataset
as a motivating example…
Antenna Length
10
9
8
7
6
5
4
3
2
1
Katydids
Each of these data
objects are called…
• exemplars
• (training) examples
• instances
• tuples
1 2 3 4 5 6 7 8 9 10
Abdomen Length
We will return to the
previous slide in two minutes.
In the meantime, we are
going to play a quick game.
I am going to show you some
classification problems which
were shown to pigeons!
Let us see if you are as
smart as a pigeon!
Pigeon Problem 1
Examples of
class A
3
4
1.5
5
Examples of
class B
5
2.5
5
2
6
8
8
3
2.5
5
4.5
3
Pigeon Problem 1
Examples of
class A
3
4
1.5
6
5
8
What class is
this object?
Examples of
class B
5
2.5
5
2
8
3
8
What about this
one, A or B?
4.5
2.5
5
4.5
3
1.5
7
Pigeon Problem 1
Examples of
class A
3
4
1.5
5
This is a B!
Examples of
class B
5
2.5
5
2
6
8
8
3
2.5
5
4.5
3
8
1.5
Here is the rule.
If the left bar is
smaller than the
right bar, it is an A,
otherwise it is a B.
Pigeon Problem 2
Examples of
class A
Oh! This ones
hard!
Examples of
class B
4
4
5
2.5
5
5
2
5
6
6
5
3
8
Even I know this
one
7
3
3
2.5
3
1.5
7
Pigeon Problem 2
Examples of
class A
Examples of
class B
4
4
5
2.5
5
5
2
5
The rule is as follows,
if the two bars are
equal sizes, it is an A.
Otherwise it is a B.
So this one is an A.
6
6
5
3
7
3
3
2.5
3
7
Pigeon Problem 3
Examples of
class A
Examples of
class B
6
4
4
5
6
1
5
7
5
6
3
4
8
3
7
7
7
6
This one is really hard!
What is this, A or B?
Pigeon Problem 3
Examples of
class A
It is a B!
Examples of
class B
6
4
4
5
6
6
1
5
7
5
6
3
4
8
3
7
7
7
The rule is as follows,
if the square of the
sum of the two bars is
less than or equal to
100, it is an A.
Otherwise it is a B.
Why did we spend so much
time with this game?
Because we wanted to
show that almost all
classification problems
have a geometric
interpretation, check out
the next 3 slides…
Examples of
class A
3
Examples of
class B
5
4
2.5
Left Bar
Pigeon Problem 1
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Right Bar
1.5
5
5
2
6
8
8
3
2.5
5
4.5
3
Here is the rule again.
If the left bar is smaller
than the right bar, it is
an A, otherwise it is a B.
Examples of
class A
4
4
Examples of
class B
5
2.5
Left Bar
Pigeon Problem 2
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Right Bar
5
5
2
5
6
6
5
3
3
3
2.5
3
Let me look it up… here it is..
the rule is, if the two bars
are equal sizes, it is an A.
Otherwise it is a B.
Examples of
class A
4
4
Examples of
class B
5
6
Left Bar
Pigeon Problem 3
100
90
80
70
60
50
40
30
20
10
10 20 30 40 50 60 70 80 90 100
Right Bar
1
5
7
5
6
3
4
8
3
7
7
7
The rule again:
if the square of the sum of the
two bars is less than or equal
to 100, it is an A. Otherwise it
is a B.
Grasshoppers
Katydids
Antenna Length
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
previously unseen instance =
11
5.1
7.0
???????
• We can “project” the
previously unseen instance
into the same space as the
database.
Antenna Length
10
9
8
7
6
5
4
3
2
1
• We have now abstracted
away the details of our
particular problem. It will
be much easier to talk about
points in space.
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Katydids
Grasshoppers
Simple Linear Classifier
10
9
8
7
6
5
4
3
2
1
R.A. Fisher
1890-1962
If previously unseen instance above the line
then
class is Katydid
else
class is Grasshopper
1 2 3 4 5 6 7 8 9 10
Katydids
Grasshoppers
The simple linear classifier is defined for higher dimensional spaces…
… we can visualize it as
being an n-dimensional
hyperplane
It is interesting to think about what would happen in this example if
we did not have the 3rd dimension…
We can no longer get perfect
accuracy with the simple linear
classifier…
We could try to solve this
problem by user a simple
quadratic classifier or a simple
cubic classifier..
However, as we will later see,
this is probably a bad idea…
Which of the “Pigeon Problems” can be
solved by the Simple Linear Classifier?
1) Perfect
2) Useless
3) Pretty Good
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Problems that can be
solved by a linear
classifier are call
linearly separable.
10
9
8
7
6
5
4
3
2
1
100
90
80
70
60
50
40
30
20
10
10 20 30 40 50 60 70 80 90 100
1 2 3 4 5 6 7 8 9 10
Virginica
A Famous Problem
R. A. Fisher’s Iris Dataset.
• 3 classes
• 50 of each class
Setosa
The task is to classify Iris plants
into one of 3 varieties using the
Petal Length and Petal Width.
Iris Setosa
Versicolor
Iris Versicolor
Iris Virginica
We can generalize the piecewise linear classifier to N classes, by
fitting N-1 lines. In this case we first learned the line to (perfectly)
discriminate between Setosa and Virginica/Versicolor, then we
learned to approximately discriminate between Virginica and
Versicolor.
Virginica
Setosa
Versicolor
If petal width > 3.272 – (0.325 * petal length) then class = Virginica
Elseif petal width…
We have now seen one classification
algorithm, and we are about to see more.
How should we compare them?
• Predictive accuracy
• Speed and scalability
– time to construct the model
– time to use the model
– efficiency in disk-resident databases
• Robustness
– handling noise, missing values and irrelevant features,
streaming data
• Interpretability:
– understanding and insight provided by the model
Predictive Accuracy I
• How do we estimate the accuracy of our classifier?
We can use K-fold cross validation
We divide the dataset into K equal sized sections. The algorithm is tested K times,
each time leaving out one of the K section from building the classifier, but using it
to test the classifier instead
Accuracy =
K=5
Number of correct classifications
Number of instances in our database
Insect
ID
Abdomen
Length
Antennae
Length
Insect Class
1
2.7
5.5
Grasshopper
2
8.0
9.1
Katydid
3
0.9
4.7
Grasshopper
4
1.1
3.1
Grasshopper
5
5.4
8.5
Katydid
6
2.9
1.9
Grasshopper
7
6.1
6.6
Katydid
8
0.5
1.0
Grasshopper
9
8.3
6.6
Katydid
10
8.1
4.7
Katydids
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Predictive Accuracy II
• Using K-fold cross validation is a good way to set any parameters we may need to
adjust in (any) classifier.
• We can do K-fold cross validation for each possible setting, and choose the model
with the highest accuracy. Where there is a tie, we choose the simpler model.
• Actually, we should probably penalize the more complex models, even if they are more
accurate, since more complex models are more likely to overfit (discussed later).
Accuracy = 94%
10
9
8
7
6
5
4
3
2
1
Accuracy = 100%
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Accuracy = 100%
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Predictive Accuracy III
Accuracy =
Number of correct classifications
Number of instances in our database
Accuracy is a single number, we may be better off looking at a
confusion matrix. This gives us additional useful information…
True label is...
Cat Dog Pig
Classified as a…
Cat
Dog
Pig
100 0
9 90
45 45
0
1
10
Speed and Scalability I
We need to consider the time and space requirements
for the two distinct phases of classification:
• Time to construct the classifier
• In the case of the simpler linear classifier, the time taken to fit the line, this
is linear in the number of instances.
• Time to use the model
• In the case of the simpler linear classifier, the time taken to test which side
of the line the unlabeled instance is. This can be done in constant time.
As we shall see, some classification
algorithms are very efficient in one aspect,
and very poor in the other.
10
9
8
7
6
5
4
3
2
1
1
2
3
4
5
6
7
8
9 10
Speed and Scalability II
For learning with small datasets, this
is the whole picture
Speed and Scalability I
We need to consider the time and space requirements
for the two distinct phases of classification:
• Time to construct the classifier
• In the case of the simpler linear classifier, the time taken to fit the line, this
is linear in the number of instances.
However, for data mining with
massive datasets, it is not so much the
(main memory) time complexity that
matters, rather it is how many times
we have to scan the database.
• Time to use the model
• In the case of the simpler linear classifier, the time taken to test which
side of the line the unlabeled instance is. This can be done in constant time.
As we shall see, some classification
algorithms are very efficient in one aspect,
and very poor in the other.
10
9
8
7
6
5
4
3
2
1
1
2
3
4
5
6
This is because for most data mining operations, disk access times
completely dominate the CPU times.
For data mining, researchers often report the number of times you
must scan the database.
7
8
9 10
Robustness I
We need to consider what happens when we have:
• Noise
• For example, a persons age could have been mistyped as 650
instead of 65, how does this effect our classifier? (This is important
only for building the classifier, if the instance to be classified is noisy we can
do nothing).
•Missing values
• For example suppose we want to classify
an insect, but we only know the abdomen
length (X-axis), and not the antennae
length (Y-axis), can we still classify the
instance?
10
9
8
7
6
5
4
3
2
1
1
2
3
4
5
6
7
8
9 10
Robustness II
We need to consider what happens when we have:
• Irrelevant features
For example, suppose we want to classify people as either
• Suitable_Grad_Student
• Unsuitable_Grad_Student
And it happens that scoring more than 5 on a particular
test is a perfect indicator for this problem…
10
9
8
7
6
5
4
3
2
1
If we also use
“hair_length” as a
feature, how will this
effect our classifier?
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Robustness III
We need to consider what happens when we have:
• Streaming data
For many real world problems, we don’t have a single fixed
dataset. Instead, the data continuously arrives, potentially
forever… (stock market, weather data, sensor data etc)
Can our classifier handle streaming data?
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Interpretability
As a trivial example, if we try to classify
peoples health risks based on just their
height and weight, we could gain the
following insight (Based of the observation
that a single linear classifier does not work
well, but two linear classifiers do).
There are two ways to be unhealthy,
being obese and being too skinny.
Weight
Some classifiers offer a bonus feature. The structure
of the learned classifier tells use something about the
domain.
Height
Nearest Neighbor Classifier
Antenna Length
10
9
8
7
6
5
4
3
2
1
Evelyn Fix
Joe Hodges
1904-1965
1922-2000
If the nearest instance to the previously
unseen instance is a Katydid
class is Katydid
else
class is Grasshopper
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Katydids
Grasshoppers
We can visualize the nearest neighbor algorithm in terms of
a decision surface…
Note the we don’t actually have
to construct these surfaces, they
are simply the implicit
boundaries that divide the
space into regions “belonging”
to each instance.
This division of space is called
Dirichlet Tessellation (or Voronoi
diagram, or Theissen regions).
The nearest neighbor algorithm is sensitive to outliers…
The solution is to…
We can generalize the nearest neighbor algorithm to
the K- nearest neighbor (KNN) algorithm.
We measure the distance to the nearest K instances, and let
them vote. K is typically chosen to be an odd number.
K=1
K=3
The nearest neighbor algorithm is sensitive to irrelevant features…
Suppose the following is true, if
an insects antenna is longer than
5.5 it is a Katydid, otherwise it
is a Grasshopper.
Training data
1 2 3 4 5 6 7 8 9 10
6
1 2 3 4 5 6 7 8 9 10
Using just the antenna length we
get perfect classification!
1 2 3 4 5 6 7 8 9 10
5
Suppose however, we add in
an irrelevant feature, for
example the insects mass.
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Using both the antenna length
and the insects mass with the
1-NN algorithm we get the
wrong classification!
How do we mitigate the nearest neighbor
algorithms sensitivity to irrelevant features?
• Use more training instances
• Ask an expert what features are relevant to the task
• Use statistical tests to try to determine which
features are useful
• Search over feature subsets (in the next slide we
will see why this is hard)
Why searching over feature subsets is hard
Suppose you have the following classification problem, with 100
features, where is happens that Features 1 and 2 (the X and Y
below) give perfect classification, but all 98 of the other features are
irrelevant…
Only Feature 2
Only Feature 1
Using all 100 features will give poor results, but so will using only
Feature 1, and so will using Feature 2! Of the 2100 –1 possible
subsets of the features, only one really works.
1,2
1
2
3
4
1,3
2,3
1,4
2,4
1,2,3
•Forward Selection
•Backward Elimination
•Bi-directional Search
1,2,4
1,3,4
1,2,3,4
3,4
2,3,4
The nearest neighbor algorithm is sensitive to the units of measurement
X axis measured in centimeters
Y axis measure in dollars
The nearest neighbor to the pink
unknown instance is red.
X axis measured in millimeters
Y axis measure in dollars
The nearest neighbor to the pink
unknown instance is blue.
One solution is to normalize the units to pure numbers. Typically the
features are Z-normalized to have a mean of zero and a standard
deviation of one. X = (X – mean(X))/std(x)
We can speed up nearest neighbor algorithm by “throwing
away” some data. This is called data editing.
Note that this can sometimes improve accuracy!
We can also speed up classification with indexing
One possible approach.
Delete all instances that are
surrounded by members of
their own class.
Up to now we have assumed that the nearest neighbor algorithm uses
the Euclidean Distance, however this need not be the case…
DQ, C qi ci
n
2
DQ, C
p
i 1
10
9
8
7
6
5
4
3
2
1
p
q
c
i i
n
i 1
Max (p=inf)
Manhattan (p=1)
Weighted Euclidean
Mahalanobis
1 2 3 4 5 6 7 8 9 10
…In fact, we can use the nearest neighbor algorithm with
any distance/similarity function
For example, is “Faloutsos” Greek or Irish? We
could compare the name “Faloutsos” to a
database of names using string edit distance…
edit_distance(Faloutsos, Keogh) = 8
edit_distance(Faloutsos, Gunopulos) = 6
Hopefully, the similarity of the name (particularly
the suffix) to other Greek names would mean the
nearest nearest neighbor is also a Greek name.
ID
1
2
3
4
5
6
7
8
Name
Class
Gunopulos Greek
Papadopoulos Greek
Kollios
Dardanos
Keogh
Gough
Greenhaugh
Hadleigh
Greek
Greek
Irish
Irish
Irish
Irish
Specialized distance measures exist for DNA strings, time series,
images, graphs, videos, sets, fingerprints etc…
Advantages/Disadvantages of Nearest Neighbor
• Advantages:
–
–
–
–
Simple to implement
Handles correlated features (Arbitrary class shapes)
Defined for any distance measure
Handles streaming data trivially
• Disadvantages:
– Very sensitive to irrelevant features.
– Slow classification time for large datasets
– Works best for real valued datasets
Decision Tree Classifier
10
9
8
7
6
5
4
3
2
1
Antenna Length
Ross Quinlan
Abdomen Length > 7.1?
yes
no
Antenna Length > 6.0?
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Katydid
no
yes
Grasshopper
Katydid
Antennae shorter than body?
Yes
No
3 Tarsi?
Grasshopper
Yes
No
Foretiba has ears?
Cricket
Decision trees predate computers
Yes
Katydids
No
Camel Cricket
Decision Tree Classification
• Decision tree
–
–
–
–
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
How do we construct the decision tree?
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they can be discretized
in advance)
– Examples are partitioned recursively based on selected attributes.
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
– There are no samples left
Information Gain as A Splitting Criteria
• Select the attribute with the highest information gain (information
gain is the expected reduction in entropy).
• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and n elements of
class N
– The amount of information, needed to decide if an arbitrary example in S
belongs to P or N is defined as
p
p
E (S )
log 2
pn
pn
0 log(0) is defined as 0
n
n
log 2
pn
pn
Information Gain in Decision Tree Induction
• Assume that using attribute A, a current set will be
partitioned into some number of child sets
• The encoding information that would be gained by
branching on A
Gain( A) E(Current set ) E(all child sets)
Note: entropy is at its minimum if the collection of objects is completely uniform
Person
Homer
Marge
Bart
Lisa
Maggie
Abe
Selma
Otto
Krusty
Comic
Hair
Length
Weight
Age
Class
0”
10”
2”
6”
4”
1”
8”
10”
6”
250
150
90
78
20
170
160
180
200
36
34
10
8
1
70
41
38
45
M
F
M
F
F
M
F
M
M
8”
290
38
?
Entropy( S )
p
p
log 2
pn
p
n
n
n
log 2
pn
p
n
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
yes
no
Hair Length <= 5?
Let us try splitting
on Hair length
Gain( A) E(Current set ) E(all child sets)
Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911
Entropy( S )
p
p
log 2
pn
p
n
n
n
log 2
pn
p
n
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
yes
no
Weight <= 160?
Let us try splitting
on Weight
Gain( A) E(Current set ) E(all child sets)
Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900
Entropy( S )
p
p
log 2
pn
p
n
n
n
log 2
pn
p
n
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
yes
no
age <= 40?
Let us try splitting
on Age
Gain( A) E(Current set ) E(all child sets)
Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183
Of the 3 features we had, Weight
was best. But while people who
weigh over 160 are perfectly
classified (as males), the under 160
people are not perfectly
classified… So we simply recurse!
This time we find that we
can split on Hair length, and
we are done!
yes
yes
no
Weight <= 160?
no
Hair Length <= 2?
We need don’t need to keep the data
around, just the test conditions.
Weight <= 160?
yes
How would
these people
be classified?
no
Hair Length <= 2?
yes
Male
no
Female
Male
It is trivial to convert Decision
Trees to rules…
Weight <= 160?
yes
Hair Length <= 2?
yes
Male
no
Female
Rules to Classify Males/Females
If Weight greater than 160, classify as Male
Elseif Hair Length less than or equal to 2, classify as Male
Else classify as Female
no
Male
Once we have learned the decision tree, we don’t even need a computer!
This decision tree is attached to a medical machine, and is designed to help
nurses make decisions about what type of doctor to call.
Decision tree for a typical shared-care setting applying
the system for the diagnosis of prostatic obstructions.
The worked examples we have
seen were performed on small
datasets. However with small
datasets there is a great danger of
overfitting the data…
When you have few datapoints,
there are many possible splitting
rules that perfectly classify the
data, but will not generalize to
future datasets.
Yes
No
Wears green?
Female
Male
For example, the rule “Wears green?” perfectly classifies the data, so does
“Mothers name is Jacqueline?”, so does “Has blue shoes”…
Avoid Overfitting in Classification
• The generated tree may overfit the training data
– Too many branches, some may reflect anomalies due to
noise or outliers
– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure falling
below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
• Use a set of data different from the training data to
decide which is the “best pruned tree”
Which of the “Pigeon Problems” can be
solved by a Decision Tree?
1) Deep Bushy Tree
2) Useless
3) Deep Bushy Tree
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
The Decision Tree
has a hard time with
correlated attributes
10
9
8
7
6
5
4
3
2
1
100
90
80
70
60
50
40
30
20
10
10 20 30 40 50 60 70 80 90 100
?
1 2 3 4 5 6 7 8 9 10
Advantages/Disadvantages of Decision Trees
• Advantages:
– Easy to understand (Doctors love them!)
– Easy to generate rules
• Disadvantages:
– May suffer from overfitting.
– Classifies by rectangular partitioning (so does
not handle correlated features very well).
– Can be quite large – pruning is necessary.
– Does not handle streaming data easily
Naïve Bayes Classifier
Thomas Bayes
1702 - 1761
We will start off with a visual intuition, before looking at the math…
Grasshoppers
Katydids
Antenna Length
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Remember this example?
Let’s get lots more data…
With a lot of data, we can build a histogram. Let us
just build one for “Antenna Length” for now…
Antenna Length
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Katydids
Grasshoppers
We can leave the
histograms as they are,
or we can summarize
them with two normal
distributions.
Let us us two normal
distributions for ease
of visualization in the
following slides…
• We want to classify an insect we have found. Its antennae are 3 units long.
How can we classify it?
• We can just ask ourselves, give the distributions of antennae lengths we have
seen, is it more probable that our insect is a Grasshopper or a Katydid.
• There is a formal way to discuss the most probable classification…
p(cj | d) = probability of class cj, given that we have observed d
3
Antennae length is 3
p(cj | d) = probability of class cj, given that we have observed d
P(Grasshopper | 3 ) = 10 / (10 + 2)
= 0.833
P(Katydid | 3 )
= 0.166
= 2 / (10 + 2)
10
2
3
Antennae length is 3
p(cj | d) = probability of class cj, given that we have observed d
P(Grasshopper | 7 ) = 3 / (3 + 9)
= 0.250
P(Katydid | 7 )
= 0.750
= 9 / (3 + 9)
9
3
7
Antennae length is 7
p(cj | d) = probability of class cj, given that we have observed d
P(Grasshopper | 5 ) = 6 / (6 + 6)
= 0.500
P(Katydid | 5 )
= 0.500
= 6 / (6 + 6)
66
5
Antennae length is 5
Bayes Classifiers
That was a visual intuition for a simple case of the Bayes classifier,
also called:
• Idiot Bayes
• Naïve Bayes
• Simple Bayes
We are about to see some of the mathematical formalisms, and
more examples, but keep in mind the basic idea.
Find out the probability of the previously unseen instance
belonging to each class, then simply pick the most probable class.
Bayes Classifiers
• Bayesian classifiers use Bayes theorem, which says
p(cj | d ) = p(d | cj ) p(cj)
p(d)
•
p(cj | d) = probability of instance d being in class cj,
This is what we are trying to compute
• p(d | cj) = probability of generating instance d given class cj,
We can imagine that being in class cj, causes you to have feature d
with some probability
• p(cj) = probability of occurrence of class cj,
This is just how frequent the class cj, is in our database
• p(d) = probability of instance d occurring
This can actually be ignored, since it is the same for all classes
Assume that we have two classes
c1 = male, and c2 = female.
(Note: “Drew
can be a male
or female
name”)
We have a person whose sex we do not
know, say “drew” or d.
Classifying drew as male or female is
equivalent to asking is it more probable
that drew is male or female, I.e which is
greater p(male | drew) or p(female | drew)
Drew Barrymore
Drew Carey
What is the probability of being called
“drew” given that you are a male?
p(male | drew) = p(drew | male ) p(male)
p(drew)
What is the probability
of being a male?
What is the probability of
being named “drew”?
(actually irrelevant, since it is
that same for all classes)
This is Officer Drew (who arrested me in
1997). Is Officer Drew a Male or Female?
Luckily, we have a small
database with names and sex.
We can use it to apply Bayes
rule…
Officer Drew
p(cj | d) = p(d | cj ) p(cj)
p(d)
Name
Drew
Sex
Male
Claudia Female
Drew
Female
Drew
Female
Alberto Male
Karin
Nina
Female
Female
Sergio
Male
Name
Sex
Drew
Male
Claudia Female
Drew
Female
Drew
Female
p(cj | d) = p(d | cj ) p(cj)
p(d)
Officer Drew
Alberto Male
Female
Karin
Nina
Female
Sergio
p(male | drew) = 1/3 * 3/8
3/8
p(female | drew) = 2/5 * 5/8
3/8
Male
= 0.125
3/8
= 0.250
3/8
Officer Drew is
more likely to be
a Female.
Officer Drew IS a female!
Officer Drew
p(male | drew) = 1/3 * 3/8
3/8
p(female | drew) = 2/5 * 5/8
3/8
= 0.125
3/8
= 0.250
3/8
So far we have only considered Bayes
Classification when we have one
attribute (the “antennae length”, or the
“name”). But we may have many
features.
How do we use all the features?
Name
Over 170CM
Eye
p(cj | d) = p(d | cj ) p(cj)
p(d)
Hair length Sex
Drew
Claudia
Drew
Drew
No
Yes
No
No
Blue
Brown
Blue
Blue
Short
Long
Long
Long
Male
Female
Female
Female
Alberto
Yes
Brown
Short
Male
Karin
Nina
No
Yes
Blue
Brown
Long
Short
Female
Female
Sergio
Yes
Blue
Long
Male
• To simplify the task, naïve Bayesian classifiers assume
attributes have independent distributions, and thereby estimate
p(d|cj) = p(d1|cj) * p(d2|cj) * ….* p(dn|cj)
The probability of
class cj generating
instance d, equals….
The probability of class cj
generating the observed
value for feature 1,
multiplied by..
The probability of class cj
generating the observed
value for feature 2,
multiplied by..
• To simplify the task, naïve Bayesian classifiers
assume attributes have independent distributions, and
thereby estimate
p(d|cj) = p(d1|cj) * p(d2|cj) * ….* p(dn|cj)
p(officer drew|cj) = p(over_170cm = yes|cj) * p(eye =blue|cj) * ….
Officer Drew
is blue-eyed,
over 170cm
tall, and has
long hair
p(officer drew| Female) = 2/5 * 3/5 * ….
p(officer drew| Male) = 2/3 * 2/3 * ….
The Naive Bayes classifiers
is often represented as this
type of graph…
cj
Note the direction of the
arrows, which state that
each class causes certain
features, with a certain
probability
p(d1|cj)
p(d2|cj)
…
p(dn|cj)
cj
Naïve Bayes is fast and
space efficient
We can look up all the probabilities
with a single scan of the database and
store them in a (small) table…
p(d1|cj)
Sex
Over190cm
Male
Yes
0.15
No
0.85
Yes
0.01
No
0.99
Female
…
p(d2|cj)
Sex
Long Hair
Male
Yes
0.05
No
0.95
Yes
0.70
No
0.30
Female
p(dn|cj)
Sex
Male
Female
Naïve Bayes is NOT sensitive to irrelevant features...
Suppose we are trying to classify a persons sex based on
several features, including eye color. (Of course, eye color
is completely irrelevant to a persons gender)
p(Jessica |cj) = p(eye = brown|cj) * p( wears_dress = yes|cj) * ….
p(Jessica | Female) = 9,000/10,000
p(Jessica | Male) = 9,001/10,000
* 9,975/10,000 * ….
* 2/10,000
* ….
Almost the same!
However, this assumes that we have good enough estimates of
the probabilities, so the more data the better.
cj
An obvious point. I have used a
simple two class problem, and
two possible values for each
example, for my previous
examples. However we can have
an arbitrary number of classes, or
feature values
p(d1|cj)
Animal
Mass >10kg
Cat
Yes
0.15
No
Dog
Pig
p(d2|cj)
…
Animal
Animal
Color
Cat
Black
0.33
0.85
White
0.23
Yes
0.91
Brown
0.44
No
0.09
Black
0.97
Yes
0.99
White
0.03
No
0.01
Brown
0.90
Black
0.04
White
0.01
Dog
Pig
p(dn|cj)
Cat
Dog
Pig
Problem!
p(d|cj)
Naïve Bayesian
Classifier
Naïve Bayes assumes
independence of
features…
p(d1|cj)
Sex
Over 6
foot
Male
Yes
0.15
No
0.85
Yes
0.01
No
0.99
Female
p(d2|cj)
Sex
Over 200
pounds
Male
Yes
0.11
No
0.80
Yes
0.05
No
0.95
Female
p(dn|cj)
Solution
p(d|cj)
Naïve Bayesian
Classifier
Consider the
relationships between
attributes…
p(d1|cj)
Sex
Male
Female
Over 6
foot
Yes
0.15
No
0.85
Yes
0.01
No
0.99
p(d2|cj)
p(dn|cj)
Sex
Over 200 pounds
Male
Yes and Over 6 foot
0.11
No and Over 6 foot
0.59
Yes and NOT Over 6 foot
0.05
No and NOT Over 6 foot
0.35
Solution
p(d|cj)
Naïve Bayesian
Classifier
Consider the
relationships between
attributes…
p(d1|cj)
p(d2|cj)
But how do we find the set of connecting arcs??
p(dn|cj)
The Naïve Bayesian Classifier
has a quadratic decision boundary
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Dear SIR,
I am Mr. John Coleman and my sister is Miss Rose
Colemen, we are the children of late Chief Paul
Colemen from Sierra Leone. I am writing you in
absolute confidence primarily to seek your assistance
to transfer our cash of twenty one Million Dollars
($21,000.000.00) now in the custody of a private
Security trust firm in Europe the money is in trunk
boxes deposited and declared as family valuables by my
late father as a matter of fact the company does not
know the content as money, although my father made
them to under stand that the boxes belongs to his
foreign partner.
…
This mail is probably spam. The original message has been
attached along with this report, so you can recognize or block
similar unwanted mail in future. See
http://spamassassin.org/tag/ for more details.
Content analysis details:
(12.20 points, 5 required)
NIGERIAN_SUBJECT2 (1.4 points) Subject is indicative of a Nigerian spam
FROM_ENDS_IN_NUMS (0.7 points) From: ends in numbers
MIME_BOUND_MANY_HEX (2.9 points) Spam tool pattern in MIME boundary
URGENT_BIZ
(2.7 points) BODY: Contains urgent matter
US_DOLLARS_3
(1.5 points) BODY: Nigerian scam key phrase
($NN,NNN,NNN.NN)
DEAR_SOMETHING
(1.8 points) BODY: Contains 'Dear (something)'
BAYES_30
(1.6 points) BODY: Bayesian classifier says spam
probability is 30 to 40%
[score: 0.3728]
Advantages/Disadvantages of Naïve Bayes
• Advantages:
–
–
–
–
Fast to train (single scan). Fast to classify
Not sensitive to irrelevant features
Handles real and discrete data
Handles streaming data well
• Disadvantages:
– Assumes independence of features
Summary of Classification
We have seen 4 major classification techniques:
• Simple linear classifier, Nearest neighbor, Decision tree.
There are other techniques:
• Neural Networks, Support Vector Machines, Genetic algorithms..
In general, there is no one best classifier for all problems. You have to
consider what you hope to achieve, and the data itself…
Let us now move on to the other classic problem of data mining
and machine learning, Clustering…
What is Clustering?
Also called unsupervised learning, sometimes called
classification by statisticians and sorting by
psychologists and segmentation by people in marketing
• Organizing data into classes such that there is
• high intra-class similarity
• low inter-class similarity
• Finding the class labels and the number of classes directly
from the data (in contrast to classification).
• More informally, finding natural groupings among objects.
What is a natural grouping among these objects?
What is a natural grouping among these objects?
Clustering is subjective
Simpson's Family
School Employees
Females
Males
What is Similarity?
The quality or state of being similar; likeness; resemblance; as, a similarity of features.
Webster's Dictionary
Similarity is hard
to define, but…
“We know it when
we see it”
The real meaning
of similarity is a
philosophical
question. We will
take a more
pragmatic
approach.
Defining Distance Measures
Definition: Let O1 and O2 be two objects from the
universe of possible objects. The distance (dissimilarity)
between O1 and O2 is a real number denoted by D(O1,O2)
Peter Piotr
0.23
3
342.7
Peter Piotr
d('', '') = 0 d(s, '') =
d('', s) = |s| -- i.e.
length of s d(s1+ch1,
s2+ch2) = min( d(s1,
s2) + if ch1=ch2 then
0 else 1 fi, d(s1+ch1,
s2) + 1, d(s1,
s2+ch2) + 1 )
When we peek inside one of
these black boxes, we see some
function on two variables. These
functions might very simple or
very complex.
In either case it is natural to ask,
what properties should these
functions have?
3
What properties should a distance measure have?
• D(A,B) = D(B,A)
• D(A,A) = 0
• D(A,B) = 0 IIf A= B
• D(A,B) D(A,C) + D(B,C)
Symmetry
Constancy of Self-Similarity
Positivity (Separation)
Triangular Inequality
Intuitions behind desirable
distance measure properties
D(A,B) = D(B,A)
Symmetry
Otherwise you could claim “Alex looks like Bob, but Bob looks nothing like Alex.”
D(A,A) = 0
Constancy of Self-Similarity
Otherwise you could claim “Alex looks more like Bob, than Bob does.”
D(A,B) = 0 IIf A=B
Positivity (Separation)
Otherwise there are objects in your world that are different, but you cannot tell apart.
D(A,B) D(A,C) + D(B,C)
Triangular Inequality
Otherwise you could claim “Alex is very like Bob, and Alex is very like Carl, but Bob
is very unlike Carl.”
Two Types of Clustering
• Partitional algorithms: Construct various partitions and then
evaluate them by some criterion (we will see an example called BIRCH)
• Hierarchical algorithms: Create a hierarchical decomposition of
the set of objects using some criterion
Hierarchical
Partitional
Desirable Properties of a Clustering Algorithm
• Scalability (in terms of both time and space)
• Ability to deal with different data types
• Minimal requirements for domain knowledge to
determine input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• Incorporation of user-specified constraints
• Interpretability and usability
A Useful Tool for Summarizing Similarity Measurements
In order to better appreciate and evaluate the examples given in the
early part of this talk, we will now introduce the dendrogram.
Terminal Branch
Root
Internal Branch
Internal Node
Leaf
The similarity between two objects in a
dendrogram is represented as the height of
the lowest internal node they share.
There is only one dataset that can be
perfectly clustered using a hierarchy…
(Bovine:0.69395, (Spider Monkey 0.390, (Gibbon:0.36079,(Orang:0.33636,(Gorilla:0.17147,(Chimp:0.19268,
Human:0.11927):0.08386):0.06124):0.15057):0.54939);
Note that hierarchies are
commonly used to
organize information, for
example in a web portal.
Yahoo’s hierarchy is
manually created, we will
focus on automatic
creation of hierarchies in
data mining.
Business & Economy
B2B
Finance
Aerospace Agriculture…
Shopping
Banking Bonds…
Jobs
Animals Apparel
Career Workspace
A Demonstration of Hierarchical Clustering using String Edit Distance
Pedro (Portuguese)
Petros (Greek), Peter (English), Piotr (Polish), Peadar
(Irish), Pierre (French), Peder (Danish), Peka
(Hawaiian), Pietro (Italian), Piero (Italian Alternative),
Petr (Czech), Pyotr (Russian)
Cristovao (Portuguese)
Christoph (German), Christophe (French), Cristobal
(Spanish), Cristoforo (Italian), Kristoffer
(Scandinavian), Krystof (Czech), Christopher (English)
Miguel (Portuguese)
Michalis (Greek), Michael (English), Mick (Irish!)
Pedro (Portuguese/Spanish)
Petros (Greek), Peter (English), Piotr (Polish),
Peadar (Irish), Pierre (French), Peder (Danish),
Peka (Hawaiian), Pietro (Italian), Piero (Italian
Alternative), Petr (Czech), Pyotr (Russian)
Hierarchal clustering can sometimes show
patterns that are meaningless or spurious
• For example, in this clustering, the tight grouping of Australia,
Anguilla, St. Helena etc is meaningful, since all these countries are
former UK colonies.
• However the tight grouping of Niger and India is completely
spurious, there is no connection between the two.
AUSTRALIA
St. Helena &
Dependencies
ANGUILLA
South Georgia &
South Sandwich
Islands
U.K.
Serbia &
Montenegro
(Yugoslavia)
FRANCE
NIGER
INDIA
IRELAND
BRAZIL
• The flag of Niger is orange over white over green, with an orange disc on the
central white stripe, symbolizing the sun. The orange stands the Sahara desert,
which borders Niger to the north. Green stands for the grassy plains of the south
and west and for the River Niger which sustains them. It also stands for fraternity
and hope. White generally symbolizes purity and hope.
• The Indian flag is a horizontal tricolor in equal proportion of deep saffron on the
top, white in the middle and dark green at the bottom. In the center of the white
band, there is a wheel in navy blue to indicate the Dharma Chakra, the wheel of
law in the Sarnath Lion Capital. This center symbol or the 'CHAKRA' is a symbol
dating back to 2nd century BC. The saffron stands for courage and sacrifice; the
white, for purity and truth; the green for growth and auspiciousness.
AUSTRALIA
St. Helena &
Dependencies
ANGUILLA
South Georgia &
South Sandwich
Islands
U.K.
Serbia &
Montenegro
(Yugoslavia)
FRANCE
NIGER
INDIA
IRELAND
BRAZIL
We can look at the dendrogram to determine the “correct” number of
clusters. In this case, the two highly separated subtrees are highly
suggestive of two clusters. (Things are rarely this clear cut, unfortunately)
One potential use of a dendrogram is to detect outliers
The single isolated branch is suggestive of a
data point that is very different to all others
Outlier
(How-to) Hierarchical Clustering
The number of dendrograms with n
leafs = (2n -3)!/[(2(n -2)) (n -2)!]
Number
of Leafs
2
3
4
5
...
10
Number of Possible
Dendrograms
1
3
15
105
…
34,459,425
Since we cannot test all possible trees
we will have to heuristic search of all
possible trees. We could do this..
Bottom-Up (agglomerative): Starting
with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.
Top-Down (divisive): Starting with all
the data in a single cluster, consider
every possible way to divide the cluster
into two. Choose the best division and
recursively operate on both sides.
We begin with a distance
matrix which contains the
distances between every pair
of objects in our database.
0
D( , ) = 8
D( , ) = 1
8
8
7
7
0
2
4
4
0
3
3
0
1
0
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together.
Consider all
possible
merges…
…
Choose
the best
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together.
Consider all
possible
merges…
Consider all
possible
merges…
…
…
Choose
the best
Choose
the best
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together.
Consider all
possible
merges…
Consider all
possible
merges…
Consider all
possible
merges…
Choose
the best
…
…
…
Choose
the best
Choose
the best
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together.
Consider all
possible
merges…
Consider all
possible
merges…
Consider all
possible
merges…
Choose
the best
…
…
…
Choose
the best
Choose
the best
We know how to measure the distance between two
objects, but defining the distance between an object
and a cluster, or defining the distance between two
clusters is non obvious.
• Single linkage (nearest neighbor): In this method the distance
between two clusters is determined by the distance of the two closest objects
(nearest neighbors) in the different clusters.
• Complete linkage (furthest neighbor): In this method, the
distances between clusters are determined by the greatest distance between any
two objects in the different clusters (i.e., by the "furthest neighbors").
• Group average linkage: In this method, the distance between two
clusters is calculated as the average distance between all pairs of objects in the
two different clusters.
• Wards Linkage: In this method, we try to minimize the variance of the
merged clusters
Single linkage
7
25
6
20
5
15
4
3
10
2
5
1
29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7
Average linkage
0
5 14 23 7 4 12 19 21 24 15 16 18 1 3 8 9 29 2 10 11 20 28 17 26 27 25 6 13 22 30
Wards linkage
Summary of Hierarchal Clustering Methods
• No need to specify the number of clusters in
advance.
• Hierarchal nature maps nicely onto human intuition
for some domains
• They do not scale well: time complexity of at least
O(n2), where n is the number of total objects.
• Like any heuristic search algorithms, local optima
are a problem.
• Interpretation of results is (very) subjective.
Up to this point we have simply assumed that we can measure
similarity, but
How do we measure similarity?
Peter Piotr
0.23
3
342.7
A generic technique for measuring similarity
To measure the similarity between two objects,
transform one of the objects into the other, and
measure how much effort it took. The measure
of effort becomes the distance measure.
The distance between Patty and Selma.
Change dress color,
1 point
Change earring shape, 1 point
Change hair part,
1 point
D(Patty,Selma) = 3
The distance between Marge and Selma.
Change dress color,
Add earrings,
Decrease height,
Take up smoking,
Lose weight,
D(Marge,Selma) = 5
1
1
1
1
1
point
point
point
point
point
This is called the “edit
distance” or the
“transformation distance”
Edit Distance Example
It is possible to transform any string Q into
string C, using only Substitution, Insertion
and Deletion.
Assume that each of these operators has a
cost associated with it.
How similar are the names
“Peter” and “Piotr”?
Assume the following cost function
Substitution
Insertion
Deletion
1 Unit
1 Unit
1 Unit
D(Peter,Piotr) is 3
The similarity between two strings can be
defined as the cost of the cheapest
transformation from Q to C.
Peter
Note that for now we have ignored the issue of how we can find this cheapest
transformation
Substitution (i for e)
Piter
Insertion (o)
Pioter
Deletion (e)
Piotr
Partitional Clustering
• Nonhierarchical, each instance is placed in
exactly one of K nonoverlapping clusters.
• Since only one set of clusters is output, the user
normally has to input the desired number of
clusters K.
Squared Error
10
9
8
7
6
5
4
3
2
1
1
Objective Function
2
3
4
5
6
7
8
9 10
Algorithm k-means
1. Decide on a value for k.
2. Initialize the k cluster centers (randomly, if
necessary).
3. Decide the class memberships of the N objects by
assigning them to the nearest cluster center.
4. Re-estimate the k cluster centers, by assuming the
memberships found above are correct.
5. If none of the N objects changed membership in
the last iteration, exit. Otherwise goto 3.
K-means Clustering: Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
3
k2
2
1
k3
0
0
1
2
3
4
5
K-means Clustering: Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
3
k2
2
1
k3
0
0
1
2
3
4
5
K-means Clustering: Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
3
2
k3
k2
1
0
0
1
2
3
4
5
K-means Clustering: Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
3
2
k3
k2
1
0
0
1
2
3
4
5
K-means Clustering: Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance
expression in condition 2
5
4
k1
3
2
k2
k3
1
0
0
1
2
3
4
expression in condition 1
5
Comments on the K-Means Method
• Strength
– Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
– Often terminates at a local optimum. The global optimum may
be found using techniques such as: deterministic annealing and
genetic algorithms
• Weakness
– Applicable only when mean is defined, then what about
categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
The K-Medoids Clustering Method
• Find representative objects, called medoids, in clusters
• PAM (Partitioning Around Medoids, 1987)
– starts from an initial set of medoids and iteratively replaces
one of the medoids by one of the non-medoids if it improves
the total distance of the resulting clustering
– PAM works effectively for small data sets, but does not scale
well for large data sets
EM Algorithm
• Initialize K cluster centers
• Iterate between two steps
– Expectation step: assign points to clusters
w
P(di ck ) wk Pr( di | ck )
wk
Pr( d c )
i
j
Pr( di | c j )
j
k
i
N
– Maximation step: estimate model parameters
k
1
m
d i P ( d i ck )
i 1 P ( d i c j )
m
k
Iteration 1
The cluster
means are
randomly
assigned
Iteration 2
Iteration 5
Iteration 25
What happens if the data is streaming…
Nearest Neighbor Clustering
Not to be confused with Nearest Neighbor Classification
• Items are iteratively merged into the
existing clusters that are closest.
• Incremental
• Threshold, t, used to determine if items are
added to existing clusters or a new cluster is
created.
10
9
8
7
Threshold t
6
5
4
3
t
1
2
1
2
1
2
3
4
5
6
7
8
9 10
10
9
8
7
6
New data point arrives…
5
4
It is within the threshold for
cluster 1, so add it to the
cluster, and update cluster
center.
3
1
3
2
1
2
1
2
3
4
5
6
7
8
9 10
New data point arrives…
10
4
9
It is not within the threshold
for cluster 1, so create a new
cluster, and so on..
8
7
6
5
4
3
1
3
2
1
Algorithm is highly order
dependent…
It is difficult to determine t in
advance…
2
1
2
3
4
5
6
7
8
9 10
Partitional Clustering Algorithms
• Clustering algorithms have been designed to handle
very large datasets
• E.g. the Birch algorithm
• Main idea: use an in-memory R-tree to store points that are
being clustered
• Insert points one at a time into the R-tree, merging a new
point with an existing cluster if is less than some distance
away
• If there are more leaf nodes than fit in memory, merge
existing clusters that are close to each other
• At the end of first pass we get a large number of clusters at
the leaves of the R-tree
Merge clusters to reduce the number of clusters
Partitional Clustering Algorithms
We need to specify the number of clusters in advance, I have chosen 2
• The Birch algorithm
R10
R11
R10 R11 R12
R1 R2 R3
R12
R4 R5 R6
R7 R8 R9
Data nodes containing points
Partitional Clustering Algorithms
• The Birch algorithm
R10
R11
R10 R11 R12
{R1,R2} R3 R4 R5 R6
R12
R7 R8 R9
Data nodes containing points
Partitional Clustering Algorithms
• The Birch algorithm
R10
R11
R12
How can we tell the right number of clusters?
In general, this is a unsolved problem. However there are many
approximate methods. In the next few slides we will see an example.
10
9
8
7
6
5
4
3
2
1
For our example, we will use the
familiar katydid/grasshopper
dataset.
However, in this case we are
imagining that we do NOT
know the class labels. We are
only clustering on the X and Y
axis values.
1 2 3 4 5 6 7 8 9 10
When k = 1, the objective function is 873.0
1 2 3 4 5 6 7 8 9 10
When k = 2, the objective function is 173.1
1 2 3 4 5 6 7 8 9 10
When k = 3, the objective function is 133.6
1 2 3 4 5 6 7 8 9 10
We can plot the objective function values for k equals 1 to 6…
The abrupt change at k = 2, is highly suggestive of two clusters
in the data. This technique for determining the number of
clusters is known as “knee finding” or “elbow finding”.
Objective Function
1.00E+03
9.00E+02
8.00E+02
7.00E+02
6.00E+02
5.00E+02
4.00E+02
3.00E+02
2.00E+02
1.00E+02
0.00E+00
1
2
3
k
4
5
6
Note that the results are not always as clear cut as in this toy example
Conclusions
• We have learned about the 3 major data
mining/machine learning algorithms.
• Almost all data mining research is in these 3 areas, or is
a minor extension of one or more of them.
• For further study, I recommend.
• Proceedings of SIGKDD, IEEE ICDM, SIAM SDM
• Data Mining: Concepts and Techniques (Jiawei Han and Micheline Kamber)
• Data Mining: Introductory and Advanced Topics (Margaret Dunham)
Obrigado!
Thanks to the conference organizers and
many Brazilians who helped me, especially
Altigran S. da Silva, Sergio Lifschitz, Nina
Edelweis, Karin Becker, Carlos A. Heuser,
and Mirella Moura Moro.
Fair Use Agreement
This agreement covers the use of all slides on this
CD-Rom, please read carefully.
• You may freely use these slides for teaching, if
• You send me an email telling me the class number/ university in advance.
• My name and email address appears on the first slide (if you are using all or most of the slides), or on each
slide (if you are just taking a few slides).
• You may freely use these slides for a conference presentation, if
• You send me an email telling me the conference name in advance.
• My name appears on each slide you use.
• You may not use these slides for tutorials, or in a published work (tech report/
conference paper/ thesis/ journal etc). If you wish to do this, email me first, it is
highly likely I will grant you permission.
(c) Eamonn Keogh, [email protected]