COMP 290-090 Data Mining: Concepts, Algorithms, and Applications 2

Download Report

Transcript COMP 290-090 Data Mining: Concepts, Algorithms, and Applications 2

K-means
Arbitrarily choose k objects as the initial
cluster centers
Until no change, do
(Re)assign each object to the cluster to which
the object is the most similar, based on the
mean value of the objects in the cluster
Update the cluster means, i.e., calculate the
mean value of the objects for each cluster
1
COMP 290-090 Data Mining: Concepts, Algorithms, and Applications
K-Means: Example
10
10
9
9
8
8
7
7
6
6
5
5
10
9
8
7
6
5
4
4
3
2
1
0
0
1
2
3
4
5
6
7
8
K=2
Arbitrarily choose K
object as initial
cluster center
9
10
Assign
each
objects
to most
similar
center
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
4
3
2
1
0
0
1
2
3
4
5
6
reassign
10
10
9
9
8
8
7
7
6
6
5
5
4
2
1
0
0
1
2
3
4
5
6
7
8
7
8
9
10
reassign
3
2
Update
the
cluster
means
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
COMP 290-090 Data Mining: Concepts, Algorithms, and Applications
Pros and Cons of K-means
Relatively efficient: O(tkn)
n: # objects, k: # clusters, t: # iterations; k, t << n.
Often terminate at a local optimum
Applicable only when mean is defined
What about categorical data?
Need to specify the number of clusters
Unable to handle noisy data and outliers
unsuitable to discover non-convex clusters
3
COMP 290-090 Data Mining: Concepts, Algorithms, and Applications
Variations of the K-means
Aspects of variations
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data: k-modes
Use mode instead of mean
Mode: the most frequent item(s)
A mixture of categorical and numerical data: kprototype method
4
COMP 290-090 Data Mining: Concepts, Algorithms, and Applications
A Problem of K-means
+
+
Sensitive to outliers
Outlier: objects with extremely large values
May substantially distort the distribution of the data
K-medoids: the most centrally located
object in a cluster
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
5
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
COMP 290-090 Data Mining: Concepts, Algorithms, and Applications