Clustering revision (Falguni Negandhi)
Download
Report
Transcript Clustering revision (Falguni Negandhi)
CLUSTERING
Revision
Spring 2007
SJSU
Falguni Negandhi
Overview
Definition
Main Features
Types of Clustering
Some Clustering Approaches
Why we use Clustering
Some methods of Clustering
Conclusion
References
Definition
Clustering is “the process of organizing objects into
groups whose members are similar in some way”.
A cluster is therefore a collection of objects which are
“similar” between them and are “dissimilar” to the
objects belonging to other clusters.
Clustering Main Features
Clustering – a data mining technique
Usage:
Statistical Data Analysis
Machine Learning
Data Mining
Pattern Recognition
Image Analysis
Bioinformatics
Types of Clustering
Hierarchical
Finding
new clusters using previously found ones
Partitional
Finding
all clusters at once
Self-Organizing Maps
Hybrids (incremental)
Some Clustering Approaches
Clustering
Hierarchical
Agglomerative
Partitional
Divisive
Categorical
Sampling
Large DB
Compression
Why clustering?
A few good reasons ...
Simplifications
Pattern detection
Useful in data concept construction
Unsupervised learning process
Where to use clustering?
Data mining
Information retrieval
text mining
Web analysis
marketing
medical diagnostic
Major Existing clustering methods
Distance-based
Hierarchical
Partitioning
Probabilistic
Distance based method
• In this case we easily identify the 4 clusters into which the data can be divided; the
similarity criterion is distance: two or more objects belong to the same cluster if they
are “close” according to a given distance. This is called distance-based clustering.
Hierarchical clustering
Agglomerative (bottom up)
1.
2.
3.
start with 1 point
(singleton)
recursively add two or
more appropriate
clusters
Stop when k number of
clusters is achieved.
Divisive (top dow)
1.
2.
3.
Start with a big cluster
Recursively divide into
smaller clusters
Stop when k number of
clusters is achieved.
Partitioning clustering
1.
2.
Divide data into proper subset
recursively go through each subset and
relocate points between clusters
(opposite to visit-once approach in
Hierarchical approach)
This recursive relocation= higher quality cluster
Steps of hierarchical clustering
Given a set of N items to be clustered, and an N*N
distance (or similarity) matrix, the basic process of
hierarchical clustering (defined by S.C. Johnson in 1967)
is this:
Start by assigning each item to a cluster, so that if you
have N items, you now have N clusters, each containing
just one item. Let the distances (similarities) between
the clusters the same as the distances (similarities)
between the items they contain.
Find the closest (most similar) pair of clusters and merge
them into a single cluster, so that now you have one
cluster less.
Compute distances (similarities) between the new cluster
and each of the old clusters.
Repeat steps 2 and 3 until all items are clustered into K
number of clusters
Hierarchical algorithms
Agglomerative algorithms begin with each
element as a separate cluster and merge
them into successively larger clusters.
Divisive algorithms begin with the whole
set and proceed to divide it into
successively smaller clusters.
Hierarchical agglomerative general
algorithm
1.
2.
3.
Find the 2 closest objects
and merge them into a
cluster
Find and merge the next two
closest points, where a point
is either an individual object
or a cluster of objects.
If more than one cluster
remains, return to step 2
Partitioning clustering
1.
2.
Divide data into proper subset
recursively go through each subset and
relocate points between clusters
(opposite to visit-once approach in
Hierarchical approach)
Probabilistic clustering
1.
2.
3.
Data are picked from mixture of
probability distribution.
Use the mean, variance of each
distribution as parameters for cluster
Single cluster membership
K-Clustering
K-clustering algorithm
Result: Given the input set S and a fixed integer k, a
partition of S into k subsets must be returned.
K-means clustering is the most common partitioning
algorithm.
K-Means re-assigns each record in the dataset to
only one of the new clusters formed. A record or data
point is assigned to the nearest cluster (the cluster
which it is most similar to) using a measure of
distance or similarity
K-clustering algo cont'd
1. Select k initial cluster centroids, c1, c2,
c3..., ck.
2. Assign each instance x in S to the
cluster whose centroid is the nearest to x.
3. For each cluster, re-compute its centroid
based on which elements are contained in.
4. Go to (2) until convergence is achieved
K-Means Clustering
Separate the objects (data points) into K clusters.
Cluster center (centroid) = the average of all the data
points in the cluster.
Assigns each data point to the cluster whose centroid is
nearest (using distance function.)
K-Means Algorithm
1. Place K points into the space of the objects
being clustered. They represent the initial
group centroids.
2. Assign each object to the group that has the
closest centroid.
3. Recalculate the positions of the K centroids.
4. Repeat Steps 2 & 3 until the group centroids
no longer move.
K-Means
Algorithm:
Example
Output
Conclusion
very useful in data mining
applicable for both text and graphical
based data
Help simplify data complexity
classification
detect hidden pattern in data
References
Dr. M.H. Dunham - http://engr.smu.edu/~mhd/dmbook/part2.ppt.
Dr. Lee, Sin-Min – San Jose State University
Mu-Yu Lu, SJSU
Database System Concepts, Silberschatz, Korth, Sudarshan
Wikipedia:
http://en.wikipedia.org/wiki/Data_clustering#Types_of_clustering
Enrique Blanco Garcia
http://genome.imim.es/~eblanco/seminars/docs/clustering/index_types.html
#hierarchy
Data Clustering - Wikipedia
http://en.wikipedia.org/wiki/Data_clustering
A Tutorial on Clustering Algorithms
http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/
The K-Means Data Clustering Problem
http://people.scs.fsu.edu/~burkardt/f_src/kmeans/kmeans.html