Unsupervised Learning: Clustering
Download
Report
Transcript Unsupervised Learning: Clustering
Cluster Analysis
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Grouping a set of data objects into clusters
Clustering is unsupervised classification: no predefined
classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
General Applications of Clustering
Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering feature spaces
detect spatial clusters and explain them in spatial data
mining
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar access
patterns
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
Land use: Identification of areas of similar land use in an
earth observation database
Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
City-planning: Identifying groups of houses according to
their house type, value, and geographical location
What Is Good Clustering?
A good clustering method will produce high quality clusters
with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
Requirements of Clustering in Data Mining
Scalability : work good on small sets only
Ability to deal with different types of attributes
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Interpretability and usability
Cluster Analysis
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Major Clustering Approaches
Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion
Hierarchy algorithms: Create a hierarchical decomposition of
the set of data (or objects) using some criterion
Density-based: based on connectivity and density functions
Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Cluster Analysis
What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Partitioning Algorithms: Basic Concept
Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters
Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
k-means : Each cluster is represented by the center of the
cluster.
The K-Means Clustering Method
k-means algorithm is implemented in 5 steps:
Step 1: Ask the user how many clusters k the data set should be
partitioned into.
Step 2: Randomly assign k records to be the initial cluster center
locations.
Step 3: For each record, find the nearest cluster center. Thus, in a sense,
each cluster center “owns” a subset of the records, thereby representing a
partition of the data set. We therefore have k clusters, C1,C2, . . . ,Ck .
Step 4: For each of the k clusters, find the cluster centroid, and update the
location of each cluster center to the new value of the centroid.
Step 5: Repeat steps 3 to 5 until convergence or termination.
The K-Means Clustering Method
Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Equations required
Euclidean : to calculate the nearest value to the center of
cluster.
Sum of squared errors
Data Mining: Concepts and Techniques
K Mean Steps
Step 1: Ask the user how many clusters k the data set should be
partitioned into. We have already indicated that we are interested in k =
2 clusters.
Step 2: Randomly assign k records to be the initial cluster center
locations. For this example, we assign the cluster centers to be m1 = (1,1)
and m2 = (2,1).
Step 3: For each record, find the nearest cluster center.
Step 4 : For each of the k clusters find the cluster centroid and update the
location of each cluster center to the new value of the centroid.
Step 5: Repeat steps 3 and 4 until convergence or termination. The
centroids have moved, so we go back to step 3 for our second pass
through the algorithm.
Data Mining: Concepts and Techniques
Example
Suppose that we have the eight data points in two-
dimensional space shown in the following table:
lets say k = 2 clusters.
1 22 3 12
5 2.23
1-Take c1=(1,1) and c2=(2,1) as initial
center points for the 2 clusters
2- calculate the distance between each point and the 2 centers
for example :Point a(1,3):
Distance (a,c1)=
Distance (a,c2)=
1 1 3 1
2
2
1 22 3 12
Data Mining: Concepts and Techniques
4 2
5 2.24
Example
Step 3 results:
Data Mining: Concepts and Techniques
Example
Step 4 (first pass): For each of the k clusters find the cluster centroid
and update the location of each cluster center to the new value of
the centroid.
Cluster1 points= {a,e,g} , Cluster 2 Points ={b,c,d,f,h}
centroid for cluster 1 is [(1 + 1 + 1) /3, (3 + 2 + 1) /3] = (1,2).
The centroid for cluster 2 is [(3 + 4 + 5 + 4 + 2) /5, (3 + 3 + 3 + 2 + 1)
/5] = (3.6, 2.4).
Step 5: Repeat steps 3 and 4 until convergence or termination. The
centroids have moved, so we go back to step 3 for our second pass
through the algorithm.
Data Mining: Concepts and Techniques
Example
Since there is no change in the cluster
points , we stop here
Data Mining: Concepts and Techniques