Unsupervised Learning: Clustering

Download Report

Transcript Unsupervised Learning: Clustering

Cluster Analysis
 What is Cluster Analysis?
 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Grouping a set of data objects into clusters
 Clustering is unsupervised classification: no predefined
classes
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms
General Applications of Clustering
 Pattern Recognition
 Spatial Data Analysis
 create thematic maps in GIS by clustering feature spaces
 detect spatial clusters and explain them in spatial data
mining
 Image Processing
 Economic Science (especially market research)
 WWW
 Document classification
 Cluster Weblog data to discover groups of similar access
patterns
Examples of Clustering Applications
 Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
 Land use: Identification of areas of similar land use in an
earth observation database
 Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
 City-planning: Identifying groups of houses according to
their house type, value, and geographical location
What Is Good Clustering?
 A good clustering method will produce high quality clusters
with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
Requirements of Clustering in Data Mining
 Scalability : work good on small sets only
 Ability to deal with different types of attributes
 Minimal requirements for domain knowledge to
determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Interpretability and usability
Cluster Analysis
 What is Cluster Analysis?
 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
Major Clustering Approaches
 Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion
 Hierarchy algorithms: Create a hierarchical decomposition of
the set of data (or objects) using some criterion
 Density-based: based on connectivity and density functions
 Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Cluster Analysis
 What is Cluster Analysis?
 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
Partitioning Algorithms: Basic Concept
 Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters
 Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
 k-means : Each cluster is represented by the center of the
cluster.
The K-Means Clustering Method
k-means algorithm is implemented in 5 steps:
 Step 1: Ask the user how many clusters k the data set should be
partitioned into.
 Step 2: Randomly assign k records to be the initial cluster center
locations.
 Step 3: For each record, find the nearest cluster center. Thus, in a sense,
each cluster center “owns” a subset of the records, thereby representing a
partition of the data set. We therefore have k clusters, C1,C2, . . . ,Ck .
 Step 4: For each of the k clusters, find the cluster centroid, and update the
location of each cluster center to the new value of the centroid.
 Step 5: Repeat steps 3 to 5 until convergence or termination.
The K-Means Clustering Method
 Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Equations required
Euclidean : to calculate the nearest value to the center of
cluster.
Sum of squared errors
Data Mining: Concepts and Techniques
K Mean Steps
 Step 1: Ask the user how many clusters k the data set should be
partitioned into. We have already indicated that we are interested in k =
2 clusters.
 Step 2: Randomly assign k records to be the initial cluster center
locations. For this example, we assign the cluster centers to be m1 = (1,1)
and m2 = (2,1).
 Step 3: For each record, find the nearest cluster center.
 Step 4 : For each of the k clusters find the cluster centroid and update the
location of each cluster center to the new value of the centroid.
 Step 5: Repeat steps 3 and 4 until convergence or termination. The
centroids have moved, so we go back to step 3 for our second pass
through the algorithm.
Data Mining: Concepts and Techniques
Example
 Suppose that we have the eight data points in two-
dimensional space shown in the following table:
 lets say k = 2 clusters.
1  22  3  12
 5  2.23
1-Take c1=(1,1) and c2=(2,1) as initial
center points for the 2 clusters
2- calculate the distance between each point and the 2 centers
for example :Point a(1,3):
Distance (a,c1)=
Distance (a,c2)=
1  1  3  1
2
2
1  22  3  12
Data Mining: Concepts and Techniques
 4 2
 5  2.24
Example
Step 3 results:
Data Mining: Concepts and Techniques
Example
 Step 4 (first pass): For each of the k clusters find the cluster centroid
and update the location of each cluster center to the new value of
the centroid.
 Cluster1 points= {a,e,g} , Cluster 2 Points ={b,c,d,f,h}
 centroid for cluster 1 is [(1 + 1 + 1) /3, (3 + 2 + 1) /3] = (1,2).
 The centroid for cluster 2 is [(3 + 4 + 5 + 4 + 2) /5, (3 + 3 + 3 + 2 + 1)
/5] = (3.6, 2.4).
 Step 5: Repeat steps 3 and 4 until convergence or termination. The
centroids have moved, so we go back to step 3 for our second pass
through the algorithm.
Data Mining: Concepts and Techniques
Example
Since there is no change in the cluster
points , we stop here
Data Mining: Concepts and Techniques