Clustering Theory

Download Report

Transcript Clustering Theory

Clustering Theory
John Nicholas Owen
Sarah Smith
What is clustering?
 The activity of grouping similar objects.
 Clustering methods are useful for data
reduction, for developing classification
schemes and for suggesting or supporting
hypotheses about the structure of the data
Steps to Clustering





Pattern representation. The analyst identifies
the number, type, and scale of features
available to the clustering algorithm.
Identify the pattern proximity relative to the
data domain. Usually performed using the
Euclidean distances.
Grouping or Clustering of the data.
Data abstraction.
Assessment of output.
Creating Clusters
 There are two basic approaches for
creating the clusters:
 Partitional
 Hierarchical
Partitional Theory
 The analyst evaluates and groups the data
using statistical algorithms
 The most popular methods of partitioning
include




k-means
Hierarchical agglomerative clustering
Unsupervised Bayes
Mode finding, or density based
k-means Clustering
 Clusters are defined by measuring the
Euclidian distances between data points
 Requires the analyst to know something
about the underlying data
 The analyst needs to provide the number
of clusters to be performed. Then the
software will perform a four step iterative
process to cluster the data.
Step 1
 Randomly assign the cluster center’s
position.
Step 2
 Assign each data point to its nearest
“center point”
Step 3
 Find the actual center of each of the new
clusters
Step 4
 Place the centroid in the new position
End State
 Repeat the four step process until the
cluster is optimized
Heirarchical theory
 Does not generate a set of disjointed
clusters
 Top-down (divisive) or bottom-up
(agglomerative) approach
 The bottom up approach being more
common
Divisive Approach
 Generates a hierarchy of nested clusters that
can be represented by a tree, called a
dendrogram
 A dendrogram consists of many upside down Ushaped lines connecting data points in a
hierarchical tree
 This method is favored by biologists because it
may give more insights into the structure of the
clusters than other methods
Dendrogram
Agglomerative Approach
 Each individual data point starts by being
alone its own group
 The groups closest to each are merged
with one another
 This continues until all individual data
points are in one single group
Agglomerative Clustering
 Step 1
 Step 2
Questions?