Data Miing / Web Data Mining
Download
Report
Transcript Data Miing / Web Data Mining
Data Mining Techniques:
Clustering
Today
Clustering
Distance Measures
Graph-based Techniques
K-Means Clustering
Tools and Software for Clustering
2
Prediction, Clustering, Classification
What is Prediction?
The goal of prediction is to forecast or deduce the value of an attribute
based on values of other attributes
A model is first created based on the data distribution
The model is then used to predict future or unknown values
Supervised vs. Unsupervised Classification
Supervised Classification = Classification
We know the class labels and the number of classes
Unsupervised Classification = Clustering
We do not know the class labels and may not know the number of
classes
3
What is Clustering in Data Mining?
Clustering is a process of partitioning a set of data (or
objects) in a set of meaningful sub-classes, called clusters
Helps users understand the natural grouping or structure in a data set
Cluster:
a collection of data objects that are
“similar” to one another and thus
can be treated collectively as one
group
but as a collection, they are
sufficiently different from other
groups
Clustering
unsupervised classification
no predefined classes
4
Requirements of Clustering Methods
Scalability
Dealing with different types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain
knowledge to determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
The curse of dimensionality
Interpretability and usability
5
Applications of Clustering
Clustering has wide applications in Pattern Recognition
Spatial Data Analysis:
create thematic maps in GIS by clustering feature spaces
detect spatial clusters and explain them in spatial data mining
Image Processing
Market Research
Information Retrieval
Document or term categorization
Information visualization and IR interfaces
Web Mining
Cluster Web usage data to discover groups of similar access patterns
Web Personalization
6
Clustering Methodologies
Two general methodologies
Partitioning Based Algorithms
Hierarchical Algorithms
Partitioning Based
divide a set of N items into K clusters (top-down)
Hierarchical
agglomerative: pairs of items or clusters are successively linked to produce
larger clusters
divisive: start with the whole set as a cluster and successively divide sets into
smaller partitions
7
Distance or Similarity Measures
Measuring Distance
In order to group similar items, we need a way to measure the distance
between objects (e.g., records)
Note: distance = inverse of similarity
Often based on the representation of objects as “feature vectors”
An Employee DB
ID
1
2
3
4
5
Gender
F
M
M
F
M
Age
27
51
52
33
45
Salary
19,000
64,000
100,000
55,000
45,000
Term Frequencies for Documents
Doc1
Doc2
Doc3
Doc4
Doc5
T1
0
3
3
0
2
T2
4
1
0
1
2
T3
0
4
0
0
2
T4
0
3
0
3
3
T5
0
1
3
0
1
T6
2
2
0
0
4
Which objects are more similar?
8
Distance or Similarity Measures
Properties of Distance Measures:
for all objects A and B, dist(A, B) 0, and dist(A, B) = dist(B, A)
for any object A, dist(A, A) = 0
dist(A, C) dist(A, B) + dist (B, C)
Common Distance Measures:
Manhattan distance:
X x1 , x2 ,
dist ( X , Y ) x1 y1 x2 y2
, xn
x1 y1
2
xn yn
Can be normalized
to make values fall
between 0 and 1.
2
Cosine similarity:
dist ( X ,Y ) 1 sim( X ,Y )
, yn
xn yn
Euclidean distance:
dist ( X , Y )
Y y1 , y2 ,
sim( X , Y )
( xi yi )
i
xi yi
2
i
2
i
9
Distance or Similarity Measures
Weighting Attributes
in some cases we want some attributes to count more than others
associate a weight with each of the attributes in calculating distance, e.g.,
dist ( X , Y ) w1 x1 y1
2
wn xn yn
2
Nominal (categorical) Attributes
can use simple matching: distance=1 if values match, 0 otherwise
or convert each nominal attribute to a set of binary attribute, then use the usual
distance measure
if all attributes are nominal, we can normalize by dividing the number of
matches by the total number of attributes
Normalization:
want values to fall between 0 an 1:
other variations possible
x 'i
xi min xi
max xi min xi
10
Distance or Similarity Measures
Example
max distance for age: 100000-19000 = 79000
max distance for age: 52-27 = 25
ID
1
2
3
4
5
Gender
F
M
M
F
M
Age
27
51
52
33
45
Salary
19,000
64,000
100,000
55,000
45,000
ID
1
2
3
4
5
x 'i
Gender
1
0
0
1
0
xi min xi
max xi min xi
Age
0.00
0.96
1.00
0.24
0.72
Salary
0.00
0.56
1.00
0.44
0.32
dist(ID2, ID3) = SQRT( 0 + (0.04)2 + (0.44)2 ) = 0.44
dist(ID2, ID4) = SQRT( 1 + (0.72)2 + (0.12)2 ) = 1.24
11
Domain Specific Distance Functions
For some data sets, we may need to use specialized functions
we may want a single or a selected group of attributes to be used in the
computation of distance - same problem as “feature selection”
may want to use special properties of one or more attribute in the data
Example: Zip Codes
distzip(A, B) = 0, if zip codes are identical
distzip(A, B) = 0.1, if first 3 digits are identical
distzip(A, B) = 0.5, if first digits are identical
distzip(A, B) = 1, if first digits are different
natural distance functions may exist in the data
Example: Customer Solicitation
distsolicit(A, B) = 0, if both A and B responded
distsolicit(A, B) = 0.1, both A and B were chosen but did not respond
distsolicit(A, B) = 0.5, both A and B were chosen, but only one responded
distsolicit(A, B) = 1, one was chosen, but the other was not
12
Distance (Similarity) Matrix
Similarity (Distance) Matrix
based on the distance or similarity measure we can construct a symmetric
matrix of distance (or similarity values)
(i, j) entry in the matrix is the distance (similarity) between items i and j
I1
I2
I1
d 21
I2
d12
In
d1n
d2n
In
d n1 d n 2
Note that dij = dji (i.e., the matrix is
symmetric. So, we only need the lower
triangle part of the matrix.
The diagonal is all 1’s (similarity) or all
0’s (distance)
dij similarity (or distance) of Di to D j
13
Example: Term Similarities in Documents
Doc1
Doc2
Doc3
Doc4
Doc5
T1
0
3
3
0
2
T2
4
1
0
1
2
T3
0
4
0
0
2
T4
0
3
0
3
3
T5
0
1
3
0
1
T6
2
2
0
0
4
T7
1
0
3
2
0
T8
3
1
0
0
2
N
sim(Ti , Tj ) ( wik w jk )
k 1
Term-Term
Similarity Matrix
T2
T3
T4
T5
T6
T7
T8
T1
7
16
15
14
14
9
7
T2
T3
T4
T5
T6
T7
8
12
3
18
6
17
18
6
16
0
8
6
18
6
9
6
9
3
2
16
3
14
Similarity (Distance) Thresholds
A similarity (distance) threshold may be used to mark pairs that are
“sufficiently” similar
T2
T3
T4
T5
T6
T7
T2
T3
T4
T5
T6
T7
T8
T1
7
16
15
14
14
9
7
8
12
3
18
6
17
18
6
16
0
8
6
18
6
9
6
9
3
2
16
3
T2
T3
T4
T5
T6
T7
T2
T3
T4
T5
T6
T7
T8
T1
0
1
1
1
1
0
0
0
1
0
1
0
1
1
0
1
0
0
0
1
0
0
0
0
0
0
1
0
Using a threshold
value of 10 in the
previous example
15
Graph Representation
The similarity matrix can be visualized as an undirected graph
each item is represented by a node, and edges represent the fact that two items
are similar (a one in the similarity threshold matrix)
T2
T3
T4
T5
T6
T7
T8
T1 T2 T3 T4 T5 T6 T7
0
1 0
1 1 1
1 0 0 0
1 1 1 1 0
0 0 0 0 0 0
0 1 0 0 0 1 0
If no threshold is used, then
matrix can be represented as
a weighted graph
T1
T3
T5
T4
T2
T7
T6
T8
16
Simple Clustering Algorithms
If we are interested only in threshold (and not the degree of similarity
or distance), we can use the graph directly for clustering
Clique Method (complete link)
all items within a cluster must be within the similarity threshold of all other
items in that cluster
clusters may overlap
generally produces small but very tight clusters
Single Link Method
any item in a cluster must be within the similarity threshold of at least one
other item in that cluster
produces larger but weaker clusters
Other methods
star method - start with an item and place all related items in that cluster
string method - start with an item; place one related item in that cluster; then
place anther item related to the last item entered, and so on
17
Simple Clustering Algorithms
Clique Method
a clique is a completely connected subgraph of a graph
in the clique method, each maximal clique in the graph becomes a cluster
T1
T3
Maximal cliques (and therefore the
clusters) in the previous example are:
T5
T4
T2
{T1, T3, T4, T6}
{T2, T4, T6}
{T2, T6, T8}
{T1, T5}
{T7}
Note that, for example, {T1, T3, T4}
is also a clique, but is not maximal.
T7
T6
T8
18
Simple Clustering Algorithms
Single Link Method
selected an item not in a cluster and place it in a new cluster
place all other similar item in that cluster
repeat step 2 for each item in the cluster until nothing more can be added
repeat steps 1-3 for each item that remains unclustered
T1
T3
In this case the single link method
produces only two clusters:
T5
T4
T2
{T1, T3, T4, T5, T6, T2, T8}
{T7}
Note that the single link method does
not allow overlapping clusters, thus
partitioning the set of items.
T7
T6
T8
19
Clustering with Existing Clusters
The notion of comparing item similarities can be extended to clusters
themselves, by focusing on a representative vector for each cluster
cluster representatives can be actual items in the cluster or other “virtual”
representatives such as the centroid
this methodology reduces the number of similarity computations in clustering
clusters are revised successively until a stopping condition is satisfied, or until no
more changes to clusters can be made
Partitioning Methods
reallocation method - start with an initial assignment of items to clusters and then
move items from cluster to cluster to obtain an improved partitioning
Single pass method - simple and efficient, but produces large clusters, and depends
on order in which items are processed
Hierarchical Agglomerative Methods
starts with individual items and combines into clusters
then successively combine smaller clusters to form larger ones
grouping of individual items can be based on any of the methods discussed earlier
20
K-Means Algorithm
The basic algorithm (based on reallocation method):
1. select K data points as the initial representatives
2. for i = 1 to N, assign item xi to the most similar centroid (this gives K clusters)
3. for j = 1 to K, recalculate the cluster centroid Cj
4. repeat steps 2 and 3 until these is (little or) no change in clusters
Example: Clustering Terms
Initial (arbitrary) assignment:
C1 = {T1,T2}, C2 = {T3,T4}, C3 = {T5,T6}
Doc1
Doc2
Doc3
Doc4
Doc5
T1
0
3
3
0
2
T2
4
1
0
1
2
T3
0
4
0
0
2
T4
0
3
0
3
3
T5
0
1
3
0
1
Cluster Centroids
T6
2
2
0
0
4
T7
1
0
3
2
0
T8
3
1
0
0
2
C1
4/2
4/2
3/2
1/2
4/2
C2
0/2
7/2
0/2
3/2
5/2
C3
2/2
3/2
3/2
0/2
5/2
21
Example: K-Means
Example (continued)
Now using simple similarity measure, compute the new cluster-term similarity matrix
Class1
Class2
Class3
Assign
T1
T2
T3
T4
T5
T6
T7
T8
29/2
29/2
24/2
27/2
17/2
32/2
15/2
24/2
31/2
20/2
38/2
45/2
12/2
34/2
6/2
17/2
28/2
21/2
22/2
24/2
17/2
30/2
11/2
19/2
to Class2 Class1 Class2 Class2 Class3 Class2 Class1 Class1
Now compute new cluster centroids using the original document-term matrix
Doc1
Doc2
Doc3
Doc4
Doc5
T1
0
3
3
0
2
T2
4
1
0
1
2
T3
0
4
0
0
2
T4
0
3
0
3
3
T5
0
1
3
0
1
T6
2
2
0
0
4
T7
1
0
3
2
0
T8
3
1
0
0
2
C1
8/3
2/3
3/3
3/3
4/3
C2
2/4
12/4
3/4
3/4
11/4
C3
0/1
1/1
3/1
0/1
1/1
The process is repeated until no further changes are made to the clusters
22
K-Means Algorithm
Strength of the k-means:
Relatively efficient: O(tkn), where n is # of objects, k is # of
clusters, and t is # of iterations. Normally, k, t << n
Often terminates at a local optimum
Weakness of the k-means:
Applicable only when mean is defined; what about categorical
data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Variations of K-Means usually differ in:
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
23
Hierarchical Algorithms
Use distance matrix as clustering criteria
does not require the number of clusters k as an input, but needs
a termination condition
24
Hierarchical Agglomerative Clustering
HAC starts with unclustered data and performs successive pairwise
joins among items (or previous clusters) to form larger ones
this results in a hierarchy of clusters which can be viewed as a dendrogram
useful in pruning search in a clustered item set, or in browsing clustering results
Some commonly used HACM methods
Single Link: at each step join most similar pairs of objects that are not yet
in the same cluster
Complete Link: use least similar pair between each cluster pair to
determine inter-cluster similarity - all items within one cluster are linked
to each other within a similarity threshold
Ward’s method: at each step join cluster pair whose merger minimizes the
increase in total within-group error sum of squares (based on distance
between centroids) - also called the minimum variance method
Group Average (Mean): use average value of pairwise links within a
cluster to determine inter-cluster similarity (i.e., all objects contribute to
inter-cluster similarity)
25
Hierarchical Agglomerative Clustering
Dendrogram for a hierarchy of clusters
A
B
C
D
E
F
G
H
I
26