Data Miing / Web Data Mining

Download Report

Transcript Data Miing / Web Data Mining

Data Mining Techniques:
Classification and Clustering
Chein-Shung Hwang
Today
 Classification
 Basic Concepts in Classification
 Decision Trees
 ID3/C4.5 Algorithm
 Bayesian Classification
 Tools and Software for Classification
 Clustering
 Distance Measures
 Graph-based Techniques
 K-Means Clustering
 Tools and Software for Clustering
2
What Is Classification?
 The goal of data classification is to organize and
categorize data in distinct classes
 A model is first created based on the data distribution
 The model is then used to classify new data
 Given the model, a class can be predicted for new data
 Classification = prediction for discrete and nominal
values
 With classification, I can predict in which bucket to put the ball, but I
can’t predict the weight of the ball
3
Prediction, Clustering, Classification
 What is Prediction?
 The goal of prediction is to forecast or deduce the value of an attribute
based on values of other attributes
 A model is first created based on the data distribution
 The model is then used to predict future or unknown values
 Supervised vs. Unsupervised Classification
 Supervised Classification = Classification
We know the class labels and the number of classes
 Unsupervised Classification = Clustering
We do not know the class labels and may not know the number of
classes
4
Classification: 3 Step Process
 1. Model construction (Learning):
 Each record (instance) is assumed to belong to a predefined class, as
determined by one of the attributes, called the class label
 The set of all records used for construction of the model is called training
set
 The model is usually represented in the form of classification rules, (IFTHEN statements) or decision trees
 2. Model Evaluation (Accuracy):
 Estimate accuracy rate of the model based on a test set
 The known label of test sample is compared with the classified result from
model
 Accuracy rate: percentage of test set samples correctly classified by the
model
 Test set is independent of training set otherwise over-fitting will occur
 3. Model Use (Classification):
 The model is used to classify unseen instances (assigning class labels)
 Predict the value of an actual attribute
5
Model Construction
6
Model Evaluation
7
Model Use: Classification
8
Classification Methods
 Decision Tree Induction
 Neural Networks
 Bayesian Classification
 Association-Based
Classification
 K-Nearest Neighbor
 Case-Based Reasoning
 Genetic Algorithms
 Fuzzy Sets
 Many More
9
Decision Trees
 A decision tree is a flow-chart-like tree structure
 Internal node denotes a test on an attribute (feature)
 Branch represents an outcome of the test
 All records in a branch have the same value for the tested attribute
 Leaf node represents class label or class label distribution
outlook
sunny
humidity
high
N
overcast
rain
windy
P
normal
P
true
N
false
P
10
Decision Trees
 Example: “is it a good day to play golf?”
 a set of attributes and their possible values:
outlook
sunny, overcast, rain
temperature
cool, mild, hot
humidity
high, normal
windy
true, false
A particular instance in the
training set might be:
<overcast, hot, normal, false>: play
In this case, the target class
is a binary attribute, so each
instance represents a positive
or a negative example.
11
Using Decision Trees for Classification
 Examples can be classified as follows
 1. look at the example's value for the feature specified
 2. move along the edge labeled with this value
 3. if you reach a leaf, return the label of the leaf
 4. otherwise, repeat from step 1
 Example (a decision tree to decide whether to go on a picnic):
outlook
sunny
humidity
high
N
overcast
So a new instance:
rain
will be classified as “noplay”
windy
P
normal
P
<rainy, hot, normal, true>: ?
true
N
false
P
12
Decision Trees and Decision Rules
outlook
If attributes are continuous,
internal nodes may test
against a threshold.
sunny
overcast rain
yes
humidity
> 75%<= 75%
no
windy
> 20
yes
no
<= 20
yes
Each path in the tree represents a decision rule:
Rule1:
If (outlook=“sunny”) AND (humidity<=0.75)
Then (play=“yes”)
Rule3:
If (outlook=“overcast”)
Then (play=“yes”)
Rule2:
If (outlook=“rainy”) AND (wind>20)
Then (play=“no”)
...
13
Top-Down Decision Tree Generation
 The basic approach usually consists of two phases:
 Tree construction
At the start, all the training examples are at the root
Partition examples are recursively based on selected attributes
 Tree pruning
remove tree branches that may reflect noise in the training data and
lead to errors when classifying test data
improve classification accuracy
 Basic Steps in Decision Tree Construction
 Tree starts a single node representing all data
 If sample are all same class then node becomes a leaf labeled with
class label
 Otherwise, select feature that best separates sample into individual
classes.
 Recursion stops when:
Samples in node belong to the same class (majority)
There are no remaining attributes on which to split
14
Trees Construction Algorithm (ID3)
 Decision Tree Learning Method (ID3)
 Input: a set of examples S, a set of features F, and a target set T (target class T
represents the type of instance we want to classify, e.g., whether “to play golf”)
 1. If every element of S is already in T, return “yes”; if no element of S is in T
return “no”
 2. Otherwise, choose the best feature f from F (if there are no features
remaining, then return failure);
 3. Extend tree from f by adding a new branch for each attribute value
 4. Distribute training examples to leaf nodes (so each leaf node S is now the set of
examples at that node, and F is the remaining set of features not yet selected)
 5. Repeat steps 1-5 for each leaf node
 Main Question:
 how do we choose the best feature at each step?
Note: ID3 algorithm only deals with categorical attributes, but can be extended
(as in C4.5) to handle continuous attributes
15
Choosing the “Best” Feature
 Using Information Gain to find the “best” (most discriminating) feature
 Entropy, E(I) of a set of instance I, containing p positive and n negative examples
E(I )  
p
p
n
n
log 2

log 2
pn
pn pn
pn
 Gain(A, I) is the expected reduction in entropy due to feature (attribute) A
Gain( A, I )  E ( I ) 

descendant j
pj  nj
E(I j )
pn
 the jth descendant of I is the set of instances with value vj for A
S: [9+,5-]
Outlook?
overcast
sunny
E = -(9/14).log(9/14) - (5/14).log(5/14)
= 0.940
rainy
“yes”, since all positive examples
[4+,0-]
[2+,3-]
[3+,2-]
16
Decision Tree Learning - Example
Day
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
D14
outlook
temp
humidity
wind
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
hot
hot
hot
mild
cool
cool
cool
mild
cool
mild
mild
mild
hot
mild
high
high
high
high
normal
normal
normal
high
normal
normal
normal
high
normal
high
weak
strong
weak
weak
weak
strong
strong
weak
weak
weak
strong
strong
weak
strong
play
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
S: [9+,5-] (E = 0.940)
humidity?
high
[3+,4-] (E = 0.985)
[6+,1-] (E = 0.592)
Gain(S, humidity) = .940 - (7/14)*.985 - (7/14)*.592
= .151
S: [9+,5-] (E = 0.940)
wind?
weak
So, classifying examples by humidity
provides more information gain than by
wind. In this case, however, you can
verify that outlook has largest information
gain, so it’ll be selected as root
normal
[6+,2-] (E = 0.811)
strong
[3+,3-] (E = 1.00)
Gain(S, wind) = .940 - (8/14)*.811 - (8/14)*1.0
= .048
17
Decision Tree Learning - Example
 Partially learned decision tree
S: [9+,5-]
Outlook
sunny
{D1, D2, …, D14}
overcast
rainy
?
yes
?
[2+,3-]
[4+,0-]
[3+,2-]
{D3, D7, D12, D13}
{D4, D5, D6, D10, D14}
{D1, D2, D8, D9, D11}
 which attribute should be tested here?
Ssunny = {D1, D2, D8, D9, D11}
Gain(Ssunny, humidity) = .970 - (3/5)*0.0 - (2/5)*0.0 = .970
Gain(Ssunny, temp) = .970 - (2/5)*0.0 - (2/5)*1.0 - (1/5)*0.0 = .570
Gain(Ssunny, wind) = .970 - (2/5)*1.0 - (3/5)*.918 = .019
18
Dealing With Continuous Variables
 Partition continuous attribute into a discrete set of intervals
 sort the examples according to the continuous attribute A
 identify adjacent examples that differ in their target classification
 generate a set of candidate thresholds midway
 problem: may generate too many intervals
 Another Solution:
 take a minimum threshold M of the examples of the majority class in each
adjacent partition; then merge adjacent partitions with the same majority class
Example: M = 3
70.5
Temperature 64 65 68 69 70 71
Play?
yes no yes yes yes no
77.5
72 72 75 75 80 81 83 85
no yes yes yes no yes yes no
Same majority, so they are merged
Final mapping: temperature  77.5 ==> “yes”; temperature > 77.5 ==> “no”
19
Improving on Information Gain
 Info. Gain tends to favor attributes with a large number of values
 larger distribution ==> lower entropy ==> larger Gain
 Quinlan suggests using Gain Ratio
 penalize for large number of values
SplitInfo ( A, S )  
Si
S
log i
S
S
GainRatio ( A, S ) 
 Example: “outlook”
SplitInfo (outlook, S)
= -(4/14).log(4/14) - (5/14).log(5/14) - (5/14).log(5/14)
= 0.156
Gain ( A, S )
SplitInfo ( A, S )
S: [9+,5-]
Outlook
overcast
sunny
rainy
GainRatio (outlook, S)
= 0.246 / 0.156 = 1.577
S1: [4+,0-]
S2 : [2+,3-]
S3 : [3+,2-]
20
Over-fitting in Classification
 A tree generated may over-fit the training examples due to
noise or too small a set of training data
 Two approaches to avoid over-fitting:
 (Stop earlier): Stop growing the tree earlier
 (Post-prune): Allow over-fit and then post-prune the tree
 Approaches to determine the correct final tree size:
 Separate training and testing sets or use cross-validation
 Use all the data for training, but apply a statistical test (e.g., chi-square) to
estimate whether expanding or pruning a node may improve over entire
distribution
 Use Minimum Description Length (MDL) principle: halting growth of the
tree when the encoding is minimized.
 Rule post-pruning (C4.5): converting to rules before
pruning
21
Pruning the Decision Tree
 A decision tree constructed using the training data may
need to be pruned
 over-fitting may result in branches or leaves based on too few examples
 pruning is the process of removing branches and subtrees that are
generated due to noise; this improves classification accuracy
 Subtree Replacement: merge a subtree into a leaf node
 Using a set of data different from the training data
 At a tree node, if the accuracy without splitting is higher than the
accuracy with splitting, replace the subtree with a leaf node; label it
using the majority class
color
red
blue
yes
no
1
2
Suppose with test set we find 3 red “no” examples, and
1 blue “yes” example. We can replace the tree with a
single “no” node. After replacement there will be only
2 errors instead of 5.
22
Bayesian Classification
 It is a statistical classifier based on Bayes theorem
 It uses probabilistic learning by calculating explicit probabilities for
hypothesis
 A naïve Bayesian classifier, that assumes total independence between
attributes, is commonly used and performs well with large data sets
 The model is incremental in the sense that each training example can
incrementally increase or decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with observed data
 Given a data sample X with an unknown class label, H is
the hypothesis that X belongs to a specific class C
 The conditional probability of hypothesis H given X, Pr(H|X), follows the
Bayes theorem:
Pr( X | H ) Pr( H )
Pr( H | X ) 
Pr( X )
 Practical difficulty: requires initial knowledge of many
probabilities, significant computational cost
23
Naïve Bayesian Classifier
 Suppose we have n classes C1 , C2 ,…,Cn. Given an
unknown sample X, the classifier will predict that X=(x1
,x2 ,…,xn) belongs to the class with the highest
conditional probability:
X  Ci if Pr(Ci | X )  Pr(C j | X ), for i  j  n, i  j
 Maximize Pr(X | Ci).Pr(Ci) / Pr(X) => maximize Pr(X |
Ci).Pr(Ci)
n
 Note: Pr(C
s,Pr(and
Pr( X i|)C=i )si /
xk | Ci ) where Pr( xk | Ci )  sik / si
k 1
 Greatly reduces the computation cost, only count the
class distribution
 Naïve: class conditional independence
24
Naïve Bayesian Classifier - Example
 Given a training set, we can compute the probabilities
X = <sunny, mild, high, true>
Pr(X | “no”).Pr(“no”) = (3/5 . 2/5 . 4/5 . 3/5) . 5/14 = 0.04
Pr(X | “yes”).Pr(“yes”) = (2/9 . 4/9 . 3/9 . 3/9) . 9/14 = 0.007
25
Classification Example - Bank Data
 Want to determine likely responders to a direct mail campaign
 a new product, a "Personal Equity Plan" (PEP)
 training data include records kept about how previous customers responded and
bought the product
 in this case the target class is “pep” with binary value
 want to build a model and apply it to new data (a customer list) in which the
value of the class attribute is not available
id
ID12101
ID12102
ID12103
ID12104
ID12105
ID12106
ID12107
ID12108
ID12109
ID12110
ID12111
…
age
48
40
51
23
57
57
22
58
37
54
66
…
sex
FEMALE
MALE
FEMALE
FEMALE
FEMALE
FEMALE
MALE
MALE
FEMALE
MALE
FEMALE
…
region
income married children car
INNER_CITY
17546
NO
1
NO
TOWN
30085.1 YES
3
YES
INNER_CITY 16575.4 YES
0
YES
TOWN
20375.4 YES
3
NO
RURAL
50576.3 YES
0
NO
TOWN
37869.6 YES
2
NO
RURAL
8877.07
NO
0
NO
TOWN
24946.6 YES
0
YES
SUBURBAN 25304.3 YES
2
YES
TOWN
24212.1 YES
2
YES
TOWN
59803.9 YES
0
NO
…
…
…
…
…
save_act current_act mortgage pep
NO
NO
NO
YES
NO
YES
YES
NO
YES
YES
NO
NO
NO
YES
NO
NO
YES
NO
NO
NO
YES
YES
NO
YES
NO
YES
NO
YES
YES
YES
NO
NO
NO
NO
NO
NO
YES
YES
NO
NO
YES
YES
NO
NO
…
…
…
…
26
Data Preparation
 Several steps for prepare data for Weka and for See5
 open training data in Excel, remove the “id” column, save results (as a comma
delimited file (e.g., “bank.csv”)
 do the same with new customer data, but also add a new column called “pep”
as the last column; the value of this column for each record should be “?”
 Weka
 must convert the the data to ARFF format
 attribute specification and data are in the same file
 the data portion is just the comma delimited data file without the label row
 See5/C5
 create a “name” file and a “data” file
 “name” file contains attribute specification; “data” file is same as above
 first line of “name” file must be the name(s) of the target class(es) - in this case
“pep”
27
Data File Format for Weka
Training Data
New Cases
@relation ’train-bank-data'
@attribute 'age' real
@attribute 'sex' {'MALE','FEMALE'}
@attribute 'region' {'INNER_CITY','RURAL','TOWN','SUBURBAN'}
@attribute 'income' real
@attribute 'married' {'YES','NO'}
@attribute 'children' real
@attribute 'car' {'YES','NO'}
@attribute 'save_act' {'YES','NO'}
@attribute 'current_act' {'YES','NO'}
@attribute 'mortgage' {'YES','NO'}
@attribute 'pep' {'YES','NO'}
@data
48,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES
40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO
...
@relation 'new-bank-data'
@attribute 'age' real
@attribute 'region' {'INNER_CITY','RURAL','TOWN','SUBURBAN'}
...
@attribute 'pep' {'YES','NO'}
@data
23,MALE,INNER_CITY,18766.9,YES,0,YES,YES,NO,YES,?
30,MALE,RURAL,9915.67,NO,1,NO,YES,NO,YES,?
28
Data File Format for See5/C5
pep.
age: continuous.
sex: MALE,FEMALE.
region: INNER_CITY,RURAL,TOWN,SUBURBAN.
income: continuous.
married: YES,NO.
children: continuous.
car: YES,NO.
save_act: YES,NO.
current_act: YES,NO.
mortgage: YES,NO.
pep: YES,NO.
48,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES
40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO
51,FEMALE,INNER_CITY,16575.4,YES,0,YES,YES,YES,NO,NO
...
23,MALE,INNER_CITY,18766.9,YES,0,YES,YES,NO,YES,?
30,MALE,RURAL,9915.67,NO,1,NO,YES,NO,YES,?
45,FEMALE,RURAL,21881.6,NO,0,YES,YES,YES,NO,?
...
Name file for
Training Data
Data file for
Training Data
Note: no “name file” is
necessary for new cases,
but the file name must
use the same stem, and a
suffix “.cases”
29
C4.5 Implementation in Weka
 To build a model (decision tree)
using the classifiers.j48.J48 class
 from command line or using the
simple java-based command line
interface
Decision Tree
Output (pruned)
Command for building the model
(additional parameters can be specified
for pruning, cross validation, etc.)
children <= 2
| children <= 0
| | married = YES
| | | mortgage = YES
| | | | save_act = YES: NO (16.0/2.0)
| | | | save_act = NO: YES (9.0/1.0)
| | | mortgage = NO: NO (59.0/6.0)
| | married = NO
| | | mortgage = YES
| | | | save_act = YES: NO (12.0)
| | | | save_act = NO: YES (3.0)
| | | mortgage = NO: YES (29.0/2.0)
| children > 0
| | income <= 29622
| | | children <= 1
| | | | income <= 12640.3: NO (5.0)
| | | | income > 12640.3
| | | | | current_act = YES: YES (28.0/1.0)
| | | | | current_act = NO
| | | | | | income <= 17390.1: NO (3.0)
| | | | | | income > 17390.1: YES (6.0)
| | | children > 1: NO (47.0/3.0)
| | income > 29622: YES (48.0/2.0)
children > 2
| income <= 43228.2: NO (30.0/2.0)
| income > 43228.2: YES (5.0)
> java weka.classifiers.j48.J48 -t <file-path-name>.arff -d <file-path-name>.model
30
C4.5 Implementation in Weka
=== Error on training data ===
The rest of the output
contains statistical
information about the
model, including confusion
matrix, error rates, etc.
Correctly Classified Instances
Incorrectly Classified Instances
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
281
93.6667 %
19
6.3333 %
0.1163
0.2412
23.496 %
48.4742 %
300
=== Confusion Matrix ===
a
b
<-- classified as
122 13 |
a = YES
6 159 |
b = NO
=== Stratified cross-validation ===
The model is now contained
in the (binary) file
<file-path-name>.model
Correctly Classified Instances
Incorrectly Classified Instances
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
274
91.3333 %
26
8.6667 %
0.1434
0.291
28.9615 %
58.4922 %
300
=== Confusion Matrix ===
a
b
<-- classified as
118 17 |
a = YES
9 156 |
b = NO
31
C4.5 Implementation in Weka
 Applying the model to new cases:
java weka.classifiers.j48.J48 -p -l <file-pathnaem>.model -T <newdatafile-pathnaem>.arff
The output gives the predicted
class for each new instance along
with its predicted acuracy.
Since we removed the “id” field,
we now need to map these
predictions to the original “new
case” records (e.g. using Excel)
0 NO 0.875 ?
1 NO 1.0 ?
2 YES 0.9310344827586207 ?
3 YES 0.9583333333333334 ?
4 NO 0.8983050847457628 ?
5 YES 0.9642857142857143 ?
6 NO 0.875 ?
7 YES 1.0 ?
8 NO 0.9333333333333333 ?
9 YES 0.9642857142857143 ?
10 NO 0.875 ?
. . .
195 YES 0.9583333333333334 ?
196 NO 0.9361702127659575 ?
197 YES 1.0 ?
198 NO 0.8983050847457628 ?
199 NO 0.9361702127659575 ?
32
Classification Using See5/C5
Note: See5/C5 also
has options for
creating the model as
a set of decision rules
(as opposed to
a decision tree)
Either model can be
used to classify new
unclassified cases.
33
Classification Using See5/C5
Class specified by attribute `pep'
Decision tree:
income > 30085.1:
:...children
> 0: YES (43/5)
** This demonstration version cannot process
**
:
children
<= 0:
** more than 200 training or test cases.
**
:
:...married = YES: NO (19/2)
:
married = NO:
Read 200 cases (11 attributes) from bank-train.data
:
:...mortgage = YES: NO (3)
:
mortgage = NO: YES (5)
income <= 30085.1:
Evaluation on training data (200 cases): :...children > 1: NO (50/4)
children <= 1:
Decision Tree
:...children <= 0:
---------------:...save_act = YES: NO (27
Size
Errors
:
save_act = NO:
:
:...married = NO: YES
13
19( 9.5%)
<<
:
married = YES:
:
:...mortgage = YES
(a)
(b)
<-classified as
:
mortgage = NO:
---- ---children > 0:
76
13
(a): class YES
:...income <= 12681.9: NO
6
105
(b): class NO
income > 12681.9:
:...current_act = YES:
Time: 0.1 secs
current_act = NO:
:...car = YES: NO
car = NO: YES
34
See5/C5: Applying Model to New Cases
 Need to use the executable file “sample.exe” (note that source
code is available to allow you to build classifier into your
applications
 From the Command Line:
prompt> sample -f bank-train > bank-new.out
Output classification
file for new cases (along
with predicted accuracy
for each new case).
Case
No
Given
Class
Predicted
Class
1
2
3
4
5
. . .
197
198
199
200
?
?
?
?
?
NO
NO
NO
YES
NO
[0.79]
[0.86]
[0.79]
[0.87]
[0.79]
?
?
?
?
NO
YES
NO
NO
[0.90]
[0.88]
[0.79]
[0.90]
35
Classification Using See5/C5
 Building the model based on decision rules:
Rule 1: (31, lift 2.2)
income > 12681.9
children > 0
children <= 1
current_act = YES
-> class YES [0.970]
Rule 2: (20, lift 2.1)
income > 12681.9
children > 0
children <= 1
car = NO
-> class YES [0.955]
Rule 3: (17, lift 2.1)
income > 30085.1
married = NO
mortgage = NO
-> class YES [0.947]
Rule 4: (7, lift 2.0)
married = NO
children <= 0
save_act = NO
-> class YES
[0.889]
Rule 5: (43/5, lift 1.9)
income > 30085.1
children > 0
-> class YES [0.867]
Rule 6: (7/1, lift 1.7)
children <= 0
save_act = NO
mortgage = YES
-> class YES [0.778]
. . .
36
What is Clustering in Data Mining?
Clustering is a process of partitioning a set of data (or
objects) in a set of meaningful sub-classes, called clusters
Helps users understand the natural grouping or structure in a data set
 Cluster:
 a collection of data objects that are
“similar” to one another and thus
can be treated collectively as one
group
 but as a collection, they are
sufficiently different from other
groups
 Clustering
 unsupervised classification
 no predefined classes
37
Requirements of Clustering Methods
 Scalability
 Dealing with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain
knowledge to determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 The curse of dimensionality
 Interpretability and usability
38
Applications of Clustering
 Clustering has wide applications in Pattern Recognition
 Spatial Data Analysis:
 create thematic maps in GIS by clustering feature spaces
 detect spatial clusters and explain them in spatial data mining
 Image Processing
 Market Research
 Information Retrieval
 Document or term categorization
 Information visualization and IR interfaces
 Web Mining
 Cluster Web usage data to discover groups of similar access patterns
 Web Personalization
39
Clustering Methodologies
 Two general methodologies
 Partitioning Based Algorithms
 Hierarchical Algorithms
 Partitioning Based
 divide a set of N items into K clusters (top-down)
 Hierarchical
 agglomerative: pairs of items or clusters are successively linked to produce
larger clusters
 divisive: start with the whole set as a cluster and successively divide sets into
smaller partitions
40
Distance or Similarity Measures
 Measuring Distance
 In order to group similar items, we need a way to measure the distance
between objects (e.g., records)
 Note: distance = inverse of similarity
 Often based on the representation of objects as “feature vectors”
An Employee DB
ID
1
2
3
4
5
Gender
F
M
M
F
M
Age
27
51
52
33
45
Salary
19,000
64,000
100,000
55,000
45,000
Term Frequencies for Documents
Doc1
Doc2
Doc3
Doc4
Doc5
T1
0
3
3
0
2
T2
4
1
0
1
2
T3
0
4
0
0
2
T4
0
3
0
3
3
T5
0
1
3
0
1
T6
2
2
0
0
4
Which objects are more similar?
41
Distance or Similarity Measures
 Properties of Distance Measures:
 for all objects A and B, dist(A, B)  0, and dist(A, B) = dist(B, A)
 for any object A, dist(A, A) = 0
 dist(A, C)  dist(A, B) + dist (B, C)
 Common Distance Measures:
 Manhattan distance:
X  x1 , x2 ,
dist ( X , Y )  x1  y1  x2  y2 
, xn
 x1  y1 
2

  xn  yn 
Can be normalized
to make values fall
between 0 and 1.
2
 Cosine similarity:
dist ( X ,Y )  1  sim( X ,Y )
, yn
 xn  yn
 Euclidean distance:
dist ( X , Y ) 
Y  y1 , y2 ,
sim( X , Y ) 
 ( xi  yi )
i
 xi   yi
2
i
2
i
42
Distance or Similarity Measures
 Weighting Attributes
 in some cases we want some attributes to count more than others
 associate a weight with each of the attributes in calculating distance, e.g.,
dist ( X , Y )  w1  x1  y1  
2
 wn  xn  yn 
2
 Nominal (categorical) Attributes
 can use simple matching: distance=1 if values match, 0 otherwise
 or convert each nominal attribute to a set of binary attribute, then use the usual
distance measure
 if all attributes are nominal, we can normalize by dividing the number of
matches by the total number of attributes
 Normalization:
 want values to fall between 0 an 1:
 other variations possible
x 'i 
xi  min xi
max xi  min xi
43
Distance or Similarity Measures
 Example
 max distance for age: 100000-19000 = 79000
 max distance for age: 52-27 = 25
ID
1
2
3
4
5
Gender
F
M
M
F
M
Age
27
51
52
33
45
Salary
19,000
64,000
100,000
55,000
45,000
ID
1
2
3
4
5
x 'i 
Gender
1
0
0
1
0
xi  min xi
max xi  min xi
Age
0.00
0.96
1.00
0.24
0.72
Salary
0.00
0.56
1.00
0.44
0.32
 dist(ID2, ID3) = SQRT( 0 + (0.04)2 + (0.44)2 ) = 0.44
 dist(ID2, ID4) = SQRT( 1 + (0.72)2 + (0.12)2 ) = 1.24
44
Domain Specific Distance Functions
 For some data sets, we may need to use specialized functions
 we may want a single or a selected group of attributes to be used in the
computation of distance - same problem as “feature selection”
 may want to use special properties of one or more attribute in the data
Example: Zip Codes
distzip(A, B) = 0, if zip codes are identical
distzip(A, B) = 0.1, if first 3 digits are identical
distzip(A, B) = 0.5, if first digits are identical
distzip(A, B) = 1, if first digits are different
 natural distance functions may exist in the data
Example: Customer Solicitation
distsolicit(A, B) = 0, if both A and B responded
distsolicit(A, B) = 0.1, both A and B were chosen but did not respond
distsolicit(A, B) = 0.5, both A and B were chosen, but only one responded
distsolicit(A, B) = 1, one was chosen, but the other was not
45
Distance (Similarity) Matrix
 Similarity (Distance) Matrix
 based on the distance or similarity measure we can construct a symmetric
matrix of distance (or similarity values)
 (i, j) entry in the matrix is the distance (similarity) between items i and j
I1
I2
I1

d 21
I2
d12

In
d1n
d2n

In
d n1 d n 2

Note that dij = dji (i.e., the matrix is
symmetric. So, we only need the lower
triangle part of the matrix.
The diagonal is all 1’s (similarity) or all
0’s (distance)
dij  similarity (or distance) of Di to D j
46
Example: Term Similarities in Documents
Doc1
Doc2
Doc3
Doc4
Doc5
T1
0
3
3
0
2
T2
4
1
0
1
2
T3
0
4
0
0
2
T4
0
3
0
3
3
T5
0
1
3
0
1
T6
2
2
0
0
4
T7
1
0
3
2
0
T8
3
1
0
0
2
N
sim(Ti , Tj )   ( wik  w jk )
k 1
Term-Term
Similarity Matrix
T2
T3
T4
T5
T6
T7
T8
T1
7
16
15
14
14
9
7
T2
T3
T4
T5
T6
T7
8
12
3
18
6
17
18
6
16
0
8
6
18
6
9
6
9
3
2
16
3
47
Similarity (Distance) Thresholds
 A similarity (distance) threshold may be used to mark pairs that are
“sufficiently” similar
T2
T3
T4
T5
T6
T7
T2
T3
T4
T5
T6
T7
T8
T1
7
16
15
14
14
9
7
8
12
3
18
6
17
18
6
16
0
8
6
18
6
9
6
9
3
2
16
3
T2
T3
T4
T5
T6
T7
T2
T3
T4
T5
T6
T7
T8
T1
0
1
1
1
1
0
0
0
1
0
1
0
1
1
0
1
0
0
0
1
0
0
0
0
0
0
1
0
Using a threshold
value of 10 in the
previous example
48
Graph Representation
 The similarity matrix can be visualized as an undirected graph
 each item is represented by a node, and edges represent the fact that two items
are similar (a one in the similarity threshold matrix)
T2
T3
T4
T5
T6
T7
T8
T1 T2 T3 T4 T5 T6 T7
0
1 0
1 1 1
1 0 0 0
1 1 1 1 0
0 0 0 0 0 0
0 1 0 0 0 1 0
If no threshold is used, then
matrix can be represented as
a weighted graph
T1
T3
T5
T4
T2
T7
T6
T8
49
Simple Clustering Algorithms
 If we are interested only in threshold (and not the degree of similarity
or distance), we can use the graph directly for clustering
 Clique Method (complete link)
 all items within a cluster must be within the similarity threshold of all other
items in that cluster
 clusters may overlap
 generally produces small but very tight clusters
 Single Link Method
 any item in a cluster must be within the similarity threshold of at least one
other item in that cluster
 produces larger but weaker clusters
 Other methods
 star method - start with an item and place all related items in that cluster
 string method - start with an item; place one related item in that cluster; then
place anther item related to the last item entered, and so on
50
Simple Clustering Algorithms
 Clique Method
 a clique is a completely connected subgraph of a graph
 in the clique method, each maximal clique in the graph becomes a cluster
T1
T3
Maximal cliques (and therefore the
clusters) in the previous example are:
T5
T4
T2
{T1, T3, T4, T6}
{T2, T4, T6}
{T2, T6, T8}
{T1, T5}
{T7}
Note that, for example, {T1, T3, T4}
is also a clique, but is not maximal.
T7
T6
T8
51
Simple Clustering Algorithms
 Single Link Method
 selected an item not in a cluster and place it in a new cluster
 place all other similar item in that cluster
 repeat step 2 for each item in the cluster until nothing more can be added
 repeat steps 1-3 for each item that remains unclustered
T1
T3
In this case the single link method
produces only two clusters:
T5
T4
T2
{T1, T3, T4, T5, T6, T2, T8}
{T7}
Note that the single link method does
not allow overlapping clusters, thus
partitioning the set of items.
T7
T6
T8
52
Clustering with Existing Clusters
 The notion of comparing item similarities can be extended to clusters
themselves, by focusing on a representative vector for each cluster
 cluster representatives can be actual items in the cluster or other “virtual”
representatives such as the centroid
 this methodology reduces the number of similarity computations in clustering
 clusters are revised successively until a stopping condition is satisfied, or until no
more changes to clusters can be made
 Partitioning Methods
 reallocation method - start with an initial assignment of items to clusters and then
move items from cluster to cluster to obtain an improved partitioning
 Single pass method - simple and efficient, but produces large clusters, and depends
on order in which items are processed
 Hierarchical Agglomerative Methods
 starts with individual items and combines into clusters
 then successively combine smaller clusters to form larger ones
 grouping of individual items can be based on any of the methods discussed earlier
53
K-Means Algorithm
 The basic algorithm (based on reallocation method):
1. select K data points as the initial representatives
2. for i = 1 to N, assign item xi to the most similar centroid (this gives K clusters)
3. for j = 1 to K, recalculate the cluster centroid Cj
4. repeat steps 2 and 3 until these is (little or) no change in clusters
 Example: Clustering Terms
Initial (arbitrary) assignment:
C1 = {T1,T2}, C2 = {T3,T4}, C3 = {T5,T6}
Doc1
Doc2
Doc3
Doc4
Doc5
T1
0
3
3
0
2
T2
4
1
0
1
2
T3
0
4
0
0
2
T4
0
3
0
3
3
T5
0
1
3
0
1
Cluster Centroids
T6
2
2
0
0
4
T7
1
0
3
2
0
T8
3
1
0
0
2
C1
4/2
4/2
3/2
1/2
4/2
C2
0/2
7/2
0/2
3/2
5/2
C3
2/2
3/2
3/2
0/2
5/2
54
Example: K-Means
 Example (continued)
Now using simple similarity measure, compute the new cluster-term similarity matrix
Class1
Class2
Class3
Assign
T1
T2
T3
T4
T5
T6
T7
T8
29/2
29/2
24/2
27/2
17/2
32/2
15/2
24/2
31/2
20/2
38/2
45/2
12/2
34/2
6/2
17/2
28/2
21/2
22/2
24/2
17/2
30/2
11/2
19/2
to Class2 Class1 Class2 Class2 Class3 Class2 Class1 Class1
Now compute new cluster centroids using the original document-term matrix
Doc1
Doc2
Doc3
Doc4
Doc5
T1
0
3
3
0
2
T2
4
1
0
1
2
T3
0
4
0
0
2
T4
0
3
0
3
3
T5
0
1
3
0
1
T6
2
2
0
0
4
T7
1
0
3
2
0
T8
3
1
0
0
2
C1
8/3
2/3
3/3
3/3
4/3
C2
2/4
12/4
3/4
3/4
11/4
C3
0/1
1/1
3/1
0/1
1/1
The process is repeated until no further changes are made to the clusters
55
K-Means Algorithm
 Strength of the k-means:
 Relatively efficient: O(tkn), where n is # of objects, k is # of
clusters, and t is # of iterations. Normally, k, t << n
 Often terminates at a local optimum
 Weakness of the k-means:
 Applicable only when mean is defined; what about categorical
data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Variations of K-Means usually differ in:
 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
56
Hierarchical Algorithms
 Use distance matrix as clustering criteria
 does not require the number of clusters k as an input, but needs
a termination condition
57
Hierarchical Agglomerative Clustering
 HAC starts with unclustered data and performs successive pairwise
joins among items (or previous clusters) to form larger ones
 this results in a hierarchy of clusters which can be viewed as a dendrogram
 useful in pruning search in a clustered item set, or in browsing clustering results
 Some commonly used HACM methods
 Single Link: at each step join most similar pairs of objects that are not yet
in the same cluster
 Complete Link: use least similar pair between each cluster pair to
determine inter-cluster similarity - all items within one cluster are linked
to each other within a similarity threshold
 Ward’s method: at each step join cluster pair whose merger minimizes the
increase in total within-group error sum of squares (based on distance
between centroids) - also called the minimum variance method
 Group Average (Mean): use average value of pairwise links within a
cluster to determine inter-cluster similarity (i.e., all objects contribute to
inter-cluster similarity)
58
Hierarchical Agglomerative Clustering
Dendrogram for a hierarchy of clusters
A
B
C
D
E
F
G
H
I
59