Transcript Lecture-4

Evaluating Performance
for Data Mining Techniques
1
Evaluating Numeric Output
• Mean absolute error (MAE)
• Mean square error (MSE)
• Root mean square error (RMSE)
2
Mean Absolute Error (MAE)
The average absolute difference between
classifier predicted output and actual output.
1
N
N
 ( Desired
i 1
i
 Actuali )  
3
Mean Square Error (MSE)
The average of the sum of squared differences
between classifier predicted output and actual
output.
1
N
N
 ( Desired
i 1
 Actuali )  
2
i
4
Root Mean Square Error (RMSE)
The square root of the mean square error.
1 N
2
( Desiredi  Actuali )  

N i 1
5
Clustering Techniques
6
Clustering Techniques
• Clustering Techniques apply some measure
of similarity to divide instances of the data
to be analyzed into disjoint partitions
• The partitions are generalized by computing
a group mean for each cluster or by listing a
most typical subset of instances from each
cluster
7
Clustering Techniques
• 1st approach: unsupervised clustering
• 2nd approach: to partition data in a
hierarchical fashion where each level of the
hierarchy is a generalization of the data at
some level of abstraction.
8
Clustering Techniques
9
The K-Means Algorithm
• The K-means algorithm is a simple (but
widely used) statistical clustering technique,
which is used for unsupervised clustering
• The K-means algorithm divides instances of
the data to be analyzed into disjoint K
partitions (clusters).
• Proposed by S.P. Lloyd in 1957, first
published in 1982.
10
The K-Means Algorithm
1. Choose a value for K, the total number of clusters.
2. Randomly choose K points as cluster centers.
3. Assign the remaining instances to their closest
cluster center (for example, using Euclidian distance
as a criterion).
4. Calculate a new cluster center for each cluster.
5. Repeat steps 3-5 until the cluster centers do not
change.
11
The K-Means Algorithm: Analysis
• Choose a value for K, the total number of
clusters – this step requires an initial
discussion about how many clusters can be
distinguished within a data set
12
The K-Means Algorithm: Analysis
• Randomly choose K points as cluster
centers – the initial cluster centers are
selected randomly, but this is not essential if
K was chosen properly; the resulting
clustering in this case should not depend on
the selection of the initial cluster centers
13
The K-Means Algorithm: Analysis
• Calculate a new cluster center for each
cluster – new cluster centers are the means
of the cluster members that were placed to
their clusters in the previous step
14
The K-Means Algorithm: Analysis
• Repeat steps 3-5 until the cluster centers do
not change – the process instance
classification and cluster center computation
continues until an iteration of the algorithm
shows no change in the cluster centers.
• The algorithm terminates after j iterations if
for each cluster Ci all instances found in Ci
after iteration j-1 remain in cluster Ci upon
the completion of iteration j
15
Euclidian Distance
Euclidian distance between two n-dimensional vectors
X   x1 ,..., xn  ;Y  ( y1 ,..., yn )
is determined as
D  X , Y   ( x1  y1 )2  ...  ( xn  yn )2
16
Cluster Quality
• How we can evaluate the cluster quality, its
reliability?
• One evaluation method, which is more
suitable for the clusters of about equal size,
is to calculate the sum of square error
differences between the instances of each
cluster and their cluster center. Smaller
values indicate clusters of higher quality.
17
Cluster Quality
• Another evaluation method is to calculate
the mean square error differences between
the instances of each cluster and their
cluster center. Smaller values indicate
clusters of higher quality.
18
Optimal Clustering Criterion
Clustering is considered optimal, when the average (taken over
all clusters) mean square deviation of the cluster members from
their center is either:
minimal over several (s) experiments
 1
1


MIN
Nj
s experiments K j 1 

K

s
s 2

D
(
Center
,
X

j
ij )

i 1

Nj
or less than some predetermined acceptable value
 1
1


K j 1  N j

K

2

D
(
Center
,
X
)

j
ij

i 1

Nj
19
An Example Using
the K-Means Algorithm
20
Table 3.6
• K-Means Input Values
Instance
X
Y
1
2
3
4
5
6
1.0
1.0
2.0
2.0
3.0
5.0
1.5
4.5
1.5
3.5
2.5
6.0
21
f(x)
7
6
5
4
3
2
1
0
x
0
1
2
3
4
5
6
22
Table 3.7 • Several Applications of the K-Means Algorithm (K = 2)
Outcome
Cluster Centers
Cluster Points
1
(2.67,4.67)
2, 4, 6
Square Error
14.50
2
(2.00,1.83)
1, 3, 5
(1.5,1.5)
1, 3
(2.75,4.125)
2, 4, 5, 6
(1.8,2.7)
1, 2, 3, 4, 5
15.94
3
9.60
(5,6)
6
23
f(x)
7
6
5
4
3
2
1
0
x
0
1
2
3
4
5
6
24
Unsupervised Model Evaluation
25
The K-Means Algorithm:
General Considerations
• Requires real-valued data.
• We must select the number of clusters present in the
data.
• Works best when the clusters that exist in the data are
of approximately equal size. If an optimal solution is
represented by clusters of unequal size, the K-Means
algorithm is not likely to
• Attribute significance cannot be determined.
• A supervised data mining tool must be used to gain
into the nature of the clusters formed by a clustering
tool.
26
Supervised Learning for
Unsupervised Model Evaluation
• Designate each formed cluster as a class and
assign each class an arbitrary name.
• Choose a random sample of instances from
each class for supervised learning.
• Build a supervised model from the chosen
instances. Employ the remaining instances
to test the correctness of the model.
27