DATA MINING AND CLUSTERING

Download Report

Transcript DATA MINING AND CLUSTERING

DATA MINING WITH
CLUSTERING AND
CLASSIFICATION
Spring 2007, SJSU
Benjamin Lam
Overview
•
•
•
•
•
•
Definition of Clustering
Existing clustering methods
Clustering examples
Classification
Classification examples
Conclusion
Definition
• Clustering can be considered the most important
unsupervised learning technique; so, as every other
problem of this kind, it deals with finding a structure in a
collection of unlabeled data.
• Clustering is “the process of organizing objects into
groups whose members are similar in some way”.
• A cluster is therefore a collection of objects which are
“similar” between them and are “dissimilar” to the
objects belonging to other clusters.
Mu-Yu Lu, SJSU
Why clustering?
A few good reasons ...
•
•
•
•
Simplifications
Pattern detection
Useful in data concept construction
Unsupervised learning process
Where to use clustering?
•
•
•
•
•
•
Data mining
Information retrieval
text mining
Web analysis
marketing
medical diagnostic
Which method should I use?
•
•
•
•
•
•
•
Type of attributes in data
Scalability to larger dataset
Ability to work with irregular data
Time cost
complexity
Data order dependency
Result presentation
Major Existing clustering methods
•
•
•
•
Distance-based
Hierarchical
Partitioning
Probabilistic
Measuring Similarity
• Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, which is typically metric:
d(i, j)
• There is a separate “quality” function that measures the
“goodness” of a cluster.
• The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
• Weights should be associated with different variables
based on applications and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
Professor Lee, Sin-Min
Distance based method
•
In this case we easily identify the 4 clusters into which the data can be
divided; the similarity criterion is distance: two or more objects belong to
the same cluster if they are “close” according to a given distance. This is
called distance-based clustering.
Hierarchical clustering
Agglomerative (bottom up)
1.
start with 1 point
(singleton)
2. recursively add two or
more appropriate
clusters
3. Stop when k number of
clusters is achieved.
Divisive (top down)
1.
2.
3.
Start with a big cluster
Recursively divide into
smaller clusters
Stop when k number of
clusters is achieved.
general steps of hierarchical clustering
Given a set of N items to be clustered, and an N*N distance (or
similarity) matrix, the basic process of hierarchical clustering
(defined by S.C. Johnson in 1967) is this:
• Start by assigning each item to a cluster, so that if you have N
items, you now have N clusters, each containing just one item.
Let the distances (similarities) between the clusters the same as
the distances (similarities) between the items they contain.
• Find the closest (most similar) pair of clusters and merge them
into a single cluster, so that now you have one cluster less.
• Compute distances (similarities) between the new cluster and
each of the old clusters.
• Repeat steps 2 and 3 until all items are clustered into K number
of clusters
Mu-Yu Lu, SJSU
Exclusive vs. non exclusive
clustering
• In the first case data are grouped in an
exclusive way, so that if a certain datum
belongs to a definite cluster then it could
not be included in another cluster. A
simple example of that is shown in the
figure below, where the separation of
points is achieved by a straight line on a
bi-dimensional plane.
• On the contrary the second type, the
overlapping clustering, uses fuzzy sets to
cluster data, so that each point may
belong to two or more clusters with
different degrees of membership.
Partitioning clustering
1. Divide data into proper subset
2. recursively go through each subset
and relocate points between clusters
(opposite to visit-once approach in
Hierarchical approach)
This recursive relocation= higher quality cluster
Probabilistic clustering
1. Data are picked from mixture of
probability distribution.
2. Use the mean, variance of each
distribution as parameters for cluster
3. Single cluster membership
Classification Examples
• Teachers classify students’ grades as A,
B, C, D, or F.
• Identify mushrooms as poisonous or
edible.
• Predict when a river will flood.
• Identify individuals with credit risks.
• Speech recognition
• Pattern recognition
Classification
Goal: Provide an overview of the classification
problem and introduce some of the basic
algorithms
• Classification Problem Overview
• Classification Techniques
– Regression
– Distance
– Decision Trees
– Rules
– Neural Networks
Classification Ex: Grading
• If x >= 90 then grade
=A.
• If 80<=x<90 then
grade =B.
• If 70<=x<80 then
grade =C.
• If 60<=x<70 then
grade =D.
• If x<50 then grade =F.
x
<90
>=90
x
<80
x
<70
x
<50
F
A
>=80
B
>=70
C
>=60
D
Classification Techniques
•
Approach:
1. Create specific model by evaluating
training data (or using domain
experts’ knowledge).
2. Apply model developed to new data.
• Classes must be predefined
• Most common techniques use DTs,
NNs, or are based on distances or
statistical methods.
Defining Classes
Distance Based
Partitioning Based