COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 2

Download Report

Transcript COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 2

Clustering
COMP 790-90 Research Seminar
BCB 713 Module
Spring 2011
Wei Wang
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Drawback of Distance-based
Methods
Hard to find clusters with irregular shapes
Hard to specify the number of clusters
Heuristic: a cluster must be dense
2
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Directly Density Reachable
Parameters
MinPts = 3
Eps = 1 cm
p
q
Eps: Maximum radius of the neighborhood
MinPts: Minimum number of points in an Epsneighborhood of that point
NEps(p): {q | dist(p,q) Eps}
Core object p: |Neps(p)|MinPts
Point q directly density-reachable from p iff
q Neps(p) and p is a core object
3
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Density-Based Clustering:
Background (II)
p1
q
Density-reachable
p
Directly density reachable p1p2, p2p3, …,
pn-1 pn  pn density-reachable from p1
Density-connected
Points p, q are density-reachable from o  p
and q are density-connected
p
q
o
4
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
DBSCAN
A cluster: a maximal set of densityconnected points
Discover clusters of arbitrary shape in spatial
databases with noise
Outlier
Border
Eps = 1cm
Core
5
MinPts = 5
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
DBSCAN: the Algorithm
Arbitrary select a point p
Retrieve all points density-reachable from p wrt
Eps and MinPts
If p is a core point, a cluster is formed
If p is a border point, no points are densityreachable from p and DBSCAN visits the next
point of the database
Continue the process until all of the points have
been processed
6
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Problems of DBSCAN
Different clusters may have very different
densities
Clusters may be in hierarchies
7
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
OPTICS: A Cluster-ordering
Method
OPTICS: ordering points to identify the
clustering structure
“Group” points by density connectivity
Hierarchies of clusters
Visualize clusters and the hierarchy
8
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
DENCLUE: Using Density
Functions
DENsity-based CLUstEring
Major features
Solid mathematical foundation
Good for data sets with large amounts of noise
Allow a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
Significantly faster than existing algorithms (faster than
DBSCAN by a factor of up to 45)
But need a large number of parameters
9
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Grid-based Clustering
Methods
Ideas
Using multi-resolution grid data structures
Use dense grid cells to form clusters
Several interesting methods
STING
WaveCluster
CLIQUE
10
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
STING: A Statistical
Information Grid Approach
The spatial area area is divided into rectangular cells
There are several levels of cells corresponding to
different levels of resolution
11
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
STING: A Statistical
Information Grid Approach (2)
Each cell at a high level is partitioned into a number of smaller
cells in the next lower level
Statistical information of each cell is calculated and stored
beforehand and is used to answer queries
Parameters of higher level cells can be easily calculated from
parameters of lower level cell
count, mean, s, min, max
type of distribution—normal, uniform, etc.
Use a top-down approach to answer spatial data queries
Start from a pre-selected layer—typically with a small number of
cells
For each cell in the current level compute the confidence interval
12
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
STING: A Statistical
Information Grid Approach (3)
Remove the irrelevant cells from further consideration
When finish examining the current layer, proceed to the
next lower level
Repeat this process until the bottom layer is reached
13
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
STING: A Statistical
Information Grid Approach (4)
Advantages:
Query-independent, easy to parallelize, incremental
update
O(K), where K is the number of grid cells at the
lowest level
Disadvantages:
All the cluster boundaries are either horizontal or
vertical, and no diagonal boundary is detected
14
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
WaveCluster
A multi-resolution clustering approach which
applies wavelet transform to the feature space
A wavelet transform is a signal processing technique
that decomposes a signal into different frequency subband.
Both grid-based and density-based
Input parameters:
# of grid cells for each dimension
the wavelet, and the # of applications of wavelet
transform.
15
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
WaveCluster
How to apply wavelet transform to find
clusters
Summaries the data by imposing a
multidimensional grid structure onto data space
These multidimensional spatial data objects are
represented in an n-dimensional feature space
Apply wavelet transform on feature space to find
the dense regions in the feature space
Apply wavelet transform multiple times which
result in clusters at different scales from fine to
coarse
16
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Wavelet Transform
Decomposes a signal into different
frequency subbands. (can be applied to ndimensional signals)
Data are transformed to preserve relative
distance between objects at different levels
of resolution.
Allows natural clusters to become more
distinguishable
17
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
What Is Wavelet (2)?
18
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Quantization
19
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Transformation
20
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
WaveCluster
Why is wavelet transformation useful for
clustering
Unsupervised clustering
It uses hat-shape filters to emphasize region
where points cluster, but simultaneously to
suppress weaker information in their boundary
21
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
WaveCluster
Effective removal of outliers
22
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
WaveCluster
Multi-resolution
Cost efficiency
23
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
WaveCluster
24
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
WaveCluster
Major features:
Complexity O(N)
Detect arbitrary shaped clusters at different
scales
Not sensitive to noise, not sensitive to input
order
Only applicable to low dimensional data
25
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
CLIQUE (Clustering In QUEst)
Automatically identifying subspaces of a high dimensional data
space that allow better clustering than original space
CLIQUE can be considered as both density-based and grid-based
It partitions each dimension into the same number of equal length interval
It partitions an m-dimensional data space into non-overlapping rectangular
units
A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter
A cluster is a maximal set of connected dense units within a subspace
26
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
CLIQUE: The Major Steps
Partition the data space and find the number of points
that lie inside each cell of the partition.
Identify the subspaces that contain clusters using the
Apriori principle
Identify clusters:
Determine dense units in all subspaces of interests
Determine connected dense units in all subspaces of interests.
Generate minimal description for the clusters
Determine maximal regions that cover a cluster of connected
dense units for each cluster
Determination of minimal cover for each cluster
27
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
20
28
30
40
50
age
60
Vacation
(week)
0 1 2 3 4 5 6 7
Salary
(10,000)
0 1 2 3 4 5 6 7
CLIQUE
20
30
40
50
age
60
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
=3
Vacation
CLIQUE
30
50
age
29
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Strength and Weakness of
CLIQUE
Strength
It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces
It is insensitive to the order of records in input and
does not presume some canonical data distribution
It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
Weakness
The accuracy of the clustering result may be degraded
at the expense of simplicity of the method
30
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Constrained Clustering
Constraints exist in
data space or in user
queries
Example: ATM
allocation with bridges
and highways
People can cross a
highway by a bridge
31
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Clustering With Obstacle
Objects
Not Taking obstacles into account Taking obstacles into account
32
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Outlier Analysis
“One person’s noise is another person’s
signal”
Outliers: the objects considerably dissimilar
from the remainder of the data
Examples: credit card fraud, Michael Jordon,
etc
Applications: credit card fraud detection,
telecom fraud detection, customer
segmentation, medical analysis, etc
33
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Statistical Outlier Analysis
Discordancy/outlier tests
100+ tests proposed
Data distribution
Distribution parameters
The number of outliers
The types of expected outliers
Example: upper or lower outliers in an ordered
sample
34
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Drawbacks of Statistical
Approaches
Most tests are univariate
Unsuitable for multidimensional datasets
All are distribution-based
Unknown distributions in many applications
35
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Depth-based Methods
Organize data objects in layers with various
depths
The shallow layers are more likely to contain
outliers
Example: Peeling, Depth contours
Complexity O(Nk/2) for k-d datasets
Unacceptable for k>2
36
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Distance-based Outliers
A DB(p, D)-outlier is an object O in a
dataset T s.t. at least fraction p of the
objects in T lies at a distance greater than
distance D from O
Algorithms for mining distance-based
outliers
The index-based algorithm, the nested-loop
algorithm, the cell-based algorithm
37
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Index-based Algorithms
Find DB(p, D) outliers in T with n objects
Find an objects having at most n(1-p)
neighbors with radius D
Algorithm
Build a standard multidimensional index
Search every object O with radius D
If there are at least n(1-p) neighbors, O is not an
outlier
Else, output O
38
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Pros and Cons of Index-based
Algorithms
Complexity of search O(kN2)
More scalable with dimensionality than depthbased approaches
Building a right index is very costly
Index building cost renders the index-based
algorithms non-competitive
39
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
A Naïve Nested-loop
Algorithm
For j=1 to n do
Set countj=0;
For k=1 to n do if (dist(j,k)<D) then countj++;
If countj <= n(1-p) then output j as an outlier;
No explicit index construction
O(N2)
Many database scans
40
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Optimizations of Nested-loop
Algorithm
Once an object has at least n(1-p)
neighbors with radius D, no need to count
further
Use the data in main memory as much as
possible
Reduce the number of database scans
41
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
References (1)
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering
of high dimensional data for data mining applications. SIGMOD'98
M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to
identify the clustering structure, SIGMOD’99.
P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World
Scientific, 1996
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases. KDD'96.
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases:
Focusing techniques for efficient class identification. SSD'95.
D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine
Learning, 2:139-172, 1987.
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach
based on dynamic systems. In Proc. VLDB’98.
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large
databases. SIGMOD'98.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
42
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
References (2)
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets.
VLDB’98.
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
W. Wang, J. Yang, R. Muntz, STING: A Statistical Information Grid Approach to
Spatial Data Mining, VLDB’97.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method
for very large databases. SIGMOD'96.
43
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications