DB Seminar Series: HARP: A Hierarchical Algorithm with Automatic

Download Report

Transcript DB Seminar Series: HARP: A Hierarchical Algorithm with Automatic

DB Seminar Series:
HARP: A Hierarchical Algorithm with Automatic
Relevant Attribute Selection for Projected Clustering
Presented by:
Kevin Yip
20 September 2002
Short Summary
Our own work (unpublished), supervised by Dr.
Cheung and Dr. Ng
Problem: to cluster datasets of very high
dimensionality
Assumption: clusters are formed in subspaces
1
Short Summary
Previous approaches: either have special
restrictions on the dataset or target clusters, or
cannot determine the dimensionality of the
clusters automatically
Our approach: not restricted by these limitations
2
Presentation Outline
Clustering
Projected clustering
Previous approaches to projected clustering
Our approach: HARP
– Concepts
– Implementation: HARP.1
– Experiments
Future work and conclusions
3
Clustering
Goal: given a dataset D with N records and d
attributes (dimensions), partition the records into
k disjoint clusters such that
– Intra-cluster similarity is maximized
– Inter-cluster similarity is minimized
4
Clustering
How to measure similarity?
– Distance-based: Manhattan distance, Euclidean
distance, etc.
– Correlation-based: cosine correlation, Pearson
correlation, etc.
– Link-based (common neighbors)
– Pattern-based
1
2
3
4
5
6
5
Clustering
2 common types of clustering algorithms:
– Partitional: selects some representative points for
each cluster, assigns all other points to their closest
clusters, and then re-determines the new
representative points
– Hierarchical (agglomerative): repeatedly determines
the two most similar clusters, and merges them
6
Clustering
Partitional clustering:
Dataset
Representatives
Replacement
Assignment
7
Clustering
Hierarchical clustering:
Dataset
Similarity calculation
Merging
Best merge
determination
8
Projected Clustering
Assumption (general case): each
cluster is formed in a subspace
Assumption (special case): each
cluster has a set of relevant
attributes
Goal: determine the records and
relevant attributes of each cluster
(to “select” the “relevant attributes”.
How to define “relevance”?)
Source of figures:
ORCLUS (SIGMOD 2000)
9
Projected Clustering
A 3-D view:
Source of figure:
DOC (SIGMOD 2002)
10
Projected Clustering
An example dataset:
Person
Age
Virus Level
Blood Type
Disease A
1
35
0.6
AB
Uninfected
2
64
0.9
AB
Uninfected
3
27
1.1
AB
Uninfected
4
18
9.8
O
Infected
5
42
8.6
AB
Infected
6
53
11.3
B
Infected
7
37
0.7
O
Recovered
8
28
0.4
A
Recovered
9
65
0.9
B
Recovered
11
Projected Clustering
Projected clustering v.s. feature selection:
– Feature selection selects a feature set for all
records, but projected clustering selects attribute
sets individually for each cluster
– Feature selection is a preprocessing task, but
projected clustering selects attributes during the
clustering process
12
Projected Clustering
Why projected clustering is important?
– At high dimensionality, the data points are sparse,
the distance between any two points is almost the
same
– There are many noise attributes that we are not
interested in
– High dimensionality implies high computational
complexity
13
Previous Approaches
(Refer to my previous DB seminar on 17 May 2002
titled “The Subspace Clustering Problem”)
Grid-based dimension selection (CLIQUE,
ENCLUS, MAFIA)
Association rule hypergraph partitioning
Context-specific Bayesian clustering
Monte Carlo algorithm (DOC)
Projective clustering (PROCLUS, ORCLUS)
14
Previous Approaches
PROCLUS:
1.
2.
3.
4.
5.
6.
Draw medoids
Determine neighbors
Select attributes
Assign records
Replace medoids
Goto 2
ORCLUS:
1.
2.
3.
4.
Draw medoids
Assign records
Select vectors
Merge (reselect vectors
and determine centroid)
5. Goto 2
15
Previous Approaches
Summary of the limitations of previous approaches
(each approach has one or more of the followings):
–
–
–
–
–
–
–
–
–
Produce non-disjoint clusters
Has exponential time complexity w.r.t. cluster dimensionality
Allow each attribute value be selected by only one cluster
* Unable to determine the dimensionality of each cluster
automatically
Produce clusters all of the same dimensionality
* Consider only local statistical values in attribute selection
Unable to handle datasets with mixed attribute types
* Assign records to clusters regardless of their distances
16
Require datasets to have a lot more records than attributes
Our Approach: HARP
Motivations:
– From datasets: we want to study gene expression
profile datasets (usually with thousands of genes
and less than a hundred samples)
– From previous algorithms: we want to develop a new
algorithm that does not have any of the above
limitations
17
Our Approach: HARP
HARP: a Hierarchical algorithm with Automatic
Relevant attribute selection for Projected
clustering
Special features:
– Automatic attribute selection
– Customizable procedures
– Mutual disagreement prevention
18
Our Approach: HARP
Special implementation based on attribute value
density, HARP.1:
– Use of global statistics in attribute selection
– Generic similarity calculations that can handle both
categorical and numeric attributes
– Implementing all mutual disagreement mechanisms
defined by HARP
– Reduced time complexity by pre-clustering
19
Our Approach: HARP
Basic idea:
– In the partitional approaches:
• At the beginning, each record is assigned to a cluster by
calculating distances/similarities using all attributes
• Very likely that some assignments are incorrect
• No clue to find the dimensionality of the clusters
– Our approach:
• Allow only the “best merges” at any time
20
Out Approach: HARP
Basic idea:
– “Best”: a merge is permitted only if
• Each selected attribute of the resulting cluster has a
relevance of at least dt
• The resulting cluster has more than mapc selected
attributes
• The two participating clusters have a mutual
disagreement not larger than md
– Mapc, dt, md: threshold variables
21
Our Approach: HARP
Multi-step clustering:
d 1 imd
md
dt
mapc
1 g mmd
Initial
thresholds
Cluster 1
Cluster2
Merge
Score
2
6
27.6
3
8
24.3
12
13
24.1
1
5
18.5
…
Perform all
possible
merges
Merge score
calculations
d 1 imd
md
dt
mapc
1 g mmd
Threshold
loosening
22
Our Approach: HARP
Expected resulting clusters:
– Have all relevant attributes selected (due to mapc)
– Selected attributes have high relevance to the
cluster (due to dt)
– Not biased by the participating clusters (due to md
and some other mechanisms)
23
Our Approach: HARP
More details: attribute relevance
– Depending on the definition of the similarity measure
– E.g. the density-based measure defines the
relevance of an attribute to a cluster by the
“compactness” of its values in the cluster.
Compactness can be reflected by the variance value
24
Our Approach: HARP
More details: attribute relevance
Attribute
Cluster
(mean/variance)
C1
C2
A1
4.9/0.1
5.0/0.1
A2
A3
A4
5.1/0.1 2.8/0.5 3.1/0.5
4.9/0.1 7.3/0.5 7.2/0.5
Which attributes are relevant to the clusters?
– A1, A2: local statistics
– A3, A4: global statistics
25
Our Approach: HARP
More details: mutual disagreement
– The two clusters participating in a merge do not
agree with each other
26
Our Approach: HARP
More details: mutual disagreement
– Case 1:
100 rec.
{A1, A2}
105 rec.
{A1, A2}
5 rec.
{A3, A4}
– One cluster dominates the selection of attributes
27
Our Approach: HARP
More details: mutual disagreement
– Case 2:
50 rec.
{A1, A2}
50 rec.
{A1, A2,
…,A6}
100 rec.
{A1, A2}
– The clusters lose some information due to the merge
28
Our Approach: HARP
More details: mutual disagreement
– Mutual disagreement prevention:
• Setup the md threshold to limit the maximum
disagreement on the new set of attributes
• Get the statistics of the loss of information in all possible
merges, discard those with extraordinary high loss
• Add a punishment factor to the similarity score
29
Our Approach: HARP.1
HARP.1: an implementation of HARP that defines the
relevance of an attribute to a cluster by its density
improvement from the global density
Relevance score of an attribute to a cluster:
–
–
–
–
Categorical: 1 – (1 – Mode-ratiolocal) / (1 – Mode-ratioglobal)
Numeric: 1 – Varlocal / Varglobal
*When Mode-ratioglobal = 1 or Varglobal = 0, the score = 0
If C1 and C2 merge into Cnew, we can use the records of C1
and C2 to evaluate their “agreement” on the selected
attributes of Cnew in a similar way.
30
Our Approach: HARP.1
Mutual disagreement calculations:
– Den(Ci, a): how good is attribute a in Ci
– Den(Ci, Cnew, a): how good is the attribute a in Ci, evaluated
by using the properties of a in Cnew
– Both values are in the range [0, 1]

Min Den (C1 , Cnew , a), Den (C2 , Cnew , a)  

MD (C1 , C2 )   1 
/ AC
 aA



Max
Den
(
C
,
C
,
a
),
Den
(
C
,
C
,
a
)
1
new
2
new


new
Cnew
 Den (C , C
ILoss (Ci , Cnew )  1 
a ACnew
i
new
, a)
 Den (Ci , a)
(truncated to 0 if - ve)
a ACi
31
Our Approach: HARP.1
Similarity score:
 Den(C
Sim (C1 , C2 ) 
aACnew
ACnew
new
, a)
 (1  ILoss (C1 , Cnew ))  (1  ILoss (C2 , Cnew ))
32
Our Approach: HARP.1
Initial and baseline
values for the md
variable: user
parameters, default
10 and 50
Multi-step clustering:
d 1 imd
md
dt
mapc
1 g mmd
Initial
thresholds
Cluster2
Merge
Score
2
6
27.6
3
8
24.3
12
13
24.1
1
5
18.5
…
Merge score
calculations
Perform all
possible
merges
With mutual disagreement prevention:
1. MD(C1,C2) <= md
2.
Sum of and difference between
ILoss(C1,Cdnew1) and
imd ILoss(C2,Cnew) not
more than a certain s.d. from mean
3.
Punishment factor in similarity score
md
dt
mapc
Baseline value for each
dt variable: the global
statistical value
Cluster 1
Each cluster keeps a local score
list (binary tree) containing
merges with all other clusters.
The best scores are propagated to
a global score list
1 g mmd
Threshold
loosening
33
Our Approach: HARP.1
Time complexity:

O N 2 mls  log N d max dom(a)
a

– Speeding up: use a fast projected clustering
algorithm to pre-cluster the data
Space complexity:

O N 2  Nd

34
Our Approach: HARP.1
Accuracy experiments (datasets):
Name
Type
Rec.
Class
Cat./Num. Attr.
Avg. Rel. Attr.
Outlier (%)
Soybean
Real-life
47
4
35 /
0
26
?
Voting
Real-life
435
2
16 /
0
11
?
Mushroom
Real-life
8124
2
22 /
0
15
?
SynCat1
Synthetic
500
5
20 /
0
12
5
SynMix1
Synthetic
500
5
10 /
10
12
5
SynNum1
Synthetic
500
5
0 /
20
12
5
SynCat2
Synthetic
500
5
20 /
0
7
5
SynMix2
Synthetic
500
5
10 /
10
7
5
SynNum2
Synthetic
500
5
0 /
20
7
5
35
Our Approach: HARP.1
Accuracy experiments (results1):
Dataset
Best score: error% / outlier%
Average: error% / outlier%
HARP.1
PROCLUS
Traditional
ROCK
Soybean
0.0 / 0.0
0.0 / 0.0
17.3 / 0.0
2.1 / 0.0
9.2 / 0.0
Voting
6.4 / 13.6
2.1 / 55.6
13.8 / 7.9
13.1 / 11.3
13.1 / 1.9
6.2 / 14.5
Mushroom
1.4 / 0.0
3.2 / 0.0
9.0 / 0.0
6.0 / 0.0
5.2 / 0.0
0.4 / 0.0
No published result
36
Our Approach: HARP.1
Accuracy experiments (results2):
Dataset
HARP.1
PROCLUS
Traditional
ORCLUS
SynCat1
0.0 / 5.0
3.6 / 1.4
6.7 / 3.7
2.6 /26.4
5.8 / 5.3
N/A
SynMix1
0.4 / 6.8
2.2 / 17.0
6.8 / 10.1
11.6 / 11.2
7.9 / 4.6
N/A
SynNum1
0.8 / 5.0
1.8 / 21.4
7.2 / 8.3
4.4 /32.0
5.9 / 9.2
0.4 / 23.8
2.31 / 8.15
SynCat2
4.0 / 8.4
11.0 / 31.0
25.0 / 14.4
17.8 /23.8
28.5 / 5.4
N/A
SynMix2
11.4 / 4.4
16.6 / 62.2
25.4 / 32.8
17.6 /38.6
24.1 / 11.6
N/A
SynNum2
18.8 / 4.4
11.6 / 50.8
18.7 / 20.7
11.6 /28.0
23.3 /10.9
50.8 / 0.0
57.2 / 0.0
37
Our Approach: HARP.1
Accuracy experiments (results3):
– Dataset: 500 records, 200 attributes, on average 13 relevant,
5 classes
– Pre-clustering: form 50 clusters
PROCLUS (best score)
HARP.1 w ith Preclustering
Preclustering (best score)
Percentage of error cases
50
40
30
20
10
0
0
20
40
60
Input l
80
100
38
Our Approach: HARP.1
Total execution time (s) in
log scale
Scalability experiments (scaling N):
100000
10000
1000
100
10
1
1000
10000
100000
Num ber of records in log scale
39
Our Approach: HARP.1
Scalability experiments (scaling d):
40
Our Approach: HARP.1
Total execution time (s)
Scalability experiments (scaling average number
of relevant attributes):
60
50
40
30
20
10
0
0
20
40
60
80
100
120
Average dim ensionality of cluster
41
Our Approach: HARP.1
Total execution time (s)
in log scale
Scalability experiments (scaling N with preclustering):
100000
10000
1000
100
1000
10000
100000
Num ber of records in log scale
42
Our Approach: HARP.1
Application: gene expression datasets
– Lymphoma: Nature 403 (2000)
– 96 samples, 4026 genes, 9 classes
DLBCL
Activated blood B
FL
Germinal centre B
Resting/ activated T
Resting blood B
NI. lymph node/tonsil
Transformed cell lines
CLL
43
Our Approach: HARP.1
Application: gene expression datasets
– Can also use genes as records and samples as
attributes:
• E.g. use the dendrogram to produce an ordering of all
genes
• Based on some domain knowledge, validate the ordering
• If the ordering is valid, the position of other genes of
unknown functions can be analyzed
44
Future Work
Produce more implementations based on other
similarity measures
Study the definition of “relevance” in gene
expression datasets
Consider very large datasets that cannot fit into
main memory
Extend the approach to solve other problems,
e.g. k-NN in high dimensional space
45
Conclusions
A hierarchical projected clustering algorithm, HARP, is
developed with
– Dynamic selection of relevant attributes
– Mutual disagreement prevention
– Generic similarity calculation
A density-based implementation called HARP.1 is
developed with
– Good accuracy
– Reasonable time complexity
– Real applications on gene expression datasets
46
References
C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fast algorithms for
projected clustering. In ACM SIGMOD International Conference on Management of Data,
1999.
C. C. Aggarwal and P. S. Yu. Finding generalized projected clusters in high dimensional
spaces. pages 70{81, 2000.
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of
high dimensional data for data mining applications. In ACM SIGMOD International
Conference on Management of Data, 1998.
A. A. Alizadeh, M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A. Rosenwald, J. C. Boldrick, H.
Sabet, T. Tran, X. Yu, J. I. Powell, L. Yang, G. E. Marti, T. Moore, J. Hudson, L. Lu, D. B.
Lewis, R. Tibshirani, G. Sherlock, W. C. Chan, T. C. Greiner, D. D.Weisenburger, J. O.
Armitage, R. Warnke, R. Levy, W. Wilson, M. R. Grever, J. C. Byrd, D. Botstein, P. O. Brown,
and L. M. Staudt. Distinct types of di
use large b-cell lymphoma identified by gene expression profiling. Nature,
403(6769):503{511, 2000.
Y. Barash and N. Friedman. Context-specific bayesian clustering for gene expression data.
In Annual Conference on Research in Computational Molecular Biology, 2001.
47
References
C. H. Cheng, A. W.-C. Fu, and Y. Zhang. Entropy-based subspace clustering for mining
numerical data. In Knowledge Discovery and Data Mining, pages 84{93, 1999.
S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical
attributes. In 15th International Conference on Data Engineering, 1999.
E.-H. Han, G. Karypis, V. Kumar, and B. Mobasher. Clustering based on association rule
hypergraphs. In 1997 SIGMOD Workshop on Research Issues on Data Mining and
Knowledge Discovery, 1997.
G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar. Multilevel hypergraph partitioning:
Applications in VLSI domain. In ACM/IEEE Design Automation Conference, 1997.
H. Nagesh, S. Goil, and A. Choudhary. Ma
a: Efficient and scalable subspace clustering for very large data sets, 1999.
C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M. Murali. A monte carlo algorithm for fast
projective clustering. In ACM SIGMOD International Conference on Management of Data,
2002.
H. Wang, W. Wang, J. Yang, and P. S. Yu. Clustering by pattern similarity in large data sets.
In ACM SIGMOD International Conference on Management of Data, 2002.
48