Cluster Description and Related Problems

Download Report

Transcript Cluster Description and Related Problems

Cluster Description and Related Problems
DDM Lab Seminar
2005 Spring
Byron Gao
Road Map
2
1.
Motivations
2.
Cluster Description Problems
3.
Related Problems in Machine Learning
4.
Related Problems in Theory
5.
Related Problems in Computational Geometry
6.
Conclusions and Past / Future Work
1. Motivations
Clustering

Group objects into clusters such that objects within a cluster are
similar and objects from different clusters are dissimilar
–
–
3
one of the major data mining tasks
K-means, hierarchical, density-based…
1. Motivations
Cluster Descriptions

Literature lacks systematic study of cluster descriptions
–
–
–

Purposes of cluster descriptions
–
–

Most clustering algorithms just give membership assignments
Hidden knowledge in data need to be represented for inference
Integration of database and data mining
Summarize data to gain initial knowledge
Compress data supporting further investigation
One way of describing a cluster: DNF formula
–
A set of isothetic hyper-rectangles


4
Interpretable
Used as search condition to retrieve objects
1. Motivations
Summarize Data

Gain initial and basic understanding of the clusters
–

Capture shape and location in the multi-dimensional space
Requirement: interpretability
–
Simple and clear structure of description format


–

Short in length
Another requirement: accuracy
–
–
5
e.g., DNF, B-DNF…
B-DNF: Set difference between the bounding
box B and a DNF formula describing objects
in B but not in cluster C
Interpreting the correct target
May trade accuracy for shorter length
1. Motivations
Compress Data (1)

Data compression has important applications in data
transmission and data storage
–
Transmission: distribute clustering results to branches

–
Storage: support further investigation

Partial results or results (in some sense, always partial results)

Description process ≈ encoding (compression)

Retrieval process ≈ decoding (decompression)
–
6
Clustering performed interactively by expert at head office
Description as search condition in SELECT query statement
1. Motivations
Compress Data (2)

Requirement: compression ratio & retrieval efficiency
–
Short in length



Another requirement: accuracy
–
–
7
comp. ratio = |C| / (|DESC| * 2)
Typically < 0.01
Retrieve desired objects
May trade accuracy for shorter length: lossy compression / faster retrieval
1. Motivations
Compress Data (3)

Support Further investigation
–
Exploratory queries without retrieval of original objects


–
e.g.,…
Incremental clustering
Exploratory queries requiring retrieval of original objects



Information not associated with the description
Information not revealed by the description
Interactive and iterative mining environment
–
8
Inductive database framework
2. Cluster Description Problems
SDL Problem

Shortest Description Length (SDL) problem
Given a cluster C, its bounding box B, and a description format,
obtain a logical expression DESC in the given format with
minimum length such that for any object o, (o  C  o  DESC)
 (o  B – C  o  DESC)


Lossless
NP-hard
–
9
variant of the Minimum Set Cover problem
2. Cluster Description Problems
MDQ Problem

Maximum Description Quality (MDQ) problem
Given a cluster C, its bounding box B, an integer l, a description
format, and a quality measure, obtain a logical expression DESC
in the given format with length at most l such that the quality
measure is maximized


Lossy
NP-hard:
–
10
reducible to the Set Cover problem
2. Cluster Description Problems
MDQ Problem: Quality Measure

F-measure
–
–
–
Harmonic mean of P and R
R = |Cdescribed| / |C|
P = |Cdescribed| / |Bdescribed|



11
F = 2PR / (P + R)
where Cdescribed (Bdescribed) is the set of objects in C (B) that
is described by the description
RP=1 (Recall at fixed Precision of 1)
PR=1 (Precision at fixed Recall of 1)
3. Related Problems in Machine Learning
Concept Learning

Concept learning: find a classifier for a concept
–

Typical machine learning problem
–

Most extensively studied problem in computational learning theory
Representation of target concept assumed known
–
12
Given a labeled training sequence, generate a hypothesis
that is a good approximation of the target concept and can be
used as a predictor for other query points
DNF formulas, Boolean circuits, geometric objects, pattern
languages, deterministic finite automata, etc…
3. Related Problems in Machine Learning
Geometric Concept Learning

Target is a geometric object
–
Half spaces. Separated by linear hyper planes

–
perceptron learning
Intersections of half spaces. Convex polyhedra

particularly, axis-parallel box
–
–
Unions of axis-parallel boxes


13
More accurate yet simple enough to offer intuitive solutions
e.g., 2 boxes. Less studied
Popular topic in computational learning theory
3. Related Problems in Machine Learning
Learning Models

PAC learning model: Probably Approximately Correct [15]
–

Given a reasonable number of labeled instances, generate with high
probability a hypothesis that is a good approximation of the target
concept within a reasonable amount of time
Exact Learning Model [2]
–
–
The learner is allowed to pose queries in order to exactly identify the
target concept within a reasonable amount of time
Common queries: answered by an oracle


14
Membership query: feed instance to oracle, return its classification
Equivalence query: present hypothesis to oracle, return “correct” or a
counterexample
3. Related Problems in Machine Learning
Similarities

Learn unions of axis-parallel boxes
–

DNF formula
Learner is fed with positive and negative examples
which need to be classified accurately
15
3. Related Problems in Machine Learning
Dissimilarities (1) : objective

Geometric concept learning
–
exact or approximate identification of geometric object

–
Ultimate goal is the classifying accuracy over the instance space

–

fixed number of boxes, usually 1 or 2
“+” and “–” examples are treated equally
Cluster description problems
–
–
–
Entire instance space is given
To describe “+” examples instead of a classify “+” and “–” examples
Optimization problem: SDL or MDQ problems.


16
labeled instances are a small part of the instance space
Optimization: least disagreement problem


related to geometric object construction in computational geometry
For quality measure, “+” and “–” examples are not treated equally
Examples could be labeled “1”, “5”, “9”... for clustering description
3. Related Problems in Machine Learning
Dissimilarities (2) : research focus

Geometric concept learning
–
Learnability

–
–

Learning from queries (membership, equivalence) rather than from
random examples to significantly improve performance
Low-dimensional
Cluster description problems
–
–
–
17
Under certain learning models
Heuristics
Large data set, possibly high-dimensional
Other concerns: compression tool, retrieval efficiency etc.
4. Related Problems in Theory
Two Traditional Problems

Minimum Set Cover problem [10]
–
–
Given a collection S of subsets of a finite ground set X, find S’  S
with minimum cardinality such that U Si  S’ Si = X
Approximable within ln|S| [8]: greedy set cover


Maximum Coverage problem (max k-cover): a variant problem
–
–

18
Iteratively picks up the subset that covers the maximum number of
uncovered elements
Maximize the number of covered elements in X using at most k
subsets from S.
Approximable within e / (e – 1) by the greedy algorithm [8]
The ratios are optimal [6]
4. Related Problems in Theory
Red Blue Set Cover Problem
19

Given a finite set of “red” elements R, a finite set of “blue”
elements B and a family S  2RUB, find a subfamily S’  S which
covers all blue elements and minimum number of red elements

Raised recently (00’) [4]

Related to Group Steiner tree, minimum monotone satisfying
assignment and minimum color path problems

Reducible to the set cover problem (at least as hard as) [4]
4. Related Problems in Theory
SDL Problem is NP-hard

a variant of the Minimum Set Cover problem
–
–
Ground set X = set of objects in C, the cluster
being described
Collection S of rectangles (subsets):


–
20
Each Sj  S is a rectangle minimally covering some
objects in C and none objects not in C
S  2C
Find S’  S with minimum cardinality such that
U Si  S’ Si = C
4. Related Problems in Theory
MDQ Problem is NP-hard

RP=1: a variant of the Maximum Coverage problem
–
In the same fashion as we showed in last slide
–
Given k, find S’  S with |S’| at most k such that the number
of covered elements in C is maximized
21

PR=1: a variant of the Red Blue Set Cover problem

F-measure: reducible to either of the two
4. Related Problems in Theory
Similarities and Dissimilarities

Set Cover problem and its variants arise in a variety of settings

Useful to argue NP-hardness of SDL and MDQ problems

Not useful from algorithm design point of view because of the
essential differences: in SDL and MDQ problems, the collection
of subsets S is not explicitly given
–
22
e.g., the greedy set cover algorithm cannot be applied
5. Related Problems in Computational Geometry
Maximum Box Problem

Given two finite sets of points X + and X - in d, find
an axis-parallel box B such that B ∩ X - =  and the
total number of points from X + covered is maximized

Raised recently (02’) [5]
NP-hard when d is part of the input
Studied by [12] [14] for d = 2 finding exact solutions




23
Corresponds to MDQ problem with k = 1 and RP=1
Techniques for low-dimensional and small datasets
5. Related Problems in Computational Geometry
Rectilinear Polygon Covering Problem

Find a collection of axis-parallel rectangles with minimum
cardinality whose union is exactly to a given rectilinear polygon
–
Covering rectilinear polygon with axis-parallel rectangles with holes [7] (79’)
–
Different from partition, overlapping is allowed
2-d
–



NP-complete [13]
No possibility of polynomial time approximation scheme [3]
A special case of the general set covering problem
–
–
24
performance guarantee O(log n)
Recent result O(√ log n) [9]
5. Related Problems in Computational Geometry
Similarities and Dissimilarities





If extended to higher-dimensional space, exactly the
same formulation as in Greedy Growth [1]
If further allowing “don’t care” cells, similar formulation
as in Algorithm BP [11]
If not restricted to grid data, same as SDL problem
[1] and [11] are related studies in DB literature
Logic minimization is also related to the grid case
–
25
Most researches are on Boolean variables
6. Conclusions and Past / Future Work
Conclusions
26

Cluster (clustering) description is of
fundamental importance to KDD, but not
systematically studied

Formulated problems have significant
differences to related problems in machine
learning, theory, computational geometry and
DB literatures.
6. Conclusions and Past / Future Work
Past Work


27
Formulated and provided heuristic algorithms for
the SDL and MDQ problem
–
for both DNF and B-DNF formats
–
work for both vector and grid data
Studied efficient retrieval issue
–
cost model
–
efficiency estimation methodology
–
optimal ordering of description terms
6. Conclusions and Past / Future Work
Future Work

Optimization of algorithms: focused on introducing concepts and ideas

Real data experiments

Experimental validation of the retrieval efficiency issue: DBMS

Description alphabet: always rectangles?
–

Clustering description: gaining quality and efficiency
–
28
if not, getting farther away from some related problems
even more distinguishable from related problems

Incremental description: description process could take long

Interaction with clustering algorithms: online description?
References:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
29
15.
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data
for data mining applications. SIGMOD 1998.
D. Angluin. Queries and concept learning. Machine Learning, 2(4): 319-342, 1988.
P. Berman and B. Dasgupta. Approximating rectilinear polygon cover problems. Algorithmica, 17: 331-356, 1997.
R.D. Carr, S. Doddi, G. Konjevod and M. Marathe. On the red-blue set cover problem. SODA 2000.
J. Eckstain, P. Hammer, Y. Liu, M. Nediak, and B. Simeone. The maximum box problem and its application to
data analysis. Computational Optimization and Applications, 23(3): 285-298, 2002.
U. Feige. A threshold of ln n for approximating set cover. J.ACM 45(4): 634-652. 1998.
M.R. Garey and D.S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H.
Freeman, New York, 1979.
D.S. Hochbaum. Approximating covering and packing problems: set cover, vertex cover, independent set, and
related problems. Approximation algorithms for NP-hard problems. PWS, New York, 1997.
V.S.A. Kumar and H. Ramesh. Covering rectilinear polygons with axis-parallel rectangles. SIAM J. Comput.
32(6): 1509-1541, 2003.
D.S. Johnson. Approximation algorithms for combinatorial problems. J.Comput.System Sci., 9: 256-278, 1974.
L.V.S. Lakshmanan, R.T. Ng, C.X. Wang, X. Zhou and T.J. Johnson. The generalized MDL approach for
summarization. VLDB 1999.
Y. Liu and M. Nediak. Planar case of the maximum box and related problems. CCCG 2003.
W.J. Masek. Some NP-Complete Set Covering Problems, manuscript, MIT, Cambridge, MA, 1979.
M. Segal. Planar maximum box problem. Journal of Mathematical Modelling and Algorithms, 3: 31-38, 2004.
L.G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134-1142, 1984.
Questions?
30