On Data Mining, Compression, and Kolmogorov

Download Report

Transcript On Data Mining, Compression, and Kolmogorov

On Data Mining, Compression,
and Kolmogorov Complexity.
C. Faloutsos and V. Megalooikonomou
Data Mining and Knowledge Discovery, 2007
Problems:


Can data mining be formalized into a logical system like
the relational algebra?
Can data mining be automated?


That is, can we remove the need for a human to decide upon a
data mining technique?
Why do so many different approaches to clustering exist?


(Implicitly, are these approaches all necessary?)
Partitioning, hierarchical, density-based, grid-based, modelbased, …
Motivation:

Can data mining be formalized?

Implications of an affirmative answer:



Can data mining be automated?

Implications:


An SQL-like language for data mining could be created.
Logical axioms and set theoretic concepts could be used to reason
about data mining techniques.
A universally optimal approach for data mining could be determined
without human intervention.
Why do so many clustering approaches exist?


Are some redundant?
Can we eliminate some of them?
“Data Mining”?

What exactly is “data mining”, anyway?

Lecture notes: Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) information or
patterns from data in large databases.

Paper: Supervised or unsupervised concept learning.

Supervised:



Unsupervised:



Classification
Regression
Clustering
Pattern discovery
Other tasks: forecasting, outlier detection, …?

Related to or dependent upon concept learning.
Intuitions?
Can concept learning be automated?

In general, can we always find an optimal model?

Some ideas:


MDL Principle: The best model results in the best compression.
Kolmogorov complexity: Shortest description of a string in a fixed
description language (alternatively, shortest Universal Turing Machine
that can generate the string):


E.g. “1 2 4 8 16 32” is less complex than “1 4 2 7 6 3”.
So data mining is essentially equivalent to data compression!


Specifically, all we need to do is find the Kolmogorov complexity.
There’s only one problem…
Kolmogorov
complexity is
undecidable!
A Simple Proof:
Let K(x) denote the Kolmogorov complexity of string x.
To compute K(x), we must execute every program, select those that halt and
compute x, and select the one with the smallest size.
The negative solution to the halting problem proves that the underlined step
is undecidable.
Thus Kolmogorov complexity is also undecidable.
Answers:



Can data mining be formalized into a logical system like
the relational algebra? No.
Can data mining be automated? No.
Why do so many different approaches to clustering exist?


Because we can’t automatically choose the best every time!
All of these results are due to the undecidability of
Kolmogorov complexity.
Formal Definition of Kolmogorov Complexity

Conditional Kolmogorov Complexity:

KU(x|y) = min{|p| : p  0,1* and U(p,y) = x}



In other words, it is the shortest program that can be input to a specific UTM
that produces the desired output.
Unconditional Kolmogorov Complexity:



Such that p is a description of x (such as a program) and a Universal Turing Machine U
outputs x when given p and some additional information y.
K(x) = K(x|), where  is the empty string.
In other words, no additional information is required.
Kolmogorov Incompressibility:

Defined as the condition K(x) ≥ |x|.



Incompressible strings are random, but not always (though usually) the converse.



Recall from SLIQ presentation: Cost(M,D) = Cost(D|M) + Cost(M).
Cost(M) is not always bounded by |x|, so K(x) > x is possible.
Pi, for example, is random in the traditional sense, but not incompressible.
In fact, it’s an example of an infinite-length string with a finite (and low) complexity.
This can be used as a proof technique, per “Kolmogorov Incompressibility in
Formal Proofs: A Critical Survey” by V. Megalooikonomou.
Some theorems

Lempel-Ziv encoding length may be used as an upper
bound on Kolmogorov complexity:



K(x|n) ≤ lLZ(x) + c, where n is the length of string x and lLZ is
the length of the Lempel-Ziv encoded string x.
lim (1 / n)lLZ ( x)  (1 / n) K ( x | n)
n 
We may also approximate the expected complexity of a
random sequence using its entropy:

1
E K ( X n | n)  H ( X )
n
Why is complexity important?

Classification

Usually uses a measure of homogeneity.

Entropy gain.


Gini index.


A special case of generalized (Renyi) entropies.
Clustering


Optimal number of clusters (k) difficult to determine.
Parameter-free approaches penalize complexity using:





We already demonstrated that this is related to Kolmogorov complexity.
Akaike Information Criterion
Bayesian Information Criterion
Minimum Description Length
All three relate to Kolmogorov complexity.
Distance functions

Many common distance functions utilize similar principles, but do not
explicitly use Kolmogorov complexity (due to its undecidability).
Relevant and Nontrivial Arguments:

Do these arguments work for lossy compression?


Yes. Lossy schemes can be made lossless by encoding the
difference between the model and reality.
What if the simplest model isn’t the best?

The simplicity of the model could be deceptive. The model may
be simple, but the number of parameters required to fit the
data to the model (or the number of digits in a single
parameter) may result in high complexity.
Positive aspects



Thinking about data mining as compression will aid in the
discovery of new parameter-free methods of classification,
clustering, and distance-function design.
Data mining will remain an art; even if we do stumble
upon the optimal domain-independent model, we cannot
prove it.
Comparing models, however, is trivial: simply compare the
number of bytes to a competing model.
Experiments: Comparing models
Example of metabolic rate vs. mass represented
using (a) piecewise flat approximation, such as
CART, (b) piecewise linear approximation, and
(c) linear approximation in a log-log scale.
Kleiber’s law: R = M(3/4). With better models,
such as (c), we may perform better compression.
However, this may require domain knowledge. A
data miner without knowledge of Kleiber’s law
may be tempted to use (b).
Conclusion


Data mining as compression.
Optimal compression undecidable.


Consequently, data mining will always remain an art.


Equivalent to finding Kolmogorov complexity, also undecidable.
Unprecedented amounts of data and computing power will
continue to challenge and inspire data miners.
Compression-based view leads to many parameter-free
data mining techniques.
References
1.
2.
C. Faloutsos and V. Megalooikonomou, " On Data
Mining, Compression, and Kolmogorov Complexity,"
Data Mining and Knowledge Discovery , Tenth Anniversary
Issue, 2007.
V. Megalooikonomou, "Kolmogorov Incompressibility
Method in Formal Proofs: A Critical Survey," Technical
Report TR CS-97-01, Department of Computer Science
Electrical Engineering, University of Maryland, Baltimore
County, Jan. 1997.
Thanks!
Any questions?