On Data Mining, Compression, and Kolmogorov Complexity.

Download Report

Transcript On Data Mining, Compression, and Kolmogorov Complexity.

Parameter-Free Spatial Data
Mining Using MDL.
S. Papadimitriou, A. Gionis, P. Tsaparas, R.A.
Väisänen, H. Mannila, and C. Faloutsos.
International Conference on Data Mining 2005
Problems:

Finding patterns of spatial correlation and feature co-occurrence.



Automatically
 That is, parameter-free.
Simultaneously
For example:





Spatial locations on a grid.
Features correspond to species present in specific cells.
Each pair of cell and species is 0 or 1, depending on species present in
that cell.
Feature co-occurrence:
 Cohabitation of species.
Spatial correlation:
 Natural habitats for species.
Motivation:

Many applications

Biodiversity Data


Geographical Data


Occurrence of events (storms, drought, fire, etc.) in various locations.
Historical and Linguistic Data


Presence of facilities on city blocks.
Environmental Data


As we just demonstrated.
Occurrence of words in different languages/countries, historical
events in a set of locations.
Existing methods either:


Detect one pattern, but not both, or
Require user-input parameters.
Background

Minimum Description Length (MDL):


Let L(D|M) denote the code length required to represent data
D given (using) model M. Let L(M) be the complexity required
to describe the model itself.
The total code length is then:


L(D, M) = L(D|M) + L(M)
This was used in SLIQ and is the intuitive notion behind the
connection between data mining and data compression.



The best model minimizes L(D, M), resulting in optimal compression.
Choosing the best model is a problem in its own right.
This will be explored further in the next paper I present.
Background

Quadtree Compression

Quadtrees:
 Used to index and reason about contiguous variable size grid
regions (among other applications, mostly spatial).
Used for 2D data; kD analogue is a kD-tree.
“Full Quadtree”: All nodes have either 0 or 4 children.
 Thus, all internal nodes correspond to a partitioning of a
rectangular region into 4 subregions.
Each quadtree’s structure corresponds to a unique partitioning.




Transmission:


If we only care about the structure (spatial partitioning), we can
transmit a 0 for internal nodes and a 1 for leaves in depth-first order.
If we transmit the values as well, the cost is the number of leaves
times the entropy of the leaf value distribution.
Example
Quadtree Encoding

Let T be a quadtree with m leaf nodes, of which mp have value p.
 The total codelength is:
m1 m 2
mk  4m 
L(T )  mH ( , ,..., )     1
m m
m
 3 


If we know the distribution of the leaf values, we can calculate this in
constant time.
Updating the tree requires O(log n) time in the worst case, as part of
the tree may require pruning.
Binary Matrices / Bi-groupings:

Bi-grouping:
 Simultaneous grouping of m rows and n columns into k and l disjoint
row and column groups.
 Let D denote an m x n binary matrix.
 The cost of transmitting D is given as follows:
L( D)  L( D | QX , QY , k , l )  L(QX , QY , k , l )



Recall the MDL Principle: L(D) = L(D|M) + L(M).
Let {Qx, Qy} be a bi-grouping.
Lemma (we will skip the proof):

The codelength for transmitting an m-to-k mapping Qx where mp symbols are
mapped to the value p is approximately:
L(Qx | k )  mH (
m1 m2
mk
, ,..., )
m m
m
Methodology

Exploiting spatial locality:
 Bi-grouping as presented is nonspatial!
 To make it spatial, assign a non-uniform prior to possible groupings.




That is, adjacent cells are more likely to belong to the same group.
Row groups correspond to spatial groupings.
 “Neighborhoods”
 “Habitats”
 Row groupings should demonstrate spatial coherence.
Column groups correspond to “families”.
 “Mountain birds”
 “Sea birds”
Intuition
 Alternately group rows and columns iteratively until the total cost
L(D) stops decreasing.
 Finding the global optimum is very expensive.

So our approach will use a greedy search for local optima.
Algorithms

INNER:

Group given the number of row and column groups.
Start with an arbitrary bi-grouping (QX0 , QY0 ) of matrix D into k row groups
and l column groups.
do {
t 1
t
Let QX  QX
for each row i from 1 to n
QXt 1 (i)  p
1 ≤ p ≤ k such that the “cost gain”:
( L( D | QXt , QYt , k , l )  L(QXt | k ))  ( L( D | QXt 1, QYt , k , l )  L(QXt 1 | k ))
is maximized.
Repeat for columns, producing the bi-grouping (QXt 1 , QYt 2 )
t += 2
} while (L(D) is decreasing)
Algorithms

OUTER:

Finds the number of row and column groups.
Start with k0 = l0 = 1.
Split the row group p* with the maximum per-row entropy,
holding the columns fixed.
Move each row in p* to a new group kT+1 iff doing so would
decrease the per-row entropy of p*, resulting in a grouping QXt 1
Assign group (QXt 1, QYt 1 ) to the result of INNER(QXt 1 , QYt )
t
t
t 1
(
k
,
l
,
Q
, QYt )
If the cost does not decrease, return
X
Otherwise, increment t and repeat.
Finally, perform this again for the columns.
Complexity

INNER is linear with respect to nonzero elements in D.






Let nnz denote those elements.
Let k be the number of row groupings and l be the number of
column groupings.
Row swaps are performed in the quadtree and take O(log m)
time each, where m is the number of cells.
Let T be the iterations required to minimize the cost.
O(nnz * (k + l + log m) * T)
OUTER, though quadratic with respect to (k + l), is linear
with respect to the dominating term nnz.


Let n be the number of row splits.
O((k + l)2nnz + (k + l) n log m)
Experiments

NoisyRegions

Three features (“species”) on a 32x32 grid.




3% of each cell, chosen at random, has a
wrong species, also randomly chosen.
The spatial and non-spatial groupings are
shown to the right.



So D has 32x32 = 1024 rows.
And 3 columns.
Recall: Bi-grouping is not spatial by default.
Spatial grouping reduces the total
codelength.
The approach is not quite perfect due to the
heuristic nature of the algorithm.
Experiments

Birds


219 Finnish bird species over 3813 10x10km habitats.
Species are the features, habitats are cells.



The spatial grouping is clearly more coherent.
Spatial grouping reveals Boreal zones:




So our matrix is 3813x219.
South Boreal: Light Blue and Green.
Mid Boreal:Yellow.
North Boreal: Red.
Outliers are (correctly) grouped alone.


Species with specialized habitats.
Or those reintroduced into the wild.
Other approaches

Clustering

k-means






BIRCH
CURE
DENCLUE
LIMBO


Variants using different estimates of central tendency:
 k-medoids, k-harmonic means, spherical k-means, …
Variants determining k based on some criteria:
 X-means, G-means, …
Also information-theoretic.
Approaches either lossy, parametric, or aren’t easily adaptable
to spatial data.
Room for improvement:

Complexity

O(n * log m) cost for reevaluating the quadtree codelength.




Faster convergence



O(log m) worst-case time for each reevaluation/row swap * n swaps.
However, the average-case complexity is probably much better.
If we know something about the data distribution, we might be able to
reduce this.
Fewer iterations, reducing the scaling factor T.
Rather than stopping only when there is no decrease in cost, perhaps
stop when we fall below a threshold? (Introduces a parameter)
Accuracy


The search will only find local optima, leading to errors.
We can employ some approaches used in annealing or genetic
algorithms to attempt to find the global optimum.


Randomly restarting in the search space, for example.
Stochastic gradient descent – similar to what we’re already doing, actually.
Conclusion




Simultaneous and automatic grouping of spatial
correlation and feature co-habitation.
Easy to exploit spatial locality.
Parameter-free.
Utilizes MDL:


Minimizes the sum of the model cost and the data cost given
the model.
Efficient.

Almost linear with the number of entries in the matrix.
References
1.
2.
S. Papadimitriou, A. Gionis, P. Tsaparas, R.A. Vaisanen, H.
Mannila, C. Faloutsos, "Parameter-Free Spatial Data
Mining Using MDL", ICDM, Houston, TX, U.S.A.,
November 27-30, 2005.
M. Mehta, R. Agrawal and J. Rissanen, "SLIQ: A Fast
Scalable Classifier for Data Mining", in Proceedings of
the 5th International Conference on Extending
Database Technology, Avignon, France, Mar. 1996.
Thanks!
Any questions?