Trajectory Sampling for Direct Traffic Oberservation

Download Report

Transcript Trajectory Sampling for Direct Traffic Oberservation

Partitioning – A Uniform Model
for Data Mining
Anne Denton, Qin Ding, William
Jockheck, Qiang Ding and William
Perrizo
Motivation
Databases and data warehouses are
currently separate systems
Why?
 Standard answer:



Details, details, details …
Our answer:

Fundamental issue of representation
Relations Revisited
R(A1, A2, …, AN)
 Set of tuples
 Any choices at a fundamental level?
Yes!
 Duality between



Element-based representation
Space-based representation
Duality
Element-based
representation:
 Standard
representation of
tuples with all their
attributes
Space-based
representation:
 The existence
(count?) of a tuple is
represented in its
attribute space
Similar Dualities in Physics

Particles can be
represented by the
coordinates of their
position
More fundamental
level:
 Particle


Particles can be 1
values in a grid of
locations
Field
Space-Based Representation


Consider standard tuples as vectors in
the space of attribute domains
Represent all possible attribute
combinations as one bit:



1 if data item is present
0 if it isn’t
Allowing counts could be useful for
projections (?)
Space-Based Representation
as a Partition


Partitions are mutually exclusive and
collectively exhaustive sets of elements
The Space-Based Representation
partitions attribute space into two sets:


Data item present in database (1)
Data item not present (0)
Usefulness of Space-Based
Representation
No indexes needed: instant value-based
access
 Index locking becomes dimensional locking
 Aggregation very easy due to value-based
ordering
 Selections become “and”s
What experience do we have with space-based
representations?

Data Cube Representation



One value (e.g., sales) given in the
space of the key attributes
Space-based with respect to key
attributes
Element-based with respect to non-key
attributes
Properties of the Domain
Space

Ideally space should have distance,
norm, etc.


Especially important for data mining
Does that make sense for all domains?

Can any domain be mapped to integer?
Can all Domains be Mapped to
Integer?

Simplistic answer: yes!



All information in a computer is saved as bits
Any sequence of bits can be interpreted as an
integer
Problems




Order may be irrelevant, e.g., hair-color
Order may be wrong, e.g., sign bit for int
Spacing may vary, e.g., float (solution in paper:
intervalization)
Domains may be very large, e.g., movies
Categorical attributes
(irrelevant order)
We need more than one attribute for an
appropriate representation
 Data mining solution:


1 attribute per domain value
Our solution:


1 attribute per bit slice
Values are corners of a Hypercube in
log(Domain Size) dimensions

Distances are given trough MAX metric
Fundamental Partition
(Space-Based Representation)

d-dimensional representation


d = Number of attributes
# of represented points
= product of all d domain sizes
Exponential in number of dimensions!
 We badly need compression!

How Do We Handle
Exponential Growth with d?
How can we reduce # of attributes, d?
Review normalization:
 We can decompose a relation into a set of
relations each of which contains the entire
key and one other attribute
 This decomposition is



lossless
dependency preserving (BCNF relations only)
Compression for Non-Key
Attributes
Fundamental partition contains only one nonzero data-point in any non-key dimension
 Represent number by bit-slices
Note:
 This works for numerical and categorical
attributes
Original values can be regained by anding
 Example: 5 (binary 101)
bit 0 & bit 1’ & bit 2
Concept Hierarchies
Bit sliced representation have significant
benefits beyond compression:
 Bit slices can be combined into concept
hierarchies:



Highest level: bit 0
Next level: bit 0 & bit 1
Next level: bit 0 & bit 1 & bit 2
Compression for Key
Attributes


Database state-independent
compression could lead to information
loss (counts > 1)
Database state-dependent compression:

Tree structure that eliminates pure
subtrees => P-trees
Other Ideas
Compression is better if attribute values
are dense within their domain
 We could use extent domain


Compression good
Problems with insertion



Reorganization of storage
Index locking has to be reintroduced
…
How Good is Compression?

If all domains are “dense”, i.e. all values
occur


Size can easily be smaller than original relation
If non-key attributes are “sparse”


Not usually a problem: good compression
Problems only in extreme cases


E.g., movies as attribute values!
If key-attributes are “sparse”

Larger potential for problems, but also large
potential for benefit (see data cubes)
Are Key-Attributes Usually
Sparse?

Many key attributes are dense (“structure”
attributes as keys)




Automatically generated IDs are usually sequential
x and y in spatial data mining
Time in data streams
Keys in tables that represent relationships
tend to be sparse (feature attributes as keys)


Student / course offering / grade
Data cubes!
What Have We Gained?
(Database Aspects)


Data simultaneously acts as index
No separate index locking


(unless extent domain is used)
All information saved as bit patterns


Easy “select”
Other database operations discussed in
class
Data Mining Benefits
(Feature Attribute Keys)

Direct mining possible on relations with
feature attributes keys





E.g., student / course offering / grade
Rollup can be defined, etc.
Clustering, classification, ARM can make use
of proximity inherent in representation
Bit-wise representation provides concept
hierarchy for non-key attribute
Tree structure provides concept hierarchy for
key attributes
Data Mining Benefits
(Structure Attribute Keys)

For relations with structure attribute keys
data mining requires “and”ing

produces counts for feature attributes
Bit-wise representation provides concept
hierarchy for non-key attribute
Duality:
 Concept hierarchies in this representation
map exactly to tree structure when the
attribute is a key

Mapping Concept Hierarchies
Bit Slices <-> Tree
P-tree:
 Take key attributes, e.g. x and y, and bit
interleave them:




x = 1
0
0
1
y =
1
1
0
1
1 1 0 1 0 0 1 1
Two consecutive digits form a level in the Ptree – or a level in a concept hierarchy
How Could We Use That
Duality?



Join with other relations and project off key
attributes
Duality allows moving to space of non-key
attributes (Meta P-trees)
Can we do that?



We lose uniqueness
We can use 1 to represent 1 or more tuples
(equivalent to relational algebra)
Or we can introduce counts


Can be useful for data mining
Need for non-duplicate eliminating counts exists also in
other applications
How Do Hierarchies Benefit us
in Databases?



Multi-granularity Locking
Subtrees form suitable units for storage
in a block
Fast value-based access!
(Data represented as multilevel index)
Access speed proportional to
 # of levels in tree
 # of bits for bit slices
Summary

Space-based representation has many
benefits




Value-based access and storage
No separate index needed
Rollups easy
P-Trees


Follow from systematic compression
Benefits from concept hierarchies