Trajectory Sampling for Direct Traffic Oberservation
Download
Report
Transcript Trajectory Sampling for Direct Traffic Oberservation
Partitioning – A Uniform Model
for Data Mining
Anne Denton, Qin Ding, William
Jockheck, Qiang Ding and William
Perrizo
Motivation
Databases and data warehouses are
currently separate systems
Why?
Standard answer:
Details, details, details …
Our answer:
Fundamental issue of representation
Relations Revisited
R(A1, A2, …, AN)
Set of tuples
Any choices at a fundamental level?
Yes!
Duality between
Element-based representation
Space-based representation
Duality
Element-based
representation:
Standard
representation of
tuples with all their
attributes
Space-based
representation:
The existence
(count?) of a tuple is
represented in its
attribute space
Similar Dualities in Physics
Particles can be
represented by the
coordinates of their
position
More fundamental
level:
Particle
Particles can be 1
values in a grid of
locations
Field
Space-Based Representation
Consider standard tuples as vectors in
the space of attribute domains
Represent all possible attribute
combinations as one bit:
1 if data item is present
0 if it isn’t
Allowing counts could be useful for
projections (?)
Space-Based Representation
as a Partition
Partitions are mutually exclusive and
collectively exhaustive sets of elements
The Space-Based Representation
partitions attribute space into two sets:
Data item present in database (1)
Data item not present (0)
Usefulness of Space-Based
Representation
No indexes needed: instant value-based
access
Index locking becomes dimensional locking
Aggregation very easy due to value-based
ordering
Selections become “and”s
What experience do we have with space-based
representations?
Data Cube Representation
One value (e.g., sales) given in the
space of the key attributes
Space-based with respect to key
attributes
Element-based with respect to non-key
attributes
Properties of the Domain
Space
Ideally space should have distance,
norm, etc.
Especially important for data mining
Does that make sense for all domains?
Can any domain be mapped to integer?
Can all Domains be Mapped to
Integer?
Simplistic answer: yes!
All information in a computer is saved as bits
Any sequence of bits can be interpreted as an
integer
Problems
Order may be irrelevant, e.g., hair-color
Order may be wrong, e.g., sign bit for int
Spacing may vary, e.g., float (solution in paper:
intervalization)
Domains may be very large, e.g., movies
Categorical attributes
(irrelevant order)
We need more than one attribute for an
appropriate representation
Data mining solution:
1 attribute per domain value
Our solution:
1 attribute per bit slice
Values are corners of a Hypercube in
log(Domain Size) dimensions
Distances are given trough MAX metric
Fundamental Partition
(Space-Based Representation)
d-dimensional representation
d = Number of attributes
# of represented points
= product of all d domain sizes
Exponential in number of dimensions!
We badly need compression!
How Do We Handle
Exponential Growth with d?
How can we reduce # of attributes, d?
Review normalization:
We can decompose a relation into a set of
relations each of which contains the entire
key and one other attribute
This decomposition is
lossless
dependency preserving (BCNF relations only)
Compression for Non-Key
Attributes
Fundamental partition contains only one nonzero data-point in any non-key dimension
Represent number by bit-slices
Note:
This works for numerical and categorical
attributes
Original values can be regained by anding
Example: 5 (binary 101)
bit 0 & bit 1’ & bit 2
Concept Hierarchies
Bit sliced representation have significant
benefits beyond compression:
Bit slices can be combined into concept
hierarchies:
Highest level: bit 0
Next level: bit 0 & bit 1
Next level: bit 0 & bit 1 & bit 2
Compression for Key
Attributes
Database state-independent
compression could lead to information
loss (counts > 1)
Database state-dependent compression:
Tree structure that eliminates pure
subtrees => P-trees
Other Ideas
Compression is better if attribute values
are dense within their domain
We could use extent domain
Compression good
Problems with insertion
Reorganization of storage
Index locking has to be reintroduced
…
How Good is Compression?
If all domains are “dense”, i.e. all values
occur
Size can easily be smaller than original relation
If non-key attributes are “sparse”
Not usually a problem: good compression
Problems only in extreme cases
E.g., movies as attribute values!
If key-attributes are “sparse”
Larger potential for problems, but also large
potential for benefit (see data cubes)
Are Key-Attributes Usually
Sparse?
Many key attributes are dense (“structure”
attributes as keys)
Automatically generated IDs are usually sequential
x and y in spatial data mining
Time in data streams
Keys in tables that represent relationships
tend to be sparse (feature attributes as keys)
Student / course offering / grade
Data cubes!
What Have We Gained?
(Database Aspects)
Data simultaneously acts as index
No separate index locking
(unless extent domain is used)
All information saved as bit patterns
Easy “select”
Other database operations discussed in
class
Data Mining Benefits
(Feature Attribute Keys)
Direct mining possible on relations with
feature attributes keys
E.g., student / course offering / grade
Rollup can be defined, etc.
Clustering, classification, ARM can make use
of proximity inherent in representation
Bit-wise representation provides concept
hierarchy for non-key attribute
Tree structure provides concept hierarchy for
key attributes
Data Mining Benefits
(Structure Attribute Keys)
For relations with structure attribute keys
data mining requires “and”ing
produces counts for feature attributes
Bit-wise representation provides concept
hierarchy for non-key attribute
Duality:
Concept hierarchies in this representation
map exactly to tree structure when the
attribute is a key
Mapping Concept Hierarchies
Bit Slices <-> Tree
P-tree:
Take key attributes, e.g. x and y, and bit
interleave them:
x = 1
0
0
1
y =
1
1
0
1
1 1 0 1 0 0 1 1
Two consecutive digits form a level in the Ptree – or a level in a concept hierarchy
How Could We Use That
Duality?
Join with other relations and project off key
attributes
Duality allows moving to space of non-key
attributes (Meta P-trees)
Can we do that?
We lose uniqueness
We can use 1 to represent 1 or more tuples
(equivalent to relational algebra)
Or we can introduce counts
Can be useful for data mining
Need for non-duplicate eliminating counts exists also in
other applications
How Do Hierarchies Benefit us
in Databases?
Multi-granularity Locking
Subtrees form suitable units for storage
in a block
Fast value-based access!
(Data represented as multilevel index)
Access speed proportional to
# of levels in tree
# of bits for bit slices
Summary
Space-based representation has many
benefits
Value-based access and storage
No separate index needed
Rollups easy
P-Trees
Follow from systematic compression
Benefits from concept hierarchies