ppt - Intelligent Data Systems Laboratory
Download
Report
Transcript ppt - Intelligent Data Systems Laboratory
Categorical Skylines for Streaming Data
Nikos Sarkas, Gautam Das, Nick Koudas and Anthony K. H. Tunge
Univ. of Toronto, Univ. of Texas at Arlington,
and National Univ. of Singapore
SIGMOD 2008
Nam, Kwang-hyun
Intelligent Database Systems Lab
School of Computer Science & Engineering
Seoul National University, Seoul, Korea
Center for E-Business Technology
Seoul National University
Seoul, Korea
Contents
Introduction
Motivation
Background
Efficient skyline maintenance for categorical tuples
Experimental Evaluation
Conclusions
Copyright 2008 by CEBT
IDS Lab Seminar (User Log Analysis) - 2
Introduction
Current situation
It has become increasingly difficult to process abundant data in
order to isolate useful and relevant information
To facilitate the exploration of a data space
One such successful tool is the skyline query
The skyline of a data set
–
The subset of tuples which are not dominated on all of their
attributes by any other tuple
–
The tuples that have a uniquely interesting combination of attribute
values that no other tuple can match.
Copyright 2008 by CEBT
IDS Lab Seminar (User Log Analysis) - 3
Motivation
Recent work
Considered the on-demand evaluation of skyline queries on
tuples with partially ordered categorical attributes
But in an offline environment
–
Inappropriate for a highly dynamic data streaming environment
Objective
To identify and study the problem of maintaining the skyline of
streaming data with partially ordered, categorical attributes
To realize two novel techniques
–
Constitute the building blocks of STARS (STreaming ARrangement
Skyline) to solve the problem.
Copyright 2008 by CEBT
Database Tuning - 4
Background
A limited-capacity, sliding-window buffer B in memory
To only store the n most recent tuples from the stream
The contents of the buffer constitute a data set denoted by D
The data set is comprised of n tuples with d categorical attributes
The domain Domi of each the attributes is partially ordered
Partial order (an antisymmetric preorder)
A binary relation"≤" over a set P which is reflexive, antisymmetric, and
transitive, i.e., for all a, b, and c in P, we have that:
–
a ≤ a (reflexivity)
–
if a ≤ b and b ≤ a then a = b (antisymmetry)
–
if a ≤ b and b ≤ c then a ≤ c (transitivity)
Poset (Partially ordered set)
–
A set with a partial order
–
Commonly represented as DAG (directed acyclic graph)
Copyright 2008 by CEBT
Database Tuning - 5
Background
• Vertex : Domain value
• Directed edge : For each pair of comparable values
Example 1
Value a and b ->
‘Comparable and a dominates b’
Value b and c ->
‘Incomparable’
Assume tuples comprised of d categorical attributes
Tuple t1 dominates tuple t2 ( t2 < t1 or t1 > t2 )
–
If for every attribute Xi, t2.Xi ≤i t1.Xi and there is at least one attribute
Xj such that t2.Xj < t1.Xj
Tuples t1 and t2 do not dominate one another
–
Call this situation ‘tied’ ( t1 ~ t2)
–
Then, the skyline of data set D is the subset of all tuples that are not
dominated by any other tuple
Copyright 2008 by CEBT
Database Tuning - 6
Background
Example 2
i) t2 < t1 ∵ b < a and d < b
ii) t2 < t3 ∵ b = b and d < c
iii) t1 ~ t3 ∵ b < a, but b ~ c
∴ The skyline of this data set consists of tuples t1 and t3.
Consider three tuples t1, t2 and t3 with two categorical attributes
The domain of both attributes is the poset of Figure 1
Assume t1 = (a,b), t2 = (b,d), and t3 = (b,c)
Skyline maintenance solution
Checking whether a tuple a dominated by the current skyline
Retrieving the tuples in the buffer that are dominated by the
outgoing skyline tuple
–
Since only these tuples are candidates for entering the skyline
Lemma 1
Let t1,t2∈B be two tuples so that t1 < t2. Then, if t2 arrived after t1
in the stream, t1 will never be in the skyline of B
Copyright 2008 by CEBT
Database Tuning - 7
Efficient skyline maintenance for categorical tuples
Two novel techniques for realizing the building blocks of the
skyline maintenance frame work
Indexing the skybuffer (the relevant part of the buffer for skyline)
–
To enable efficiently identify the skybuffer tuples dominated by a
query tuple
Organizing the skyline
–
To enable rapidly answer whether a query tuple is dominated by the
skyline
Copyright 2008 by CEBT
Database Tuning - 8
Topological sorting
Topological sort
A numbering of the vertices of a DAG
–
Every edge from a vertex numbered i to a vertex numbered j satisfies
i<j
If for two values a, b, a > b, then a will appear before b.
–
For a given value x, all values dominated by x will appear after it in
the linear order, while x can never dominate any value that precedes
it in the ordering
Definition 1
Let v be a value of a partially-ordered domain. We denote by r(v)
the integer corresponding to v’s position in a ceratin topological
sort of the domain
Copyright 2008 by CEBT
Database Tuning - 9
Topological sorting
Lemma 2
Let v1, … , vm be the m values of a partially-ordered domain Dom.
Then vi > vj only if r(vi) < (vj)
Lemma 3
Let t1, t2 be two tuples with d partially-ordered categorical
attributes X1, …, Xd. Then t1 > t2 only if r(t1.Xi) < r(t2.Xi), 1≤i≤d
Lemma 4
Let t1, t2 be two tuples with d partially-ordered categorical
attributes X1, … Xd. If ∃i, j such that r(t1.Xi) < r(t2.Xi) and r(t1.Xj) >
r(t2.Xj), then t1 ~ t2
Copyright 2008 by CEBT
Database Tuning - 10
Organizing the skybuffer tuples in a grid
One of goals is indexing the skybuffer
To efficiently insert and delete tuples, as well as identify the
tuples dominated by a query tuple
Build skybuffer indexing solution around a simple grid
Example 3
Consider a set of tuples with two categorical attributes, the
domain of both attributes being the poset of Figures 2. Suppose
that the poset values have been mapped to integers according to
topological sort (i) in Figure 2. Then, we create the grid by using
this topological sort as the grid scales and place the skybuffer
tuples in its cell.
Each grid cell corresponds to a unique combination of attribute
values and only contains tuples with these exact values
–
The cell corresponding to tuples with attribute values (d, e) has been
Database Tuning - 11
marked with an “x”. Copyright 2008 by CEBT
Improving the skybuffer organization
1. Visiting only relevant cells
One of advantage of the grid-based index
Directly identify and process precisely the cells that contain
dominated tuples (directly visit the cells marked with •)
Lemma 5 (Focused Search)
Let t be a tuple with d partially-ordered categorical attributes X1
∈ Domi, …, Xd ∈ Domd. Let dom(t.Xi) be the values in Domi such
that t.Xi ≥ v, v ∈ Domi. Then, t dominates a tuple s if and only if
s ∈ dom(t.X1) x … x dom(t.Xd)
Example
–
The query tuple is (d,e) in figure 3. The domain values dominated by
d are {d,g}. The domain values dominated by e are {e,g,h}. Then only
tuples with values in {d,g} x {e,g,h} are dominated by {d,e}.
Copyright 2008 by CEBT
Database Tuning - 12
Improving the skybuffer organization
2. Controlling the grid granularity
A problem with the grid-based index
The lack of control over granularity of the grid
Example
–
If the tuples have 4 attributes and each domain size is about 500,
then the grid would be comprised of 62.5 billion cells
Definition 2
Consider a DAG and its vertices. A vertex is a source if it has no
incoming edges. Then, the depth of a vertex in a DAG is the
length of the longest path from a source to the vertex.
Lemma 6
Let v1, …, vm be the m values of a partially-ordered domain Dom.
Then vi > vj only if depth(vi) < depth(vj)
Copyright 2008 by CEBT
Database Tuning - 13
Improving the skybuffer organization
Example 4
Consider a data set consisting of tuples with 2 partially-ordered
categorical attributes, the domain of both of them being the
poset of Figure 4(a). Then, we can group domain values that lie
on the same depth and create the grid of Figure 4(b). Every cell
would contain tuples with the same attribute values. Now, tuples
with corresponding attributes that lie at the same depth level are
placed in the same cell.
–
tuple (d,e) lies in the cell marked with “x” in Figure 4(b), along with
other tuples with values in {c,d,e} x {c,d,e}.
Problem
The cells can contain tuples both dominated and not dominated
by the query tuple
–
A dominance check is required to identify the dominated tuples
Copyright 2008 by CEBT
Database Tuning - 14
Improving the skybuffer organization
3. Poset partitioning
Example 5
Suppose that our data set contains tuples with two partiallyAll cells that can potentially contain tuples
ordered categorical attributes, the domain
both being
the
that areof
dominated
by (d,e)
are located
in the
rectangular
poset of Figure 5(a). The values of the
domain
havearea.
been
grouped. If we order the groups in ascending order of their
depth value, breaking ties arbitrarily, we propose a valid
topological sort for the poset
Figure 5(b) presents the resulting grid by using as scales the
grouping of Figure 5(a)
The cells that contain candidate are marked with •
–
This allows us to use focused search in order to directly access
relevant cells.
Copyright 2008 by CEBT
Database Tuning - 15
Experimental Evaluation
SDC (Stratification by Dominance Classification)
Introduced in Stratified computation of skylines with partiallyordered domains written by Chan et al. in 2005.
Improves over BBS+ in terms of both progressiveness and speed
Suggested in the numerical skyline maintenance solution in
offline environment
Performance evaluation of STARS and SDC
Goal
–
To identify the impact of the buffer size, data dimensionality and
domain structure on performance
Measure
–
The average time required to process a buffer update
–
i.e., A combined tuple arrival and expiration
Copyright 2008 by CEBT
Database Tuning - 23
Experimental Evaluation
STARS
SDC
Effect of buffer size and dimensionality on performance
Performance on synthetic data
STARS outperforms SDC by an order of magnitude
The time required by STARS is in the order of a millisecond or
less
Rendering its use in real life applications entirely realistic
Copyright 2008 by CEBT
Database Tuning - 24
Experimental Evaluation
STARS
SDC
3 dimensional tuples
Performance on skewed real data
Performance on a stream of real, skewed and correlated data
Used DMV data set and three categorical attributes of the
“cars” table: Maker/Model, Color, and Year
The result can be attributed to the presence of skew and
correlation in the data that leads considerably larger skylines
as the buffer size increases
Copyright 2008 by CEBT
Database Tuning - 25
Experimental Evaluation
Tree
Further evlauation of STARS
Every depth level has twice
the number of values from
the level above
wall
All depth levels have the
same number of values
Pruning efficiency of arrangement skyline organization
Pruning efficency
Measured as the fraction of skyline tuples that need to be
examined on average in order to answer a dominance query
Found to be resilient to increases in data dimensionality
Copyright 2008 by CEBT
Database Tuning - 26
Experimental Evaluation
3 dimensional tuples
Tree
Every depth level has twice
the number of values from
the level above
wall
All depth levels have the
same number of values
Effect of grid granularity on performance
Demonstrates the potential perfomance benefits by utilizing
the techniques of ‘improving the skybuffer organization’
The benefit of increasing the grid granularity is eventually
offset(상쇄) by the overhead of visiting many sparse cells
Knee in the curves
Copyright 2008 by CEBT
Database Tuning - 27
Conclusions
Identified and motivated the problem of maintaining the
skyline of streaming data with partially ordered, categorical
attributes
Realized two novel techniques
To constitute the building blocks of an efficient solution to the
problem.
Introduce a lightweight data structure for indexing the tuples
in the streaming buffer
Copyright 2008 by CEBT
IDS Lab Seminar (User Log Analysis) - 28
Skyline
Nam, Kwang-hyun
Intelligent Database Systems Lab
School of Computer Science & Engineering
Seoul National University, Seoul, Korea
Center for E-Business Technology
Seoul National University
Seoul, Korea
Introduction of Skyline Query
Motivation
The amount of information is
explosively increasing
–
Users often face information
overload
How can different attributes be
compared?
Low
Solution
B
Return objects which are not
Price
dominated by any other object in
all attributes
Intuitive querying
–
A
High
No overhead of formulation
Copyright 2008 by CEBT
Low
Speed
High
IDS Lab Seminar (User Log Analysis) - 30
Categorical Skylines for Streaming Data
Motivation
Existing work is not proper to maintain skylines for streaming
data
Skyline maintenance of streaming tuples
Incoming tuple
–
Outgoing tuple
–
Consider whether incoming tuple can become part of skyline
Consider whether the tuples dominated by outgoing tuple can be
candidates for entering skyline
Skyline maintenance solution
–
Checking whether a tuple is dominated by the current line
–
Retrieving the tuples in the buffer that are dominated by the
outgoing sky line tuple
Copyright 2008 by CEBT
Database Tuning - 31
Categorical Skylines for Streaming Data
Proposal
Indexing the skybuffer
–
To insert and delete tuples, as well as identify the skybuffer tuples
dominated by a query tuple efficiently
–
1. Topological sorting
–
2. Organizaing the skybuffer tuples in a grid
–
3. Improving the skybuffer
Visiting only relevant cells, Controlling the grid granularity, Poset
partitioning
Advantages
Optimize query evaluation
Applicable to a wide range of dimensionality
Adapt any size of poset
Copyright 2008 by CEBT
Database Tuning - 32