ppt - Intelligent Data Systems Laboratory

Download Report

Transcript ppt - Intelligent Data Systems Laboratory

Categorical Skylines for Streaming Data
Nikos Sarkas, Gautam Das, Nick Koudas and Anthony K. H. Tunge
Univ. of Toronto, Univ. of Texas at Arlington,
and National Univ. of Singapore
SIGMOD 2008
Nam, Kwang-hyun
Intelligent Database Systems Lab
School of Computer Science & Engineering
Seoul National University, Seoul, Korea
Center for E-Business Technology
Seoul National University
Seoul, Korea
Contents
 Introduction
 Motivation
 Background
 Efficient skyline maintenance for categorical tuples
 Experimental Evaluation
 Conclusions
Copyright 2008 by CEBT
IDS Lab Seminar (User Log Analysis) - 2
Introduction
 Current situation

It has become increasingly difficult to process abundant data in
order to isolate useful and relevant information
 To facilitate the exploration of a data space

One such successful tool is the skyline query

The skyline of a data set
–
The subset of tuples which are not dominated on all of their
attributes by any other tuple
–
The tuples that have a uniquely interesting combination of attribute
values that no other tuple can match.
Copyright 2008 by CEBT
IDS Lab Seminar (User Log Analysis) - 3
Motivation
 Recent work

Considered the on-demand evaluation of skyline queries on
tuples with partially ordered categorical attributes

But in an offline environment
–
Inappropriate for a highly dynamic data streaming environment
 Objective

To identify and study the problem of maintaining the skyline of
streaming data with partially ordered, categorical attributes

To realize two novel techniques
–
Constitute the building blocks of STARS (STreaming ARrangement
Skyline) to solve the problem.
Copyright 2008 by CEBT
Database Tuning - 4
Background

A limited-capacity, sliding-window buffer B in memory

To only store the n most recent tuples from the stream

The contents of the buffer constitute a data set denoted by D

The data set is comprised of n tuples with d categorical attributes

The domain Domi of each the attributes is partially ordered

Partial order (an antisymmetric preorder)


A binary relation"≤" over a set P which is reflexive, antisymmetric, and
transitive, i.e., for all a, b, and c in P, we have that:
–
a ≤ a (reflexivity)
–
if a ≤ b and b ≤ a then a = b (antisymmetry)
–
if a ≤ b and b ≤ c then a ≤ c (transitivity)
Poset (Partially ordered set)
–
A set with a partial order
–
Commonly represented as DAG (directed acyclic graph)
Copyright 2008 by CEBT
Database Tuning - 5
Background
• Vertex : Domain value
• Directed edge : For each pair of comparable values
 Example 1

Value a and b ->
‘Comparable and a dominates b’

Value b and c ->
‘Incomparable’
 Assume tuples comprised of d categorical attributes

Tuple t1 dominates tuple t2 ( t2 < t1 or t1 > t2 )
–

If for every attribute Xi, t2.Xi ≤i t1.Xi and there is at least one attribute
Xj such that t2.Xj < t1.Xj
Tuples t1 and t2 do not dominate one another
–
Call this situation ‘tied’ ( t1 ~ t2)
–
Then, the skyline of data set D is the subset of all tuples that are not
dominated by any other tuple
Copyright 2008 by CEBT
Database Tuning - 6
Background
 Example 2
i) t2 < t1 ∵ b < a and d < b
ii) t2 < t3 ∵ b = b and d < c
iii) t1 ~ t3 ∵ b < a, but b ~ c
∴ The skyline of this data set consists of tuples t1 and t3.

Consider three tuples t1, t2 and t3 with two categorical attributes

The domain of both attributes is the poset of Figure 1

Assume t1 = (a,b), t2 = (b,d), and t3 = (b,c)
 Skyline maintenance solution

Checking whether a tuple a dominated by the current skyline

Retrieving the tuples in the buffer that are dominated by the
outgoing skyline tuple
–
Since only these tuples are candidates for entering the skyline
 Lemma 1

Let t1,t2∈B be two tuples so that t1 < t2. Then, if t2 arrived after t1
in the stream, t1 will never be in the skyline of B
Copyright 2008 by CEBT
Database Tuning - 7
Efficient skyline maintenance for categorical tuples
 Two novel techniques for realizing the building blocks of the
skyline maintenance frame work

Indexing the skybuffer (the relevant part of the buffer for skyline)
–

To enable efficiently identify the skybuffer tuples dominated by a
query tuple
Organizing the skyline
–
To enable rapidly answer whether a query tuple is dominated by the
skyline
Copyright 2008 by CEBT
Database Tuning - 8
Topological sorting
 Topological sort

A numbering of the vertices of a DAG
–

Every edge from a vertex numbered i to a vertex numbered j satisfies
i<j
If for two values a, b, a > b, then a will appear before b.
–
For a given value x, all values dominated by x will appear after it in
the linear order, while x can never dominate any value that precedes
it in the ordering
 Definition 1

Let v be a value of a partially-ordered domain. We denote by r(v)
the integer corresponding to v’s position in a ceratin topological
sort of the domain
Copyright 2008 by CEBT
Database Tuning - 9
Topological sorting
 Lemma 2

Let v1, … , vm be the m values of a partially-ordered domain Dom.
Then vi > vj only if r(vi) < (vj)
 Lemma 3

Let t1, t2 be two tuples with d partially-ordered categorical
attributes X1, …, Xd. Then t1 > t2 only if r(t1.Xi) < r(t2.Xi), 1≤i≤d
 Lemma 4

Let t1, t2 be two tuples with d partially-ordered categorical
attributes X1, … Xd. If ∃i, j such that r(t1.Xi) < r(t2.Xi) and r(t1.Xj) >
r(t2.Xj), then t1 ~ t2
Copyright 2008 by CEBT
Database Tuning - 10
Organizing the skybuffer tuples in a grid
 One of goals is indexing the skybuffer

To efficiently insert and delete tuples, as well as identify the
tuples dominated by a query tuple

Build skybuffer indexing solution around a simple grid
 Example 3

Consider a set of tuples with two categorical attributes, the
domain of both attributes being the poset of Figures 2. Suppose
that the poset values have been mapped to integers according to
topological sort (i) in Figure 2. Then, we create the grid by using
this topological sort as the grid scales and place the skybuffer
tuples in its cell.

Each grid cell corresponds to a unique combination of attribute
values and only contains tuples with these exact values
–
The cell corresponding to tuples with attribute values (d, e) has been
Database Tuning - 11
marked with an “x”. Copyright 2008 by CEBT
Improving the skybuffer organization
 1. Visiting only relevant cells
 One of advantage of the grid-based index

Directly identify and process precisely the cells that contain
dominated tuples (directly visit the cells marked with •)
 Lemma 5 (Focused Search)

Let t be a tuple with d partially-ordered categorical attributes X1
∈ Domi, …, Xd ∈ Domd. Let dom(t.Xi) be the values in Domi such
that t.Xi ≥ v, v ∈ Domi. Then, t dominates a tuple s if and only if
s ∈ dom(t.X1) x … x dom(t.Xd)

Example
–
The query tuple is (d,e) in figure 3. The domain values dominated by
d are {d,g}. The domain values dominated by e are {e,g,h}. Then only
tuples with values in {d,g} x {e,g,h} are dominated by {d,e}.
Copyright 2008 by CEBT
Database Tuning - 12
Improving the skybuffer organization
 2. Controlling the grid granularity
 A problem with the grid-based index

The lack of control over granularity of the grid

Example
–
If the tuples have 4 attributes and each domain size is about 500,
then the grid would be comprised of 62.5 billion cells
 Definition 2

Consider a DAG and its vertices. A vertex is a source if it has no
incoming edges. Then, the depth of a vertex in a DAG is the
length of the longest path from a source to the vertex.
 Lemma 6

Let v1, …, vm be the m values of a partially-ordered domain Dom.
Then vi > vj only if depth(vi) < depth(vj)
Copyright 2008 by CEBT
Database Tuning - 13
Improving the skybuffer organization
 Example 4

Consider a data set consisting of tuples with 2 partially-ordered
categorical attributes, the domain of both of them being the
poset of Figure 4(a). Then, we can group domain values that lie
on the same depth and create the grid of Figure 4(b). Every cell
would contain tuples with the same attribute values. Now, tuples
with corresponding attributes that lie at the same depth level are
placed in the same cell.
–
tuple (d,e) lies in the cell marked with “x” in Figure 4(b), along with
other tuples with values in {c,d,e} x {c,d,e}.
 Problem

The cells can contain tuples both dominated and not dominated
by the query tuple
–
A dominance check is required to identify the dominated tuples
Copyright 2008 by CEBT
Database Tuning - 14
Improving the skybuffer organization
 3. Poset partitioning
 Example 5

Suppose that our data set contains tuples with two partiallyAll cells that can potentially contain tuples
ordered categorical attributes, the domain
both being
the
that areof
dominated
by (d,e)
are located
in the
rectangular
poset of Figure 5(a). The values of the
domain
havearea.
been
grouped. If we order the groups in ascending order of their
depth value, breaking ties arbitrarily, we propose a valid
topological sort for the poset

Figure 5(b) presents the resulting grid by using as scales the
grouping of Figure 5(a)

The cells that contain candidate are marked with •
–
This allows us to use focused search in order to directly access
relevant cells.
Copyright 2008 by CEBT
Database Tuning - 15
Experimental Evaluation
 SDC (Stratification by Dominance Classification)

Introduced in Stratified computation of skylines with partiallyordered domains written by Chan et al. in 2005.

Improves over BBS+ in terms of both progressiveness and speed

Suggested in the numerical skyline maintenance solution in
offline environment
 Performance evaluation of STARS and SDC

Goal
–

To identify the impact of the buffer size, data dimensionality and
domain structure on performance
Measure
–
The average time required to process a buffer update
–
i.e., A combined tuple arrival and expiration
Copyright 2008 by CEBT
Database Tuning - 23
Experimental Evaluation
STARS
SDC
Effect of buffer size and dimensionality on performance
 Performance on synthetic data
 STARS outperforms SDC by an order of magnitude
 The time required by STARS is in the order of a millisecond or
less

Rendering its use in real life applications entirely realistic
Copyright 2008 by CEBT
Database Tuning - 24
Experimental Evaluation
STARS
SDC
3 dimensional tuples
Performance on skewed real data
 Performance on a stream of real, skewed and correlated data
 Used DMV data set and three categorical attributes of the
“cars” table: Maker/Model, Color, and Year
 The result can be attributed to the presence of skew and
correlation in the data that leads considerably larger skylines
as the buffer size increases
Copyright 2008 by CEBT
Database Tuning - 25
Experimental Evaluation
Tree
 Further evlauation of STARS
Every depth level has twice
the number of values from
the level above
wall
All depth levels have the
same number of values
Pruning efficiency of arrangement skyline organization
 Pruning efficency

Measured as the fraction of skyline tuples that need to be
examined on average in order to answer a dominance query
 Found to be resilient to increases in data dimensionality
Copyright 2008 by CEBT
Database Tuning - 26
Experimental Evaluation
3 dimensional tuples
Tree
Every depth level has twice
the number of values from
the level above
wall
All depth levels have the
same number of values
Effect of grid granularity on performance
 Demonstrates the potential perfomance benefits by utilizing
the techniques of ‘improving the skybuffer organization’
 The benefit of increasing the grid granularity is eventually
offset(상쇄) by the overhead of visiting many sparse cells

Knee in the curves
Copyright 2008 by CEBT
Database Tuning - 27
Conclusions
 Identified and motivated the problem of maintaining the
skyline of streaming data with partially ordered, categorical
attributes
 Realized two novel techniques

To constitute the building blocks of an efficient solution to the
problem.
 Introduce a lightweight data structure for indexing the tuples
in the streaming buffer
Copyright 2008 by CEBT
IDS Lab Seminar (User Log Analysis) - 28
Skyline
Nam, Kwang-hyun
Intelligent Database Systems Lab
School of Computer Science & Engineering
Seoul National University, Seoul, Korea
Center for E-Business Technology
Seoul National University
Seoul, Korea
Introduction of Skyline Query
 Motivation

The amount of information is
explosively increasing
–

Users often face information
overload
How can different attributes be
compared?
Low
 Solution
B

Return objects which are not
Price
dominated by any other object in
all attributes

Intuitive querying
–
A
High
No overhead of formulation
Copyright 2008 by CEBT
Low
Speed
High
IDS Lab Seminar (User Log Analysis) - 30
Categorical Skylines for Streaming Data
 Motivation

Existing work is not proper to maintain skylines for streaming
data
 Skyline maintenance of streaming tuples

Incoming tuple
–

Outgoing tuple
–

Consider whether incoming tuple can become part of skyline
Consider whether the tuples dominated by outgoing tuple can be
candidates for entering skyline
Skyline maintenance solution
–
Checking whether a tuple is dominated by the current line
–
Retrieving the tuples in the buffer that are dominated by the
outgoing sky line tuple
Copyright 2008 by CEBT
Database Tuning - 31
Categorical Skylines for Streaming Data
 Proposal

Indexing the skybuffer
–
To insert and delete tuples, as well as identify the skybuffer tuples
dominated by a query tuple efficiently
–
1. Topological sorting
–
2. Organizaing the skybuffer tuples in a grid
–
3. Improving the skybuffer

Visiting only relevant cells, Controlling the grid granularity, Poset
partitioning
 Advantages

Optimize query evaluation

Applicable to a wide range of dimensionality

Adapt any size of poset
Copyright 2008 by CEBT
Database Tuning - 32