slides - UCLA Computer Science
Download
Report
Transcript slides - UCLA Computer Science
Continuous Query Languages for DSMS
CS240B Notes
by
Carlo Zaniolo
1
Blocking Operators
A blocking query operator is ‘one that is unable to produce
the first tuple of the output until it has seen the entire
input’ [Babcock et al. PODS02]
But continuous queries cannot wait for the end of the
stream: must return results while the data is streaming in.
Blocking operators cannot be used.
Only non-blocking (nb) queries and operators can be used on
data streams (i.e. those that return their results before
they have detected the end of the input).
Current DBMSs make heavy usage of blocking computations:
1. For operators that are intrinsically blocking: e.g., SQL
aggregates,
2. And for those that are not: e.g., sort-based implementation of
joins and group by
3. We only need to be concerned with 1: find a characterization
for blocking & nonblocking independent of implementation.
3
Partial Ordering
Let S = [ t1, , tn] be a sequence and 0 k n.
Then [t1, , tk ] is said to be the presequence of S,
of length k, denoted by Sk.
We write L S to denote that L is a presequence of
S,
Defines a Partial Order: reflexive, antisymmetric
and transitive.
generalizes to the subset notion when order and
duplicates are immaterial
The empty sequence, [ ], is a subsequence of every
other sequence.
4
employees(E#, Dept)
select dept, count(E#)
from employees
group by dept
select dept, count(E#)
over (partition by dept
range unbounded preceding)
from employees
Traditional SQL-2
aggregates: Blocking
SQL:2003 OLAP functions:
Non-Blocking
Continuous count returns, for each new tuple, the count so
far. Consider a sequence of length n. At each step j<n, j is
returned cumulative return up to j:
sumj (S)= [1,2, …, j] independent on whether j=n or j<n.
Traditional count:
For each j<n --nothing: sumj (S)=[]
Final: sumn (S)=[n]
5
Examples
• Traditional SQL-2 aggregates are blocking—
SQL:2003 OLAP functions are not.
• Selection is nonblocking.
• Continuous count (i.e., the unlimited preceding
count of OLAP functions) is non-blocking
• Also window aggregates are non-blocking
• In between cases: e.g., traditional aggregates on
input that is already sorted on group-by values.
7
Characterization of NonBlocking (nb)
Many functions expressible by nb-computations can also be
expressed by blocking ones. E.g., joins can be implemented using
sorting. Ditto for projections with duplicate elimination.
But many functions implemented using blocking computation cannot
be given an nb-implementation.
We must distinguish between the two kinds of functions, since one
can be supported in our DSMS (via suitable nb-implementation) and
the other cannot.
Theorem: Queries can be expressed via nb
computations iff they are monotonic w.r.t. the
presequence ordering.
8
NB-completeness
A query language L can express a given set of functions F on its
input (DB, sequences, data streams)---the larger F, the greater
the expressive power of L.
Non-monotonic functions are intrinsically blocking and they
cannot be used on data streams. Thus, if we use L in a DSMS,
we give up the non-monotonic subset of F with no regret.
However, let us make sure that we do not give up anything more!
More? Yes, because for continuous queries of streams, we will
normally disallow L’s blocking (i.e. nonmonotonic) operators &
constructs, and only allow nb (i.e., monotonic ) operators.
But are ALL the monotonic functions expressible by L using the
nb-operators of L ? Or by disallowing blocking operators did we
also lose the ability of expressing some monotonic queries?
Definition: L is said to be nb-complete when it can express all
the monotonic queries expressible by L using only its nboperators.
9
Expressive Power and NB-Completeness
NB-completeness is a test that a language is as
suitable for continuous queries on data streams as
it is on stored database.
In a language L lacking nb-completeness, there are
monotonic functions that L cannot express as
continuous queries, that L can express if the
stream had been stored in a database.
For instance, Relational Algebra and SQL are not
nb-complete (in addition to the shortcomings they
might have on DBs).
10
Sets versus Sequences
Sets are sequences where duplicates are allowed and
order is immaterial.
Thus S1 is a subset of S2 iff S1 can be reordered in
a presequence of S2.
Theorem [Lifted from sequences to sets]. A function
is is nb iff it is monotonic.
NB=monotonic: selection, projection, and OLAP
functions
Blocking=Non-Monotonic: e.g. Traditional aggregates.
Operators of more than one argument:
Join are monotonic (i.e., NB) in both arguments.
R-S is monotonic on R and antimonotonic on S: i.e., will block
on S but not on R (after it has seen the whole S, though)
11
Relational Algebra (RA)
Set difference can produce monotonic queries: Intersection: R1
R2 = R1 - (R1 - R2)
Are these still expressible without set diff?
Intersection can be expressed as a joins: product+select
But interval coalescing and Until queries are monotonic queries that can be
expressed in RA but not in nb-RA.
Example: Temporal domain isomorfic to nonnegative
integers.Intervals closed to the left but open to the right:
p(0, 3).
p(2, 4).
p(4, 5).
p(6, 8).
% 0,1, and 2 are in p but 3 is not
% 3 is not a hole because is covered by this
% 5 is a hole because not covered by any other interval
12
Coalesce p (cp) & p Until q
p(0, 3). p(2, 4). p(4, 5).
cp(0, 3). cp(2, 4). cp(4, 5).
p(6, 8).
cp(6, 8). cp(0, 4).
cp(2, 5). cp(0,5).
cp contains intervals from the start point of any p interval to the endpoint of
any p interval unless the endpoint of an interval in between is a hole.
cp(I1, J2) p(I1, J1), p(I2, J2), J1 < J2, hole(I1, J2).
hole(I1, J2) p(I1, J1), p(I2, J2), p(_,K), J1 K, K < I2, cep(K).
cep(K) p(_, K), p(I, J), I K, K < J.
q(5,_) holds if cp has an interval that starts at 0 & contains 5
pUntil q(yes) q(0, J).
pUntil q(yes) cp(0, I), q(J, _), I J .
13
Relational Algebra
NonMonotonic (i.e., blocking) RA operators: set
difference and division
We are left with: select, project, join, and union. Can
these express all FO monotonic queries?
Some interesting temporal queries: coalesce and until
They are expressible in RA (by double negation)
They are monotonic
They cannot be expressed in nb-RA.
Theorem: RA and SQL are not nb-complete.
SQL faces two problems: (i) the exclusion of EXCEPT/NOT
EXISTS, and
(ii) the exclusion of aggregates.
14
E-Bay Example
Auctions: a stream of bids on an item.
bidStream(Item#, BidValue, Time)
Items for which sum of bids is > 100K
SELECT Item#
FROM bidStream
GROUP BY Item#
HAVING SUM(BidValue) > 100000;
This is a monotonic query. Thus it can be expressed in a
language containing suitable query operators, but not in SQL-2.
SQL-2 is not nb-complete; thus it is ill-suited for continuous
queries on data streams.
So SQL-2 is not nb-complete because of its blocking
aggregates. What about relational algebra?
15
Incompleteness of Relational QL
The coalesce and until queries
can be expressed in safe nonrecursive Datalog,
thus
They are expressible in RA,
They are monotonic
They cannot be expressed in nb-RA
Theorem: RA and SQL are not nb-complete.
A new limitation for DB query languages (which
were already severely challenged in terms of
expressive power)
16
Embedding SQL Queries in a PL
In DB applications, SQL can be embedded in a PL (Java,
C++…) where the PL accesses the tuples returned by SQL
using a Get Next of Cursor statement.
Operations that could not be expressed in SQL can then be
expressed in the PL:
an effective remedy for the lack of expressive power of SQL
But cursors is a ‘pull-based’ mechanism and cannot be used
on data streams: the DSMS cannot hold tuples until the PL
request them.
The DSMS can only deliver its output to the PL as a
stream—This is OK to drive a GUI. But if most of the work
has not been done yet, who is the DSMS?
Contrast this to DBMS who are useful even with a weak QL.
17
Reviewing the Situation
SQL’s lack of expressive power is a major problem
for database-centric applications.
These problems are significantly more serious for
data streams since:
Only monotonic queries can be used,
Actually, not even all the monotonic ones since SQL is not
nb-complete,
These problems cannot be really by using PLs with
embedded SQL statements on streams
DSMS will be impaired--unless significant
improvements can be made.
18
UDAs to the Rescue
Full support for UDAs with all window combinations—
effective on UDAs written in SQL, PLs, and even builtins
Support for continuous queries and ad hoc queries, under
a simple and unified semantics
Turing completeness --all possible queries
nb-completeness all monotonic queries using only nonblocking operators (e.g., window UDAs & those without
TERMINATE)
Effective on a broad range of data-intensive applications:
data/stream mining, approximate queries, sequential
patters (XML not there)
Making a strong case for the DB-oriented approach to
data streams.
19
Conclusion
Language Technology:
ESL a very powerful language for data stream and DB
applications
Simple semantics and unified syntax conforming to
SQL:2003 standards
Strong case for the DB-oriented approach to data
streams
System Technology:
Some performance-oriented techniques well-developed—
e.g., buffer management for windows
For others: work is still in progress—stay tuned for
latest news
Stream Mill is up and running: http://wis.cs.ucla.edu/stream-mill
20
References
[1]ATLaS user manual. http://wis.cs.ucla.edu/atlas.
[2]SQL/LPP: A Time Series Extension of SQL Based on Limited Patience Patterns, volume 1677 of
Lecture Notes in Computer Science. Springer, 1999.
[4]A. Arasu, S. Babu, and J. Widom. An abstract semantics and concrete language for continuous queries
over streams and relations. Technical report, Stanford University, 2002.
[5]B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream
systems. In PODS, 2002.
[9]D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul,
and S. Zdonik. Monitoring streams - a new class of data management applications. In VLDB, Hong
Kong, China, 2002.
[10]J. Celko. SQL for Smarties, chapter Advanced SQL Programming. Morgan Kaufmann, 1995.
[11]S. Chandrasekaran and M. Franklin. Streaming queries over streaming data. In VLDB, 2002.
[12]J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable continuous query system for
internet databases. In SIGMOD, pages 379-390, May 2000.
[13]C. Cranor, Y. Gao, T. Johnson, V. Shkapenyuk, and O. Spatscheck. Gigascope: A stream database
for network applications. In SIGMOD Conference, pages 647-651. ACM Press, 2003.
[14]Lukasz Golab and M. Tamer Özsu. Issues in data stream management. ACM SIGMOD Record,
32(2):5-14, 2003.
[15]J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, 1997.
[16] Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, Carlo Zaniolo: A Data Stream Language and
System Designed for Power and Extensibility. Proc. of the ACM 15th Conference on Information and
Knowledge Management (CIKM'06), 2006
[17] Yijian Bai, Hetal Thakkar, Haixun Wang and Carlo Zaniolo: Optimizing Timestamp Management in
Data Stream Management Systems. ICDE 2007.
21
References (Cont.)
[18] Yan-Nei Law, Haixun Wang, Carlo Zaniolo: Query Languages and Data Models for Database
Sequences and Data Streams. VLDB 2004: 492-503
[19] Sam Madden, Mehul A. Shah, Joseph M. Hellerstein, and Vijayshankar Raman. Continuously adaptive
continuous queries over streams. In SIGMOD, pages 49-61, 2002.
[20]R. Motwani, J. Widom, A. Arasu, B. Babcock, M. Datar S. Babu, G. Manku, C. Olston, J.
Rosenstein, and R. Varma. Query processing, approximation, and resource management in a data
stream management system. In First CIDR 2003 Conference, Asilomar, CA, 2003.
[21]R. Ramakrishnan, D. Donjerkovic, A. Ranganathan, K. Beyer, and M. Krishnaprasad. SRQL: Sorted
relational query language, 1998.
[23]Reza Sadri, Carlo Zaniolo, and Amir M. Zarkesh andJafar Adibi. A sequential pattern query language
for supporting instant data minining for e-services. In VLDB, pages 653-656, 2001.
[24]Reza Sadri, Carlo Zaniolo, Amir Zarkesh, and Jafar Adibi. Optimization of sequence queries in
database systems. In PODS, Santa Barbara, CA, May 2001.
[25]P. Seshadri. Predator: A resource for database research. SIGMOD Record, 27(1):16-20, 1998.
[26]P. Seshadri, M. Livny, and R. Ramakrishnan. SEQ: A model for sequence databases. In ICDE, pages
232-239, Taipei, Taiwan, March 1995.
[27]Praveen Seshadri, Miron Livny, and Raghu Ramakrishnan. Sequence query processing. In ACM
SIGMOD 1994, pages 430-441. ACM Press, 1994.
[28]M. Sullivan. Tribeca: A stream database manager for network traffic analysis. In VLDB, 1996.
[29]D. Terry, D. Goldberg, D. Nichols, and B. Oki. Continuous queries over append-only databases. In
SIGMOD, pages 321-330, 6 1992.
[30]Peter A. Tucker, David Maier, Tim Sheard, and Leonidas Fegaras. Exploiting punctuation semantics in
continuous data streams. IEEE Trans. Knowl. Data Eng, 15(3):555-568, 2003.
[31]Haixun Wang and Carlo Zaniolo. ATLaS: a native extension of SQL for data minining. In Proceedings
of Third SIAM Int. Conference on Data MIning, pages 130-141, 2003.
22
DSMS Research Projects
Aurora (Brandeis/Brown/MIT)
http://www.cs.brown.edu/research/aurora/
Cougar (Cornell) http://www.cs.cornell.edu/database/cougar/
Telegraph (Berkeley)- http://telegraph.cs.berkeley.edu
STREAM (Stanford) –http://www-db.stanford.edu/stream
Niagara (OGI/Wisconsin)-http://www.cs.wisc.edu/niagara/
OpenCQ (Georgia Tech) – http://disl.cc.gatech.edu/CQ
Tapestry (Xerox) – electronic documents stream filtering
Hancock (AT&T)
http://www.research.att.com/~kfisher/hancock/
Cape (WPI) http://davis.wpi.edu/dsrg/CAPE/home.html
Tribeca (Bellcore) – network monitoring
Stream Mill (UCLA) – http://wis.cs.ucla.edu/stream-mill
Gigascope …
23
CQLs for DSMS
Most of DSMS projects use SQL for continuous
queries—for good reasons, since
Many applications span data streams and DB tables
A CQL based on SQL will be easier to learn & use
Moreover: the fewer the differences the better!
But DSMS were designed for persistent data and
transient queries---not for persistent queries on
transient data
Adaptation of SQL and its enabling technology
presents difficult research challenges
These combine with traditional SQL problem, such
as inability to deal with sequences, DM tasks, and
other complex query tasks---i.e., lack of
expressive power
24
Language Problems
Most DSMS projects use SQL — queries spanning both data
streams and DBs will be easier. But …
Even for persistent data, SQL is far from perfect.
Important application areas poorly supported include:
Data Mining, and we need to mine data streams,
Sequence queries, and data streams are infinite time series!
Major new problems for SQL on data stream applications.
(After all, it was designed for persistent data on secondary store, not
for streaming data)
Only NonBlocking operators in DSMS: blocking forbidden
Distinction not clear in DBMS which often use blocking
implementations for nonblocking operators
The distinction needs to formally characterized
and so is the loss of query power of the QL.
25