slides - UCLA Computer Science

Download Report

Transcript slides - UCLA Computer Science

Continuous Query Languages for DSMS
CS240B Notes
by
Carlo Zaniolo
1
Blocking Operators
 A blocking query operator is ‘one that is unable to produce
the first tuple of the output until it has seen the entire
input’ [Babcock et al. PODS02]
 But continuous queries cannot wait for the end of the
stream: must return results while the data is streaming in.
Blocking operators cannot be used.
 Only non-blocking (nb) queries and operators can be used on
data streams (i.e. those that return their results before
they have detected the end of the input).
 Current DBMSs make heavy usage of blocking computations:
1. For operators that are intrinsically blocking: e.g., SQL
aggregates,
2. And for those that are not: e.g., sort-based implementation of
joins and group by
3. We only need to be concerned with 1: find a characterization
for blocking & nonblocking independent of implementation.
3
Partial Ordering
Let S = [ t1, , tn] be a sequence and 0  k  n.
Then [t1, , tk ] is said to be the presequence of S,
of length k, denoted by Sk.
We write L  S to denote that L is a presequence of
S,
 Defines a Partial Order: reflexive, antisymmetric
and transitive.
 generalizes to the subset notion when order and
duplicates are immaterial
The empty sequence, [ ], is a subsequence of every
other sequence.
4
employees(E#, Dept)
select dept, count(E#)
from employees
group by dept
select dept, count(E#)
over (partition by dept
range unbounded preceding)
from employees
 Traditional SQL-2
aggregates: Blocking
 SQL:2003 OLAP functions:
Non-Blocking
Continuous count returns, for each new tuple, the count so
far. Consider a sequence of length n. At each step j<n, j is
returned  cumulative return up to j:
sumj (S)= [1,2, …, j] independent on whether j=n or j<n.
Traditional count:
For each j<n --nothing: sumj (S)=[]
Final: sumn (S)=[n]
5
Examples
• Traditional SQL-2 aggregates are blocking—
SQL:2003 OLAP functions are not.
• Selection is nonblocking.
• Continuous count (i.e., the unlimited preceding
count of OLAP functions) is non-blocking
• Also window aggregates are non-blocking
• In between cases: e.g., traditional aggregates on
input that is already sorted on group-by values.
7
Characterization of NonBlocking (nb)
 Many functions expressible by nb-computations can also be
expressed by blocking ones. E.g., joins can be implemented using
sorting. Ditto for projections with duplicate elimination.
 But many functions implemented using blocking computation cannot
be given an nb-implementation.
 We must distinguish between the two kinds of functions, since one
can be supported in our DSMS (via suitable nb-implementation) and
the other cannot.
Theorem: Queries can be expressed via nb
computations iff they are monotonic w.r.t. the
presequence ordering.
8
NB-completeness
 A query language L can express a given set of functions F on its
input (DB, sequences, data streams)---the larger F, the greater
the expressive power of L.
 Non-monotonic functions are intrinsically blocking and they
cannot be used on data streams. Thus, if we use L in a DSMS,
we give up the non-monotonic subset of F with no regret.
However, let us make sure that we do not give up anything more!
 More? Yes, because for continuous queries of streams, we will
normally disallow L’s blocking (i.e. nonmonotonic) operators &
constructs, and only allow nb (i.e., monotonic ) operators.
 But are ALL the monotonic functions expressible by L using the
nb-operators of L ? Or by disallowing blocking operators did we
also lose the ability of expressing some monotonic queries?
Definition: L is said to be nb-complete when it can express all
the monotonic queries expressible by L using only its nboperators.
9
Expressive Power and NB-Completeness
 NB-completeness is a test that a language is as
suitable for continuous queries on data streams as
it is on stored database.
 In a language L lacking nb-completeness, there are
monotonic functions that L cannot express as
continuous queries, that L can express if the
stream had been stored in a database.
 For instance, Relational Algebra and SQL are not
nb-complete (in addition to the shortcomings they
might have on DBs).
10
Sets versus Sequences
 Sets are sequences where duplicates are allowed and
order is immaterial.
 Thus S1 is a subset of S2 iff S1 can be reordered in
a presequence of S2.
 Theorem [Lifted from sequences to sets]. A function
is is nb iff it is monotonic.
 NB=monotonic: selection, projection, and OLAP
functions
 Blocking=Non-Monotonic: e.g. Traditional aggregates.
 Operators of more than one argument:
 Join are monotonic (i.e., NB) in both arguments.
R-S is monotonic on R and antimonotonic on S: i.e., will block
on S but not on R (after it has seen the whole S, though)
11
Relational Algebra (RA)
 Set difference can produce monotonic queries: Intersection: R1
R2 = R1 - (R1 - R2)
 Are these still expressible without set diff?
 Intersection can be expressed as a joins: product+select
 But interval coalescing and Until queries are monotonic queries that can be
expressed in RA but not in nb-RA.
 Example: Temporal domain isomorfic to nonnegative
integers.Intervals closed to the left but open to the right:
p(0, 3).
p(2, 4).
p(4, 5).
p(6, 8).
% 0,1, and 2 are in p but 3 is not
% 3 is not a hole because is covered by this
% 5 is a hole because not covered by any other interval
12
Coalesce p (cp) & p Until q
p(0, 3). p(2, 4). p(4, 5).
cp(0, 3). cp(2, 4). cp(4, 5).
p(6, 8).
cp(6, 8). cp(0, 4).
cp(2, 5). cp(0,5).
cp contains intervals from the start point of any p interval to the endpoint of
any p interval unless the endpoint of an interval in between is a hole.
cp(I1, J2)  p(I1, J1), p(I2, J2), J1 < J2, hole(I1, J2).
hole(I1, J2)  p(I1, J1), p(I2, J2), p(_,K), J1  K, K < I2, cep(K).
cep(K)  p(_, K), p(I, J), I  K, K < J.
q(5,_) holds if cp has an interval that starts at 0 & contains 5
pUntil q(yes)  q(0, J).
pUntil q(yes)  cp(0, I), q(J, _), I  J .
13
Relational Algebra
 NonMonotonic (i.e., blocking) RA operators: set
difference and division
 We are left with: select, project, join, and union. Can
these express all FO monotonic queries?
 Some interesting temporal queries: coalesce and until
They are expressible in RA (by double negation)
They are monotonic
They cannot be expressed in nb-RA.
Theorem: RA and SQL are not nb-complete.
SQL faces two problems: (i) the exclusion of EXCEPT/NOT
EXISTS, and
(ii) the exclusion of aggregates.
14
E-Bay Example
 Auctions: a stream of bids on an item.
bidStream(Item#, BidValue, Time)
 Items for which sum of bids is > 100K
SELECT Item#
FROM bidStream
GROUP BY Item#
HAVING SUM(BidValue) > 100000;
 This is a monotonic query. Thus it can be expressed in a
language containing suitable query operators, but not in SQL-2.
SQL-2 is not nb-complete; thus it is ill-suited for continuous
queries on data streams.
 So SQL-2 is not nb-complete because of its blocking
aggregates. What about relational algebra?
15
Incompleteness of Relational QL
The coalesce and until queries
can be expressed in safe nonrecursive Datalog,
thus
They are expressible in RA,
They are monotonic
They cannot be expressed in nb-RA
Theorem: RA and SQL are not nb-complete.
 A new limitation for DB query languages (which
were already severely challenged in terms of
expressive power)
16
Embedding SQL Queries in a PL
 In DB applications, SQL can be embedded in a PL (Java,
C++…) where the PL accesses the tuples returned by SQL
using a Get Next of Cursor statement.
 Operations that could not be expressed in SQL can then be
expressed in the PL:
 an effective remedy for the lack of expressive power of SQL
 But cursors is a ‘pull-based’ mechanism and cannot be used
on data streams: the DSMS cannot hold tuples until the PL
request them.
 The DSMS can only deliver its output to the PL as a
stream—This is OK to drive a GUI. But if most of the work
has not been done yet, who is the DSMS?
 Contrast this to DBMS who are useful even with a weak QL.
17
Reviewing the Situation
 SQL’s lack of expressive power is a major problem
for database-centric applications.
 These problems are significantly more serious for
data streams since:
Only monotonic queries can be used,
Actually, not even all the monotonic ones since SQL is not
nb-complete,
These problems cannot be really by using PLs with
embedded SQL statements on streams
 DSMS will be impaired--unless significant
improvements can be made.
18
UDAs to the Rescue
Full support for UDAs with all window combinations—
effective on UDAs written in SQL, PLs, and even builtins
Support for continuous queries and ad hoc queries, under
a simple and unified semantics
Turing completeness --all possible queries
nb-completeness all monotonic queries using only nonblocking operators (e.g., window UDAs & those without
TERMINATE)
Effective on a broad range of data-intensive applications:
data/stream mining, approximate queries, sequential
patters (XML not there)
Making a strong case for the DB-oriented approach to
data streams.
19
Conclusion
 Language Technology:
ESL a very powerful language for data stream and DB
applications
Simple semantics and unified syntax conforming to
SQL:2003 standards
Strong case for the DB-oriented approach to data
streams
 System Technology:
Some performance-oriented techniques well-developed—
e.g., buffer management for windows
For others: work is still in progress—stay tuned for
latest news
 Stream Mill is up and running: http://wis.cs.ucla.edu/stream-mill
20
References
[1]ATLaS user manual. http://wis.cs.ucla.edu/atlas.
[2]SQL/LPP: A Time Series Extension of SQL Based on Limited Patience Patterns, volume 1677 of
Lecture Notes in Computer Science. Springer, 1999.
[4]A. Arasu, S. Babu, and J. Widom. An abstract semantics and concrete language for continuous queries
over streams and relations. Technical report, Stanford University, 2002.
[5]B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream
systems. In PODS, 2002.
[9]D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul,
and S. Zdonik. Monitoring streams - a new class of data management applications. In VLDB, Hong
Kong, China, 2002.
[10]J. Celko. SQL for Smarties, chapter Advanced SQL Programming. Morgan Kaufmann, 1995.
[11]S. Chandrasekaran and M. Franklin. Streaming queries over streaming data. In VLDB, 2002.
[12]J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable continuous query system for
internet databases. In SIGMOD, pages 379-390, May 2000.
[13]C. Cranor, Y. Gao, T. Johnson, V. Shkapenyuk, and O. Spatscheck. Gigascope: A stream database
for network applications. In SIGMOD Conference, pages 647-651. ACM Press, 2003.
[14]Lukasz Golab and M. Tamer Özsu. Issues in data stream management. ACM SIGMOD Record,
32(2):5-14, 2003.
[15]J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, 1997.
[16] Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, Carlo Zaniolo: A Data Stream Language and
System Designed for Power and Extensibility. Proc. of the ACM 15th Conference on Information and
Knowledge Management (CIKM'06), 2006
[17] Yijian Bai, Hetal Thakkar, Haixun Wang and Carlo Zaniolo: Optimizing Timestamp Management in
Data Stream Management Systems. ICDE 2007.
21
References (Cont.)
[18] Yan-Nei Law, Haixun Wang, Carlo Zaniolo: Query Languages and Data Models for Database
Sequences and Data Streams. VLDB 2004: 492-503
[19] Sam Madden, Mehul A. Shah, Joseph M. Hellerstein, and Vijayshankar Raman. Continuously adaptive
continuous queries over streams. In SIGMOD, pages 49-61, 2002.
[20]R. Motwani, J. Widom, A. Arasu, B. Babcock, M. Datar S. Babu, G. Manku, C. Olston, J.
Rosenstein, and R. Varma. Query processing, approximation, and resource management in a data
stream management system. In First CIDR 2003 Conference, Asilomar, CA, 2003.
[21]R. Ramakrishnan, D. Donjerkovic, A. Ranganathan, K. Beyer, and M. Krishnaprasad. SRQL: Sorted
relational query language, 1998.
[23]Reza Sadri, Carlo Zaniolo, and Amir M. Zarkesh andJafar Adibi. A sequential pattern query language
for supporting instant data minining for e-services. In VLDB, pages 653-656, 2001.
[24]Reza Sadri, Carlo Zaniolo, Amir Zarkesh, and Jafar Adibi. Optimization of sequence queries in
database systems. In PODS, Santa Barbara, CA, May 2001.
[25]P. Seshadri. Predator: A resource for database research. SIGMOD Record, 27(1):16-20, 1998.
[26]P. Seshadri, M. Livny, and R. Ramakrishnan. SEQ: A model for sequence databases. In ICDE, pages
232-239, Taipei, Taiwan, March 1995.
[27]Praveen Seshadri, Miron Livny, and Raghu Ramakrishnan. Sequence query processing. In ACM
SIGMOD 1994, pages 430-441. ACM Press, 1994.
[28]M. Sullivan. Tribeca: A stream database manager for network traffic analysis. In VLDB, 1996.
[29]D. Terry, D. Goldberg, D. Nichols, and B. Oki. Continuous queries over append-only databases. In
SIGMOD, pages 321-330, 6 1992.
[30]Peter A. Tucker, David Maier, Tim Sheard, and Leonidas Fegaras. Exploiting punctuation semantics in
continuous data streams. IEEE Trans. Knowl. Data Eng, 15(3):555-568, 2003.
[31]Haixun Wang and Carlo Zaniolo. ATLaS: a native extension of SQL for data minining. In Proceedings
of Third SIAM Int. Conference on Data MIning, pages 130-141, 2003.
22
DSMS Research Projects
 Aurora (Brandeis/Brown/MIT)
http://www.cs.brown.edu/research/aurora/
 Cougar (Cornell) http://www.cs.cornell.edu/database/cougar/
 Telegraph (Berkeley)- http://telegraph.cs.berkeley.edu
 STREAM (Stanford) –http://www-db.stanford.edu/stream
 Niagara (OGI/Wisconsin)-http://www.cs.wisc.edu/niagara/
 OpenCQ (Georgia Tech) – http://disl.cc.gatech.edu/CQ
 Tapestry (Xerox) – electronic documents stream filtering
 Hancock (AT&T)
http://www.research.att.com/~kfisher/hancock/
 Cape (WPI) http://davis.wpi.edu/dsrg/CAPE/home.html
 Tribeca (Bellcore) – network monitoring
 Stream Mill (UCLA) – http://wis.cs.ucla.edu/stream-mill
 Gigascope …
23
CQLs for DSMS
 Most of DSMS projects use SQL for continuous
queries—for good reasons, since
Many applications span data streams and DB tables
A CQL based on SQL will be easier to learn & use
Moreover: the fewer the differences the better!
 But DSMS were designed for persistent data and
transient queries---not for persistent queries on
transient data
 Adaptation of SQL and its enabling technology
presents difficult research challenges
 These combine with traditional SQL problem, such
as inability to deal with sequences, DM tasks, and
other complex query tasks---i.e., lack of
expressive power
24
Language Problems

Most DSMS projects use SQL — queries spanning both data
streams and DBs will be easier. But …
 Even for persistent data, SQL is far from perfect.
Important application areas poorly supported include:
 Data Mining, and we need to mine data streams,
 Sequence queries, and data streams are infinite time series!
 Major new problems for SQL on data stream applications.
(After all, it was designed for persistent data on secondary store, not
for streaming data)
 Only NonBlocking operators in DSMS: blocking forbidden
 Distinction not clear in DBMS which often use blocking
implementations for nonblocking operators
The distinction needs to formally characterized
 and so is the loss of query power of the QL.
25