Formato Base dei Dati - UCLA Computer Science

Download Report

Transcript Formato Base dei Dati - UCLA Computer Science

Continuous Query Languages for DSMS
CS240B Notes
by
Carlo Zaniolo
1
CQLs for DSMS
 Most of DSMS projects use SQL for continuous
queries—for good reasons, since
Many applications span data streams and DB tables
A CQL based on SQL will be easier to learn & use
Moreover: the fewer the differences the better!
 But DSMS were designed for persistent data and
transient queries---not for persistent queries on
transient data
 Adaptation of SQL and its enabling technology
presents many research challenges
 Lack of expressive power—even worse now since
only nonblocking operators are allowed.
3
Continuous Query Graph:
many components—arbitrary DAGs
Source
σ
∑1
Sink
∑2
O2
Sink
O3
Sink
O1
Source

Source1
U
Source2
Source1
Sink
σ

∑1
Sink
∑2
Sink
U
Source2
σ
4
Relational Algebra Operators
Stored data
 Selection, Projection
 Union
Data Streams
 ... same
 Union by Sort-Merging on
 Join (including X) on tables
 Join of Stream with table
 Window joins on streams
(timestamps merged into 1 column)
 Set Difference
 Aggregates:
 Traditional Blocking aggregates
 OLAP functions on windows or
unlimited preceding
timestamps
 No stream difference (blocking—diff of
stream with table OK).
 Aggregates:
 No blocking aggregate
 OLAP functions on windows or
unlimited preceding
 Slides, and tumbles.
5
Bolts and Nuts
create stream
bids(bid#, item, offer, Time)
create stream mybids as (select bid#, offer, Time
from bids
where item=bolt
union
select bid#, offer, Time
from bids
where item=nut)
Result same as: select bid#, offer, Time where
item= bolt or item=nut
6
\
Joins
We could create a stream called interesting bids by say joining
bids with the ‘interesting_items’ table.
We next find the bolt bids for which there was a nut bid offered
in the last 5 minutes for the same price.
create stream selfjoinbids as
(select S1.bid#, S1.offer, S2.bid#, S2.Time
from bids as S1, bids as S2 [window of 5 minutes]
where S1.item=bolt and S2.item=nut and
S1.offer=S2.offer)
The window condition implies that
S1.Time >= S2.Time and S2.Time >= S1.Time-5 minutes.
Windows on both streams are used very often.
7
Processing Unions
Source1
σ
U
Source2
Sink
σ
Union: When tuples are present at all inputs, select
one with minimal timestamp and
Production: add this tuple to the output, and
Consumption: remove it from the input.
8
Window Joins
SourceA
σ
A
join
SourceB
σ
Sink
B
Window Join of Stream A and Stream B:
When tuples are present at both inputs, and the timestamp of
A is less or equal than that of B, then perform the following
operations (symmetric operations are performed if timestamp
of B is less or equal than that of A):
Production: compute the join of the tuple in A with the tuples in
W(B) and add the resulting tuples to output buffer (these tuple
have the same timestamp a the tuple in A)
Consumption: the current tuple in A is removed from the input and
added to the window buffer W(A) (from which the expired
tuples are also removed)
9
Relational Algebra Operators
Stored data
Data Streams
 Selection, Projection
 Union
 ... same
 Union by Sort-Merging on timestamps
 Join of Stream with table
 Window joins on streams (timestamps
 Join (including X) on tables

Set Difference
 Aggregates:

merged into 1 column)
No stream difference (blocking—diff of
stream with table OK).
 Aggregates:
 Traditional Blocking aggregates
 No blocking aggregate
 OLAP functions on windows or
unlimited preceding
 OLAP functions on windows or
unlimited preceding
 Slides, and tumbles.
 Including UDAs
10
User-Defined Aggregates:
Max Power via Min SQL Extensions
 Windows (logical, physical, slides, tumbles,…):
flexible synopses that solve the blocking problem for
aggregates
 DSMS only support these constructs on built-in aggregates
ESL is the first to support the complete integration of these
two
 User Defined Aggregates (UDAs) —the key to power
and extensibility, and
And thus can support data mining,
XML,
sequences not supported by other DSMS
 One framework for aggregates and windows, whether
they are built-ins or user-defined, and
independent on the language used to
define them.
11
Defining Traditional Aggregates
 Specification consists of 3 blocks of code--- Written in an
external PL (as DBMS and other DSMS do), or
 In SQL itself (SQL becomesTuring Complete!)
 INITIALIZE
Executed upon the arrival of the first tuple
 ITERATE
Executed upon the arrival of each subsequent tuples (an incremental
computation suitable for streams)
 TERMINATE
Executed after the end of the relation/stream has been reached
 Invocation:
SELECT myavg(start_price) FROM OpenAuction
12
The UDA AVG in SQL
AGGREGATE avg(Next Int) : Real
{ TABLE state(tsum Int, cnt Int);
INITIALIZE : {
INSERT INTO state VALUES (Next, 1);
}
ITERATE : {
UPDATE state
SET tsum=tsum+Next, cnt=cnt+1;
}
TERMINATE : {
INSERT INTO RETURN
SELECT tsum/cnt FROM state;
}
}

“INSERT INTO RETURN” in TERMINATE  a blocking UDA
13
NonBlocking UDA: AVG of last 200 Values
AGGREGATE myavg(Next Int) : Real
{TABLE state(tsum Int, cnt Int);
INITIALIZE : {
INSERT INTO state VALUES (Next, 1);
}
ITERATE : {
UPDATE state SET tsum=tsum+Next, cnt=cnt+1;
INSERT INTO RETURN
SELECT tsum/cnt FROM state
WHERE cnt %200 =0;
UPDATE state SET tsum=Next, cnt=1
WHERE cnt %200 =1
}
TERMINATE : { }
}
Empty
TERMINATE Denotes a non-blocking UDA
14
UDAs in ESL
In ESL user-defined Aggregates (UDAs)
can be defined directly in SQL, rather
than in a PL
Native extensibility in SQL via UDAs (which can
also be defined in a PL for better performance)
No impedance mismatch
Access to DB tables from UDAs
Data Independence and optimization
Good ease of use and performance
Turing completeness & nb-completeness.
15
Data Intensive Applications & UDAs
 Complex Applications can expressed concisely, with
good performance
 ATLAS: a single-user DBMS developed at UCLA.
Support for SQL with UDAs
On top of Berkeley-DB record manager.
 Data Mining Algorithms in ATLAS
Decision Tree Classifiers: 18 lines of codes
APriori: 40 lines of codes
Modest overhead: <50% w.r.t procedural UDA
 Data Stream Applications in ESL
Data Stream Mining, approximate aggregates, sketches,
histograms, …
16
SQL:2003 OLAP Functions
Aggregates on Windows
CREATE STREAM ClosedAuction (/*auction closings */
itemID, /*id of the item in this auction.*/
Auctions
buyerID /*buyer of this item.*/)
Final price real /*final price of the item */,
Current_time) order by … source …
For
each seller, show the average selling price over the
last 10 items sold (physical window)
CREATE STREAM LastTenAvg
SELECT sellerID,
AVG(price) OVER(PARTITION BY sellerID
ROWS 9 PRECEDING),
Current_time
FROM ClosedPrice;
17
Optimizing Window AVG in ESL
•For each expired tuple decrease the count by one and the sum
by the expired value—works for logical & physical windows
WINDOW AGGREGATE avg(Next Real) : Real
{
TABLE state(tsum Int, cnt Real);
TABLE inwindow(wnext Real);
INITIALIZE : {
INSERT INTO state VALUES (Next, 1)}
ITERATE : {
UPDATE state SET tsum=tsum+Next, cnt=cnt+1;
INSERT INTO RETURN
SELECT tsum/cnt FROM state}
EXPIRE: { /*if there are expired tuples, take the oldest */
UPDATE state
SET cnt= cnt-1,
tsum = tsum – (select wnext
FROM inwindow
WHERE oldest(inwindow)) }
}
18
MAX
System maintains inwindow
Remove dominated (less & older) values
The oldest is always the max.
WINDOW AGGREGATE max (Next Real) : Real
{
TABLE inwindow(wnext real);
INITIALIZE : { etc.} /*system adds new tuples to inwindow*/
ITERATE : { DELETE FROM inwindow
WHERE wnext <Next;
INSERT INTO RETURN
SELECT wnext FROM inwindow
WHERE oldest(inwindow) }
EXPIRE: { } /*expired tuples removed automatically*/
}
19
For Each Aggregate two versions
The traditional Base aggregate with
terminate
The Window aggregate with inwindow and
expire.
These definitions will take care of both
logical and physical windows.
But there are more complications: slides and
tumbles.
20
Slides and Tumbles
Every two minutes, show the average selling price over the last
10 minutes (logical window)

CREATE STREAM LastTenAvg
SELECT sellerID,
max(price) OVER(RANGE 10 MINUTE PRECEDING
SLIDE 2 MINUTE),
Current_time
FROM ClosedPrice;
Here the window is W=10 and the slide is S=2.
Tumble: When S ≥ W
21
SLIDEs
Summary Tuples
slide/pane
window
window
 The slide constructs divides a window into panes,
results only returned at the end of each pane
 Slide is conducive to optimization.
Combine summaries into the desired aggregation
E.g.: MAX(1, 2, 3, 4)= MAX(MAX(1,2), MAX(3,4)) = 4
I.e., for MAX, we can perform MAX on subsets of numbers as
local summaries, then combine them together to get the true
MAX
Proposed before: but what constructs should be used to
integrate these concepts into the language?
22
Slides &Tumbles--Examples
 Tumble – where the SLIDE size is equal or larger than
the window size
E.g. Once every 50 tuples, compute and return average over the
last 10 tuples
Easy to optimize
Skip the first 40 tuples of every 50 tuples, and compute the
blocking base version of the aggregate on the last 10
 Slide – where slide size is smaller than the window size
E.g. Once every 10 tuples, compute and return average over the
last 50 tuples
Naïve implementation--not optimized
Perform incremental maintenance on every incoming tuple
Ignore RETURN statements for most incoming tuples
Only invoke RETURN once every 10 tuples
23
Pane-based SLIDE Optimization
 Two-level cascading aggregates using two existing aggregates
 Perform sub-aggregation inside each pane using the base aggregate
No need for incremental maintenance here
Computed with a blocking aggregate once for each pane
 Combine the summary tuples using the window aggregate that returns
on every incoming tuple (non-blocking)
With incremental maintenance here
At any time, only the last un-finished pane needs to store data tuples
all finished panes are reduced to one reusable summary tuple
Agg1 (base)
window
Agg2 (window)
window
24
Pane-based SLIDE optimization
Example: SUM with window size 50 tuples, and slide size 10
tuples
 First create a stream of summary tuples using base
aggregate
CREATE STREAM temp AS (
SELECT itemID, base_max(sale_price) as s
OVER(PARTITION BY itemID ROWS 9 PRECEDING
SLIDE 10)
FROM Auction);
 Then apply the window version of the aggregate
SELECT itemID, window_max(s)
OVER(PARTITION BY itemID ROWS 4 PRECEDING)
FROM temp;
• This simple approach can be used to implement very complex
aggregations (e.g. ensemble classifiers)
• Applies uniformly to logical/physical windows defined in SQL or in an
external language
25
Summary
{ Logical, Physical} x
{tumble, slide, unlimited_preceding}
Six different types of calls, supported by two
definitions
Both SQL or procedural languages can be
used in the definition.
26
Window UDAs vs. Base UDAs
 Base UDAs:
called as traditional SQL-2 aggregates, with
optional GROUP BY
 Window UDAs:
called with SQL:2003 OVER clause
logical or physical windows
optional PARTITION BY and SLIDE clauses in ESL
 Clear semantics and optimization rules unify:
UDAs—SQL or PL-defined, algebraic or not …
 window (logical & physical), slice, tumbles, etc.
System and user roles in optimization.
27
Window UDAs: Physical Optimization
 The Stream Mill System provides efficient
support for:
 Management of new & expiring tuples in buffer
 Main memory & intelligent paging into disk
 Events caused by tuple expiration
 Users can access the buffer as the table called inwindow
28
Conclusion
 Language Technology:
ESL a very powerful language for data stream and DB
applications
Simple semantics and unified syntax conforming to
SQL:2003 standards
Strong case for the DB-oriented approach to data
streams
 System Technology:
Some performance-oriented techniques well-developed—
e.g., buffer management for windows
For others: work is still in progress—stay tuned for
latest news
 Stream Mill is up and running: http://wis.cs.ucla.edu/stream-mill
29
*********
The End
THANK YOU !
*****
30
References
[1]ATLaS user manual. http://wis.cs.ucla.edu/atlas.
[2]SQL/LPP: A Time Series Extension of SQL Based on Limited Patience Patterns, volume 1677 of
Lecture Notes in Computer Science. Springer, 1999.
[4]A. Arasu, S. Babu, and J. Widom. An abstract semantics and concrete language for continuous queries
over streams and relations. Technical report, Stanford University, 2002.
[5]B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream
systems. In PODS, 2002.
[9]D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul,
and S. Zdonik. Monitoring streams - a new class of data management applications. In VLDB, Hong
Kong, China, 2002.
[10]J. Celko. SQL for Smarties, chapter Advanced SQL Programming. Morgan Kaufmann, 1995.
[11]S. Chandrasekaran and M. Franklin. Streaming queries over streaming data. In VLDB, 2002.
[12]J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable continuous query system for
internet databases. In SIGMOD, pages 379-390, May 2000.
[13]C. Cranor, Y. Gao, T. Johnson, V. Shkapenyuk, and O. Spatscheck. Gigascope: A stream database
for network applications. In SIGMOD Conference, pages 647-651. ACM Press, 2003.
[14]Lukasz Golab and M. Tamer Özsu. Issues in data stream management. ACM SIGMOD Record,
32(2):5-14, 2003.
[15]J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, 1997.
[16] Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, Carlo Zaniolo: A Data Stream Language and
System Designed for Power and Extensibility. Proc. of the ACM 15th Conference on Information and
Knowledge Management (CIKM'06), 2006
[17] Yijian Bai, Hetal Thakkar, Haixun Wang and Carlo Zaniolo: Optimizing Timestamp Management in
Data Stream Management Systems. ICDE 2007.
31
References (Cont.)
[18] Yan-Nei Law, Haixun Wang, Carlo Zaniolo: Query Languages and Data Models for Database
Sequences and Data Streams. VLDB 2004: 492-503
[19] Sam Madden, Mehul A. Shah, Joseph M. Hellerstein, and Vijayshankar Raman. Continuously adaptive
continuous queries over streams. In SIGMOD, pages 49-61, 2002.
[20]R. Motwani, J. Widom, A. Arasu, B. Babcock, M. Datar S. Babu, G. Manku, C. Olston, J.
Rosenstein, and R. Varma. Query processing, approximation, and resource management in a data
stream management system. In First CIDR 2003 Conference, Asilomar, CA, 2003.
[21]R. Ramakrishnan, D. Donjerkovic, A. Ranganathan, K. Beyer, and M. Krishnaprasad. SRQL: Sorted
relational query language, 1998.
[23]Reza Sadri, Carlo Zaniolo, and Amir M. Zarkesh andJafar Adibi. A sequential pattern query language
for supporting instant data minining for e-services. In VLDB, pages 653-656, 2001.
[24]Reza Sadri, Carlo Zaniolo, Amir Zarkesh, and Jafar Adibi. Optimization of sequence queries in
database systems. In PODS, Santa Barbara, CA, May 2001.
[25]P. Seshadri. Predator: A resource for database research. SIGMOD Record, 27(1):16-20, 1998.
[26]P. Seshadri, M. Livny, and R. Ramakrishnan. SEQ: A model for sequence databases. In ICDE, pages
232-239, Taipei, Taiwan, March 1995.
[27]Praveen Seshadri, Miron Livny, and Raghu Ramakrishnan. Sequence query processing. In ACM
SIGMOD 1994, pages 430-441. ACM Press, 1994.
[28]M. Sullivan. Tribeca: A stream database manager for network traffic analysis. In VLDB, 1996.
[29]D. Terry, D. Goldberg, D. Nichols, and B. Oki. Continuous queries over append-only databases. In
SIGMOD, pages 321-330, 6 1992.
[30]Peter A. Tucker, David Maier, Tim Sheard, and Leonidas Fegaras. Exploiting punctuation semantics in
continuous data streams. IEEE Trans. Knowl. Data Eng, 15(3):555-568, 2003.
[31]Haixun Wang and Carlo Zaniolo. ATLaS: a native extension of SQL for data minining. In Proceedings
of Third SIAM Int. Conference on Data MIning, pages 130-141, 2003.
32