slide - UCLA Computer Science
Download
Report
Transcript slide - UCLA Computer Science
Data Stream Management Systems-Supporting Stream Mining Applications
Carlo Zaniolo
CS240B
2017
1
Three Main Topics
Data Stream Management Systems (DSMS):
efficient support for continuous queries
With Quality of Service (QoS) Guarantees
Data Stream Mining Algorithms
fast & light algorithms to support classical
mining tasks (e.g. classification, association,
clustering) with online response
Combining these into a Data Stream Mining
workbench
E.g. an online version of Weka.
Data Stream Management Systems--Supporting
Data Stream Mining Applications
In the age of the Internet, massive amounts of information are continuously
exchanged as data streams that are then processed by on-line applications
of increasing complexity. For such advanced applications, a store-now and
process-later approach cannot be used because of real time (or quasi realtime) requirements and excessive data rates. Therefore, current research
seeks to develop a new generation of information management systems,
called Data Stream Management Systems (DSMS), that can support complex
applications on massive data streams with Quality of Service (QoS)
guarantees. This work has produced novel techniques, research prototypes,
startup companies, and the successful deployment of DSMS in many
applications, including network traffic analysis, transaction log analysis,
intrusion detection, credit-card fraud detection, click stream analysis, and
algorithmic trading.
3
DSMS (cont.)
In particular, only monotonic queries and non-blocking
operators can be used. Also, the unbounded streams must be
represented by synopses, such as windows containing the
most recent tuples in the streams. Thus the semantics of basic
operators such as joins and aggregates must be revised for
windows. At the implementation level, we have new query
optimization techniques that seek to minimize response time
and memory utilization. Load shedding techniques based on
samples and sketches are used to achieve QoS under
overload conditions. The first part of the course, will cover
these techniques and the architectures of the main DSMS
systems.
4
Data Streams
The first part of the course, will cover these techniques and
the architectures of the main DSMS systems.
The second part of the course will focus on the data stream
mining problem that represents a vibrant area of new
research. Past work concentrated on devising data mining
algorithms that (i) are fast and light enough for on line
applications, and (ii) can cope with the concept shifts and
drifts that are often present in data streams
5
Topic 1: The Big Picture DBMS from 9000 meters
Data Stream Management Systems (DSMS)
Overview of Applications, Systems, and commercial
ventures
Main Research Challenges:
Data models and timestamps
Query constructs and execution models
Optimization and Scheduling
Approximation and Load Shedding
The bigger Picture
XML
CEP
6
Data Streams
Unbounded, rapid, time-varying streams of data elements,
continuous flowing on the internet and broad-band
communication channels
Data Stream Management Systems (DSMS) are designed to
process them continuously, since a store now and process later
approach will not work due to:
Response requirements: real time or quasi real-time
Streams are too massive, and also bursty. Online filtering of
interesting data for in-depth analysis later.
Many applications similar to those of DBMS. Most DSMS use
some form of SQL
Computing environments quite different. E.g., persistent
queries on transient data, vs. transient queries on persistent
data
7
Systems and Technology
Several Research prototypes (next slide)
High-tech ventures startups+ DB vnedors
Apama (Software AG)
Truviso: network analytics (CISCO)
SQLstream (Amazon)
AURORA,StreamBase Systems, Inc.
PIPES, webMethods Business Events
StreamInsight (Microsoft)
InfoSphere Streams (IBM)
SAS Event Stream Processing Engine
Pipeline DB
Complex Event Processing (CEP) systems:
More general functionality than continuous queries
Building on middleware/Java rather than DBs and SQL
8
Many Research Projects …
Amazon/Cougar (Cornell) – sensors
Aurora (Brown/MIT) – sensor monitoring, dataflow
Hancock (AT&T) – Telecom streams
Niagara (OGI/Wisconsin) – Internet DBs & XML
OpenCQ (Georgia) – triggers, view maintenance
Stream (Stanford) – general-purpose DSMS
Tapestry (Xerox) – pubish/subscribe filtering
Telegraph (Berkeley) – adaptive engine for sensors
Tribeca (Bellcore) – network monitoring
Stream Mill (UCLA) - power & extensibility
Gigascope: AT&T Labs – Network Monitoring
... All SQL-based, except Hancock
9
Client-Server Architecture
Clients
Register
Query
Streamed
Result
DSMS
Server
Input streams
Scratch Store
Stored
Relations
10
Databases vs. DSMS
DBMS
DSMS
Model: persistent data
Model: transient data
Table: set|bag of tuples
Updates: All
Query: transient
Query Answer: exact
Query Eval. multi-pass
Operator: blocking OK
Query Plan: fixed
Infinite sequence of tuples
Updates: append only
Query: persistent
Query Answer: Often approx
Query Eval. one-pass
Operators: unblocking only
Query Plan: adaptive
11
Research Challenges: Data Models
Relational Data Streams
Each data stream consists of relational tuples
The stream can be modelled as an append-only
relation
But repetitions are allowed and order is very
important!
Order based on timestamps—or arrival order
12
Timestamps
Data streams are (basically) ordered according to their
timestamps
The meaning of RA operators such as unions an joins is based on
timestamps
External timestamps
Injected by data source
Model real-world event represented by tuple
Tuples may be out-of-order--but if near-ordered the DSMS can
reorder them using small buffers
Internal timestamps
Introduced as special field by the DSMS
According to the approximate time of arrival
Missing (will call them latent in Stream Mill)
The system assigns no timestamp to arriving tuples,
But tuples are still processed as ordered sequences
Operators whose semantics requires timestamps can generate them
as/when needed
13
Data Stream Query Languages
(SQL-like)
Continuous queries and
Blocking Operators
14
Cascading of Streams
CREATE STREAM
OpenAuction (itemID int, sellerID char(10), price real, start
time timestamp)
ORDER BY start time; /*external timestamps*/
SOURCE ’port4445’;
CREATE STREAM expensiveItems AS
SELECT itemID, sellerID, price, start time
FROM OpenAuction WHERE price > 1000
SELECT itemID, price, start time
FROM expensiveItems WHERE sellerID= `ABCwarehouse`
Source
port4445
OpenAuction
σ
ExpensiveItems
σπ
Sink
15
A More General Query Graph
Q1
Q2
∑
U
⋈
Stream1
Stream2
Stream3
16
SQL: DSMS versus DBMS
Cascading of streams similar to composition of views:
optimizer can combine them as virtual views or pipeline results
through queues
Syntax/semantic uniformity (between DSMS and DBMS)
highly desirable, since it
Minimizes confusion/learning curve
Queries for DSMS can be tested on DBMS: actually might be a
practical solution for modest arrival rates
Vice-versa might not work because of:
Three Major Complications:Order is critical
1. Order is critical
2. Memory is limited
3. Blocking queries disallowed
New SQL standards overcame 1: e.g. OLAP Functions with
Windows and Sequence Patterns, but 2 and 3 remain.
17
Surviving the 3 Complications
1. ORDER:
Union of two or more streams becomes a merge that
Preserves the order of the incoming streams
the order of their timestamps (if these are present)
2. Limited Memory:
Only a synopsis of each stream can be memorized---e.g., a window
containing the last 10000 tuples or those that have arrived in the
last 6 minutes. The window defines a time changing table
Join (of each arriving tuple in) the data stream with a table is
simple.A widow on a data stream can be viewed as a table.
Symmetric joins of data streams are thus supported by creating wi
windows on each streams and computing the join on such window.
3. Blocking Queries Disallowed:
…next slide
18
Surviving the 3 Complications (cont.)
1. ORDER
2. Limited Memory:
3. Blocking Queries Disallowed:
If any operator part of the query is blocking the query is
also blocking
Blocking operators are NonMonotonic operators—which
are thus disallowed (e.g., set diff and SQL-2 aggregagtes)
Significant loss of expressive power
19
Join of Stream S with a Table T
(where T is a DB relation or a Window on a Stream)
When a new tuple z with timestamp ts(z)
arrives in S, join it with all the tuples in T.
-
ts(z) is the timestamp of tuples so produced
If T is a window on a stream S’
-
E.g., 30 minutes preceding current tuple in S’ or
499 tuples preceding current one
T must contain all the tuples up to ts(z) included.
Thus a window is a synopsis of the past.
20
Query Operators
Selection and projection—same as in relational
algebra (no duplicate elimination in projection)
Union of data streams becomes a merge: idle
waiting problem!
Join of data streams with tables—no problem
Joins of two or more data streams: must use
windows. Idle waiting here two!
Set difference and SQL-2 aggregates are
blocking and cannot be used.
Aggregates can be salvaged by using
continuous aggregates or windows. More later
21
Blocking Query Operators
No output until the entire input has been seen—i.e., the
last tuple in the input, … often detected after we hit the EOF.
Streams – input never ends: thus blocking operators
cannot be used as such.
Traditional SQL aggregates are blocking.
Several SQL operators have DBMS implementations
that are blocking but are not intrinsically blocking
group by, join can be implemented in blocking and
nonblocking ways
but others operators are intrinsically blocking
We will need to formally characterize which is which! (and
we’ll see that nonblocking operators are the monotonic ones)
22
Problematic Operators for Data Streams
Blocking query operators—i.e., those that must
see everything in the input before they can return
anything in the output
NonBlocking query operators are those that can
return results now, without seeing the rest of the
stream
Selection and projection are non-blocking
Set Difference, and Traditional aggregates are
blocking
Continuous aggregates are not
23
Expressive Power Problem
SQL: a language of limited expressive power
Non-blocking operator requirement further limit its expressive
power when expressing continuous queries
Pull-Based computation of PL and DBMS makes it possible to
embed SQL in a procedural language program
But DSMS have a push-based computation model: embedding
continuous queries a PL is a problem not a general solution
OR-DBMS extenders that use BLOBs not working either
User Defined Aggregate functions remain a viable extensibility
mechanism for continuous queries
However, only non blocking aggregates can be used
Blocking versus non-blocking aggregates.
24
Aggregates: two Forms
Traditional SQL2 blocking aggregate
select X, avg(X)
from S
where P
group by X
NB SQL:2003 OLAP Functions—aggregates on windows
traffic (sourceIP, sourcePort, destIP , destPort, length, Time)
select sourceIP, Time, avg(lenght) over(order by Time,
partition by sourceIP 50 rows preceding)
Cumulative (running) window:
... over(order by Time,
partition by sourceIP unlimited preceding)
25
Avoiding Blocking Behavior
Windows: aggregates on a limited size window are
approximate and nonblocking
DSMS make extensive use of windows of all kinds:
Sliding windows (same as SQL:2003 OLAP functions)
Tumbles: restart every new window (we can use the
traditional definition)
Panes: the window is broken up into panes
Punctuation:
Assertion about future stream contents not being needed to
answer correctly.
Unblocks operators, reduces state
Construct used for avoiding blocking are also useful
for avoiding infinite memory
26
Query Plans:
Optimization and Scheduling
The DSMS optimization problem is quite different
from the DBMS query optimization problem.
27
Query Optimization in DBMS
In DBMS: execution time savings by
selecting operator implementation indexes
and join reordering--- all measured by page
swap counts
Scheduling of various query tasks might be
left up to the OS.
DSMS Optimization
of Queries and Schedules
In DSMS: data is in memory and execution time is
mostly determined by the query graphs and the costs
of tuples being processed.
But many queries competing for resources: thus
schedules must be optimized to minimize latency and
memory (similarities with tasks scheduling in OS)
Simple DBMS-like query optimization opportunities
remain: e.g. pushing selection, composing views
Sharing of operators and buffers also important!
29
Scheduler based on Heuristics
Round Robin (RR) is perhaps the most basic
operators in a circular queue are given a fixed
time slice.
Starvation is avoided, but little adaptivity
FIFO: takes the first tuple in input and
moves it through the chain
Minimal latency, poor memory
Greedy Algorithms:
Buffers with most tuples first
Tuples that waited longest first
Operators that release more memory first
30
Chain: a Formal Result on Memory Optimization
[Babcock, Babu, Datar, Motwani 2003]
Output
selectivity = 0.0
σ3
selectivity = 0.6
σ2
selectivity = 0.2
Net Selectivity
σ1
best slope
σ2
σ3
σ1
Time
Input
31
DSMS
Approximation and Load Schedding
DSMS: online response on boundless and
bursty data streams—How?
by using Approximation and Synopses, or even
Shedding load when arrival rates become
impossible.
32
QOS via Synopses and Approximation
Synopsis: bounded-memory history-approximation
Succinct summary of old stream tuples
Like indexes/materialized-views, but base data is unavailable
Examples
Sliding Windows
Sampling and Bootstrapping
Sketching techniques
Histograms
Wavelet representation
Approximate Algorithms: e.g., median, quantiles,…
Fast and light Data Mining algorithms
33
Quality of Service (QoS)
and Load Schedding
Data streams can be bursty. When the input
stream rate exceeds system capacity the QoS will
become unacceptable—even with optimal schedules
Then more drastic measures must be taken:
something must be cut: Load Shedding
Introducing load shedding in a data stream
manager is a challenging problem
Queries can be cut alltogether, or if possible
Some tuples might be dropped from certain
streams in ways that only reduce the accuracy
of the queries, by
1. Random shedding, or
2. Semantic shedding
34
The Bigger Picture…
XML Data Streams
Complex Event Processing (CEP)
35
XML Data Streams: Applications
• An XML data stream is a sequence of tokens
• Typical applications using XML streams involve:
• integration o heterogneous data and information
• Distributed monitoring of computing systems
• Message-based web services
• Purchase orders, retail transactions
• Personalized content delivery
36
XML Streams: Data Model
XML data: tree structure
Data stream: ~ SAX events
<Purchase_Doc>
<PR_Number val = “50”/>
<Supp_Name>ABC</Supp_Na
me>
<Address>
<City>Florham Park</City>
<State>New Jersey</State>
</Address>
<Line_Items>
<Item>
<Part_Number val= “1050”/>
<Quantity val=“20”/>
</Item>
[element Purchase_Doc
anyType]
[element PR_Number anyType]
[attribute val anySimpleType]
[chardata 50]
[end-attribute]
[end-element]
[element Supp_Name anyType]
[text ABC]
[end-element]
…
37
XML Query Languages
XML query languages
Xquery, XSLT, Xpath
Declarative matching of structured data and text
Easy restructuring to meet needs of data
consumers
38
XML Streams: research Issue
Efficient Processing of single/multiple queries (e.g.,
Xfilters/Yfilters)
Blocking operators/constructs in XQuery—e.g.,
XQuery new function definition mechanisms are
blocking
Optimization, synopses and other QoS techniques for
XML data streams
Integration of relational and XML DSMS—just like
relational and XML DBMS are now being integrated.
39
The Even Bigger Picture…
XML Data Streams
Complex Event Processing (CEP)
40
Research & Commercial Systems
Progress Apama: they claim to be the industry leading platform
for Complex Event Processing (CEP) [a UK-trained Chicago team]
Aurora (Brandeis, Brown, MIT)
Borealis: multiprocessor processor version
StreamBase: commercial startup.
Gigascope (AT&T): used with success for network traffic
management, continuing projects
STREAM (Stanford Univ) influential but short lived
Coral8: R. Motwani among startup founders
Telegraph (UCB): Quickly completed by now
The Truvisio startup
Stream Mill (UCLA): extending the power and generality of DSMS
integrates XML streams and relational streams
Support for Data Mining applications
41
More Tutorial Talks
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani,Jennifer
Widomhttp://theory.stanford.edu/~rajeev/pods-full-talk.ppt
Nick Koudas and Divesh Srivastava. Data stream query processing.
Tutorial presented at International Conference on Very Large
Databases (VLDB), 1149, 2003. [ PDF | talk slides (PDF)
Nick Koudas et al. Matching XML Documents Approximately (with S.
Yahia and D. Srivastava) Tutorial delivered at ICDE 2003
Nick Koudas et al. Stream Data Management: Research Directions and
Opportunities. Invited Talk at IDEAS 2002.
Nick Koudas et al. Mining Data Streams (with S. Guha) Invited
Tutorial delivered at PAKDD 2003
42
Books
Nauman A. Chaudhry, Kevin Shaw, Mahdi Abdelguerfi: Stream
Data Management. Advances in Database Systems 30, Springer
2005.
Charu C. Aggarwal: Data Streams - Models and Algorithms.
Advances in Database Systems 31, Springer 2007.
Lukasz Golab M. Tamer Ozsu, Data Stream Management, 2010,
Morgan & Claypool Publishers.
Charu C. Aggarwal: Data Mining - The Textbook. Springer 2015.
Minos N. Garofalakis, Johannes Gehrke, Rajeev Rastogi: Data
Stream Management - Processing High-Speed Data Streams. DataCentric Systems and Applications, Springer 2016.
43
References
Overview Papers
B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom: Models and Issues in
Data Stream
Systems. PODS 2002: 1-16
Lukasz Golab and M. Tamer ¨Ozsu. Issues in data stream management. ACM
SIGMOD Record, 32(2):5–14, 2003.
Application Papers
1. Cranor, Johnson, Spatscheck & Shkapenyuk. Gigascope: A Stream Database
for Network Applications. SIGMOD 2003
2. Joseph M. Hellerstein. From Database to Dataflow: New Directions in IT.
Medical Records Institute Health IT Advisory Report 3(6) (2002).
3.
Lerner & Shasha. The Virtues and Challenges of Ad Hoc + Streams Querying
in Finance. IEEE Data Engineering Bulletin, March 2003.
4. Sistal, Wolfson, Chamberlain, Dao. Modeling and Querying Moving Objects.
ICDE 1997.
5. Yao & Gehrke. Query Processing for Sensor Networks. CIDR 2003
CEP Papers:
Apama; CEP applications www.eventstreamprocessing.com/cepapplications.htm
Hans-Martin Brandl and David Guschakowski, Complex Event Processing in the
context of Business Activity Monitoring, Thesis, University of Applied Sciences
Regensburg, 2007.
Roy Schulte, Event Processing in Business Applications, Workshop on Event
Processing, March 14,15 2007www.complexevents.com/slides/Gartner_Schulte.ppt
Other papers at the same Workshop on Event Processing - Presentations
More about Systems
Sankar Subramanian, Srikanth Bellamkonda, Hua-Gang Li, Vince Liang, Lei
Sheng, Wayne Smith, James Terry, Tsae-Feng Yu, Andrew Witkowski:
Continuous Queries in Oracle. VLDB 2007: 1173-1184
Arvind, Shivnath Babu, Jennifer Widom: The CQL continuous query
language: semantic foundations anquery execution. VLDB J. 15(2): 121-142
(2006)
Hari Balakrishnan, Magdalena Balazinska, Donald Carney, Ugur Çetintemel,
Mitch Cherniack, Christian Convey, Eduardo F. Galvez, Jon Salz, Michael
Stonebraker, Nesime Tatbul, Richard Tibbetts, Stanley B. Zdonik:
Retrospective on Aurora. VLDB J. 13(4): 370-383 (2004).
Jeong-Hyon Hwang, Ugur Çetintemel, Stanley B. Zdonik: Fast and Reliable
Stream Processing over Wide Area Networks. ICDE Workshops 2007: 604613.
Jeong-Hyon Hwang, Magdalena Balazinska, Alex Rasin, Ugur Çetintemel,
Michael Stonebraker, Stanley B. Zdonik: High-Availability Algorithms for
Distributed Stream Processing. ICDE 2005: 779-790.
Charles D. Cranor, Theodore Johnson, Oliver Spatscheck, Vladislav
Shkapenyuk: Gigascope: A Stream Database for Network Applications.
SIGMOD Conference 2003: 647-651
Arvind Arasu, Mitch Cherniack, Eduardo F. Galvez, David Maier, Anurag
Maskey, Esther Ryvkina, Michael Stonebraker, Richard Tibbetts: Linear
Road: A Stream Data Management Benchmark. VLDB 2004.
More about Systems (cont.)
J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable
continuous query system for internet databases. In Proc. of the 2000
ACMSIGMOD Intl. Conf. on Management of Data, pages 379-390.
Madden, Vijayshankar Raman, Fred Reiss, and Mehul A. Shah. Sailesh
Krishnamurthy et al.: TelegraphCQ: An Architectural Status Report. IEEE
Data Engineering Bulletin, Vol 26(1), March 2003.
Sam Madden, Mehul A. Shah, Joseph M. Hellerstein, Vijayshankar Raman:
Continuously Adaptive Continuous Queries over Streams. SIGMOD 2002,
49-61.
S. Chandrasekaran and M. Franklin. Streaming queries over streaming data.
In VLDB, 2002.
J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable
continuous query system for internet databases. In SIGMOD, pages 379390, May 2000.
M. Sullivan. Tribeca: A stream database manager for network traffic
analysis. In VLDB, 1996.