slide - UCLA Computer Science

Download Report

Transcript slide - UCLA Computer Science

Data Stream Management Systems-Supporting Stream Mining Applications
Carlo Zaniolo
CS240B
2017
1
Three Main Topics
Data Stream Management Systems (DSMS):
efficient support for continuous queries
With Quality of Service (QoS) Guarantees
Data Stream Mining Algorithms
 fast & light algorithms to support classical
mining tasks (e.g. classification, association,
clustering) with online response
Combining these into a Data Stream Mining
workbench
E.g. an online version of Weka.
Data Stream Management Systems--Supporting
Data Stream Mining Applications
 In the age of the Internet, massive amounts of information are continuously
exchanged as data streams that are then processed by on-line applications
of increasing complexity. For such advanced applications, a store-now and
process-later approach cannot be used because of real time (or quasi realtime) requirements and excessive data rates. Therefore, current research
seeks to develop a new generation of information management systems,
called Data Stream Management Systems (DSMS), that can support complex
applications on massive data streams with Quality of Service (QoS)
guarantees. This work has produced novel techniques, research prototypes,
startup companies, and the successful deployment of DSMS in many
applications, including network traffic analysis, transaction log analysis,
intrusion detection, credit-card fraud detection, click stream analysis, and
algorithmic trading.
3
DSMS (cont.)
In particular, only monotonic queries and non-blocking
operators can be used. Also, the unbounded streams must be
represented by synopses, such as windows containing the
most recent tuples in the streams. Thus the semantics of basic
operators such as joins and aggregates must be revised for
windows. At the implementation level, we have new query
optimization techniques that seek to minimize response time
and memory utilization. Load shedding techniques based on
samples and sketches are used to achieve QoS under
overload conditions. The first part of the course, will cover
these techniques and the architectures of the main DSMS
systems.
4
Data Streams
The first part of the course, will cover these techniques and
the architectures of the main DSMS systems.
The second part of the course will focus on the data stream
mining problem that represents a vibrant area of new
research. Past work concentrated on devising data mining
algorithms that (i) are fast and light enough for on line
applications, and (ii) can cope with the concept shifts and
drifts that are often present in data streams
5
Topic 1: The Big Picture DBMS from 9000 meters
 Data Stream Management Systems (DSMS)
Overview of Applications, Systems, and commercial
ventures
 Main Research Challenges:




Data models and timestamps
Query constructs and execution models
Optimization and Scheduling
Approximation and Load Shedding
 The bigger Picture
XML
CEP
6
Data Streams
 Unbounded, rapid, time-varying streams of data elements,
continuous flowing on the internet and broad-band
communication channels
 Data Stream Management Systems (DSMS) are designed to
process them continuously, since a store now and process later
approach will not work due to:
 Response requirements: real time or quasi real-time
 Streams are too massive, and also bursty. Online filtering of
interesting data for in-depth analysis later.
 Many applications similar to those of DBMS. Most DSMS use
some form of SQL
 Computing environments quite different. E.g., persistent
queries on transient data, vs. transient queries on persistent
data
7
Systems and Technology
 Several Research prototypes (next slide)
 High-tech ventures startups+ DB vnedors
Apama (Software AG)
Truviso: network analytics (CISCO)
SQLstream (Amazon)
AURORA,StreamBase Systems, Inc.
PIPES, webMethods Business Events
StreamInsight (Microsoft)
InfoSphere Streams (IBM)
SAS Event Stream Processing Engine
Pipeline DB
 Complex Event Processing (CEP) systems:
More general functionality than continuous queries
Building on middleware/Java rather than DBs and SQL
8
Many Research Projects …
 Amazon/Cougar (Cornell) – sensors
 Aurora (Brown/MIT) – sensor monitoring, dataflow
 Hancock (AT&T) – Telecom streams
 Niagara (OGI/Wisconsin) – Internet DBs & XML
 OpenCQ (Georgia) – triggers, view maintenance
 Stream (Stanford) – general-purpose DSMS
 Tapestry (Xerox) – pubish/subscribe filtering
 Telegraph (Berkeley) – adaptive engine for sensors
 Tribeca (Bellcore) – network monitoring
 Stream Mill (UCLA) - power & extensibility
 Gigascope: AT&T Labs – Network Monitoring
... All SQL-based, except Hancock
9
Client-Server Architecture
Clients
Register
Query
Streamed
Result
DSMS
Server
Input streams
Scratch Store
Stored
Relations
10
Databases vs. DSMS
DBMS
DSMS
 Model: persistent data
 Model: transient data
 Table: set|bag of tuples
 Updates: All
 Query: transient
 Query Answer: exact
 Query Eval. multi-pass
 Operator: blocking OK
 Query Plan: fixed
 Infinite sequence of tuples
 Updates: append only
 Query: persistent
 Query Answer: Often approx
 Query Eval. one-pass
 Operators: unblocking only
 Query Plan: adaptive
11
Research Challenges: Data Models
Relational Data Streams
Each data stream consists of relational tuples
 The stream can be modelled as an append-only
relation
But repetitions are allowed and order is very
important!
 Order based on timestamps—or arrival order
12
Timestamps
 Data streams are (basically) ordered according to their
timestamps
 The meaning of RA operators such as unions an joins is based on
timestamps
 External timestamps
 Injected by data source
 Model real-world event represented by tuple
 Tuples may be out-of-order--but if near-ordered the DSMS can
reorder them using small buffers
 Internal timestamps
 Introduced as special field by the DSMS
 According to the approximate time of arrival
 Missing (will call them latent in Stream Mill)
 The system assigns no timestamp to arriving tuples,
 But tuples are still processed as ordered sequences
 Operators whose semantics requires timestamps can generate them
as/when needed
13
Data Stream Query Languages
(SQL-like)
Continuous queries and
Blocking Operators
14
Cascading of Streams
CREATE STREAM
OpenAuction (itemID int, sellerID char(10), price real, start
time timestamp)
ORDER BY start time; /*external timestamps*/
SOURCE ’port4445’;
CREATE STREAM expensiveItems AS
SELECT itemID, sellerID, price, start time
FROM OpenAuction WHERE price > 1000
SELECT itemID, price, start time
FROM expensiveItems WHERE sellerID= `ABCwarehouse`
Source
port4445
OpenAuction
σ
ExpensiveItems
σπ
Sink
15
A More General Query Graph
Q1
Q2
∑
U
⋈
Stream1
Stream2
Stream3
16
SQL: DSMS versus DBMS
 Cascading of streams similar to composition of views:
optimizer can combine them as virtual views or pipeline results
through queues
 Syntax/semantic uniformity (between DSMS and DBMS)
highly desirable, since it
 Minimizes confusion/learning curve
 Queries for DSMS can be tested on DBMS: actually might be a
practical solution for modest arrival rates
 Vice-versa might not work because of:
 Three Major Complications:Order is critical
1. Order is critical
2. Memory is limited
3. Blocking queries disallowed
 New SQL standards overcame 1: e.g. OLAP Functions with
Windows and Sequence Patterns, but 2 and 3 remain.
17
Surviving the 3 Complications
1. ORDER:
 Union of two or more streams becomes a merge that
Preserves the order of the incoming streams
 the order of their timestamps (if these are present)
2. Limited Memory:
 Only a synopsis of each stream can be memorized---e.g., a window
containing the last 10000 tuples or those that have arrived in the
last 6 minutes. The window defines a time changing table
 Join (of each arriving tuple in) the data stream with a table is
simple.A widow on a data stream can be viewed as a table.
 Symmetric joins of data streams are thus supported by creating wi
windows on each streams and computing the join on such window.
3. Blocking Queries Disallowed:
 …next slide
18
Surviving the 3 Complications (cont.)
1. ORDER
2. Limited Memory:
3. Blocking Queries Disallowed:
If any operator part of the query is blocking the query is
also blocking
 Blocking operators are NonMonotonic operators—which
are thus disallowed (e.g., set diff and SQL-2 aggregagtes)
 Significant loss of expressive power
19
Join of Stream S with a Table T
(where T is a DB relation or a Window on a Stream)
When a new tuple z with timestamp ts(z)
arrives in S, join it with all the tuples in T.
-
ts(z) is the timestamp of tuples so produced
If T is a window on a stream S’
-
E.g., 30 minutes preceding current tuple in S’ or
499 tuples preceding current one
 T must contain all the tuples up to ts(z) included.
 Thus a window is a synopsis of the past.
20
Query Operators
Selection and projection—same as in relational
algebra (no duplicate elimination in projection)
Union of data streams becomes a merge: idle
waiting problem!
 Join of data streams with tables—no problem
Joins of two or more data streams: must use
windows. Idle waiting here two!
Set difference and SQL-2 aggregates are
blocking and cannot be used.
Aggregates can be salvaged by using
continuous aggregates or windows. More later
21
Blocking Query Operators
 No output until the entire input has been seen—i.e., the
last tuple in the input, … often detected after we hit the EOF.
 Streams – input never ends: thus blocking operators
cannot be used as such.
 Traditional SQL aggregates are blocking.
 Several SQL operators have DBMS implementations
that are blocking but are not intrinsically blocking
 group by, join can be implemented in blocking and
nonblocking ways
 but others operators are intrinsically blocking
 We will need to formally characterize which is which! (and
we’ll see that nonblocking operators are the monotonic ones)
22
Problematic Operators for Data Streams
 Blocking query operators—i.e., those that must
see everything in the input before they can return
anything in the output
 NonBlocking query operators are those that can
return results now, without seeing the rest of the
stream
 Selection and projection are non-blocking
 Set Difference, and Traditional aggregates are
blocking
 Continuous aggregates are not
23
Expressive Power Problem
SQL: a language of limited expressive power
 Non-blocking operator requirement further limit its expressive
power when expressing continuous queries
 Pull-Based computation of PL and DBMS makes it possible to
embed SQL in a procedural language program
 But DSMS have a push-based computation model: embedding
continuous queries a PL is a problem not a general solution
 OR-DBMS extenders that use BLOBs not working either
 User Defined Aggregate functions remain a viable extensibility
mechanism for continuous queries
 However, only non blocking aggregates can be used
 Blocking versus non-blocking aggregates.
24
Aggregates: two Forms
 Traditional SQL2 blocking aggregate
select X, avg(X)
from S
where P
group by X
 NB SQL:2003 OLAP Functions—aggregates on windows
traffic (sourceIP, sourcePort, destIP , destPort, length, Time)
select sourceIP, Time, avg(lenght) over(order by Time,
partition by sourceIP 50 rows preceding)
Cumulative (running) window:
... over(order by Time,
partition by sourceIP unlimited preceding)
25
Avoiding Blocking Behavior
 Windows: aggregates on a limited size window are
approximate and nonblocking
 DSMS make extensive use of windows of all kinds:
 Sliding windows (same as SQL:2003 OLAP functions)
 Tumbles: restart every new window (we can use the
traditional definition)
 Panes: the window is broken up into panes
 Punctuation:
Assertion about future stream contents not being needed to
answer correctly.
Unblocks operators, reduces state
 Construct used for avoiding blocking are also useful
for avoiding infinite memory
26
Query Plans:
Optimization and Scheduling
The DSMS optimization problem is quite different
from the DBMS query optimization problem.
27
Query Optimization in DBMS
In DBMS: execution time savings by
selecting operator implementation indexes
and join reordering--- all measured by page
swap counts
Scheduling of various query tasks might be
left up to the OS.
DSMS Optimization
of Queries and Schedules
In DSMS: data is in memory and execution time is
mostly determined by the query graphs and the costs
of tuples being processed.
But many queries competing for resources: thus
schedules must be optimized to minimize latency and
memory (similarities with tasks scheduling in OS)
Simple DBMS-like query optimization opportunities
remain: e.g. pushing selection, composing views
Sharing of operators and buffers also important!
29
Scheduler based on Heuristics
Round Robin (RR) is perhaps the most basic
 operators in a circular queue are given a fixed
time slice.
Starvation is avoided, but little adaptivity
FIFO: takes the first tuple in input and
moves it through the chain
Minimal latency, poor memory
Greedy Algorithms:
Buffers with most tuples first
Tuples that waited longest first
Operators that release more memory first
30
Chain: a Formal Result on Memory Optimization
[Babcock, Babu, Datar, Motwani 2003]
Output
selectivity = 0.0
σ3
selectivity = 0.6
σ2
selectivity = 0.2
Net Selectivity
σ1
best slope
σ2
σ3
σ1
Time
Input
31
DSMS
Approximation and Load Schedding
 DSMS: online response on boundless and
bursty data streams—How?
 by using Approximation and Synopses, or even
 Shedding load when arrival rates become
impossible.
32
QOS via Synopses and Approximation
 Synopsis: bounded-memory history-approximation
Succinct summary of old stream tuples
Like indexes/materialized-views, but base data is unavailable
Examples
Sliding Windows
Sampling and Bootstrapping
Sketching techniques
Histograms
Wavelet representation
 Approximate Algorithms: e.g., median, quantiles,…
 Fast and light Data Mining algorithms
33
Quality of Service (QoS)
and Load Schedding
 Data streams can be bursty. When the input
stream rate exceeds system capacity the QoS will
become unacceptable—even with optimal schedules
 Then more drastic measures must be taken:
something must be cut: Load Shedding
 Introducing load shedding in a data stream
manager is a challenging problem
 Queries can be cut alltogether, or if possible
 Some tuples might be dropped from certain
streams in ways that only reduce the accuracy
of the queries, by
1. Random shedding, or
2. Semantic shedding
34
The Bigger Picture…
XML Data Streams
Complex Event Processing (CEP)
35
XML Data Streams: Applications
• An XML data stream is a sequence of tokens
• Typical applications using XML streams involve:
• integration o heterogneous data and information
• Distributed monitoring of computing systems
• Message-based web services
• Purchase orders, retail transactions
• Personalized content delivery
36
XML Streams: Data Model
XML data: tree structure
Data stream: ~ SAX events
<Purchase_Doc>
<PR_Number val = “50”/>
<Supp_Name>ABC</Supp_Na
me>
<Address>
<City>Florham Park</City>
<State>New Jersey</State>
</Address>
<Line_Items>
<Item>
<Part_Number val= “1050”/>
<Quantity val=“20”/>
</Item>
[element Purchase_Doc
anyType]
[element PR_Number anyType]
[attribute val anySimpleType]
[chardata 50]
[end-attribute]
[end-element]
[element Supp_Name anyType]
[text ABC]
[end-element]
…
37
XML Query Languages
XML query languages
 Xquery, XSLT, Xpath
 Declarative matching of structured data and text
 Easy restructuring to meet needs of data
consumers
38
XML Streams: research Issue
 Efficient Processing of single/multiple queries (e.g.,
Xfilters/Yfilters)
 Blocking operators/constructs in XQuery—e.g.,
XQuery new function definition mechanisms are
blocking
 Optimization, synopses and other QoS techniques for
XML data streams
 Integration of relational and XML DSMS—just like
relational and XML DBMS are now being integrated.
39
The Even Bigger Picture…
XML Data Streams
Complex Event Processing (CEP)
40
Research & Commercial Systems
 Progress Apama: they claim to be the industry leading platform
for Complex Event Processing (CEP) [a UK-trained Chicago team]
 Aurora (Brandeis, Brown, MIT)
Borealis: multiprocessor processor version
StreamBase: commercial startup.
 Gigascope (AT&T): used with success for network traffic
management, continuing projects
 STREAM (Stanford Univ) influential but short lived
 Coral8: R. Motwani among startup founders
 Telegraph (UCB): Quickly completed by now
The Truvisio startup
 Stream Mill (UCLA): extending the power and generality of DSMS
integrates XML streams and relational streams
Support for Data Mining applications
41
More Tutorial Talks
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani,Jennifer
Widomhttp://theory.stanford.edu/~rajeev/pods-full-talk.ppt
Nick Koudas and Divesh Srivastava. Data stream query processing.
Tutorial presented at International Conference on Very Large
Databases (VLDB), 1149, 2003. [ PDF | talk slides (PDF)
Nick Koudas et al. Matching XML Documents Approximately (with S.
Yahia and D. Srivastava) Tutorial delivered at ICDE 2003
Nick Koudas et al. Stream Data Management: Research Directions and
Opportunities. Invited Talk at IDEAS 2002.
Nick Koudas et al. Mining Data Streams (with S. Guha) Invited
Tutorial delivered at PAKDD 2003
42
Books
 Nauman A. Chaudhry, Kevin Shaw, Mahdi Abdelguerfi: Stream
Data Management. Advances in Database Systems 30, Springer
2005.
 Charu C. Aggarwal: Data Streams - Models and Algorithms.
Advances in Database Systems 31, Springer 2007.
 Lukasz Golab M. Tamer Ozsu, Data Stream Management, 2010,
Morgan & Claypool Publishers.
 Charu C. Aggarwal: Data Mining - The Textbook. Springer 2015.
 Minos N. Garofalakis, Johannes Gehrke, Rajeev Rastogi: Data
Stream Management - Processing High-Speed Data Streams. DataCentric Systems and Applications, Springer 2016.
43
References
Overview Papers
 B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom: Models and Issues in
Data Stream
Systems. PODS 2002: 1-16
 Lukasz Golab and M. Tamer ¨Ozsu. Issues in data stream management. ACM
SIGMOD Record, 32(2):5–14, 2003.
Application Papers
1. Cranor, Johnson, Spatscheck & Shkapenyuk. Gigascope: A Stream Database
for Network Applications. SIGMOD 2003
2. Joseph M. Hellerstein. From Database to Dataflow: New Directions in IT.
Medical Records Institute Health IT Advisory Report 3(6) (2002).
3.
Lerner & Shasha. The Virtues and Challenges of Ad Hoc + Streams Querying
in Finance. IEEE Data Engineering Bulletin, March 2003.
4. Sistal, Wolfson, Chamberlain, Dao. Modeling and Querying Moving Objects.
ICDE 1997.
5. Yao & Gehrke. Query Processing for Sensor Networks. CIDR 2003
CEP Papers:
 Apama; CEP applications www.eventstreamprocessing.com/cepapplications.htm
 Hans-Martin Brandl and David Guschakowski, Complex Event Processing in the
context of Business Activity Monitoring, Thesis, University of Applied Sciences
Regensburg, 2007.
 Roy Schulte, Event Processing in Business Applications, Workshop on Event
Processing, March 14,15 2007www.complexevents.com/slides/Gartner_Schulte.ppt
 Other papers at the same Workshop on Event Processing - Presentations
More about Systems
 Sankar Subramanian, Srikanth Bellamkonda, Hua-Gang Li, Vince Liang, Lei
Sheng, Wayne Smith, James Terry, Tsae-Feng Yu, Andrew Witkowski:
Continuous Queries in Oracle. VLDB 2007: 1173-1184
 Arvind, Shivnath Babu, Jennifer Widom: The CQL continuous query
language: semantic foundations anquery execution. VLDB J. 15(2): 121-142
(2006)
 Hari Balakrishnan, Magdalena Balazinska, Donald Carney, Ugur Çetintemel,
Mitch Cherniack, Christian Convey, Eduardo F. Galvez, Jon Salz, Michael
Stonebraker, Nesime Tatbul, Richard Tibbetts, Stanley B. Zdonik:
Retrospective on Aurora. VLDB J. 13(4): 370-383 (2004).
 Jeong-Hyon Hwang, Ugur Çetintemel, Stanley B. Zdonik: Fast and Reliable
Stream Processing over Wide Area Networks. ICDE Workshops 2007: 604613.
 Jeong-Hyon Hwang, Magdalena Balazinska, Alex Rasin, Ugur Çetintemel,
Michael Stonebraker, Stanley B. Zdonik: High-Availability Algorithms for
Distributed Stream Processing. ICDE 2005: 779-790.
 Charles D. Cranor, Theodore Johnson, Oliver Spatscheck, Vladislav
Shkapenyuk: Gigascope: A Stream Database for Network Applications.
SIGMOD Conference 2003: 647-651
 Arvind Arasu, Mitch Cherniack, Eduardo F. Galvez, David Maier, Anurag
Maskey, Esther Ryvkina, Michael Stonebraker, Richard Tibbetts: Linear
Road: A Stream Data Management Benchmark. VLDB 2004.
More about Systems (cont.)
 J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable
continuous query system for internet databases. In Proc. of the 2000
ACMSIGMOD Intl. Conf. on Management of Data, pages 379-390.
 Madden, Vijayshankar Raman, Fred Reiss, and Mehul A. Shah. Sailesh
Krishnamurthy et al.: TelegraphCQ: An Architectural Status Report. IEEE
Data Engineering Bulletin, Vol 26(1), March 2003.
 Sam Madden, Mehul A. Shah, Joseph M. Hellerstein, Vijayshankar Raman:
Continuously Adaptive Continuous Queries over Streams. SIGMOD 2002,
49-61.
 S. Chandrasekaran and M. Franklin. Streaming queries over streaming data.
In VLDB, 2002.
 J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable
continuous query system for internet databases. In SIGMOD, pages 379390, May 2000.
 M. Sullivan. Tribeca: A stream database manager for network traffic
analysis. In VLDB, 1996.