A Data Stream Management System for Network Traffic Management

Download Report

Transcript A Data Stream Management System for Network Traffic Management

A Data Stream Management System
for Network Traffic Management
Shivnath Babu
Stanford University
Lakshminarayanan Subramanian
Univ. California, Berkeley
Jennifer Widom
Stanford University
NRDM, Santa Barbara, CA, May 25, 2001
Network Traffic Management
•
Large networks are growing complex and difficult to manage
–
–
•
Increasing demands, overprovisioning, hardware changes, manual
configuration
Lack of information to configure network for effective usage
Network traffic management is becoming an important part of
the Internet infrastructure
–
Collect data
• E.g., packet traces, network-flow data, SNMP data
–
Process data
• E.g., compute link utilization, per-hop delays, traffic demands
–
Deploy mechanisms to control traffic
• E.g., change routing parameters
•
Data management forms a core part of traffic management
Traffic Management: Data Collection
• Many data sources
–
–
–
–
Packet and flow traces
Router forwarding tables and configuration data
SNMP data
Active measurements of packet delay, link utilization
• Data is collected continuously
– Networks need to be 24*7 for everything
– Huge and fast-growing databases
• Many current traffic management systems store collected data
in file systems or data warehouses
Traffic Management: Data Processing
• Sophisticated data processing is required
• Measuring link utilization
– Aggregate packet traces
• Maintaining network topology
– Join SNMP data from different network elements
• Deriving traffic demands
– Join network flow traces, router forwarding tables and
configuration data, and SNMP data
• Anomaly detection, traffic modeling, traffic prediction, and
many others
• Most current traffic management systems process data
using ad-hoc scripts or software toolkits
Challenge in Data Management: Online Data
Processing
• Most current traffic management applications process data offline
– Huge volume of data
– Complex processing involved
• Offline processing is indeed appropriate for some applications
– E.g., capacity planning, determining pricing plans
• Many traffic management applications need online processing
– E.g., congestion cause detection, resource allocation for guaranteed QoS,
detecting denial-of-service attacks, detecting Service-Level Agreement
violations, admission control and traffic policing
Online Processing
• What’s wrong with using a file system and procedural processing?
– Difficult to maintain and reuse (not a long term solution)
• What’s wrong with using a Database Management System
(DBMS)?
– DBMS expects all data to be managed as persistent data sets
– DBMS assumes “one-time” queries against stored and finite data
A Data Stream Management System (DSMS)
for Online Processing
• Data Streams are the appropriate model for online processing
– Data is changing frequently (often exclusively though insertions)
– It is impractical to operate on same data multiple times
• Continuous queries -- issued once and run “forever”
• Performance
– Need continuous-query optimization
– Need adaptive query-optimization
• A Data Stream Management System for traffic management
– Idea: Support online processing with continuous queries over data streams
A Data Stream Management System for
Online Processing (cont’d)
Applications based on online processing
Continuous Queries
Data Stream
Management
System
Data Management System
Streams
SNMP data
Packet traces
Flow traces
Router forwarding
tables
Active
measurements
Continuous Query over a Single Data Stream
<A,B> <B,C> <A,D>
Data Stream
Q
A?
• Many options with different ramifications
• Stream is infinite, append-only (e.g., packet traces)
– size of A is unbounded for a filter query -- cannot store A
– Stream out A -- but self-join query requires unbounded intermediate
state to compute A
– Updates to tuples in A -- e.g., aggregation query
• Stream has updates, deletions (e.g., SNMP data)
– often require more intermediate state to compute A
Operator Architecture in a DSMS
• Stream
• Append-only semantics: Result tuples that won’t change later
• Update semantics: Updates to current result
• Store: Result tuples that could change later
• Scratch: Intermediate state to compute future results
• Throw: Unneeded data
Example Queries from Traffic Management
• Single packet trace input data stream (IP headers over a link)
• Continuous query 1: Link utilization (total #bytes sent over the
link)
– Store -- sum of packet lengths
– Stream -- empty
– Scratch -- empty
• Continuous query 2: Number of flows per protocol
Flow
Identifier
Packet
Trace
Per-Protocol
#flows counter
Stream
Scratch
Store
Example Queries from Traffic Management
(cont’d)
• Continuous query 3: Join packet traces collected from different
points in the network to measure packet delays (or identify routes)
HT 1
Packet trace 1
Packet trace 2
Scratch
HT 2
Symmetric
Hash-Join
Stream
• Efficient intermediate state management
• Intermediate state is unbounded theoretically
• Use of constraints can reduce intermediate state
• Can reclaim memory after each match
• Approximate answers can further reduce intermediate state
• Can you trade precision for state?
Examples Queries from Traffic Management
(cont’d)
• Continuous query 4: Identify top 5% (source IP address,
destination IP address) Pairs with maximum bandwidth
consumption over a link
• Non-trivial query over a stream
– Number of distinct Pairs can vary
– Bandwidth consumption of each Pair can vary
– How much intermediate state is needed?
Count
Distinct Pairs
Stream
Packet trace
Scratch
Bandwidth
Consumption
Of Pairs
Scratch
Store
Top 5%
Pairs
Further Challenges in Data Management:
Distributed Stream Processing
• Data is collected from different points in a network
• Structure of an Internet Service Provider imposes restrictions
– Core routers are sensitive (so are the network operators )
• Sending collected data to a central processing site is harmful
– Additional load on the network
– Hinders real-time processing
– Won’t scale with the network and traffic
•
Truly distributed processing is infeasible for many queries
– Goal: minimize communication traffic
– Trade communication traffic for precision
Example Queries from Traffic Management
(cont’d)
• Continuous query 5: Identify top 5% of destination IP addresses
with maximum bandwidth consumption (to detect denial-ofservice attacks)
CQ 5
local
CQ 5
CQ 5
Stream
global
Stream
local
Stream
CQ 5
local
• Hierarchical processing structure could also be useful
Summary of Basic Problems and Techniques
• Continuous queries over data streams is a unique combination of:
– Online processing
– Storage constraints -- amount of memory available is bounded
• Query result size may be unbounded
• Intermediate state may be unbounded
• Relevant techniques
– Online data structures (not build-and-throw)
– Summarization: samples, histograms, wavelets, fractals
– Adaptivity
• Data characteristics
• Flow rates
• Amount of memory
Some Simplifying Assumptions
• In talk, but not necessarily in work
• Traffic management data is clean
– Data is dirty: incomplete, inconsistent
– Temporal uncertainties
– Could be reduced as the importance of traffic management is realized
• Traffic management data is tuple-oriented
– Often true
– Implications for query language
Conclusions
• Traffic management requires efficient data management
• Many traffic management applications benefit from online data
processing
• Case for a Data Stream Management System (DSMS)
– Provides continuous queries over data streams for online processing
– Many interesting research issues
– Work is in progress
• Additional references
– S. Babu and J. Widom. Continuous queries over data streams
http://dbpubs.stanford.edu/pub/2001-9
– STREAM project homepage
http://www-db.stanford.edu/stream