PPT - Frontiers in Distributed Information Systems

Download Report

Transcript PPT - Frontiers in Distributed Information Systems

Telegraph
Continuously Adaptive Dataflow
Joe Hellerstein
Scenarios
• Ubiquitous computing: more than clients
– sensors and their data feeds are key
• smart dust, biomedical (MEMS sensors)
• each consumer good records (mis)use
– disposable computing
• video from surveillance cameras, broadcasts, etc.
• Global Data Federation
– all the data is online – what are we waiting for?
– The plumbing is coming
• XML/HTTP, etc. give LCD communication
• but how do you flow, summarize, query and analyze data
robustly over many sources in the wide area?
Dataflow in Volatile Environments
• Federated query processors a reality
– Cohera, IBM DataJoiner
– No control over stats, performance, administration
• Large Cluster Systems “Scaling Out”
– No control over “system balance”
• User “CONTROL” of running dataflows
– Long-running dataflow apps are interactive
– No control over user interaction
• Sensor Nets: the next killer app
– E.g. “Smart Dust”
– No control over anything!
• Telegraph
– Dataflow Engine for these environments
Data Flood: Main Features
• What does it look like?
– Never ends: interactivity required
• Online, controllable algorithms for all tasks!
– Big: data reduction/aggregation is key
– Volatile: this scale of devices and nets will not
behave nicely
The Telegraph Dataflow Engine
• Key technologies
– Interactive Control
• interactivity with early answers and examples
• online aggregation for data reduction
– Dataflow programming via paths/iterators
• Elevate query processing frameworks out of DBMSs
• Long tradition of static optimization here
– Suggestive, but not sufficient for volatile environments
– Continuously adaptive flow optimization
• massively parallel, adaptive dataflow via Rivers and
Eddies
CONTROL
Continuous Output and Navigation Technology with Refinement On Line
• Data-intensive jobs are long-running. How to give early
answers and interactivity?
– online interactivity over feeds
• pipelining “online” operators, data “juggle”
– online data correlation algs: ripple joins, online mining and
aggregation
– statistical estimators, and their performance
implications
• Deliver data to satisfy statistical goals
• Appreciate interplay of massive data processing, stats,
and HCI
“Of all men's miseries, the bitterest is this: to
know so much and have control over nothing”
–Herodotus
Performance Regime for CONTROL
• New “Greedy” Performance Regime
– Maximize 1st derivative of the user-happiness
function
100%
CONTROL
Traditional

Time
CONTROL
Continuous Output and Navigation Technology with Refinement On Line
CONTROL
Continuous Output and Navigation Technology with Refinement On Line
Potter’s Wheel Anomaly Detection
River
• We built the world’s fastest sorting machine
– On the “NOW”: 100 Sun workstations + SAN
– But it only beat the record under ideal
conditions!
• River: performance adaptivity for data flows on
clusters
– simplifies management and programming
– perfect for sensor-based streams
Declarative Dataflow: NOT new
• Database Systems have been doing this for years
– Xlate declarative queries into an efficient dataflow plan
– “query optimization” considers:
•
•
•
•
•
Alternate data sources (“access methods”)
Alternate implementations of operators
Multiple orders of operators
A space of alternatives defined by transformation rules
Estimate costs and “data rates”, then search space
• But in a very static way!
– Gather statistics once a week
– Optimize query at submission time
– Run a fixed plan for the life of the query
• And these ideas are ripe to elevate out of DBMSs
– And outside of DBMSs, the world is very volatile
– There are surely going to be lessons “outside the box”
Static Query Plans
• Volatile environments like sensors need to adapt
at a much finer grain
Continuous Adaptivity: Eddies
Eddy
• How to order and reorder operators over time
– based on performance, economic/admin feedback
• Vs.River:
– River optimizes each operator “horizontally”
– Eddies optimize a pipeline “vertically”
Competitive Eddies
s
index1
block
hash
Eddy
R1 R2 R3 S1 S2 S3
s
index2
Telegraph: Putting it Together
• Scalable, adaptive dataflow infrastructure. Apps include…
– sensor nets
– massively parallel and wide-area query engines
– net appliances: chaining xform8n/aggreg8n/compression/
etc. in proxies
– any volatile dataflow scenario
• Technology: a marriage of…
– CONTROL, Rivers & Eddies
• Many research questions here
• E.g. how to combine River and Eddy adaptivity
• E.g. how to tune Eddies for statistical performance goals
– Combinations of browse/query/mine at UI
– Storage management to handle new hardware realities
• Look for a live service this summer!
Integration with Endeavour
• Give
– Be data-intensive backbone to diverse clients
– Be replication/delivery dataflow engine for OceanStore
– Telegraph Storage Manager provides storage
(xactional/otherwise) for OceanStore
– Provide platform for data-intensive “tacit info mining”
• Take
– Leverage OceanStore to manage distributed metadata,
security
– Leverage protocols out of TinyOS for sensors
Connectivity & Heterogeneity
• Lots of folks working on data format translation, parsing
– we will borrow, not build
– currently using JDBC & Cohera Net Query
• commercial tool, donated by Cohera Corp.
• gateways XML/HTML (via http) to ODBC/JDBC
– we may write “Teletalk” gateways from sensors
• Heterogeneity
– never a simple problem
– Control project developed interactive, online data
transformation tool: ABC
More Info
• Collaborators:
– Mike Franklin, Eric Brewer, Christos
Papadimitriou
– Sirish Chandrasekaran, Amol Deshpande, Kris
Hildrum, Sam Madden, Vijayshankar Raman,
Mehul Shah
• Me: [email protected]
• Web:
– http://db.cs.berkeley.edu/telegraph
– http://control.cs.berkeley.edu
Extra slides for backup