Information Capture and Re-Use

Download Report

Transcript Information Capture and Re-Use

Telegraph
Endeavour Retreat 2000
Joe Hellerstein
Roadmap
• Motivation & Goals
• Application Scenarios
• Quickie core technology overview
– Adaptive dataflow
– Event-based storage manager
– Come hear more about these tonight/tomorrow!
• Status and Plans
– Dataflow infrastructure & apps
– Storage manager?
Motivations
• Global Data Federation
– All the data is online – what are we waiting for?
– The plumbing is coming
• XML/HTTP, XML/WAP, etc. give LCD communication
• but how do you flow, summarize, query and analyze data
robustly over many sources in the wide area?
• Ubiquitous computing: more than clients
– sensors and their data feeds are key
• smart dust, biomedical (MEMS sensors)
• each consumer good records (mis)use
– disposable computing
• video from surveillance cameras, broadcasts, etc.
• Huge Data flood a’comin’!
– will it capsize the good ship Endeavour?
Initial Telegraph Goals
• Unify data access & dataflow apps
– Commercial wrappers for most infosources
– Most info-centric apps can be cast as dataflow
– The data flood needs a big dataflow manager!
– Goal: a robust, adaptive dataflow engine
• Unify storage
– Currently lots of disparate data stores
• Databases, Files, Email servers (and http access on these)
– Goal: A single, clean storage manager that can serve:
• DB records & semantics
• Files and “semantics”
• Email folders, calendars, etc. and semantics
Challenge for Dataflow: Volatility!
• Federated query processors
– A la Cohera, IBM DataJoiner
– No control over stats, performance, administration
• Large Cluster Systems “Scaling Out”
– No control over “system balance”
• User “CONTROL” of running dataflows
– Long-running dataflow apps are interactive
– No control over user interaction
• Sensor Nets
– No control over anything!
• Telegraph
– Dataflow Engine for these environments
The Data Flood: Main Features
• What does it look like?
– Never ends: interactivity required
• Online, controllable algorithms for all tasks!
– Big: data reduction/aggregation is key
– Volatile: this scale of devices and nets will not
behave nicely
The Telegraph Dataflow Engine
• Key technologies
– Interactive Control
• interactivity with early answers and examples
• online aggregation for data reduction
– Dataflow programming via paths/iterators
• Elevate query processing frameworks out of DBMSs
• Long tradition of static optimization here
– Suggestive, but not sufficient for volatile environments
– Continuously adaptive flow optimization
• massively parallel, adaptive dataflow
• Rivers and Eddies
Static Query Plans
• Volatile environments like sensors need to adapt
at a much finer grain
Continuous Adaptivity: Eddies
Eddy
• How to order and reorder operators over time
– based on performance, economic/admin feedback
• Vs.River:
– River optimizes each operator “horizontally”
– Eddies optimize a pipeline “vertically”
Unifying Storage
• Storage management buried inside specific systems
• Elevate and expose the core services & semantic options
– Layout/indexing
– Concurrent access/modification
– Recovery
• Design for clustered environments
– Replicate for reliability (tie-ins with Ninja)
– Cluster options: your RAM vs. my disk
– Events & State Machines for scalability
• Unify eventflow and dataflow?
• Share optimization lessons?
Status: Adaptive Dataflow
• Initial Eddy results promising, well received (SIGMOD 2K)
• Finishing Telegraph v0 in Java/Jaguar
– Prototype now running
• Demo service to go live on web this summer
– Analysis queries over web sites
• We’ve picked a provocative app to go live with (stay tuned!)
• Incorporates Ninja “path” project for caching
– Goal: Telegraph is to “facts and figures” as search
engines are to “documents”
• Longer-term goals:
– Formalize & optimize Eddy/River scheduling policies
– Study HCI/systems/stats issues for interaction
– Crawl “Dark Matter” on the web
– Attack streams from sensors
• Sequence queries and mining, data reduction, browsing, etc.
Status: Unified Storage Manager
• Prototype implementation in Java/Jaguar
– ACID transactions + (non-ACID) Java file access
– Robust enough to get TPC-W numbers
– Events/states vs. threads
• Echoes Gribble/Welsh results: better than threaded under
load, but Java complicates detailed mesurement
• Time to re-evaluate importance of this part
– Interest? More mindshare in dataflow infrastructure.
– Vs. tuning an off-the-shelf solution (e.g. Berkeley DB)?
– Goal? unified lessons about dataflow/eventflow
optimization on clusters.
Integration with Rest of Endeavour
• Give
– Be dataflow backbone for diverse “clients”
•
•
•
•
Our own Telegraph apps (federated dataflow, sensors)
Replication/delivery dataflow engine for OceanStore
Scalable infrastructure for tacit info mining algorithms?
Pipes for next version of Iceberg?
– Telegraph Storage Manager provides storage
(xactional/otherwise) for OceanStore? Ninja?
• Take
– OceanStore to manage distributed metadata, security
– Leverage protocols out of TinyOS for sensors
– Partner with Ninja to manage local metadata?
– Work with GUIR on interacting with streams?
More Info
• People:
– Joe Hellerstein, Mike Franklin, Eric Brewer, Christos
Papadimitriou
– Sirish Chandrasekaran, Amol Deshpande, Kris Hildrum,
Sam Madden, Vijayshankar Raman, Mehul Shah
• Software
– http://telegraph.cs.berkeley.edu coming soon
– ABC interactive data anlysis/cleansing at
http://control.cs.berkeley.edu
• Papers:
– See http://db.cs.berkeley.edu/telegraph
Extra slides for backup
Connectivity & Heterogeneity
• Lots of folks working on data format translation, parsing
– we will borrow, not build
– currently using JDBC & Cohera Net Query
• commercial tool, donated by Cohera Corp.
• gateways XML/HTML (via http) to ODBC/JDBC
– we may write “Teletalk” gateways from sensors
• Heterogeneity
– never a simple problem
– Control project developed interactive, online data
transformation tool: ABC
CONTROL
Continuous Output and Navigation Technology with Refinement On Line
• Data-intensive jobs are long-running. How to give early
answers and interactivity?
– online interactivity over feeds
• pipelining “online” operators, data “juggle”
– online data correlation algs: ripple joins, online mining and
aggregation
– statistical estimators, and their performance
implications
• Deliver data to satisfy statistical goals
• Appreciate interplay of massive data processing, stats,
and HCI
“Of all men's miseries, the bitterest is this: to
know so much and have control over nothing”
–Herodotus
Performance Regime for CONTROL
• New “Greedy” Performance Regime
– Maximize 1st derivative of the user-happiness
function
100%
CONTROL
Traditional

Time
CONTROL
Continuous Output and Navigation Technology with Refinement On Line
CONTROL
Continuous Output and Navigation Technology with Refinement On Line
River
• We built the world’s fastest sorting machine
– On the “NOW”: 100 Sun workstations + SAN
– But it only beat the record under ideal
conditions!
• River: performance adaptivity for data flows on
clusters
– simplifies management and programming
– perfect for sensor-based streams
Declarative Dataflow: NOT new
• Database Systems have been doing this for years
– Xlate declarative queries into an efficient dataflow plan
– “query optimization” considers:
•
•
•
•
•
Alternate data sources (“access methods”)
Alternate implementations of operators
Multiple orders of operators
A space of alternatives defined by transformation rules
Estimate costs and “data rates”, then search space
• But in a very static way!
– Gather statistics once a week
– Optimize query at submission time
– Run a fixed plan for the life of the query
• And these ideas are ripe to elevate out of DBMSs
– And outside of DBMSs, the world is very volatile
– There are surely going to be lessons “outside the box”
Static Query Plans
• Volatile environments like sensors need to adapt
at a much finer grain
Continuous Adaptivity: Eddies
Eddy
• How to order and reorder operators over time
– based on performance, economic/admin feedback
• Vs.River:
– River optimizes each operator “horizontally”
– Eddies optimize a pipeline “vertically”
Competitive Eddies
s
index1
block
hash
Eddy
R1 R2 R3 S1 S2 S3
s
index2
Potter’s Wheel Anomaly Detection
The Data Flood is Real
Petabytes
3500
3000
2500
2000
Sales
1500
Moore's
Law
1000
500
Year
Source: J. Porter, Disk/Trend, Inc.
http://www.disktrend.com/pdf/portrpkg.pdf
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
0
Disk Appetite, cont.
• Greg Papadopoulos, CTO Sun:
– Disk sales doubling every 9 months
• Note: only counts the data we’re saving!
• Translate:
– Time to process all your data doubles every 18 months
– MOORE’S LAW INVERTED!
• (and Moore’s Law may run out in the next couple decades?)
• Big challenge (opportunity?) for SW systems research
– Traditional scalability research won’t help
• “Ideal” linear scaleup is NOT NEARLY ENOUGH!
Data Volume: Prognostications
• Today
– SwipeStream
• E.g. Wal-Mart 24 Tb Data Warehouse
– ClickStream
– Web
• Internet Archive: ?? Tb
– Replicated OS/Apps
• Tomorrow
– Sensors Galore
– DARPA/Berkeley “Smart Dust”
• Note: the privacy issues only
get more complex!
– Both technically and ethically
Temperature, light,
humidity, pressure,
accelerometer,
magnetics
Explaining Disk Appetite
• Areal density increases 60%/yr
• Yet Mb/$ rises much faster!
100
60
Mb/$
Moore's Law
40
20
0
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
MB/$
80
Year
Source: J. Porter, Disk/Trend, Inc.
http://www.disktrend.com/pdf/portrpkg.pdf