(PHI) - KAIST
Download
Report
Transcript (PHI) - KAIST
(Phi)
Timothy Roscoe, Joseph M. Hellerstein, Brent Chun, Nina Taft, Petros
Maniatis, Ryan Huebsch, Tyson Condie, Boon Thau Long
Intel Research at Berkeley and U.C. Berkeley
(much input from Tom Anderson, Vern Paxson, Larry Peterson, Scott
Shenker, Ion Stoica, and David Wetherall)
Intel Research
1
Lessons from PlanetLab…
What have we learned from PlanetLab?
We understand how to build robust large-scale systems
E.g. structured overlay networks that scale to the planet
What have we learned about the state of the
Internet?
It’s brittle.
It’s unpredictable.
It’s doesn’t know what’s happening.
It’s afraid.
Intel Research
2
The brittleness of the
Internet
Security systems are afraid of the unknown.
Everything new is unknown.
Everything new is a threat.
Better to shut it down now.
Users really don’t comprehend the problems (and
why should they?)
Not exactly easy to understand
Very little information available
Intel Research
3
The brittleness of the
Internet
Performance is unpredictable
Failures, bottlenecks, congestion, misconfigurations, etc.
It appears overlays can do better
Provided they can measure the network.
This shouldn’t work, but it does.
IP protocols assume no information is available
about current network state.
Intel Research
4
Research trends and
directions
Extensive network measurement / modelling
Distributed security solutions
Distributed performance diagnosis
Machine learning across networks
Measurement-based overlays
Better Internet protocols
Network visualizations
Intel Research
5
What’s missing?
Applications and services
User awareness
?
Measurement, monitoring, logging, etc.
of the real Internet
Intel Research
6
An “Information Plane” for the
Internet
Continuous queries over distributed network state,
available to all end systems
Integrate data from:
Backbone monitoring
Router configuration
Network state databases (e.g. RouteViews)
Security systems (Firewalls, DShield, Autograph, etc.)
End-system monitoring
Intel Research
7
The big picture
sensor
INTERNET
Types of Clients
End users
End applications
Overlays
Types of sensors
sensor
sensor
End-systems
Backbone monitors
Routers
Network databases
Firewall logs
Intel Research
8
The big picture
query
plan
query
plan
sensor
disseminate
query
sensor
sensor
sensor
sensor
Intel Research
9
The big picture
query
execution
query
execution
Answer
sensor
Query
results
sensor
sensor
sensor
sensor
Intel Research
10
Implications of success
Short-term: Enable & Connect measurement & security
researchers
E.g. “Live DShield”, top 10 IP address result from Barford et.al
Promote user-awareness through downloadable tools
Medium-term: Provide global network knowledge for
planetary-scale applications & overlays
E.g. Resource discovery on PlanetLab, OpenDHT could exploit NW
link information
Long-term: Kick off a new generation of Network-Aware
Internet Protocols
E.g. Host-based source routing solutions
Intel Research
11
Phi goals
Create the missing piece of the information plane by building
a scalable, distributed dataflow engine for processing
continuous queries in-network
Data tuples are
Routed between nodes along a dataflow graph
Processed at nodes (filtering, aggregation, data reduction, correlation,
result dissemination)
Declarative
Queries
Abstract
Dataflow
(Query Plan)
Physical Dataflow
in Overlay Network
Physical Network
Intel Research
12
The hard problems
Scale: Millions of sources, sinks, queries
Linear scaling on a n3 problem : need to factor out n2 redundant
communication & computation
Fidelity & Security
Bad inputs: data poisoning, perturbed computations
Bad outputs: launchpads, vulnerability detection
Efficiently embedding analysis algorithms in network
topologies
Data must be combined (hence moved around the network) according
to the distributed analysis algorithm
Intel Research
13
The rest of the talk
Where we are today
PIER: distributed relational query processor
Single query, many sources, many sinks
Deployed on PlanetLab for the last 12 months
Where we intend to go
P2: full dataflow engine with multiquery scaling
Topological Fault Tolerance
Develop embeddings of distributed analysis algorithms
Intel Research
14
Key technology: Structured
overlay networks (DHTs)
• E.g. Chord, Pastry,
Tapestry, CAN,
Kademlia...
• Flat, sparse ID space
(e.g. 160-bit identifiers)
• Routing in log(n) hops
routing to the owner of
any key
• Based on “interesting”
routing graphs
Intel Research
15
What can DHTs do?
• Content-Based Routing
– i.e. send a message to a
key
– Equivalent to hashing a
key to a node
• Storage
– Storing values in the
network under a key
• Tree construction
– Formed by routing to a
key from all nodes
0
15
1
14
2
13
3
12
4
11
5
10
6
9
8
7
Intel Research
16
What can DHTs do?
• Content-Based Routing
– i.e. send a message to a
key
– Equivalent to hashing a
key to a node
• Storage
– Storing values in the
network under a key
• Tree construction
– Formed by routing to a
key from all nodes
0
15
1
14
2
13
3
12
4
11
5
10
6
9
8
7
Intel Research
17
Using DHTs in Phi
Query Dissemination (trees)
Hierarchical Aggregation (trees and storage)
Indexing (routing and storage)
Range Indexing Substrate (routing and storage)
Hash-partitioned parallelism (routing)
Hash tables for group-by, join (storage)
Intel Research
18
Bamboo: our DHT
(Sean Rhea)
Pastry-style routing
Epidemic propagation of leaf sets, routing tables
Recursive routing
Adaptive timeouts based on continuous
measurement
Highly robust under churn
Tested to ~1000 nodes
PlanetLab, ModelNet
Intel Research
19
PIER: a relational query
engine
Data is tuples in named tables
Tables exist on nodes
Relational operators:
Selection
Projection
Join (correlate, intersect, match)
Aggregation (summarize, compress, group by)
Also has recursive queries
Can query topological structures
Intel Research
20
PIER architecture
Declarative
Queries
Query Plan
Overlay Network
Physical Network
Network
Monitoring
Other User
Apps
Applications
Query
Optimizer
Catalog
Manager
Core
Relational
Execution
Engine
PIER
DHT
Wrapper
Storage
Manager
IP
Network
Overlay
Routing
DHT
Network
Intel Research
21
Experience so far
• PIER has run on
PlanetLab for about
a year
• Querying PlanetLab
sensors, in particular
Snort events
Intel Research
22
Experience so far
Use of DHT for query processing by-and-large
works
Need story for NATs, non-transitive connectivity
Node heterogeneity
Multiresolution emulation is essential
Simulation, emulation (ModelNet), deployment
(PlanetLab)
Simple results are quite compelling
E.g. top 10 attackers demo for IDF
Intel Research
23
P2
Build on PIER techniques
Reimplementation in C++
Extend beyond relational operators
Synopses/sketches, junction trees, Bayes nets, PCA,..
Address multiquery optimization (2 factors of n)
Investigate using the overlay for data fidelity
Codify communication and computation for a variety
of algorithms
Intel Research
24
Multiquery optimization
Granular lineage for data inputs and intermediate data products
Telegraph: Tuple lineage bitmaps (operators & queries)
Scaling via cluster analysis: bits name sets of queries/operators
0
15
14
Embedded in the network
Multi-operator query mesh of multiple trees is formed
1
2
13
3
12
4
Optimizations in routing & replicating intermediate results
11
5
Scaling result dissemination
10
Multicast from within the MQO mesh
6
9
8
7
A many-source/many-sink multicast problem
Tie-ins with MQO: Can choose the multicast source points as part of query optimization
Intel Research
25
Ephemeral overlay
topologies
DHTs emulate InterConnect Networks
These have deep algebraic structure
Based on group-theoretic graph constructs
Rich families of such graphs with different properties
We can exploit the structure (i.e. constraints) of the overlay
To embed complex computations with efficient communication
To reason about the “influence” of malicious nodes in the network
We could choose ephemeral topologies to suit specific analysis algorithms
Intel Research
26
Topological Fault Tolerance
Fidelity and Security
influence: 8 nodes
Diversifying Influence
Reundant computation (a la process pairs)
applied in an adversarial environment
Structured overlay topologies admit analysis
of spheres of influence
Two Dimensions to Diversity
In Space: Multiple trees, different roots
In Time: Reassign node IDs to change spheres
of influence
influence: 1 node
Intel Research
27
Design Patterns for NetworkEmbedded Data Analysis
Taxonomize and abstract comm patterns for in-network analyses
binomial tree
We already understand how some of these map to DHTs
0
15
Up-tree aggregation (AVG, SUM, etc.)
14
Up-a-special-tree aggregation (Haar Wavelets)
Arbitrary dissemination (e.g. MIN, MAX, Gibbons-style
duplicate-insensitive sketching)
1
2
13
3
12
Lessons from sensornet arena (topology-oblivious)
First cut taxonomy of aggrs (TAG)
4
11
5
10
Junction Trees for distributed inference: Up-then-down a special tree
6
9
8
7
What all this means
Algebraic properties of comm patterns dictate an extensibility API
Expose only enough of alg logic to achieve optimization, code reuse, resource control
Efficient comm patterns can drive research into new analysis techniques
Intel Research
28
Collaborations
• Network Measurement
Community
– End-host packet traces (KAIST VMS,
NETI@home, DIMES)
– Firewall log repositories (Snort, Bro,
DShield, Domino)
– Backbone monitors and repositories
(CAIDA, CoMo)
– Network tomography (RouteViews,
NetTelescope)
• Distributed Algorithms
Community
– Summarization / data reduction
(IRP, Bell Labs)
– Inference / anomaly detection
(CMU, UCB, IR)
– Signature detection (IRP,
EarlyBird)
– Joins / correlations (UCB, ICSI)
Value proposition: reusable backplane for
–
–
–
–
Real-time data summarization & transport
Data validation (against other sources)
Correlation with other data
Algorithm design and deployment
Intel Research
29
The Plan (1)
Build a distributed peer-to-peer dataflow engine
Define protocols:
Tuple transfer protocol
Dataflow signalling protocol
Instantiate the “right” overlay
Address multiquery optimization
Rich aggregations/summarization and joins/correlations
Explore topological diversity in time and space
Identify efficient, realizable families
Establish feasible timescales for topology construction
Apply to both topological FT and net-embedded computations
Intel Research
30
The Plan (2)
Deploy an initial information plane, starting on PlanetLab and building out
Multiple classes of data sources:
End-system monitoring (e.g. Neti@home)
Link monitors (e.g. CoMo)
Network Telescopes (dark address space)
Databases / archives (e.g. Routeviews)
Build example applications ourselves
Implement example analysis operators: wavelet, PCA, etc.
Enable others to more easily build applications
Client libraries
Query handholding
Intel Research
31
Many thanks!
[email protected]
Intel Research
32