cs.uchicago.edu

Transcript cs.uchicago.edu

In Search for Simplicity:
A Self-Organizing, Multi-Source Multicast Overlay
Matei Ripeanu
The University of Chicago
7/21/2015
Long term trend …

Increasingly rewarding to aggregate the capabilities small
sites/machines


A virtual machine aggregating the last 10 in Top500 would rank
63rd in ’94 but 17th in ’04
Grids and P2P consequences of this trend

Increasingly more ‘power’ in the tail
of these distributions
Last 5
Last 10
Last 20
1994 2004
106th 38th
63rd 17th
18th
7th
Top500 supercomputer performance
10000
LinPack perf.GFLOPS (log scale) .
Parameter 'k' evolution
.
-0.68
-0.73
-0.78
-0.83
-0.88
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
-0.93
2001
1999
1997
1995
1000
2000
1998
1996
100
10
1
1
10
100
1000
Rank (log scale)
Zipf distribution: Perf(rank)~ rank-k
Grids: infrastructure to support federated resource
sharing.
 Early Grid work:



Focus on defining service interfaces and individual
behaviors
Implementations often based on centralized components
Today’s challenge:

Self-organizing, adaptive services.
Self-organization – system behavior emerges as a result of:
 independent decisions made by system components
 using incomplete information about the entire system.
Group communication (multicast) functionality

Applications





Setting



Conferencing
White-board type applications
Resource monitoring and discovery
Data distribution
Multiple senders and receivers
Medium scale
Challenge

Build/maintain support structures (overlays) …



that map well on a heterogeneous set of end nodes and
network paths …
when facing node and link volatility …
using decentralized, adaptive algorithms.
Roadmap




Introduction
UMM: An Unstructured Multi-source Multicast
Application study: Replica Location Service
Conclusions
Existing approaches

Shared Tree



No need for explicit routing protocol
But: fragile to failures, does not exploit all available capacity,
larger delays (for multi-source)
Unstructured Mesh

Random overlay + flooding (Gnutella)



Simple, resilient.
But: inefficient resource usage (duplicate traffic).
'Measurement based’ overlays (Narada, Scattercast):




(NICE, ALMI, Overcast)
Unstructured overlay mesh, initially random
Routing protocol to extract distribution trees for each source.
But: (1) Overhead to build/maintain routing tables, (2) Tree
extraction and overlay optimization are coupled.
Structured Overlay


(M-CAN, Bayeux, Pastry)
Scalable. Promise: structure reusability.
But: (1) Complex protocols to maintain structure, (2) Difficult
to adapt to heterogeneous node/link capacities (Bharambe’05)
UMM design guidelines

Use unstructured overlays for their:
Low construction/maintenance costs
 Ability to map well to heterogeneous node and link
characteristics
 Adaptiveness


Prefer a simple design over a highly optimized
complex one.
We attempt preserve flooding simplicity and try to reduce
its overheads.
 Soft-state protocols and passive state collection are
preferable to active state maintenance
 Deployability, reliability.


Keep design layers independent
 Ability to optimize and evolve layers individually
UMM in one slide
Build reasonably ‘good’ unstructured mesh

Short-long heuristics (similar to Saxons [NSDI’04], M-CAN)



½ links ‘short’ optimized for delay,
½ links ‘long’ optimized for throughput
Any other mesh building technique would work
A
Extract distribution trees


Start with flooding
Use the implicit information from
duplicate messages to extract
source-specific distribution
trees
B
C
Deal with failures, adaptation


Acquired routing info is soft-state
Link failure info flooded with low TTL
D
Success metrics

Overheads compared to IP multicast


Relative Delay Penalty = Overlay-delay vs. IP-delay
Stress: number of duplicate packets on a physical link
Berkeley
UCSD
MIT1
MIT2
Berkeley
MIT2
UCSD
CMU1
IP Multicast

CMU1
Overlay
Flexibility in adapting to changes:


CMU2
CMU2
set of participating nodes, underlying network characteristics
Reliable delivery:
Number of lost messages
 Protocol complexity:


MIT1
number of explicit messages to build state, state size
Evaluation Methodology
Design and implementation evaluated in two contexts:
 ModelNet - emulated wide-area network testbed




Application runs unmodified
Controlled network characteristics
IP-multicast used as common base to compare with
published results
Topologies



Inet generated,
4040 routers with end-nodes randomly attached
PlanetLab deployment

Evaluate performance when facing real network dynamics
Evaluation: ModelNet
Estimate:
 relative delay penalty
 stress
Graph: min, max, average values over 2 topologies, 10
runs, 20% of all trees
8
Real DHT
Idealized DHT
8
6
6
4
Maximum stress .
UMM
Narada
Flooding
Maximum link stress
UMM
DHT
Narada
Flooding
8080
Maximum link stress .
90% RDP
90% RDP
90%-tile Relative Delay Penalty
6060
4040
4
2
2020
2
# nodes
0
0
# nodes
64
128
64
128
256
256
512
512
1024
1024
# nodes
# nodes
0
0
64
64
128
128
256
Data source for systems other than UMM: Jain et al. USITS’03
256
512
512
1024
1024
Evaluation: PlanetLab - Adaptation
Bootstrap, handling node failures:
RDP and stress evolution: 100 nodes start, then 10 nodes
crash and rejoin 900s later
RDP
10
8
10 nodes fail then
rejoin 900s later
MaxRDP
95% RDP
90%RDP
Stress
6
12
10
8
6
4
2
2
0
0
0
240
480
720
960
1200
1440
1680
1920
2160
2400
2640
2880
3120
3360
3600
3840
4
Maxximum link stress .

Time (sec)


Note: few messages lost
Visualization tool (with Dinoj Sunrendan)
OverViz
Why is this solution appealing?

Low-overhead, reliable, simple





Low-overhead: uing passive data collection to build
distribution trees (simply inspecting the message flow).
Reliable distributed structure to support routing: all
routing state is soft-state
Preserves simplicity of flooding-based solutions
Self-organizing: nodes make independent decisions
based only on local information
Scales with the number of sources, not with the
number of passive participants
Roadmap




Introduction
UMM: An Unstructured Multi-source Multicast
Application study: Replica Location Service
Conclusions
Application -- Replica Location Service (RLS)


Replication often used to improve reliability,
access latency, or availability.
Need efficient mechanism to locate replicas:



Map logical ID to replica location(s).
Common to cooperative proxy caches, distributed object
systems.
Data Grids operations:


client presents an identifier (LFN=logical file name) and
asks for one, many or ‘all’ replica locations (PFN=physical
file name).
Client presents a set of LFN and asks for a site where
(most) replicas of these files are already present
RLS: Application Requirements
Data-intensive, scientific applications requirements:

Target scale:




up to 500m replicas by 2008,
100s of sites.
Decoupling: sites able to operate independently
Support efficient location of collocated sets of replicas,
(requirements and trace analysis)
 Lookup rates order(s) of magnitude higher than update rates,
 Flat popularity distribution for files (non-Zipf)
 Most add/update operations are done in batches.
RLS: The pieces

Replica Location Nodes



Local state: set of (lfn, pfn) mappings
Compressed representation: Bloom Filters
State dissemination


UMM -- overlay
Soft-state messages
Site A
RLNA
RLN1
Client
Storage
elements
Discussion
Tradeoffs





Accuracy (false positive rate) vs. memory and bandwidth cost
Accuracy (false negative rate) vs. bandwidth cost
Extra CPU and memory vs. bandwidth cost with compressed
Bloom filters.
Workload and deployment characteristics reduce
appeal of DHT-based solutions
Existing users prefer low-latency queries over low
memory usage

Current RLS implementation offers
a deployment choice
1-hop DHT
2-hops DHT
2-hops DHT + replication
DHT (base 16)
DHT (base 16) + replication
10000
# nodes

1000
100
10
0.001
0.01
0.1
1
Update/Lookup rate
RLS: Prototype implementation and evaluation

Bloom filters


(isolated) Replica Location Node



Fast lookup (7-10μs), add, delete operations (14μs)
Lookup rates: up to 24,000 lookups/sec.
Add, delete: about half of lookup performance
Overlay performance


Deployed and tested on PlanetLab
Meets user-defined accuracy


Alternatively: caps for generated traffic.
Site A
Congestion management

Reduced accuracy for faraway nodes
RLNA
RLN1
Client
Storage
element
s
Contributions: UMM, RLS

(UMM) Unstructured multi-source multicast overlay:




As efficient as solutions based on structured overlays or on
running full routing protocols at the overlay level
Simple protocol based on flooding and passive data
collection.
Self-organizing
(RLS) Replica Location Service


Application layered on top of UMM
Proposed design:


Meets requirements of data-intensive, scientific applications.
Is simple, decentralized, adaptive.
 Matei Ripeanu and Ian Foster, A Decentralized, Adaptive, Replica Location Service, HPDC 2002.
 A. Chervenak, E. Deelman, I. Foster, A. Iamnitchi, C. Kesselman, W. Hoschek, P. Kunst, M. Ripeanu, B.
Schwartzkopf, H. Stockinger, K. Stockinger, and Brian Tierney, Giggle: A Framework for Constructing Scalable
Replica Location Services, SC2002.
 Matei Ripeanu, I. Foster, A. Iamnitchi, A. Rogers, UMM-An unstructured multisource overlay, 2005 (submitted)
Open questions

‘Geometries’ to organize interactions between
resources

Random graphs:



Structured geometries




Power-law graphs present in a number of instances
(Internet, power-grid, Gnutella).
Low cost to build and maintain
Structure can be exploited by applications
High maintenance cost when nodes volatile, heterogeneous
Where is the tipping point? (application requirements?)
Large systems and self-organization

Are there generic building blocks for self-organization?
Thank you

More information, links to papers


http://www.cs.uchicago.edu/~matei
Questions?

cs.uchicago.edu

Transcript cs.uchicago.edu

Directory