Using topology aware communications services in grid environments

Download Report

Transcript Using topology aware communications services in grid environments

Using Topology-Aware
Communication Services
in Grid Environments
Craig A. Lee, Eric Coe1, B. Scott Michel2, James Stepanek2,
Ignacio Solis3, Matt Clark, Brooks Davis
{ lee | ecoe | scottm | stepanek | mclark | brooks }@aero.org
The Aerospace Corporation
1Also at the University of Southern California
2Also at the University of California, Los Angeles
[email protected], University of California, Santa Cruz
Supported in part by a subcontract in the DARPA Active Networks Program
Grids and Advanced Networks Workshop/CCGrid 2003
Tokyo, Japan, May 15, 2003
Introduction: Why TopologyAware Communication Services?
• Grids promise an unprecedented degree of
distributed computing
– A fabric of network-connected sites and resources
• The network topology connecting these sites
and resources can be exploited
– Improve performance
– Enable new functionality
• As processors and networks get faster, grid
computations will become increasingly
latency-sensitive
– Topology-awareness will become essential
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 2
Many Types of Services
Improved or Enabled
• Augmented Semantics
– Caching (web caching), filtering, compression,
encryption, quality of service, data-transcoding, etc.
• Collective Operations
– Accomplished “in the network” rather than using
point-to-point msgs across the diameter of the grid
• Communication Scope
– Named topologies can denote a communication
scope to limit problem size and improve performance
• Content and Policy-based networking
– Publish/subscribe, interest management, event
services, tuple spaces, quality of service
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 3
Topo-Aware Comm Services
Can Be Similar to an Overlay
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 4
Service Organization & Mgmt
• Organization
– Network of Servers
– Middleware Layer – peer-to-peer virtual overlay
– Active Networks – real overlays
• Management
– Creation, Configuration, Discovery, Termination
– Natural application of grid tools!
• This slide glosses over a multitude of issues!
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 5
A Case Study: Time Mgmt
• Time Management enables temporal causality to be
enforced in Distributed Simulations
• Maintaining agreements about time within a “simulated
world” is essential to its apparent reality and to the
validity of observed results
– Strict temporal synchronization is not necessary
– Time must be synchronization only to the point
where temporal causality is preserved
• Topology-Aware Communication is a natural
– Eliminates point-to-point communication
– Increase performance for LBTS, the key TM algorithm
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 6
A Simple Case of
Violated Causality
Fire and Destroy events
observed out of order
Observer
Target
Destroy
Event
Shooter
Fire
Event
Wallclock Time
Explicit references to simulated time, e.g.,
“the attack begins at 0300”, requires LBTS.
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 7
LBTS: the Heart of TM
• Lower Bound Time Stamp algorithm
– The lower time bound of all simulated entities
(hosts) and all in-transit messages
• Including in-transit messages means
reduction of a variable number of values
– Computing the reduction is the easy part
– Knowing when you’re done is the hard part
• LBTS is an instance of the Distributed
Termination Detection (DTD) problem
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 8
LBTS Algorithm Phases
• Initiation
– Tags or “colors” used to distinguish multiple,
simultaneous LBTSi computations
• Merging
– Multiple LBTSi initiations must be merged
• Reduction
– Min time stamp of hosts and in-transit messages
• Announcement
– All hosts eventually need to know final LBTS value
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 9
Distinguished Root Node
H
Single Spanning Tree
In Grid Cloud
H
DRN
H
H
H
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 10
Integrating DRN/ATMD w/ GaTech RTI
• Active Time Mgmt Daemons implement DRN
• TM-Kit
– Provides modular place to implement TM
– Application requests TM-Kit to compute LBTS
(through the HLA/RTI API)
• Several end-to-end comm topologies available
– Completely connected, Star, Butterfly
– All implemented using the same select()
• DRN was added as just another option
– Opens a single socket to the first-hop ATMD
• ATMD must sniff timestamps in data packets
sent by the GaTech RTI
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 11
Managing the Ensemble w/ Grid Tools
discover
APSimScript
start
Globus
overlay
request
APSim
overlay
config
APSim
May 15, 2003
APSim
HLA/RTI
MDS-2
register
HLA/RTI
XBone
Manager
GAN/CCGrid 2003
Tokyo, Japan
HLA/RTI
ATMD
ATMD
overlay
Slide 12
An HLA Application: Airport Sim
• m Airplanes fly between n Airports
•
•
•
•
•
•
•
•
Airplanes randomly fly from Airporti to Airportj
Airplanes wait until runway available for take-off
Flights last 1.0 hr + some + circling time
Airplanes wait until runway available for landing
Small random time for passenger handling, etc.
Airports assigned to processors
Airports process take-off, arrival, landing events
Look-ahead of 1.0 hr.
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 13
Airport Sim Testbed Configuration
• Part of six processor
Aerospace Active
Network Testbed
• 100 planes and 10
airports on 3 procs
• Run about 1000
simulated minutes
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 14
ATMD Instrumented with NetLogger
Groups of Events for Each LBTS
Computation Connected into ‘Lifelines’
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 15
Metropolitan Testbed
USC
ISI
Aerospace
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 16
Metro Testbed NetLogger
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 17
Perf. Results: LBTS Makespan (ms)
Lab Testbed
Algo.
All Host
Host 1
Host 2
Makespan Makespan Makespan
Non-act.
72.77
50.78
54.27
DRN
350.90
260.30
258.41
Host 3
Makespan
56.48
259.59
Metropolitan Testbed
Algo.
All Host
Host 1
Makespan Makespan
Non-act. 210.60
192.91
DRN
357.65
330.00
Host 2
Makespan
192.91
Host 3
Makespan
193.04
329.47
329.54
• Performance gap dramatically smaller!
– DRN performance only slight affected by additional latency
– Non-active performance 3x-4x slower
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 18
So, What To Do?
• This configuration is too small to
demonstrate real advantage
– But going from the lab to across town made a
huge difference in performance difference
– Graph properties promise better performance
but per node overhead is significant
• Compared to “native” network hardware
– Need to test larger configurations
• Emulation
– EmuLab – University of Utah
• Simulation
– Parsec – UCLA
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 19
EmuLab
• Specialized cluster
– 168 PCs (at the time)
– Each with 5 x 100Mb ethernet interfaces connected
to a programmable switch
• Each experiment:
– Has exclusive access to node subset
– Can run NS scripts that program the switch to
customize physical topology among the node subset
– Can boot a custom OS
• University of Utah, Network Emulation Testbed
– http://www.emulab.net
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 20
EmuLab Experiments
• Set of quasi-random, tree topologies generated
– 4, 8, 16, 32, and 64 end-hosts
– 9, 15, 22, 29, 34 interior service hosts
• Trees topologies constrained to have an
average 2.5 degree of connectedness
• 550 LBTS calculations done for each topology
– Just running the DRN algorithm
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 21
Example Graph: 32 end-hosts
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 22
LBTS Makespan on EmuLab (ms)
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 23
Parsec Experiments
• Under EmuLab, both algos used same topology
– Rebooting is time-consuming
– NTP must settle down before experiment can begin
• In real world, algos would use different paths
• Simulations done with Parsec 1.1
–
–
–
–
–
Using same tool, 50 random topologies generated
Shortest path routing determined for TM-Kit
Minimum latency spanning tree determined for DRN
50 LBTS calculations done for each case
100 Mb/sec w/ 1 ms per-hop latency
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 24
LBTS Makespan from Parsec (ms)
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 25
Questions, Questions, Questions
• Clearly topology-aware communication wins
– For the right application under the right conditions
•
•
•
•
What are the design issues?
What are the deployment issues?
What other capabilities are possible?
How many grid apps will really be able use it
for a significant advantage?
– Not many right now
– But later?
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 26
Collective Operations
for Message-Passing
• We’ve just seen an example of min reduction
– Over a variable number of elements!
• Broadcasts
– Typically requires multicast support
– Topology-aware middleware could provide similar capability
• Scatter/Gather
– Scatter is similar to bcast
• One-to-many communication but with different data
– Aggregate messages get partitioned en route to destination
• Barrier and Reductions
– Split phase or fuzzy barrier/reductions could also be used
with topology-aware middleware
• Scans
– Similar to a progressive reduction
– Similar performance benefits possible
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 27
Communication Scope
• Providing service on a per instance or per
application basis greatly reduces problem size
• Provides isolation for late composition of
components with private communication needs
– Such as MPI communicators
• Issues
– Creation, termination overhead & latency
– Splitting, merging
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 28
Content-Based Networking
• Content-Based Routing
– Message-Passing with Associative Addressing
– Requires an associative matching operation
• A fundamental and powerful capability
– Enables a number of very useful capabilities and services
– Event services, resource discovery, coordination programming
models
• But notoriously expensive to implement
– How can matching be done efficiently in a wide-area grid env?
• Can users and apps find a “sweet-spot” where
content-based routing is constrained enough to be
practical and provide capabilities that can’t be
accomplished any other way?
– Scale of deployability
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 29
Implementation Challenges
• Constrain/aggregate application comm space
– Enhance scalability/deployability
– Per application basis? Per instance basis?
• Use of soft-state to enhance fault tolerance?
• Effective use of network topology will often be an
instance of the Steiner Tree Problem
– For a graph with weighted edges, find a minimum weighted
subgraph through a subset of the vertices
• Practical topology construction
– Current multicast group construction and routing techniques?
• PIM Sparse and Dense Modes, i.e., single tree vs. tree per source
– Network overlays?
• IP tunnels among “active” sites managed via X-Bone
– Peer-to-Peer group construction?
• Hierarchical name space
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 30
Needed: A General Interface to
Isolate Applications from the Details
Applications
API
Grid
Services
TACS Middleware
“Hard-wired” Infrastructure
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Any Impl.
Any Deploy.
Slide 31
Conclusion
• Topology-aware communication useful for
many capabilities
– Collective operations, associative operations, scope
• Emulation and simulation results confirm
expectations
– Superior performance with as few as 8 end-hosts
• Middleware layer seems to be most promising
implementation approach at this time
– Many outstanding issues…
• Future Work
– Develop prototype APIs
– Evaluate with real apps on real grids
May 15, 2003
GAN/CCGrid 2003
Tokyo, Japan
Slide 32