Zero-Cost Reliability for Tree

Download Report

Transcript Zero-Cost Reliability for Tree

The MRNet Tree-based
Overlay Network
Where We’ve Been,
Where We’re Going!
Dorian Arnold
Philip C. Roth
Paradyn Project
University of Wisconsin
Future Technologies Group
Oak Ridge National Laboratory
Abstract
• Large scale systems are here
• Tree-based Overlay Networks (TBŌNs)
– Intuitive, seemingly restrictive
– Effective model for tool scalability
– Prototype: www.paradyn.org/mrnet
• Where we’ve been
– Tool scalability
– Programming model for large class of applications
• Where we’re going
– Topology studies
– TBŌNs on high-performance networks
– Filters on hardware accelerators
– Reliability
2
HPC Trends from
140
Nov-05
92
Nov-04
Nov-03
58
Jun-03
58
4096 - 8191
11
36
2048 - 4095
81
1024 - 2047
54
Nov-02
44
Jun-02
28
Nov-01
19
Jun-00
18
275
512 - 1023
73
256 - 511
No Data Available
Nov-00
Jun-99
12
75
Jun-04
Nov-99
> 8192
129
Jun-05
Jun-01
.
128 - 255
5
64 - 127
5
16
0 - 63
13
2
November ’05
processor count distribution.
Growth in 1024-processor systems.
Easier than ever to deploy thousands of
processors (one BG/L rack!)
3
An Example: ORNL National Center
for Computational Sciences
4
Hierarchical Distributed Systems
• Hierarchical Topologies
FE
– Application Control
– Data collection
– Data reduction/analysis
• As scale increases, frontend becomes a bottleneck
BE
5
BE
…
BE
BE
TBŌNs for Scalable Systems
TBŌNs for scalability
FE
– Scalable multicast
– Scalable gather
CP
CP
– Scalable data
aggregation
CP
BE
6
CP
BE
…
BE
BE
TBŌN Model
Application Front-end
FE
CP
CP
Tree of
Communication Processes
CP
BE
Application Back-ends
7
CP
BE
…
BE
BE
TBŌN Model
Reliable FIFO channels
FE
– Non-lossy
– Duplicate suppressing
– Non-corrupting
CP
CP
CP
BE
8
CP
BE
…
BE
BE
TBŌN Model
FE
Application-level packet
CP
CP
Packet filter
Filter state
CP
Channel state
BE
9
CP
BE
…
BE
BE
TBŌN Model
Filter function:
– Inputs a packet from each child
– Outputs a single packet
– Updates filter state
{output, new_state } ≡ f ( inputs, cur_state )
10
• Multicast
–
–
–
–
TBŌNs at Work
ALMI [Pendarakis, Shi, Verma and Waldvogel ’01]
End System Multicast [Chu, Rao, Seshan and Zhang ’02]
Overcast [Jannotti, Gifford, Johnson, Kaashoek and O’Toole ’00]
RMX [Chawathe, McCanne and Brewer ’00]
• Multicast/gather (reduction)
–
–
–
–
–
Bistro (no reduction) [Bhattacharjee et al ’00]
Gathercast [Badrinath and Sudame ’00]
Lilith [Evensky, Gentile, Camp, and Armstrong ’97]
MRNet [Roth, Arnold and Miller ‘03]
Ygdrasil [Balle, Brett, Chen, LaFrance-Linden ’02]
• Distributed monitoring/sensing
– Ganglia [Sacerdoti, Katz, Massie, Culler ’03]
– Supermon (reduction) [Sottile and Minnich ’02]
– TAG (reduction) [Madden, Franklin, Hellerstein and Hong ’02]
11
Example TBŌN Reductions
• Simple
– Min, max, sum, count, average
– Concatenate
• Complex
–
–
–
–
–
Clock synchronization [ Roth, Arnold, Miller ’03]
Time-aligned aggregation [ Roth, Arnold,Miller ’03]
Graph merging [Roth, Miller ’05]
Equivalence relations [Roth, Arnold, Miller ‘03]
Mean-shift image segmentation [Arnold, Pack,
Miller ‘06]
12
TBŌNs for Tool Scalability
MRNet integrated into Paradyn
• Efficient tool startup
• Performance data analysis
• Scalable visualization
13
TBŌNs for Scalable Applications
• Many algorithms  equivalence computation
– Equivalence/non-equivalence to summarize/analyze input data
Application
Input
Filter
Output
Trace Analysis
Trace file
Trace equivalence /
Anomaly detector
Compressed traces,
anomalous traces
Graph Merging
Sub-graphs
Sub-graph
equivalence
Merged graphs
Data Clustering
Data Files
Object classifiers
Partitioned data
• Streaming programming models
• Possibly even for Bulk Synchronous Parallel programs
14
TBŌNs for Scalable Applications:
Mean-Shift Algorithm
• Clustering points in
feature spaces
• Useful for image
segmentation
Window
• Prohibitively expensive
as feature space
complexity increases
Centroid
15
TBŌNs for Scalable Applications:
Mean-Shift Algorithm
1.
Partition data into windows and
calculate window densities
2.
Keep windows above chosen
density threshold
3.
Run mean-shift on remaining
windows
4.
Keep local maxima as peaks
16
TBŌNs for Scalable Applications:
Mean-Shift Algorithm
• Uses MRNet as
general purpose
programming
paradigm
• Implements
mean-shift in
custom MRNet
filters
~6x speedup with
only 6% more nodes
17
TBŌN Computational Model
• At large scales, suitable for algorithms with:
– Complexity ≥ O(n), n  input size
– Output size ≤ total input size*
• Sometimes algorithm just runs faster on output
(better-behaved input)
– Output is in the same form as the inputs
• E.g., if inputs are sets of elements, the output should
be a set of elements
18
Research and Development Directions
• TBŌN topology studies
• TBŌNs and high-performance networks
• Use of emerging technologies for TBŌN filters
• TBŌN Reliability
19
TBŌN Topology
We expect many factors influence “best” topology
– Physical network topology and capabilities
– Expected traffic (type and volume)
– Desired reliability guarantees
– Cost of “extra” nodes
20
TBŌN Topology Investigation
• Previous studies used reasonable topologies
• How factors influence performance remains an
open question
• Beginning rigorous effort to investigate this issue
– Performance modeling
– Empirical studies on variety of systems
21
High-end Network Support
• Current MRNet implementation uses TCP/IP sockets
• Many high-end networks provide TCP/IP support
– E.g., IP over Quadrics QsNet
– Flexible, but undesirable for performance reasons
• Effort underway to support alternative data transports
– One-sided, OS/application bypass
– Complements topology investigations
– Initially targeting Portals on Cray XT3
22
High-Performance Filters on
Hardware Accelerators
• Multi-paradigm computing (MPC) systems are here
– MPC systems include several types of processors, such as FPGAs,
multi-core processors, GPUs, PPUs, MTA processors
– E.g., Cray Adaptive Supercomputing strategy, SRC Computers, Linux
Networx, DRC FPGA co-processor
• Streaming approach expected to work well for some types
• Running filters on accelerators is natural fit for some
applications, e.g. Sloan Digital Sky Survey and Large
Synoptic Survey Telescope
23
TBŌN Reliability
1
MTTF 
System Size
Given the emergence of TBŌNs for
scalable computing, low-cost
reliability for TBŌN environments
becomes critical!
24
• Goal
TBŌN Reliability
– Tolerate process failures
– Avoid checkpoint overhead
• General concept: leverage TBŌN properties
– Natural information redundancies
– Computational semantics
• Lost state may be replaced by non-identical state
• Computational equivalence: relaxed consistency model
• Zero-cost: no additional computation, storage or
network overhead during normal operation
– Define operations that compensate for lost state
– Maintain computational equivalence
25
TBŌN Information Redundancies
Fundamental to the TBŌN Model
1. Input streams propagate toward root
2. Persistent state summarizes input history
3. Therefore, summary is replicated naturally as
input propagates upstream
26
Recovery Strategy
if failure is detected then
1.Reconstruct tree
2.Regenerate compensatory state
3.Reintegrate state into tree
4.Resume normal operation
end if
27
State Regeneration: Composition
fs( CPi )
CPi
Parent’s state is
composition of children’s
CPj
fs( CPj )
28
CPk
fs( CPk )
State Regeneration: Composition
State composition:
– Input filter state from children
– Output computationally-equivalent state for parent
fs( CPi ) ≡ fs( CPj )  fs( CPk )
Parent’s state
Child’s state
Child’s state
Composition
Operator
29
State Regeneration: Composition
Where does this mysterious composition operation
come from?
Recall filter definition:
{output, new_state } ≡ f (inputs, cur_state )
When filter’s new_state is copy of output;
then f becomes composition operator.
30
State Regeneration: Composition
Proof Outline
– State is history of processed inputs
– Children’s output becomes parent’s input
– Updated state is a copy of output
• can be used as input to filter function
– Filter execution on children’s state will produce
computationally equivalent state for parent
31
State Regeneration: Composition
Composition can also work when output is not a
copy of the state!
– Requires mapping operation from filter state to
output form
32
State Composition Example
CP0
{}
{}
CP2
CP1
{}
CP3
CP4
CP5
CP6
3
1
1
1
4
5
5
8
3
3
1
9
1
4
1
5
33
State Composition Example
CP0
{}
{}
CP2
CP1
3
1
1
{}
1
CP3
CP4
CP5
CP6
4
5
5
8
3
3
1
9
1
4
1
5
34
State Composition Example
CP0
{1,3}
{}
{1}
{1,3}
CP1
4
CP2
5
5
{1}
8
CP3
CP4
CP5
CP6
3
3
1
9
1
4
1
5
35
State Composition Example
{1,3}
CP0
{1,3,4,5}
{1,3}
{1,5,8}
{1,3,4,5}
CP1
3
CP2
3
1
{1,5,8}
9
CP3
CP4
CP5
CP6
1
4
1
5
36
State Composition Example
{1,3}
{1,3,4,5,8}
CP0
{1,3,4,5}
{1,3,4,5,8}
{1,5,8,9}
{1,3,4,5}
CP1
1
CP3
4
CP4
CP2
1
CP5
{1,5,8,9}
5
CP6
37
State Composition Example
{1,3}
{1,3,4,5,8}
{1,3,4,5,8,9}
CP0
{1,3,4,5}
{1,3,4,5,8,9}
{1,5,8,9}
{1,3,4,5}
CP1
CP3
CP4
CP2
CP5
{1,5,8,9}
CP6
38
State
Composition
Example
{1,3}
{1,3,4,5,8}
{1,3,4,5,8,9}
{1,3,4,5,8,9}
{1,3,4,5,8,9}
CP0
{1,3,4,5}
CP1
CP3
CP4
CP2
CP5
{1,5,8,9}
CP6
39
State Composition Example
{1,3}
CP0
{1,3,4,5}
{1,5,8}
{1,3,4,5}
CP1
3
CP0 crashes!
{1,3}
CP2
3
1
{1,5,8}
9
CP3
CP4
CP5
CP6
1
4
1
5
40
State Composition Example
{1,3}
CP0
{1,3,4,5}
{1,3}
{1,5,8}
{1,3,4,5}
CP1
3
CP2
3
1
{1,5,8}
• Use f on children’s
state to regenerate
computationallyconsistent version of
lost state
9
CP3
CP4
CP5
CP6
1
4
1
5
fs( CP0 ) ≡ fs( CP1)  fs( CP2 )
41
State Composition Example
Non-identical, but computationally-consistent!
{1,3}
{1,3}
CP0
{1,3,4,5}
{1,3}
{1,3,4,5,8}
{1,5,8}
{1,3,4,5}
CP1
3
CP0
CP2
3
1
{1,3,4,5}
CP1
{1,5,8}
3
9
CP3
CP4
CP5
CP6
1
4
1
5
CP2
3
1
{1,5,8}
9
CP3
CP4
CP5
CP6
1
4
1
5
fs( CP0 ) ≡ fs( CP1 )  fs( CP2 )
42
State Composition Example
{1,3}
{1,3,4,5,8}
CP0
{1,3,4,5}
{1,3,4,5,8}
CP3
4
CP4
CP0
{1,3,4,5}
{1,5,8,9}
{1,3,4,5}
CP1
1
{1,3}
CP2
1
CP5
{1,3,4,5,8}
{1,5,8,9}
{1,3,4,5}
CP1
{1,5,8,9}
1
5
CP3
CP6
43
4
CP4
CP2
1
CP5
{1,5,8,9}
5
CP6
State Composition Example
{1,3}
{1,3,4,5,8}
{1,3,4,5,8,9}
CP0
{1,3,4,5}
{1,3,4,5,8,9}
CP4
CP0
{1,3,4,5}
{1,5,8,9}
{1,3,4,5}
CP1
CP3
{1,3}
{1,3,4,5,8,9}
CP2
CP5
{1,3,4,5,8,9}
{1,5,8,9}
{1,3,4,5}
CP1
{1,5,8,9}
CP3
CP6
44
CP4
CP2
CP5
{1,5,8,9}
CP6
State
Composition
Example
{1,3}
{1,3}
{1,3,4,5,8,9}
{1,3,4,5,8}
{1,3,4,5,8,9}
{1,3,4,5,8,9}
{1,3,4,5,8,9}
{1,3,4,5,8,9}
CP0
CP0
{1,3,4,5}
CP1
CP3
CP4
CP2
CP5
{1,3,4,5,8,9}
{1,3,4,5}
CP1
{1,5,8,9}
CP3
CP6
45
CP4
CP2
CP5
{1,5,8,9}
CP6
Reliability Highlights
•
Zero-cost TBŌN reliability requirements:
1. Associative/commutative filter function
2. Filter state and output have same
representation, or
3. Known mapping from filter state
representation to output form
•
•
Filter function used for regeneration
Many computations meet requirements
46
Other Issues
• Compensating for lost messages
– Use computational state to compensate
– Idempotent/non-idempotent computations
• Other state regeneration mechanisms
– Decomposition
• Failure detection
• Tree reconstruction
• Evaluation of the recovery process
47
MRNet References
•
Arnold, Pack, and Miller: “Tree-based Overlay Networks for Scalable
Applications”, Workshop on High-Level Parallel Programming Models and
Supportive Environments, April 2006.
•
Roth and Miller, “The Distributed Performance Consultant and the Sub-Graph
Folding Algorithm: On-line Automated Performance Diagnosis on Thousands of
Processes”, Principles and Practice of Parallel Programming, March 2006.
•
Schulz et al, “Scalable Dynamic Binary Instrumentation for Blue Gene/L”,
Workshop on Binary Instrumentation and Applications, September, 2005.
•
Roth, Arnold and Miller, “Benchmarking the MRNet Distributed Tool
Infrastructure: Lessons Learned”, 2004 High-Performance Grid Computing
Workshop, April 2004.
•
Roth, Arnold, and Miller, “MRNet: A Software-Based Multicast/Reduction
Network for Scalable Tools”, SC 2003, November 2003.
48
Summary
•
TBŌN model suitable for many types of tools,
applications and algorithms
•
Future work:
–
Evaluation of reliability mechanisms
•
Coming real soon!
–
Performance modeling to support topology decisions
–
TBŌNs on emerging HPC networks and technologies
–
Other application areas like GIS, Bioinformatics,
data mining, …
49
Funding Acknowledgements
• This research is sponsored in part by The National Science
Foundation under Grant EIA-0320708
• This research is also sponsored in part by the Office of
Mathematical, Information, and Computational Sciences,
Office of Science, U.S. Department of Energy under
Contract No. DE-AC05-00OR22725 with UT-Battelle, LLC.
• Accordingly, the U.S. Government retains a non-exclusive,
royalty-free license to publish or reproduce the published
form of this contribution, or allow others to do so, for U.S.
Government purposes.
50
EXTRA SLIDES
51
MRNet Front-end Interface
front_end_main(){
Network * net = new Network (topology_conf);
Communicator * comm = net->
get_BroadcastCommunicator();
Stream * stream =
new Stream( comm, IMAX_FILT, WAITFORALL);
stream->send(“%s”, “go”);
stream->recv(“%d”, result);
}
52
MRNet Back-end Interface
back_end_main(){
Stream * stream;
char *s;
Network * net = new Network();
net->recv(“%s”, &s, &stream);
if(s == “go”){
stream->send(“%d”, rand_int);
}
}
53
MRNet Filter Interface
imax_filter(vector<Packet> packets_in,
vector<Packet> packets_out)
{
for( i=0; i<packets_in.size; i++){
result = max( result,
packets[i].get_int());
}
Packet p(“%d”, result);
packets_out.pushback(p);
}
54