Transcript ppt - LIFL

The status of “SAKURA” project
and
proposals for “Instant Grid” in NEGST project
Mitsuhisa Sato
University of Tsukuba, Japan
1st NEGST WS
(JST-CNRS)
2006/6/24
1
“SAKURA” project
•
•
•
Title: “Research on software and network technologies for an international
large scale P2P distributed computing infrastructure”
Period: 2005 ~ 2006 (2 years)
Research Topics and teams:
–
–
•
Distributed computing – parallel programming, storage
• U of Tsukuba (Sato, Boku, Nakajima)
• INRIA(Fedak, Cappello)
Network measurement
• AIST(Kudoh, Kodama)
• ENS Lyon (P. Primet , Gluck, Otal)
Status and outcome
– Integration of OmniRPC and XtremWeb
• “Grid RPC System Integrating Computing Resources on Multiple Grid-enabled
Job Scheduling Systems”
– Data communication layer for OmniRPC
• OmniStorage (and P2P data sharing)
1st NEGST WS
(JST-CNRS)
2006/6/24
2
Integrating Computing Resources
on Multiple Grid-enabled Job Scheduling Systems
Through a Grid RPC System
Yoshihiro Nakajima † , Mitsuhisa Sato †,
Yoshiaki Aida †, Taisuke Boku †,
Franck Cappello††
† University of Tsukuba, Japan
†† INRIA, France
1st NEGST WS
(JST-CNRS)
2006/6/24
3
Presentation outline
• Grid RPC System Integrating Computing
Resources on Multiple Grid-enabled Job
Scheduling Systems
– “SAKURA”
• Experimental results
• Summary
1st NEGST WS
(JST-CNRS)
2006/6/24
4
Motivation
• Need for High Throughput Computing
(cf. Simulations for drug design, Circuit design…)
– Many kinds of Grid-enabled Job Scheduling System (GJSS) have been
developed (XtremWeb, Condor, Grid engine, CyberGRIP, GridMP)
• User wants to use massive computing resources on different sites easily
– Different management policy and middleware on each sites
– User should write extra code to adapt environment
– User does not want to stop calculation by faults (need for Fault-tolerance
mechanism)
Provide RPC style programming model on GJSS
1st NEGST WS
(JST-CNRS)
2006/6/24
5
Objectives of Grid RPC system integrating
computing resources on multiple GJSS
• Provides unified and
parallel programming
model by RPC on GJSS
PC Cluster
OmniRPC
Remote
executable
OmniRPC
Agent
OmniRPC
Client Program
• Provide Fault-tolerant
feature s for Grid RPC
system on the worker
programs
OmniRPC
Agent
Client PC
OmniRPC
Agent
XtremWeb
Dispatcher
OmniRPC
Agent
Volunteer PCs
CyberGRIP
SRMD
Condor Pool
• Exploit massive computing
resources on different
sites simultaneously
1st NEGST WS
(JST-CNRS)
2006/6/24
6
CyberGRIP
GRM
Target Grid-enabled Job Scheduling Systems
•
Grid-enabled Job Scheduling System or Workflow
manager
–
–
–
Used as a job batch system
Basic work unit is an independent job
Typical type of jobs is to read input data from files, to calculate
and to writes output data to files
GMW
SRMD
Server
SRM
GMW
(GJM)
GMW
XtremWeb
Client
GMW
OJC
GRM
SRMD
Client
Batch System Server
Central Server
SRMD
Server
SRM
(JTX)
CyberGRIP
1st NEGST WS
(JST-CNRS)
Client PC
Windows
XtremWeb
Dispatcher
XtremWeb
Dispatcher
Computing
Nodes
7
XtremWeb
Dispatcher
XtremWeb
Dispatcher
XtremWeb
Worker
XtremWeb
Worker
XtremWeb
Worker
Computing
Nodes
XtremWeb
Worker
UNIX
2006/6/24
)
ask
n
a (T
Dat licatio
Rep
Site B
JTX
JTX
JTX
JTX
Site A
Submit Job/ Retrieve Result
Register Applications
XtremWeb
Worker
XtremWeb
Worker
XtremWeb
Design of Grid RPC system for integrating
computing resources on multiple GJSS’s
• Decoupling computations and data transmission from RPC
mechanism
• Design the agent mechanism to bridge between Grid RPC and GJSS
• Using document-based communication, rather than connection-based
communication
Submit
Grid RPC
Client
Program
Agent
Bridge
Get
Result
GJSS
Grid RPC
Worker
program
The propose system can
• Submit a RPC computation as a job to GJSS
• Guarantee Fault-tolerant execution on the side of worker program
1st NEGST WS
(JST-CNRS)
2006/6/24
8
An example of the proposed Grid RPC
system as an extension of OmniRPC
• Provide seamless parallel programming for local cluster to multicluster in a grid environment
• Make use of remote PC clusters as Grid computing resources
• OmniRPC consists three kinds of components:
Client, Remote executable, Agent
• OmniRPC agent works as a proxy of the communication between
client program and remote executables
Agent invocation
communication
Client
Internet
/Networ
k
Agent
rex rex rex
Multiplex I/O
1st NEGST WS
(JST-CNRS)
2006/6/24
9
Rex: Remote
Executable
Extensions of OmniRPC for
Proposed System and Implementations
• We have designed class libraries and interface to adopt
different GJSS’s
– Agent handles conversion between OmniRPC protocol and GJSS
protocol
– Agent takes care of submitted jobs for fault-tolerance in worker
program
– Remote executable operates I/O data through files
• Implementations
–
–
–
–
XtremWeb (by INRIA) ->OmniRPC/XW
CyberGRIP (by Fujitsu Lab)-> OmniRPC/CG
Condor (by U of WM) -> OmniRPC/C
Open Source Grid Engine (by SUN) -> OmniRPC/GE
1st NEGST WS
(JST-CNRS)
2006/6/24
10
Experiment
•
How much performance improvement with using two different
GJSSs concurrently?
• Parallelized version of Codeml in PAML package [Yang97]
– It analyzes phylogeny of DNA or protein sequences using maximum
likelihood.
– 1000 DNAs processing using
asynchronous 200 RPC calls
• Experimental Setting
– Open Source Grid Engine (Dennis)
+ Condor (Alice Cluster)
Name
Configuration
Network
Dennis
Dual Xeon 2.4GHz, 1GB Mem
1GbE
10
Dual Xeon 3.06GHz, 1GB Mem
1GbE
6
Dual Athlon 1800+, 1GB Mem
100MbE
9
Alice
1st NEGST WS
(JST-CNRS)
2006/6/24
11
# nodes
Used cluster name (# of nodes)
Result
Alice(9) + Dennis
(16)
Sample Data
• Asynchronous 200 RPC
• Amount of data / RPC:
IN 100KB, OUT 30KB
Dennis (16)
Alice(9)
Dennis (1, 2.4Ghz)
original program
0
10000 20000 30000 40000 50000 60000 70000 80000
Elapsed time (s)
• 20.6 times speed up by using two clusters
– Performance improvement was limited in contrast with
using single cluster
– Load-imbalance of a RPC executions disturbs the performance
improvement
1st NEGST WS
(JST-CNRS)
2006/6/24
12
OmniStorage:
Performance Improvement by Data
Management Layer in a Grid RPC System
1st NEGST WS
(JST-CNRS)
2006/6/24
13
Parameter Search Application
• Parameter search applications often need a large
amount of common data.
– Master-worker Parallel Eigenvalue Solver
• Solving large-scale eigenvalue problems by RPC
model
• Common data = large-scale sparse matrix
• If size of the matrix is very large, it takes long time to
master
send it to every worker
Parameters
1st NEGST WS
(JST-CNRS)
worker
Large Initial
Data
2006/6/24
14
worker
Large Initial
Data
worker
Large Initial
Data
Problems in Grid RPC
• In RPC model, master communicates with workers one-by-one.
– Only supports direct communication between master and worker
– This is not corresponding with actual network topology
Worker
Worker
Worker
Worker
Worker
Master
Worker
Worker
Worker
• Network bandwidth between master and worker may become
a bottleneck of performance
1st NEGST WS
(JST-CNRS)
2006/6/24
15
Our Proposal
• We propose a programming model that
decouples data transfer layer from RPC layer
– It enable to optimize data transfer among master and
workers.
• We developed OmniStorage as a prototype to
investigate this model
1st NEGST WS
(JST-CNRS)
2006/6/24
16
OmniStorage Overview
OmniRPC Layer
Worker
one-by-one
Worker
Master
Worker Invocation+Argument
+
Large Data Transfer
OmstPutData();
Register Data
1st NEGST WS
(JST-CNRS)
Data Transfer Layer
“OmniStorage”
2006/6/24
17
Worker
Worker
OmstGetData();
Retrieve Data
Programming example using OmniRPC only
Master program
int main(){
double initialdata[1000*1000], output[100][1000];
...
for(i = 0; i < 100; i++){
req[i] = OmniRpcCallAsync("MyProcedure", i, initialdata, output[i]);
}
OmniRpcWaitAll(100, req);
...
}
Worker program (Worker’s IDL)
Define MyProcedure(int IN i, double IN initialdata[1000*1000], double OUT
output[1000]){
...
/* program in C */
}
1st NEGST WS
(JST-CNRS)
2006/6/24
18
Programming example using OmniRPC
with OmniStorage
Master program
Identifier
Pointer
int main(){
double initialdata[1000*1000], output[100][1000];
...
Data size
OmstPutData(“MyInitialData”, initialdata, sizeof(double)*1000*1000);
}
for(i = 0; i < 100; i++){
req[i] = OmniRpcCallAsync("MyProcedure", i, output[i]);
}
OmniRpcWaitAll(100, req);
...
Worker program (Worker’s IDL)
Define MyProcedure(int IN i,double OUT output[1000]){
}
...
OmstGetData(“MyInitialData”, initialdata, sizeof(double)*1000*1000);
/* program in C */
1st NEGST WS
(JST-CNRS)
2006/6/24
19
OmniStorage prototype
– First objective: To speed up the distribution of (large) initial data
– Relay nodes and workers can cache received data
• Data transfer of the same content occurs only once
work management layers
agent
OmniRPC
invocation of workers
agent
Omnst_put_data()
sever
Omnst_get_data()
sever
data management layers
1st NEGST WS
(JST-CNRS)
2006/6/24
OmniStorage
20
sever
Prototype Implementation of OmniStorage
Communication
through WAN
done only once
Site A
OmstPutData();
OmniRP
C’s client
OmniRPC
data client hosthost
WAN
Relay node
Site B
Send
data only
only once
once
Requests
Master node
data
req
a
of cluster
Sendrequest
data
Send
communicati
on inside
cluster
req
data
execute
execute
data
req
data
reqexecute
data
req
execute
data
req execute
data
req execute
begin processing
OmstGetData();
OmniRPC workers
OmniRPC
Workers
1st NEGST WS
(JST-CNRS)
2006/6/24
21
•We used the master-worker parallel eigenvalue solver (Sakurai et a
•It has 80 jobs and each jobs takes 30 seconds
•Each worker requires 50MB of common data
Performance
Execution
time
2000
35
OmniRPC only
1800
1600
Speedup(OmniRPC only)
1400
25
Speedup(OmniRPC+Omst)
1200
20
Performance
ratio vs. 1 node
1000
15
800
600
10
400
5
200
0
0
1
2
4
8
Nodes
1st NEGST WS
(JST-CNRS)
2006/6/24
22
16
32
64
Speedup
Execution time[s]
30
OmniRPC + OmniStorage
Coarse-grain Eigenvalue Solver:
Moment-based Method
• Find all of the eigenvalues that lie inside a given domain
• A small matrix pencil that has only the desired
eigenvalues is derived by solving large sparse systems
of linear equations constructed from A and B.
• Since these equations can
be solved independently,
we solve them on remote hosts
in parallel.
• This approach is suitable for
master-worker programming
models.
1st NEGST WS
(JST-CNRS)
2006/6/24
23
Singular points of
Eigenvalues
Circular Region
・・・・・
Future Work
• Implement a function of collecting result data from workers
to a master
– Current version of OmniStorage has only facility to sends data from
a master to workers.
• Examine another systems for data management layer
because the APIs doesn’t depend on implementation
– Distributed Hash Table
– Gfarm – Distributed file system
– Bittorrent – P2P file sharing
1st NEGST WS
(JST-CNRS)
2006/6/24
24
Proposals for “Instant Grid”
in NEGST project
1. Instant Grid
2. Network measurement
3. Interoperability
1st NEGST WS
(JST-CNRS)
2006/6/24
25
Background Project
“Study on P2P grid infrastructure for “large-capacity” distributed
computing”
– Supported by JSPS, Project period: 2005 ~ 2007 (3 years)
– P2P grid = grid + P2P distributed computing
• The “Grid” infrastructure to exploit computing power using P2P(Peer to
Peer) technologies
– “large-capacity” distributed computing
• large-scale computing
– parameter searches
– Not, large-scale parallel program with MPI
• large-amount data
– We want to handle a large-scale data which cannot be stored in single-site
1st NEGST WS
(JST-CNRS)
2006/6/24
26
Our research topics for “Instant Grid”
• P2P infrastructure
– P2P network layer, traversing firewall/NAT
• Overlay network and UDP hole punching
– Interoperability with other grid/P2P middleware (starting from OmniRPC/XW)
• Virtualization of resources (VM technology)
– BEE : Linux binary execution environment on Windows
– checkpoint/migration …
• Data management
– P2P data storage system for large-scale persistent data in volatile P2P grid
environment.
• Gfram for P2P environments (by Tatebe)
– OmniStorage: Data layer for Grid RPC
• Numerical algorithms for P2P-grid
– Collaboration with Serge Petition’s group
1st NEGST WS
(JST-CNRS)
2006/6/24
27
Overlay network for P2P
• Computing resources in P2P-grid will be a PC in office and home
• P2P ad-hoc network through NAT
– More of home or office PCs are running with private IP address behind
NAT (or personal firewall)
– For P2P computing with them, we need to tolerate the problem of NAT
traversal communication
– TCP is useful, but only outbound connection from inside to outside through
NAT can be established
• Overlay network for P2P-grid
– Logical network to aggregate computing resources in P2P grid
1st NEGST WS
(JST-CNRS)
2006/6/24
28
Several ways for NAT traversal
•
UDP hole punching
–
–
–
–
•
TCP hole punching
–
–
–
–
–
•
When the ACK packet of outbound UDP communication arrives, NAT allows to pass it
Only available for UDP (unreliable & connection-less communication)
NAT judges it as an ACK packet from its by the combination of IP address and port number
and the interval from outbound packet
It is easy to pass ordinary packet as a “fake” of ACK packet
Same as UDP hole punching, but also available on TCP [Bryan06]
Using more detailed information on TCP packet header than UDP, and usually “faked” packet
is rejected by NAT
With packet header modification, it is possible to go through the NAT with TCP packet
For packet modification, the network drivers must be modified and it is not suitable for P2P
communication
Not so many NAT allows it
NAT with UPnP
–
–
Microsoft and other companies push it, and most of home-NAT accept it
Not suitable for high-end routers (not for home use)
1st NEGST WS
(JST-CNRS)
2006/6/24
29
UDP hole punching
• Arbitration server is required for the beginning of
communication
• After that, client-client P2P communication is possible
• Limited only for UDP
arbitration
server
client1
1st NEGST WS
(JST-CNRS)
2006/6/24
opening
direct UDP
connection
domain1
behind NAT1
UDP communication
30
domain2
behind NAT2
client2
RI2N/UDP for UDP hole punching
RI2N: Redundant Interconnection with Inexpensive Network on UDP
– Originally, designed and implemented for reliable communication on PC cluster with
multi-link Ethernet
– User-level implementation as communication library of which API is similar to TCP
socket interface
– To avoid lock-in-kernel status (no response for user), it is implemented on UDP with
flow control and re-transmission features
– Available on any UDP compatible environment, and no kernel modification is
required
• Although the low level protocol is UDP, RI2N/UDP provides a reliable stream
communication like TCP
•
•
We use RI2N/UDP for a low level layer to traverse NAT for our P2P overlay network.
Status and Plan:
– Experiment and Evaluation of RI2N/UDP for hole punching is done!
– Design a high level algorithms for routing/finding
1st NEGST WS
(JST-CNRS)
2006/6/24
31
Linux binary execution environment
Linux binary
on Window for Grid RPC
of RPC
• Objective
Linux host
– Exploit computing resources running under
Windows as OmniRPC workers
RPC client
• Approach
– Direct execution of Linux binary of RPC
worker (without re-compile) (eg. Wine, Line)
– Not VM
• BEE: Linux binary execution environment on
Window for Grid RPC
– Loader for linux binary to Windows
– Emulation of system calls
• Limited for network sys. calls
• Status and Plan
(JST-CNRS)
RPC Call
2006/6/24
32
Return
result
OmniRPC
Agent
upload
Invoke
Load
BEE
– First prototype is finish (Linux 2.4)
– Linux1st
2.6NEGST
support
WS and process migration
worker
Linux binary
of RPC
worker
Windows remote host
Q&A?
End …
1st NEGST WS
(JST-CNRS)
2006/6/24
33