Transcript Document

Developing a Scalable Coherent
Interface (SCI) device for MPJ
Express
Guillermo López Taboada
14th October, 2005
Dept. of Electronics and Systems
University of A Coruña (Spain)
http://www.des.udc.es
Visitor at Distributed Systems Group
http://dsg.port.ac.uk
Outline
• Introduction
• Design of scidev
• Implementation issues
• Benchmarking
• Future work
• Conclusions
July 17, 2015
2
Introduction
• The interconnection network and its associated
software libraries play a key role in High
Performance Clustering Technology
• Cluster interconnection technologies:
•
•
•
•
•
•
•
•
Gb & 10Gb Ethernet
Myrinet
SCI
Infiniband
Qsnet
Quadrics
GSN - HIPPI
Giganet
– Latencies are small (usually under 10us)
– Bandwidths are high (usually above 1Gbps)
July 17, 2015
3
Introduction
• SCI (Scalable Coherent Interface)
• Latency 1.42 us (theoretical)
• Bandwidth 5333 Mbps (bi-directional)
–Usually without switch (small clusters)
–Topologies 1D (ring) / 2D (torus 2D)
July 17, 2015
4
Introduction
• Example of a 2D torus SCI cluster with FE
(admin)
July 17, 2015
5
Introduction
• Software available from Dolphinics:
• Software available from Scali:
July 17, 2015
–ScaIP: IP emulation
–ScaSISCI: SISCI (Sw Infrastructure for SCI)
–ScaMPI: proprietary MPI implementation
6
Introduction
• Java’s portability means in networking that
only the widely extended TCP/IP is
supported by the JDK
• Previously, IP emulations were used (ScaIP
& SCIP) but performance is similar to FE
• Now a High Performance Socket
Implementation, SCI SOCKETS
• Similar to other Interconnection Tech.
Myrinet (IPoGM->GMSockets)
July 17, 2015
7
Introduction
• Several research projects have been trying
to get support in Java for these System
Area Networks, mainly in Myrinet:
–KaRMI/GM (JavaParty, Univ. Karlsruhe)
–Manta/LFC/Panda/Ibis (Univ. Vrije – Holland)
–Java GM Sockets
–RMIX myrinet
–mpiJava/MPICH-GM or MPICH-MX
–…
• But nothing in SCI
July 17, 2015
8
Introduction
• My PhD Project:
“Designing Efficient Mechanisms for Java communications
on SCI systems”
• The motivation is filling the gap between Java and
this high-speed interconnect, which lacks of sw
support for Java
– SCI Java Fast Sockets
– An SCI communication device, base of a messaging
system
– SCI Channel for Java NIO
– Wrappers for some libraries
– Optimized RMI for High Speed Networks
– Low level Java buffering and communication system
July 17, 2015
9
Introduction
• MPJ Express, a reference
implementation of the MPI bindings for the
Java language, has been released.
–Already mature bindings for C, C++, and Fortran,
but ongoing efforts on the Java binding at DSG
• A good opportunity to provide SCI support
to a messaging system
July 17, 2015
10
Outline
•Introduction
•Design of scidev
•Implementation issues
•Benchmarking
•Future work
•Conclusions
July 17, 2015
11
Design of scidev
• Use of Java Native Interface JNI
(unavoidable)
–In order to provide support and good
performance we have to rely on specific low
level libraries
–In the presence of SCI hw it should use it
–Lost of portability in exchange of higher
performance
–Differences between mpiJava and scidev:
• mpiJava- thin wrapper providing a large number of
Java MPI primitives
• scidev- thicker layer providing a small API
July 17, 2015
12
Design of scidev
• Implementing the xdev API:
–init()
–finish()
–id()
–iprobe(ProcessID srcID, int tag,
int context)
–irecv(Buffer buf, ProcessID srcID,
int tag, int context, Status status)
–isend(Buffer buf, ProcessID destID,
int tag, int context)
–and the blocking counterparts of these functions:
probe, recv, send + issend & ssend
July 17, 2015
13
Design of scidev
MPJ API
MPJ collective Communications (High level)
MPJ point to point communications (Base level)
mpjdev (MPJ Device level)
xdev
Native MPI
smpdev
JNI
Threads
API
niodev
Java NIO
gmdev
JNI
Java Virtual Machine (JVM)
Hardware (NIC, Memory etc)
July 17, 2015
14
Design of scidev
mpjdev
JVM
xdev
JNI
O.S
July 17, 2015
mxdev
scidev
Native Libraries
15
Design of scidev
• Native libraries: SCILib and SISCI
SCILIB
July 17, 2015
16
Outline
•Introduction
•Design of scidev
•Implementation issues
•Benchmarking
•Future work
•Conclusions
July 17, 2015
17
Implementation Issues
• Optimizations / initialization process:
–JNI: Caching field identifiers and references to
objects
–Sending 2 messages in Long protocol
• 1st from a 4-byte multiple address and second from a
128-byte multiple address up to a 128-byte multiple
address (go further the end of the message – raw
Buffer has a 2^n length)
–Algorithm to init the message queues of SCILib
•
•
•
•
July 17, 2015
Connect (to nodes with lower rank)
Create (for all nodes, beginning with the following rank)
Connect (the remaining nodes)
The complexity is O(n)
18
Implementation Issues
• Tranport protocols:
–3 native protocols:
• Inline 1-113b
• Short 114b-64Kb
• Long 64Kb-1Mb
–scidev fragments messages > 1MB and is using:
• Inline for control messages and small messages<113b
• Short with PIO (Programmed Input-Output) for
messages < 8Kb
• Short with DMA (Direct Memory Access) for messages
8-64Kb
• Long in user level libraries does not use DMA transfers,
so it is replaced by own Long protocol with DMA tx
July 17, 2015
19
Implementation Issues
• Communications:
–scidev is based on non-blocking communications
–It’s coded having niodev as template
–Asynchronous sends for messages sizes > 1MB
–Notification strategy:
• Following the approach of SCI SOCKET, using the
mbox interruption library
• Created without transfering the references (SCI
interrupt handlers)
• Each interruption (both user_interruptions and
dma_interruptions) register a callback method
July 17, 2015
20
Implementation Issues
• Sending/Receiving:
–2 threads: user and selector thread,
synchronized for reducing latency
–1 message queue in which the control messages
of pending communications are kept
–Sending directly from the “Buffer” Direct
ByteBuffer
–If selector thread receives a message not posted
-> creates an intermediate buffer for temporal
storage
–If the message has been posted, it copies the
message directly to the “Buffer” Direct
ByteBuffer
July 17, 2015
21
Implementation Issues
This schema for each pair of nodes
selector
thread
user
thread
SBUFFER
RBUFFER
ULL
ULL
LONG
LONG
SHORT
SHORT
Inline
July 17, 2015
user
thread
SCI
Intermediate
Queue
Queue
Queue
Queue
Inline
22
Outline
•Introduction
•Design of scidev
•Implementation issues
•Benchmarking
•Future work
•Conclusions
July 17, 2015
23
Benchmarking
• JDK 1.5 on holly. Latency (us).
MPJE
mpiJava C sockets
Java S.
SCI
51
12
5
11
FE
161
145
83
109
GbE
131
101
65
86
• scidev latency is 33us!
July 17, 2015
24
Benchmarking
• JDK 1.5 on holly. Asymptotic Bandwidths (Mbps).
MPJE
mpiJava C sockets
Java S.
SCI
1200
1480
400
360
FE
90
92
93
92
GbE
680
587
900
600*
• scidev throughput is 1280 Mbps!
July 17, 2015
25
Outline
•Introduction
•Design of scidev
•Implementation issues
•Benchmarking
•Future work
•Conclusions
July 17, 2015
26
Future work
• Immediatily:
–Testing for collective communications (here only
was for point-to-point)
• A design with lower interdependence
between xdev and mpjbuf
• Get information from different formats of
configuration files in SCI
• Benchmarking with MPJ applications and
developing MPJ and xdev applications.
• New buffering implementation
July 17, 2015
27
Future work
Buffering System with Sbuffer and Rbuffer
in ULL (still intermidiate)
SBUFFER
RBUFFER
ULL
ULL
SBUFFER
RBUFFER
LONG
LONG
SHORT
SHORT
Inline
July 17, 2015
SCI
Intermediate
Queue
Queue
Queue
Queue
Inline
28
Outline
•Introduction
•Design of scidev
•Implementation issues
•Benchmarking
•Future work
•Conclusions
July 17, 2015
29
Conclusions
• Performance is still a problem
–Try to avoid control message. Maybe integrating
this data in the ul library
–Aim: latency 30us & Bw 1350 Mbps
• Current phase in developing: Testing
–Hard to do multiple initializations in a single
thread (restart the device)
• Design is a bit coupled with MPJ – strong
interdependence
• Needs evaluation and implementation using
a kernel level library (threads and spawns
process natively)
July 17, 2015
30
Questions
?
July 17, 2015
31
Appendix
• Visitor at the DSG during summer 05
–Pursuing PhD at Univ. of A Coruña (Spain)
July 17, 2015
32
Appendix
• BS in Computing Tech. in 2002 at A Coruña
Univ.
• Member of the Computer Architecture
Group.
–Areas of interest of the group:
– High Performance compilers (automatic detection of
parallelism)
– Cluster computing
– Grid applications
– Management of Parallel/Distributed systems
– Fault tolerance in MPI
– Computer graphics (rendering, radiosity)
– Geographical Information Systems
–12 staff members, 8 PhD students
July 17, 2015
33
Appendix
• Computer Architecture Group.
–Crossgrid (eu project within Gridstart)
July 17, 2015
34
Appendix
• The Computer Architecture Group is young,
has an average age of 32 years
• Some achievements (2000-2004):
–Papers in international conferences: 102
–Papers in Journals: 53 (41 in JCR/SCI list)
–Regional, national and european funded projects
(+/- 1M € in 5 years)
July 17, 2015
35
Gratitudes
• DSG for providing full support for my work
–Specially Aamir and Raz for late, smoky and
caffeinated DSG office hours
–Mark for hosting the visit and his valuable
support
• ICG and UoP for the facilities and services
• Bryan Carpenter for his rare but valuable
comments, and his help with some JNI pbs.
• DXIDI – Xunta de Galicia, for funding the
visit
July 17, 2015
36
A Coruña
• You will be always welcome to A Coruña!
July 17, 2015
37
A Coruña
• You will be always welcome to A Coruña!
July 17, 2015
38