Workshop on Distributed Processing, Transfer, Retrieval, Fusion and

Download Report

Transcript Workshop on Distributed Processing, Transfer, Retrieval, Fusion and

EFFICIENT PARALLEL PROCESSING,
PROGRAM DEVELOPMENT AND
COMMUNICATION
IN LOW-COST HIGH PERFORMANCE
PLATFORMS
Anguita, M.; Cañas, A.; Díaz, A.F.; Fernández, F.J.; Ortega, J.; Prieto, A.
Department of Computer
Architecture and Technology
University of Granada
(Spain)
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images
and Signals: High Resolution and Low Resolution in Data and Information Grids
(21-22 February, Granada, Spain)
Outline of the talk
1. Introduction: Grid computing
2. Communication performance improvement
1. CLIC on Fast Ethernet
2. CLIC on Gigabit Ethernet
3. Grid- and cluster-aware program
development environments
1. PVMTB and MPITB performance
2. Application examples: wavelets, pH control, nanoelectronics
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
2
Introduction: Grid Computing
As the available bandwidths of the networks increase, the location of the
computing power becomes less relevant
It would be possible to use networks of computers as a single computing
resource for large-scale applications
The goal:
(platform researching point of view)
To provide a transparent
access to the available
computing resources (including
supercomputers, storage
systems,...) and other
geographically distributed
devices and scientific
instruments via a networked
environment
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
3
Introduction: our goals in this context
Efficient exploitation of parallelism (at different
levels) in low cost platforms based on clusters of
computers:
• Improvement of communication bandwidths
available to applications
• High-level programming environments for parallel
program development
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
4
Outline of the talk
1. Introduction: Grid computing
2. Communication performance improvement
1. CLIC on Fast Ethernet
2. CLIC on Gigabit Ethernet
3. Grid- and cluster-aware program
development environments
1. Performance
2. Application examples: wavelets, pH control, nanoelectronics
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
5
Improving communication in Clusters (II)
CLIC (Communication in Linux Clusters) protocol
• Reliable Transport system suited for Cluster Computing
• Developed on Linux (kernel module)
• Optimizes OS support for communication:
(scheduler, NIC drivers, kernel functions)
• Upper layer systems (PVM, MPI,…) can be
efficiently used on top of CLIC
CLIC improves the performance of the
communications so that user-level applications
can take advantage of network features
(better latency & bandwidth, Broadcast, Channel Bonding).
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
7
Improving communication in Clusters (III)
CLIC avoids the TCP/IP stack
LAMMPI/
CLIC
User Processes
Socket
CLIC
User
MPI upper
Upper MPI Layer
layer
RPI
Functions
MPI upper
Upperlayer
MPI Layer
Lower MPI
Layer (sockets)
CLIC
TCP
Network Interface Circuit (NIC)
RPI
Functions
TCP
IP
IP
Driver
LAMMPI/
TCP
Kernel
Software Driver
Network Interface (NIC)
LAM-MPI has been efficiently implemented on CLIC
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
8
Improving communication in Clusters (V)
CLIC
Speedup w.r.t. 4
PVM-TCP
CLIC has lower
Software
overhead
TCP/IP works
while packets
crossing the
network
3,5
3
LAM-MPI/CLIC
2,5
2
1,5
1
LAM-MPI/TCP
0,5
0
10
100
1000
Message Size (bytes)
10000
100000
•
High improvement w.r.t. MPI/TCP and PVM/TCP
•
MPI/CLIC provides a performance similar to CLIC
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
10
Network technology trends: Gigabit networks
Ethernet switch
Clusters
10 Gigabit/s
10 Gigabit/s
10 Gigabit/s
Servers
Hard disks
10 Gigabit/s
Ethernet switch
Datacenters and networks are
moving towards 1-10 Gigabit
technologies
10 Gigabit/s
Fibre channel switch
Infiniband switch
Infiniband array
Fibre channel array
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
12
CLIC on Gigabit Ethernet (I)
Techniques implemented on CLIC to take advantage of the gigabit
network bandwidths:
•
Jumbo frames: use MTUs longer (up to MTU=9000 bytes) than
the Ethernet standard (MTU=1500 bytes)
Reduce the number of interrupts and the overhead related with the
communication protocol processing
•
Coalesced interrupt: the NIC only interrupts the processors after
a given time interval, or a given number of packet arrivals.
Reduce the number of generated interrupts (at the cost of a delay in
the reception)
•
0-Copy: data to be sent are copied directly from the user memory
space to the NIC (to receive data, only one copy is needed)
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
14
CLIC on Gigabit Ethernet (II)
User
O.S.
O.S. (driver)
driver)
TCP
(2 copies)
O.S.
Data to be sent go from user
memory to system memory,
then another copy is done to
build the packets (2 copies),
and then to the NIC
O.S. (driver)
driver)
CLIC (Fast Ethernet)
(1-copy)
Data to be sent go from user
memory to system memory (1copy), in order to build the
packets, and then to the NIC
O.S. (driver)
driver)
CLIC (Gigabit Ethernet)
(0-copy)
CLIC takes advantage of the
new drivers for Gigabit network
cards:
Data to be sent can go directly
from user memory to the NIC
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
15
CLIC on Gigabit Ethernet (III)
Latency=36μs
(messages of 0 bytes)
750
50% of maximum
bandwidth: 4 KBytes
600
0-copy MTU 9000
1-copy MTU 9000


450
300
0-copy MTU 1500
1-copy MTU 1500


150
Mbps
0
1,E+01
1,E+02
1,E+03
1,E+04
1,E+05
1,E+06
1,E+07
Size (bytes)
Comparison of bandwidths provided by CLIC on Gigabit Ethernet
with 0-copy/1-copy and MTU=9000/1500:
Using MTU=9000 bytes has more impact than using 0-copy
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
16
Further Improvements: Intelligent NIC
The emergence of fast, cheap embedded processors allows the use of Intelligent Network
Interface Cards (INIC), including one or more processors, to assist communication by
offloading protocol processing: the entire communication protocol is configured and
moved to the INIC
Consequences:
•
The load on the CPU (from the communication process) is reduced
It is possible for the applications to take advantage from overlapping
communication and computation.
•
The card becomes protocol-aware and can interact with the network without CPU
intervention
The overall protocol latency is reduced as short messages do not have to cross the
peripheral (PCI) bus and the CPU does not have to service an interrupt and
perform a context switch for each one.
•
The INIC can transfer data more efficiently to/from the CPU (small messages can
be reassembled in the INIC and block-transferred to the main memory rather than
a sequence of short DMA transfers).
•
Implementing the communication protocols in the INIC contributes to reduce the
effect of the I/O (PCI) bus bottleneck.
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
18
Outline of the talk
1. Introduction: Grid computing
2. Communication performance improvement
1. CLIC on Fast Ethernet
2. CLIC on Gigabit Ethernet
3. Grid- and cluster-aware program
development environments
1. PVMTB and MPITB performance
2. Application examples: wavelets, pH control, nanoelectronics
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
19
MATLAB interpreted environment
• M-files
P-code
M-file
interactive try-and-error
integrated debugger
– interpreted
– fast-prototyping
– save & run
• MEX-files
C
MEX
C source
intermediate compile/link step
involved configuration
normal debugger
involved breakpoints
– compiled
– lower-level
• computing-intensive
• access to libraries
• data export/import
• hardware control
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
20
Parallel Toolboxes
• Toolbox of MEX files, each PVM/MPI routine has its own MEX
MATLAB application
PVM app
PVMTB
PVM
MPI app
MPITB
MATLAB
MPI
Operating System
Network
PVMTB: 93 cmds (interfaces 86 PVM calls)
MPITB: 153 cmds (interfaces 135 MPI calls)
• Have been used for:
– signal processing:wavelet transform
(UGR, Spain)
– automatic control:
real-time pH control
(UNED, Spain)
– chemical engineering:
chemical manufact. simul. (Carnegie Mellon, USA)
– nanoelectronics:
nanoscale device simul. (CELAB, Purdue, USA)
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
22
Performance results
“Performance of Parallel MATLAB Toolboxes”, VecPar’02, Porto, Portugal
Overhead 20%
@ 1500B
Latency 1.8x
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
24
Application: pH control
S. Dormido (UNED, Spain) ASCC’02 Suntec, Singapore
“Dynamic Programming on clusters for solving Control problems”
PHm
PHm
Flm
u
cA
cB
q
BASE
ÁCIDO
cluster de PCs
controlador
predictivo
PHm
V
PHm
PH-metro
Flm
Caudalímetro
Cluster “smaug”:
16 Athlon K7 500MHz, 128MB, 7GB HD
server with 20GB HD, 2NICs, KVM
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
33
pH control: results
15
700
14
650
13
S(M), factor de aceleración
750
600
550
500
tp (segundos)
tp
450
400
350
300
250
etapa 1
etapa 2
11
10
9
8
7
6
5
200
4
150
3
100
2
50
1
0
0
0
inicio
12
1
2
3
4
5
6
7
8
9
10 11 12
13 14 15
1
2
3
4
5
6
Nº de procesadores esclavos (M)
PD_SIN/E/ME con ligaduras
PD_SIN/E/ME sin ligaduras
7
8
9
10
11
PD_SIN/E/ME con ligaduras
PD_SIN/E/ME sin ligaduras
Aceleración lineal
tp
inicio
etapa 1
12
13
14
15
Nº de procesadores esclavos (M)
etapa 2
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
34
Application: nanoelectronics
S. Goasguen (Purdue, USA) IEEE-NANO’02 Arlington-VA, USA
“Parallelization of nanoMOS2.0 using a 100-nodes Linux cluster”
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
35
nanoMOS: cluster “superman”
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
36
nanoMOS: results
efficiency = 98.0%
e ~ 95.3%
e ~ 88.8%
e ~ 84.2%
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
39
Conclusions
Lightweight protocol CLIC:
•
Portable to any Linux machine, benefits from OS resources/Gbps driver features
•
Facilitated tracking of technology advances, it’s not NIC/CPU dependant
•
Exposes latency/bandwidth improvements at application level (MPI-PVM/CLIC)
•
Foreseen: INICs with embedded processors to offload protocol load from CPU and
avoid I/O bus bottleneck (PCI)
Parallel Toolboxes PVMTB-MPITB:
•
Fast learning / Fast prototyping of parallel applications on clusters
•
Useful for research: small overhead, acceptable efficiency even with 120 CPUs
•
Foreseen: efficiency improvement by compiling application M-files (MATLAB compiler)
and by linking them against MPI/CLIC and PVM/CLIC.
Grids are made of clusters / other resources. Inside them, we want/need:
•
Communication performance, promptly tracking technology advances (Gbps, INICs)
•
Parallel application development environments, benefiting from those improvements
Efficient parallel processing, program development and communication in low-cost high performance platforms
Workshop on Distributed Processing, Transfer, Retrieval, Fusion and Display of Images.. (21-22 February 2003, Granada, Spain)
40