PPT`02 format - The High performance Computing and Simulation
Download
Report
Transcript PPT`02 format - The High performance Computing and Simulation
High-Performance Networking
(HPN) Group*
* The HPN Group was formerly known as the SAN group.
Distributed Shared-Memory Parallel
Computing with UPC on SAN-based Clusters
02/05/04
Appendix for Q3 Status Report
DOD Project MDA904-03-R-0507
February 5, 2004
1
Outline
Objectives and Motivations
Background
Related Research
Approach
Results
Conclusions and Future Plans
02/05/04
2
Objectives and Motivations
Objectives
Support advancements for HPC with Unified Parallel C (UPC) on cluster systems
exploiting high-throughput, low-latency system-area networks (SANs) and LANs
Design and analysis of tools to support UPC on SAN-based systems
Benchmarking and case studies with key UPC applications
Analysis of tradeoffs in application, network, service and system design
Motivations
Increasing demand in sponsor and scientific computing community for sharedmemory parallel computing with UPC
New and emerging technologies in system-area networking and cluster computing
Scalable Coherent Interface (SCI)
Myrinet (GM)
InfiniBand
QsNet (Quadrics Elan)
Gigabit Ethernet and 10 Gigabit Ethernet
PCI Express (3GIO)
UPC
Intermediate
Layers
Network Layer
Clusters offer excellent cost-performance potential
SCI
02/05/04
Myrinet InfiniBand QsNet
1/10 Gigabit
PCI
Ethernet Express
3
Background
Key sponsor applications and developments toward
shared-memory parallel computing with UPC
More details from sponsor are requested
UPC
UPC extends the C language to exploit parallelism
Currently runs best on shared-memory multiprocessors
(notably HP/Compaq’s UPC compiler)
First-generation UPC runtime systems becoming
available for clusters (MuPC, Berkeley UPC)
?
Significant potential advantage in cost-performance
ratio with COTS-based cluster configurations
Leverage economy of scale
Clusters exhibit low cost relative to tightly-coupled SMP,
CC-NUMA, and MPP systems
Scalable performance with commercial off-the-shelf
(COTS) technologies
02/05/04
3 Com
3 Com
UPC
3 Com
4
Related Research
University of California at Berkeley
UPC runtime system
UPC to C translator
Global-Address Space Networking
(GASNet) design and development
Application benchmarks
George Washington University
UPC specification
UPC documentation
UPC testing strategies, testing
suites
UPC benchmarking
UPC collective communications
Parallel I/O
02/05/04
Michigan Tech University
Michigan Tech UPC (MuPC)
design and development
UPC collective communications
Memory model research
Programmability studies
Test suite development
Ohio State University
HP/Compaq
UPC benchmarking
UPC compiler
Intrepid
GCC UPC compiler
5
Benchmarking
Exploiting SAN Strengths for UPC
Design and develop new SCI Conduit for
GASNet in collaboration UCB/LBNL
Evaluate DSM for SCI as option of
executing UPC
Performance Analysis
Use and design of applications in UPC to
grasp key concepts and understand
performance issues
Network communication experiments
UPC computing experiments
Emphasis on SAN Options and Tradeoffs
Upper Layers
Michigan Tech
Benchmarks, modeling,
specification
UC Berkeley
Benchmarks
Benchmarks, UPC-to-C
translator, specification,
GWU
Benchmarks, documents,
specification
Michigan Tech
UF HCS Lab
Applications, Translators,
Documentation
HP/Compaq UPC Compiler V2.1 running in
lab on new ES80 AlphaServer (Marvel)
Support of testing by OSU, MTU,
UCB/LBNL, UF, et al. with leading UPC
tools and system for function performance
evaluation
Field test of newest compiler and system
Middle Layers
Runtime Systems, Interfaces
Collaboration
API, Networks
Lower Layers
Approach
Ohio State
Benchmarks
UP-to-MPI translation
and runtime system
UC Berkeley
C runtime system, upper
levels of GASNet
GASNet
collaboration,
beta testing
HP
UPC runtime system on
AlphaServer
UC Berkeley
GASNet
GASNet
collaboration,
network
performance
analysis
SCI, Myrinet, InfiniBand, Quadrics, GigE,
10GigE, etc.
02/05/04
6
GASNet - Experimental Setup & Analysis
Testbed
Elan, MPI and SCI conduits
Dual 2.4 GHz Intel Xeon, 1GB DDR PC2100
(DDR266) RAM, Intel SE7501BR2 server
motherboard with E7501 chipset
Specs: 667 MB/s (300MB/s sustained) Dolphin
SCI D337 (2D/3D) NICs, using PCI 64/66, 4x2
torus
Specs: 528 MB/s (340MB/s sustained)
Elan3 ,using PCI-X in two nodes with QM-S16
16 port switch
RedHat 9.0 with gcc compiler V 3.3.2
GM (Myrinet) conduit (c/o access to cluster at MTU)
*Dual 2.0 GHz Intel Xeon, 2GB DDR PC2100
(DDR266) RAM
Specs - *250 MB/s Myrinet 2000, using PCI-X,
on 8 nodes connected with 16-port M3F-SW16
switch
RedHat 7.3 with Intel C compiler V 7.1
Experimental Results
Experimental Setup
02/05/04
Elan, GM conduits executed with extended API
implemented
SCI, MPI executed with the reference API (based on
AM in core API)
GASNet Conduit experiments
Berkeley GASNet Test suite
Average of 1000 iterations
Each uses Bulk transfers to take
advantage of implemented extended APIs
Latency results use testsmall
Throughput results use testlarge
Throughput
Elan shows best performance with approx.
300MB/s in both put and get operations
Myrinet and SCI very close with 200MB/s
on put operations
Myrinet obtains nearly the same
performance with get operations
SCI suffers from the reference extended
API in get operations (approx 7MB/s) due to
greatly increasing latency
get operations will benefit the
greatest from extended API
implementation
Currently being addressed in
UF’s design of the extended API
for SCI
MPI suffers from high latency but still
performs well on GigE with almost 50 MB/s
Latency
Elan again performs best put/get(~8 µs)
Myrinet put (~20usec), get (~33 µs)
SCI both put and get (~25 µs) better than
Myrinet get for small messages
Larger messages suffer from the AM
rpc protocol
MPI latency too high to show (~250 µs)
Elan is the best performer in low-level API tests
* Testbed made available
by Michigan Tech
7
GASNet Throughput on Conduits
Elan -- put
Elan --get
SCI -- put
SCI -- get
Myrinet-put
Myrinet-get
MPI/GigE - put
MPI/GigE - get
350
Throughput (MB/sec)
300
For get operations, must wait
for rpc to be executed before
data can be pushed back
250
200
150
100
50
0
16
32
64
128
256
512
1024
2048
4096
8192
16384 32768 65536
Message Size (bytes)
02/05/04
8
GASNet Latency on Conduits
SCI put
SCI get
Elan put
Elan get
Myrinet put
Myrinet get
180
160
140
Despite not having yet constructed the extended API, which allows better hardware
exploitation, SCI conduit still manages to keep pace with GM conduit for throughput and
most small-message latencies. Q1 report shows target possibility of ~10usec latencies.
Latency (usec)
120
SCI results based on generic
GASNet version of extended
API, which limits performance.
100
80
60
40
20
0
1
2
4
8
16
32
64
128
256
512
1024
Message Size (bytes)
02/05/04
9
UPC Benchmarks – IS from NAS benchmarks*
Class A executed with Berkeley UPC runtime system V1.1 with gcc V3.3.2 for Elan, MPI; Intel V7.1 for GM
IS (Integer Sort), lots of fine-grain communication, low computation
Communication layer should have greatest effect on performance
Single thread shows performance without use of communication layer
Poor performance in the GASNet communication system does NOT necessary indicating poor performance in
UPC application
MPI results poor for GASNet but decent for UPC applications
Application may need to be larger to confirm this assertion
GM conduit shows greatest gain in parallelization (could be partly due to better compiler)
1 Thread
2 Threads
4 Threads
8 Threads
20
Only two nodes
available with Elan,
unable to determine
scalability at this point
18
Execution Time (sec)
16
TCP/IP overhead
outweighs benefit of
parallelization
14
12
10
8
6
4
2
0
gm
02/05/04
Elan
MPI
* Code developed at GWU
10
Network Performance Tests
Detailed understanding of high-performance cluster interconnects
Identifies suitable networks for UPC over clusters
Aids in smooth integration of interconnects with upper-layer UPC components
Enables optimization of network communication; unicast and collective
Various levels of network performance analysis
Low-level tests
Mid-level tests
02/05/04
InfiniBand based on Virtual Interface Provider Library (VIPL)
SCI based on Dolphin SISCI and SCALI SCI
Myrinet based on Myricom GM
QsNet based on Quadrics Elan Communication Library
Host architecture issues (e.g. CPU, I/O, etc.)
Sockets
Dolphin SCI Sockets on SCI
BSD Sockets on Gigabit and 10Gigabit Ethernet
GM Sockets on Myrinet
SOVIA on InfiniBand
MPI
InfiniBand and Myrinet based on MPI/PRO
SCI based on ScaMPI and SCI-MPICH
Intermediate
Layers
Network Layer
SCI
Myrinet InfiniBand QsNet
1/10 Gigabit PCI
Ethernet Express
11
Network Performance Tests
Raw
Elan Conduit
14
Tests run on two Elan3 cards
connected by QM-S16 16-port switch
Quadrics dping used for raw tests
GASNet testsmall used for latency,
testlarge for throughput
Utilizes extended API
Results obtained from put
operations
Conduit throughput matches hardware
Elan conduit does not add appreciably
to performance overhead
8
6
4
0
8
16
32
64
128
256
512
1024
2048
Message Size (bytes)
Elan conduit for GASNet more than
doubles hardware latency, but still
maintains sub-10 µs for small
messages
10
2
Raw
Elan Conduit
350
300
Throughput (MB/s)
Latency (usec)
12
250
200
150
100
50
6
53
8
65
76
4
32
38
16
92
81
96
40
48
20
24
10
2
51
6
25
8
12
64
32
16
0
Message Size (bytes)
02/05/04
12
Low Level vs. GASNet Conduit
GM raw
GM conduit
35
GASNet testsmall used for latency,
testlarge for throughput
20
15
10
Utilizes extended API
5
Results obtained from puts
0
1
GM conduit almost doubles the hardware
latency, with latencies of ~19 µs for small
messages
Conduit throughput follows trend of
hardware but differs by an average of
60MB/s for messages ≥ 1024bytes
Latency (usec)
Myricom gm_allsize used for raw tests
25
Conduit peaks at 204MB/s compared to
238MB/s for hardware
GM conduit adds a small amount to
performance overhead
2
4
8
16
32
64
128
256
512
1024 2048
8
30
4
Tests run on two Myrinet 2000 cards
connected by M3F-SW16 switch
Size (bytes)
GM Raw
GM Conduit
250
Throughput (MB/s)
200
150
100
50
6
65
53
32
76
16
38
81
92
40
96
20
48
10
24
51
2
25
6
12
8
64
32
16
0
Size (bytes)
02/05/04
13
Architectural Performance Tests
Pentium 4 Xeon
Features
Features
Increased CPU utilization
RISC processor core
4.3 GB/s I/O bandwidth
Future Plans
02/05/04
Intel NetBurst microarchitecture
Opteron
32-bit processor
Hyper-Threading technology
UPC and other parallel application
benchmarks
64-bit processor
Real-time support of 32-bit OS
On-chip memory controllers
Eliminates 4 GB memory barrier
imposed by 32-bit systems
19.2 GB/s I/O bandwidth per
processor
Future plans
UPC and other parallel application
benchmarks
14
CPU Performance Results
DOD Seminumeric Benchmark #2
Radix sort
Measures set up time, sort time, and time to
verify the sort
Sorting is the dominant component of
execution time
Results Analysis
Opteron architecture outperforms Xeons in
all tests performed for all iterations
Setup and Verify times around half as
much as Xeon architecture
Xeon
NAS Benchmarks
Class A problem set size
Opteron and Xeon comparable with floatingpoint operations (FT)
For integer operations, Opteron performs
better compared to Xeon (EP & IS)
Xeon
Opteron
140
Execution Time (sec)
120
Sort Time (sec)
EP, FT, and IS
Opteron
140
100
80
60
40
20
120
100
80
60
40
20
6
6,
1
,8
64
,1
64
,8
,4
64
,4
6
6,
1
,8
32
,1
32
,8
32
,4
,4
0
Parameters
(bit size, radix size, bits in radix sort)
02/05/04
Computationally intensive
0
EP
FT
IS
Benchmarks
15
Memory Performance Results
Opteron latency/throughput worsen as expected at size
64KB (L1 cache size) and 1MB (L2 cache size)
Xeon latency/throughput shows the same trend for L1
(8KB) but starts earlier for L2 (256KB instead of 512KB)
Cause under investigation
Between CPU / L1 / L2, Opteron outperforms Xeon, but
Xeon outperforms Opteron when loading data from disk into
main memory
6
5
4
3
2
1
Write throughput for Xeon stays relatively constant for size
< L2 cache size suggesting write-through policy use
between L1 and L2
0
512
2K
8K
32K
Xeon read > Opteron write > Opteron read > Xeon write
Xeon
128K
512K
2M
8M
32M
128M
2M
8M
32M
128M
Data Size (bytes)
Xeon
Opteron
Opteron
14
180
12
Read Throughput (GB/s)
160
140
10
120
100
80
60
40
8
6
4
2
20
0
0.49
Write Throughput (GB/s)
Opteron
7
Lmbench-3.0-a3
Read Latency (ns)
Xeon
8
0
15.62
39.06
93.75
218.75
500
Data size (KB)
02/05/04
7000
50000
150000
250000
512
2K
8K
32K
128K
512K
Data size (bytes)
16
File I/O Results
10 iterations of writing and reading a 2GB file using per
character functions and efficient block functions
stdio overhead great for per character functions
Efficient block reads and writes greatly reduce the CPU
utilization
Throughput results were directly proportional to CPU utilization
Shows the same trend as observed in the memory
performance testing
Xeon read > Opteron write > Opteron read > Xeon write
Suggesting memory access and I/O access might utilizes
the same mechanism
AIM 9
10 iterations using 5MB files testing sequential and random
reads, writes, and copies
Opteron consistently outperforming Xeon by a wide margin
Large increase in performance for disk reads as compare to
write
Xeon read speeds are very high for all results with a much
lower write performance
Opteron read speeds are also very high and greatly outperform
the Xeons in write performance in all cases
Xeon sequential read is actually worse than Opteron, but still
comparable
Throughput (MB/s)
50
45
40
35
30
25
20
15
10
5
0
Write Per
Char
Write Per
Block
Read Per
Char
Read Per
Block
Sequential I/O Average CPU Utilization
CPU Utilization (%)
Bonnie / Bonnie++
100
80
60
40
20
0
Write Per Write Per Read Per Read Per
Char
Block
Char
Block
AIM9 Disk/Filesystem I/O
700
600
Throughput (MB/s)
Sequential I/O Average Throughput
500
400
300
200
100
0
02/05/04
Sequential
Reads
Sequential
Writes
Random
Reads
Random
Writes
17
Disk Copies
Conclusions and Future Plans
Accomplishments to date
Leverage and extend communication and UPC layers
Conceptual design of new tools
Preliminary network and system performance analyses
Completed V1.0 of the GASNet Core API SCI conduit for UPC
Key insights
Inefficient communication system does not necessarily translate to poor UPC application
performance
Xeon cluster suitable for applications with high Read/Write ratio
Opteron cluster suitable for generic application due to comparable Read/Write capability
Baselining of UPC on shared-memory multiprocessors
Evaluation of promising tools for UPC on clusters
Future Plans
Comprehensive performance analysis of new SANs and SAN-based clusters
Evaluation of UPC methods and tools on various architectures and systems
UPC benchmarking on cluster architectures, networks, and conduits
Continuing effort in stabilizing/optimizing GASNet SCI Conduit
Cost/Performance analysis for all options
02/05/04
18