PPT`02 format - The High performance Computing and Simulation

Download Report

Transcript PPT`02 format - The High performance Computing and Simulation

High-Performance Networking
(HPN) Group*
* The HPN Group was formerly known as the SAN group.
Distributed Shared-Memory Parallel
Computing with UPC on SAN-based Clusters
02/05/04
Appendix for Q3 Status Report
DOD Project MDA904-03-R-0507
February 5, 2004
1
Outline

Objectives and Motivations

Background

Related Research

Approach

Results

Conclusions and Future Plans
02/05/04
2
Objectives and Motivations

Objectives





Support advancements for HPC with Unified Parallel C (UPC) on cluster systems
exploiting high-throughput, low-latency system-area networks (SANs) and LANs
Design and analysis of tools to support UPC on SAN-based systems
Benchmarking and case studies with key UPC applications
Analysis of tradeoffs in application, network, service and system design
Motivations


Increasing demand in sponsor and scientific computing community for sharedmemory parallel computing with UPC
New and emerging technologies in system-area networking and cluster computing







Scalable Coherent Interface (SCI)
Myrinet (GM)
InfiniBand
QsNet (Quadrics Elan)
Gigabit Ethernet and 10 Gigabit Ethernet
PCI Express (3GIO)
UPC
Intermediate
Layers
Network Layer
Clusters offer excellent cost-performance potential
SCI
02/05/04
Myrinet InfiniBand QsNet
1/10 Gigabit
PCI
Ethernet Express
3
Background

Key sponsor applications and developments toward
shared-memory parallel computing with UPC



More details from sponsor are requested
UPC
UPC extends the C language to exploit parallelism

Currently runs best on shared-memory multiprocessors
(notably HP/Compaq’s UPC compiler)

First-generation UPC runtime systems becoming
available for clusters (MuPC, Berkeley UPC)
?
Significant potential advantage in cost-performance
ratio with COTS-based cluster configurations

Leverage economy of scale

Clusters exhibit low cost relative to tightly-coupled SMP,
CC-NUMA, and MPP systems

Scalable performance with commercial off-the-shelf
(COTS) technologies
02/05/04
3 Com
3 Com
UPC
3 Com
4
Related Research

University of California at Berkeley

UPC runtime system

UPC to C translator

Global-Address Space Networking
(GASNet) design and development



Application benchmarks
George Washington University

UPC specification

UPC documentation

UPC testing strategies, testing
suites

UPC benchmarking

UPC collective communications

Parallel I/O
02/05/04

Michigan Tech University

Michigan Tech UPC (MuPC)
design and development

UPC collective communications

Memory model research

Programmability studies

Test suite development
Ohio State University


HP/Compaq


UPC benchmarking
UPC compiler
Intrepid

GCC UPC compiler
5

Benchmarking


Exploiting SAN Strengths for UPC



Design and develop new SCI Conduit for
GASNet in collaboration UCB/LBNL
Evaluate DSM for SCI as option of
executing UPC
Performance Analysis



Use and design of applications in UPC to
grasp key concepts and understand
performance issues
Network communication experiments
UPC computing experiments
Emphasis on SAN Options and Tradeoffs

Upper Layers
Michigan Tech
Benchmarks, modeling,
specification
UC Berkeley
Benchmarks
Benchmarks, UPC-to-C
translator, specification,
GWU
Benchmarks, documents,
specification
Michigan Tech
UF HCS Lab

Applications, Translators,
Documentation

HP/Compaq UPC Compiler V2.1 running in
lab on new ES80 AlphaServer (Marvel)
Support of testing by OSU, MTU,
UCB/LBNL, UF, et al. with leading UPC
tools and system for function performance
evaluation
Field test of newest compiler and system
Middle Layers

Runtime Systems, Interfaces
Collaboration
API, Networks

Lower Layers
Approach
Ohio State
Benchmarks
UP-to-MPI translation
and runtime system
UC Berkeley
C runtime system, upper
levels of GASNet
GASNet
collaboration,
beta testing
HP
UPC runtime system on
AlphaServer
UC Berkeley
GASNet
GASNet
collaboration,
network
performance
analysis
SCI, Myrinet, InfiniBand, Quadrics, GigE,
10GigE, etc.
02/05/04
6
GASNet - Experimental Setup & Analysis

Testbed


Elan, MPI and SCI conduits

Dual 2.4 GHz Intel Xeon, 1GB DDR PC2100
(DDR266) RAM, Intel SE7501BR2 server
motherboard with E7501 chipset

Specs: 667 MB/s (300MB/s sustained) Dolphin
SCI D337 (2D/3D) NICs, using PCI 64/66, 4x2
torus

Specs: 528 MB/s (340MB/s sustained)
Elan3 ,using PCI-X in two nodes with QM-S16
16 port switch

RedHat 9.0 with gcc compiler V 3.3.2
GM (Myrinet) conduit (c/o access to cluster at MTU)

*Dual 2.0 GHz Intel Xeon, 2GB DDR PC2100
(DDR266) RAM

Specs - *250 MB/s Myrinet 2000, using PCI-X,
on 8 nodes connected with 16-port M3F-SW16
switch

RedHat 7.3 with Intel C compiler V 7.1

Experimental Results



Experimental Setup



02/05/04
Elan, GM conduits executed with extended API
implemented
SCI, MPI executed with the reference API (based on
AM in core API)
GASNet Conduit experiments

Berkeley GASNet Test suite

Average of 1000 iterations

Each uses Bulk transfers to take
advantage of implemented extended APIs

Latency results use testsmall

Throughput results use testlarge

Throughput

Elan shows best performance with approx.
300MB/s in both put and get operations

Myrinet and SCI very close with 200MB/s
on put operations

Myrinet obtains nearly the same
performance with get operations

SCI suffers from the reference extended
API in get operations (approx 7MB/s) due to
greatly increasing latency

get operations will benefit the
greatest from extended API
implementation
 Currently being addressed in
UF’s design of the extended API
for SCI

MPI suffers from high latency but still
performs well on GigE with almost 50 MB/s
Latency

Elan again performs best put/get(~8 µs)

Myrinet put (~20usec), get (~33 µs)

SCI both put and get (~25 µs) better than
Myrinet get for small messages

Larger messages suffer from the AM
rpc protocol

MPI latency too high to show (~250 µs)
Elan is the best performer in low-level API tests
* Testbed made available
by Michigan Tech
7
GASNet Throughput on Conduits
Elan -- put
Elan --get
SCI -- put
SCI -- get
Myrinet-put
Myrinet-get
MPI/GigE - put
MPI/GigE - get
350
Throughput (MB/sec)
300
For get operations, must wait
for rpc to be executed before
data can be pushed back
250
200
150
100
50
0
16
32
64
128
256
512
1024
2048
4096
8192
16384 32768 65536
Message Size (bytes)
02/05/04
8
GASNet Latency on Conduits
SCI put
SCI get
Elan put
Elan get
Myrinet put
Myrinet get
180
160
140
Despite not having yet constructed the extended API, which allows better hardware
exploitation, SCI conduit still manages to keep pace with GM conduit for throughput and
most small-message latencies. Q1 report shows target possibility of ~10usec latencies.
Latency (usec)
120
SCI results based on generic
GASNet version of extended
API, which limits performance.
100
80
60
40
20
0
1
2
4
8
16
32
64
128
256
512
1024
Message Size (bytes)
02/05/04
9
UPC Benchmarks – IS from NAS benchmarks*




Class A executed with Berkeley UPC runtime system V1.1 with gcc V3.3.2 for Elan, MPI; Intel V7.1 for GM
IS (Integer Sort), lots of fine-grain communication, low computation
 Communication layer should have greatest effect on performance
 Single thread shows performance without use of communication layer
Poor performance in the GASNet communication system does NOT necessary indicating poor performance in
UPC application
 MPI results poor for GASNet but decent for UPC applications
 Application may need to be larger to confirm this assertion
GM conduit shows greatest gain in parallelization (could be partly due to better compiler)
1 Thread
2 Threads
4 Threads
8 Threads
20
Only two nodes
available with Elan,
unable to determine
scalability at this point
18
Execution Time (sec)
16
TCP/IP overhead
outweighs benefit of
parallelization
14
12
10
8
6
4
2
0
gm
02/05/04
Elan
MPI
* Code developed at GWU
10
Network Performance Tests

Detailed understanding of high-performance cluster interconnects




Identifies suitable networks for UPC over clusters
Aids in smooth integration of interconnects with upper-layer UPC components
Enables optimization of network communication; unicast and collective
Various levels of network performance analysis

Low-level tests






Mid-level tests


02/05/04
InfiniBand based on Virtual Interface Provider Library (VIPL)
SCI based on Dolphin SISCI and SCALI SCI
Myrinet based on Myricom GM
QsNet based on Quadrics Elan Communication Library
Host architecture issues (e.g. CPU, I/O, etc.)
Sockets

Dolphin SCI Sockets on SCI

BSD Sockets on Gigabit and 10Gigabit Ethernet

GM Sockets on Myrinet

SOVIA on InfiniBand
MPI

InfiniBand and Myrinet based on MPI/PRO

SCI based on ScaMPI and SCI-MPICH
Intermediate
Layers
Network Layer
SCI
Myrinet InfiniBand QsNet
1/10 Gigabit PCI
Ethernet Express
11
Network Performance Tests
Raw
Elan Conduit
14

Tests run on two Elan3 cards
connected by QM-S16 16-port switch

Quadrics dping used for raw tests

GASNet testsmall used for latency,
testlarge for throughput

Utilizes extended API

Results obtained from put
operations
Conduit throughput matches hardware

Elan conduit does not add appreciably
to performance overhead
8
6
4
0
8
16
32
64
128
256
512
1024
2048
Message Size (bytes)
Elan conduit for GASNet more than
doubles hardware latency, but still
maintains sub-10 µs for small
messages

10
2
Raw
Elan Conduit
350
300
Throughput (MB/s)

Latency (usec)
12
250
200
150
100
50
6
53
8
65
76
4
32
38
16
92
81
96
40
48
20
24
10
2
51
6
25
8
12
64
32
16
0
Message Size (bytes)
02/05/04
12
Low Level vs. GASNet Conduit
GM raw
GM conduit
35

GASNet testsmall used for latency,
testlarge for throughput
20
15
10

Utilizes extended API
5

Results obtained from puts
0
1
GM conduit almost doubles the hardware
latency, with latencies of ~19 µs for small
messages
Conduit throughput follows trend of
hardware but differs by an average of
60MB/s for messages ≥ 1024bytes


Latency (usec)
Myricom gm_allsize used for raw tests
25
Conduit peaks at 204MB/s compared to
238MB/s for hardware
GM conduit adds a small amount to
performance overhead
2
4
8
16
32
64
128
256
512
1024 2048
8


30
4

Tests run on two Myrinet 2000 cards
connected by M3F-SW16 switch
Size (bytes)
GM Raw
GM Conduit
250
Throughput (MB/s)

200
150
100
50
6
65
53
32
76
16
38
81
92
40
96
20
48
10
24
51
2
25
6
12
8
64
32
16
0
Size (bytes)
02/05/04
13
Architectural Performance Tests

Pentium 4 Xeon

Features



Features


Increased CPU utilization


RISC processor core
4.3 GB/s I/O bandwidth

Future Plans

02/05/04

Intel NetBurst microarchitecture


Opteron
32-bit processor
Hyper-Threading technology



UPC and other parallel application
benchmarks

64-bit processor
Real-time support of 32-bit OS
On-chip memory controllers
Eliminates 4 GB memory barrier
imposed by 32-bit systems
19.2 GB/s I/O bandwidth per
processor
Future plans

UPC and other parallel application
benchmarks
14
CPU Performance Results

DOD Seminumeric Benchmark #2

Radix sort

Measures set up time, sort time, and time to
verify the sort

Sorting is the dominant component of
execution time

Results Analysis

Opteron architecture outperforms Xeons in
all tests performed for all iterations

Setup and Verify times around half as
much as Xeon architecture
Xeon
NAS Benchmarks



Class A problem set size

Opteron and Xeon comparable with floatingpoint operations (FT)

For integer operations, Opteron performs
better compared to Xeon (EP & IS)
Xeon
Opteron
140
Execution Time (sec)
120
Sort Time (sec)
EP, FT, and IS
Opteron
140
100
80
60
40
20
120
100
80
60
40
20
6
6,
1
,8
64
,1
64
,8
,4
64
,4
6
6,
1
,8
32
,1
32
,8
32
,4
,4
0
Parameters
(bit size, radix size, bits in radix sort)
02/05/04
Computationally intensive

0
EP
FT
IS
Benchmarks
15
Memory Performance Results
Opteron latency/throughput worsen as expected at size
64KB (L1 cache size) and 1MB (L2 cache size)

Xeon latency/throughput shows the same trend for L1
(8KB) but starts earlier for L2 (256KB instead of 512KB)

Cause under investigation
Between CPU / L1 / L2, Opteron outperforms Xeon, but
Xeon outperforms Opteron when loading data from disk into
main memory
6
5
4
3
2
1
Write throughput for Xeon stays relatively constant for size
< L2 cache size suggesting write-through policy use
between L1 and L2

0
512
2K
8K
32K
Xeon read > Opteron write > Opteron read > Xeon write

Xeon
128K
512K
2M
8M
32M
128M
2M
8M
32M
128M
Data Size (bytes)
Xeon
Opteron
Opteron
14
180
12
Read Throughput (GB/s)
160
140
10
120
100
80
60
40
8
6
4
2
20
0
0.49
Write Throughput (GB/s)


Opteron
7
Lmbench-3.0-a3

Read Latency (ns)
Xeon
8
0
15.62
39.06
93.75
218.75
500
Data size (KB)
02/05/04
7000
50000
150000
250000
512
2K
8K
32K
128K
512K
Data size (bytes)
16
File I/O Results





10 iterations of writing and reading a 2GB file using per
character functions and efficient block functions
stdio overhead great for per character functions
Efficient block reads and writes greatly reduce the CPU
utilization
Throughput results were directly proportional to CPU utilization
Shows the same trend as observed in the memory
performance testing

Xeon read > Opteron write > Opteron read > Xeon write

Suggesting memory access and I/O access might utilizes
the same mechanism
AIM 9






10 iterations using 5MB files testing sequential and random
reads, writes, and copies
Opteron consistently outperforming Xeon by a wide margin
Large increase in performance for disk reads as compare to
write
Xeon read speeds are very high for all results with a much
lower write performance
Opteron read speeds are also very high and greatly outperform
the Xeons in write performance in all cases
Xeon sequential read is actually worse than Opteron, but still
comparable
Throughput (MB/s)

50
45
40
35
30
25
20
15
10
5
0
Write Per
Char
Write Per
Block
Read Per
Char
Read Per
Block
Sequential I/O Average CPU Utilization
CPU Utilization (%)
Bonnie / Bonnie++
100
80
60
40
20
0
Write Per Write Per Read Per Read Per
Char
Block
Char
Block
AIM9 Disk/Filesystem I/O
700
600
Throughput (MB/s)

Sequential I/O Average Throughput
500
400
300
200
100
0
02/05/04
Sequential
Reads
Sequential
Writes
Random
Reads
Random
Writes
17
Disk Copies
Conclusions and Future Plans

Accomplishments to date







Leverage and extend communication and UPC layers
Conceptual design of new tools
Preliminary network and system performance analyses
Completed V1.0 of the GASNet Core API SCI conduit for UPC
Key insights

Inefficient communication system does not necessarily translate to poor UPC application
performance

Xeon cluster suitable for applications with high Read/Write ratio
Opteron cluster suitable for generic application due to comparable Read/Write capability


Baselining of UPC on shared-memory multiprocessors
Evaluation of promising tools for UPC on clusters
Future Plans


Comprehensive performance analysis of new SANs and SAN-based clusters
Evaluation of UPC methods and tools on various architectures and systems

UPC benchmarking on cluster architectures, networks, and conduits
Continuing effort in stabilizing/optimizing GASNet SCI Conduit

Cost/Performance analysis for all options

02/05/04
18