2007_SC_Nov_OFA-Reno_HPCS2008 Overview

Transcript 2007_SC_Nov_OFA-Reno_HPCS2008 Overview

A brief overview with emphasis on cluster performance
Eric Lantz ([email protected])
Lead Program Manager , HPC Team
Microsoft Corp.
Fab Tillier ([email protected] )
Developer, HPC Team
Microsoft Corp.
A Brief Overview of this second release from Microsoft’s HPC team.
Some Applicable Market Data
 IDC Cluster Study (113 sites, 303 clusters, 29/35/36 GIA split)
 Industry self-reports average of 85 nodes per cluster

When needing more computing power:


When purchasing:



~50% buy a new cluster, ~50% add nodes to existing cluster
61% buy direct from vendor, 67% have integration from vendor
51% use a standard benchmark in purchase decision
Premium paid for lower network latency as well as power and cooling solutions
 Applications Study (IDC Cluster Study, IDC App Study (250 codes, 112 vendors, 11 countries) -
Visits)

Application usage

Apps use 4-128 CPUs and are majority In-house developed
Majority multi-threaded
Only 15% use whole cluster

In practice 82% are run at 32 processors or below



Excel running in parallel is an application of broad interest
 Top challenges for implementing clusters:




Facility issues with power and cooling
System management capability
Complexity implementing parallel algorithms
Interconnect latency

Complexity of system purchase and deployment
Application Source by Sector
In House
ISV
Open Source
13%
10%
29%
38%
55%
59%
53%
Academic
Government
Industry
30%
16%
Sources: 2006 IDC Cluster Study,
HECMS, 2006 Microsoft HEWS
Study
Page3
Markets Addressed by HPCS2008
Personal
Departmental
Enterprise
4-8
8-256
64-HUGE
Office
On-site or remote
(datacenter or closet)
Centralized datacenter
Cluster Size
Location
BDM End-User
BDM Increases time to
Requirements solution
Low maintenance
overhead
Minimal learning
curve
Availability of
apps/codes used
Support for when the
box goes down
Priced within user
purchase limits
Personal ergonomics
Engineering/Research Department
Manager
CIO; Enterprise Architect
Productivity for staff :
ease-of-use
perf enhancement
control
Ability to run end-user applications
Availability of apps/codes used
Ability to leverage existing
investment/training
Price
Interop w/existing infrastructure
existing cluster or other hw
Price/Performance
Interoperability
existing processes/infra
“ilities”
reliability
manageability
serviceability
availability
Utilization/efficiency
Key HPC Server 2008 Features

Systems Management






Job Scheduling




Integration with the Windows Communication Foundation, allowing SOA application developers to harness the
power of parallel computing offered by HPC solutions
Job scheduling granularity at processor core, processor socket, and compute node levels
Support for Open Grid Forum’s HPC-Basic Profile interface
Networking and MPI





New admin console based on System Center UI framework integrates every aspect of cluster management
Monitoring heat map allows viewing cluster status at-a-glance
High availability for multiple head nodes
Improved compute node provisioning using Windows Deployment Services
Built-in system diagnostics and cluster reporting
Network Direct, providing dramatic RDMA network performance improvements for MPI applications
Improved Network Configuration Wizard
New shared memory MS-MPI implementation for multicore servers
MS-MPI integrated with Event Tracing for Windows and Open Trace Format translation
Storage



Improved iSCSI SAN and Server Message Block (SMB) v2 support in Windows Server 2008
New parallel file system support and vendor partnerships for clusters with high-performance storage needs
New memory cache vendor partnerships
HPCS2008 Is Part Of A Bigger Picture
8
End-To-End Approach To Performance
 Multi-Core is Key
 Big improvements in MS-MPI shared memory communications
 NetworkDirect
 A new RDMA networking interface built for speed and stability
 Devs can't tune what they can't see
 MS-MPI integrated with Event Tracing for Windows
 Perf takes a village
 Partnering for perf
 Regular Top500 runs
 Performed by the HPCS2008 product team on a permanent, scale-
testing cluster
9
Multi-Core is Key
Big improvements in MS-MPI shared memory communications
 MS-MPI automatically routes between
 Shared Memory:
Between processes on a single [multi-proc] node
 Network:
TCP, RDMA (WinsockDirect, NetworkDirect)
 MS-MPIv1 monitored incoming shmem traffic
by aggressively polling [for low latency] which
caused:
 Erratic latency measurements
Prelim shmem results
Latency
(µsec)
@ 0-128b
message
Bandwidth
(MB/sec)
@256kB
message
V1
shmem
1.8
800
V2
shmem
0.7
3000
 High CPU utilization
 MS-MPIv2 uses entirely new shmem approach
 Direct process-to-process copy to increase shm
throughput.
 Advanced algorithms to get the best shm latency while
keeping CPU utilization low.
10
NetworkDirect
A new RDMA networking interface built for speed and stability
 Priorities
Equal toHardware-Optimized stacks for
MPI micro-benchmarks

Socket-Based
App
MPI App
MS-MPI
Focus on MPI-Only Solution for CCSv2
Verbs-based design for close fit with
native, high-perf networking interfaces
 Coordinated w/ Win Networking team’s
long-term plans

 Implementation
 MS-MPIv2 capable of 4 networking paths:

Shared Memory
between processors on a motherboard

TCP/IP Stack (“normal” Ethernet)

Winsock Direct
for sockets-based RDMA

New NetworkDirect interface
 HPC team partnering with networking
IHVs to develop/distribute drivers for this
new interface
Windows Sockets
(Winsock + WSD)
RDMA
Networking
Networking
Networking
WinSock
Direct
Hardware
Hardware
Provider
TCP/Ethernet
Networking
Networking
Networking
NetworkDirect
Hardware
Hardware
Provider
Networking Hardware
Hardware
Networking
User
Mode Access Layer
TCP
User
Mode
Kernel By-Pass

IP
NDIS
Networking
Networking
Mini-port
Hardware
Hardware
Driver
Kernel
Mode
Networking Hardware
Hardware
Networking
Hardware Driver
Networking Hardware
Hardware
Networking
Networking
Hardware
(ISV) App
CCP
Component
OS
Component
IHV
Component
Devs can't tune what they can't see
MS-MPI integrated with Event Tracing for Windows
 Single, time-correlated log of:
 High-precision CPU clock
mpiexec.exe
-trace args
Trace settings
(mpitrace.mof)
MS-MPI
logman.exe
Windows ETW
Infrastructure
correction
 Log consolidation from multiple
compute nodes into a single record
of parallel app execution
Trace
Log File
 Dual purpose:
Trace
Trace
Trace
Log
Files
Log
LogFiles
Files
Convert to text
Live feed
 Performance Analysis
 Application Trouble-Shooting
 Trace Data Display
 Visual Studio & Windows ETW
tools
 !Soon! Vampir Viewer for Windows
MS-MPI
Windows ETW
Infrastructure
Consolidate Trace
files at end of job
OS, driver, MPI, and app events
 CCS-specific additions
Trace
Log File
13
Perf takes a village
(Partnering for perf)
 Networking Hardware vendors
 NetworkDirect design review
 NetworkDirect & WinsockDirect provider development
 Windows Core Networking Team
 Commercial Software Vendors
 Win64 best practices
 MPI usage patterns
 Collaborative performance tuning

3 ISVs and counting
 4 benchmarking centers online
 IBM, HP, Dell, SGI
14
Regular Top500 runs
 MS HPC team just completed a
3rd entry to the Top500 list
 Using our dev/test scale cluster
(Rainier)
Currently #116 on Top500
 Best efficiency of any Clovertown
with SDR IB (77.1%)

 Learnings incorporated into
white papers & CCS product
Configuration:
•
•
•
•
•
•
•
•
260 Dell Blade Servers • Each compute node
1 Head node
has two quad-core
256 compute nodes
Intel 5320 Clovertown,
1 IIS server
1.86GHz, 8GB RAM
1 File Server
•Total
App/MPI: Infiniband
• 2080 Cores
Private: Gb-E
• 2+TB RAM
Public: Gb-E
Location:
• Microsoft Tukwila
Data center
(22 miles from
Redmond campus)
15
What is Network Direct?
What Verbs should look like for Windows:
 Service Provider Interface (SPI)
 Verbs Specifications are not APIs!
 Aligned with industry-standard Verbs
 Some changes for simplicity
 Some changes for convergence of IB and iWARP
 Windows-centric design
 Leverage Windows asynchronous I/O capabilities
ND Resources
Provider
Adapter
Memory
Registration
Memory
Window
Completion
Queue
Endpoint
Listen
Resources Explained
Resource
Description
Provider
Represents the IHV driver
Adapter
Represents an RDMA NIC
Container for all other resources
Completion Queue (CQ)
Used to get I/O results
Endpoint (EP)
Used to initiate I/O
Used to establish and manage
connections
Memory Registration (MR)
Make buffers accessible to HW for
local access
Memory Window (MW)
Make buffers accessible for remote
access
19
ND to Verbs Resource Mapping
Network Direct
Verbs
Provider
N/A
Adapter
HCA/RNIC
Completion Queue (CQ)
Completion Queue (CQ)
Endpoint (EP)
Queue Pair (QP)
Memory Registration (MR)
Memory Region (MR)
Memory Window (MW)
Memory Window (MW)
ND SPI Traits
 Explicit resource management
 Application manages memory registrations
 Applications manages CQ to Endpoint bindings
 Only asynchronous data transfers
 Initiate requests on an Endpoint
 Get request results from the associated CQ
 Application can use event driven and/or polling I/O model
 Leverage Win32 asynchronous I/O for event driven operation
 No kernel transitions for polling mode
 “Simple” Memory Management Model
 Memory Registrations are used for local access
 Memory Windows are used for remote access
 IP Addressing
 No proprietary address management required
21
ND SPI Model
 Collection of COM interfaces
 No COM runtime dependency


Use the interface model only
Follows model adopted by the UMDF
 Thread-less providers
 No callbacks
 Aligned with industry standard Verbs
 Facilitates IHV adoption
Why COM Interfaces?
 Well understood programming model
 Easily extensible via IUnknown::QueryInterface
 Allows retrieving any interface supported by an object
 Object oriented
 C/C++ language independent
 Callers and providers can be independently
implemented in C or C++ without impact on one
another
 Interfaces support native code syntax - no wrappers
Asynchronous Operations
 Win32 Overlapped operations used for:
 Memory Registration
 CQ Notification
 Connection Management
 Client controls threading and completion mechanism
 I/O Completion Port or GetOverlappedResult
 Simpler for kernel drivers to support
 IoCompleteRequest – I/O manager handles the rest.
References
Microsoft HPC web site - HPC Server 2008 (beta) Available Now!!
http://www.microsoft.com/hpc
Network Direct SPI documentation, header and test executables
In the HPC Server 2008 (beta) SDK
http://www.microsoft.com/hpc
Microsoft HPC Community Site
http://windowshpc.net/default.aspx
Argonne National Lab’s MPI website
http://www-unix.mcs.anl.gov/mpi/
CCS 2003 Performance Tuning Whitepaper
http://www.microsoft.com/downloads/details.aspx?FamilyID=40cd8
152-f89d-4abf-ab1c-a467e180cce4&DisplayLang=en
Or go to http://www.microsoft.com/downloads and
search for CCS Performance
Socrates software boosts performance by 30% on Microsoft
cluster to achieve 77.1% overall cluster efficiency
Performance improvement was demonstrated with
exactly the same hardware and is attributed to :
 Improved networking performance of MS-MPI’s NetworkDirect
interface
 Entirely new MS-MPI implementation for shared memory
communications
 Tools and scripts to optimize process placement and tune the Linpack
parameters for this 256-node, 2048-processor cluster
 Windows Server 2008 improvements in querying completion port
status
 Use of Visual Studio’s Profile Guided Optimization (POGO) on the
Linpack, MS-MPI, and the ND provider binaries

2007_SC_Nov_OFA-Reno_HPCS2008 Overview

Transcript 2007_SC_Nov_OFA-Reno_HPCS2008 Overview

Directory