2007_SC_Nov_OFA-Reno_HPCS2008 Overview
Download
Report
Transcript 2007_SC_Nov_OFA-Reno_HPCS2008 Overview
A brief overview with emphasis on cluster performance
Eric Lantz ([email protected])
Lead Program Manager , HPC Team
Microsoft Corp.
Fab Tillier ([email protected] )
Developer, HPC Team
Microsoft Corp.
A Brief Overview of this second release from Microsoft’s HPC team.
Some Applicable Market Data
IDC Cluster Study (113 sites, 303 clusters, 29/35/36 GIA split)
Industry self-reports average of 85 nodes per cluster
When needing more computing power:
When purchasing:
~50% buy a new cluster, ~50% add nodes to existing cluster
61% buy direct from vendor, 67% have integration from vendor
51% use a standard benchmark in purchase decision
Premium paid for lower network latency as well as power and cooling solutions
Applications Study (IDC Cluster Study, IDC App Study (250 codes, 112 vendors, 11 countries) -
Visits)
Application usage
Apps use 4-128 CPUs and are majority In-house developed
Majority multi-threaded
Only 15% use whole cluster
In practice 82% are run at 32 processors or below
Excel running in parallel is an application of broad interest
Top challenges for implementing clusters:
Facility issues with power and cooling
System management capability
Complexity implementing parallel algorithms
Interconnect latency
Complexity of system purchase and deployment
Application Source by Sector
In House
ISV
Open Source
13%
10%
29%
38%
55%
59%
53%
Academic
Government
Industry
30%
16%
Sources: 2006 IDC Cluster Study,
HECMS, 2006 Microsoft HEWS
Study
Page3
Markets Addressed by HPCS2008
Personal
Departmental
Enterprise
4-8
8-256
64-HUGE
Office
On-site or remote
(datacenter or closet)
Centralized datacenter
Cluster Size
Location
BDM End-User
BDM Increases time to
Requirements solution
Low maintenance
overhead
Minimal learning
curve
Availability of
apps/codes used
Support for when the
box goes down
Priced within user
purchase limits
Personal ergonomics
Engineering/Research Department
Manager
CIO; Enterprise Architect
Productivity for staff :
ease-of-use
perf enhancement
control
Ability to run end-user applications
Availability of apps/codes used
Ability to leverage existing
investment/training
Price
Interop w/existing infrastructure
existing cluster or other hw
Price/Performance
Interoperability
existing processes/infra
“ilities”
reliability
manageability
serviceability
availability
Utilization/efficiency
Key HPC Server 2008 Features
Systems Management
Job Scheduling
Integration with the Windows Communication Foundation, allowing SOA application developers to harness the
power of parallel computing offered by HPC solutions
Job scheduling granularity at processor core, processor socket, and compute node levels
Support for Open Grid Forum’s HPC-Basic Profile interface
Networking and MPI
New admin console based on System Center UI framework integrates every aspect of cluster management
Monitoring heat map allows viewing cluster status at-a-glance
High availability for multiple head nodes
Improved compute node provisioning using Windows Deployment Services
Built-in system diagnostics and cluster reporting
Network Direct, providing dramatic RDMA network performance improvements for MPI applications
Improved Network Configuration Wizard
New shared memory MS-MPI implementation for multicore servers
MS-MPI integrated with Event Tracing for Windows and Open Trace Format translation
Storage
Improved iSCSI SAN and Server Message Block (SMB) v2 support in Windows Server 2008
New parallel file system support and vendor partnerships for clusters with high-performance storage needs
New memory cache vendor partnerships
HPCS2008 Is Part Of A Bigger Picture
8
End-To-End Approach To Performance
Multi-Core is Key
Big improvements in MS-MPI shared memory communications
NetworkDirect
A new RDMA networking interface built for speed and stability
Devs can't tune what they can't see
MS-MPI integrated with Event Tracing for Windows
Perf takes a village
Partnering for perf
Regular Top500 runs
Performed by the HPCS2008 product team on a permanent, scale-
testing cluster
9
Multi-Core is Key
Big improvements in MS-MPI shared memory communications
MS-MPI automatically routes between
Shared Memory:
Between processes on a single [multi-proc] node
Network:
TCP, RDMA (WinsockDirect, NetworkDirect)
MS-MPIv1 monitored incoming shmem traffic
by aggressively polling [for low latency] which
caused:
Erratic latency measurements
Prelim shmem results
Latency
(µsec)
@ 0-128b
message
Bandwidth
(MB/sec)
@256kB
message
V1
shmem
1.8
800
V2
shmem
0.7
3000
High CPU utilization
MS-MPIv2 uses entirely new shmem approach
Direct process-to-process copy to increase shm
throughput.
Advanced algorithms to get the best shm latency while
keeping CPU utilization low.
10
NetworkDirect
A new RDMA networking interface built for speed and stability
Priorities
Equal toHardware-Optimized stacks for
MPI micro-benchmarks
Socket-Based
App
MPI App
MS-MPI
Focus on MPI-Only Solution for CCSv2
Verbs-based design for close fit with
native, high-perf networking interfaces
Coordinated w/ Win Networking team’s
long-term plans
Implementation
MS-MPIv2 capable of 4 networking paths:
Shared Memory
between processors on a motherboard
TCP/IP Stack (“normal” Ethernet)
Winsock Direct
for sockets-based RDMA
New NetworkDirect interface
HPC team partnering with networking
IHVs to develop/distribute drivers for this
new interface
Windows Sockets
(Winsock + WSD)
RDMA
Networking
Networking
Networking
WinSock
Direct
Hardware
Hardware
Provider
TCP/Ethernet
Networking
Networking
Networking
NetworkDirect
Hardware
Hardware
Provider
Networking Hardware
Hardware
Networking
User
Mode Access Layer
TCP
User
Mode
Kernel By-Pass
IP
NDIS
Networking
Networking
Mini-port
Hardware
Hardware
Driver
Kernel
Mode
Networking Hardware
Hardware
Networking
Hardware Driver
Networking Hardware
Hardware
Networking
Networking
Hardware
(ISV) App
CCP
Component
OS
Component
IHV
Component
Devs can't tune what they can't see
MS-MPI integrated with Event Tracing for Windows
Single, time-correlated log of:
High-precision CPU clock
mpiexec.exe
-trace args
Trace settings
(mpitrace.mof)
MS-MPI
logman.exe
Windows ETW
Infrastructure
correction
Log consolidation from multiple
compute nodes into a single record
of parallel app execution
Trace
Log File
Dual purpose:
Trace
Trace
Trace
Log
Files
Log
LogFiles
Files
Convert to text
Live feed
Performance Analysis
Application Trouble-Shooting
Trace Data Display
Visual Studio & Windows ETW
tools
!Soon! Vampir Viewer for Windows
MS-MPI
Windows ETW
Infrastructure
Consolidate Trace
files at end of job
OS, driver, MPI, and app events
CCS-specific additions
Trace
Log File
13
Perf takes a village
(Partnering for perf)
Networking Hardware vendors
NetworkDirect design review
NetworkDirect & WinsockDirect provider development
Windows Core Networking Team
Commercial Software Vendors
Win64 best practices
MPI usage patterns
Collaborative performance tuning
3 ISVs and counting
4 benchmarking centers online
IBM, HP, Dell, SGI
14
Regular Top500 runs
MS HPC team just completed a
3rd entry to the Top500 list
Using our dev/test scale cluster
(Rainier)
Currently #116 on Top500
Best efficiency of any Clovertown
with SDR IB (77.1%)
Learnings incorporated into
white papers & CCS product
Configuration:
•
•
•
•
•
•
•
•
260 Dell Blade Servers • Each compute node
1 Head node
has two quad-core
256 compute nodes
Intel 5320 Clovertown,
1 IIS server
1.86GHz, 8GB RAM
1 File Server
•Total
App/MPI: Infiniband
• 2080 Cores
Private: Gb-E
• 2+TB RAM
Public: Gb-E
Location:
• Microsoft Tukwila
Data center
(22 miles from
Redmond campus)
15
What is Network Direct?
What Verbs should look like for Windows:
Service Provider Interface (SPI)
Verbs Specifications are not APIs!
Aligned with industry-standard Verbs
Some changes for simplicity
Some changes for convergence of IB and iWARP
Windows-centric design
Leverage Windows asynchronous I/O capabilities
ND Resources
Provider
Adapter
Memory
Registration
Memory
Window
Completion
Queue
Endpoint
Listen
Resources Explained
Resource
Description
Provider
Represents the IHV driver
Adapter
Represents an RDMA NIC
Container for all other resources
Completion Queue (CQ)
Used to get I/O results
Endpoint (EP)
Used to initiate I/O
Used to establish and manage
connections
Memory Registration (MR)
Make buffers accessible to HW for
local access
Memory Window (MW)
Make buffers accessible for remote
access
19
ND to Verbs Resource Mapping
Network Direct
Verbs
Provider
N/A
Adapter
HCA/RNIC
Completion Queue (CQ)
Completion Queue (CQ)
Endpoint (EP)
Queue Pair (QP)
Memory Registration (MR)
Memory Region (MR)
Memory Window (MW)
Memory Window (MW)
ND SPI Traits
Explicit resource management
Application manages memory registrations
Applications manages CQ to Endpoint bindings
Only asynchronous data transfers
Initiate requests on an Endpoint
Get request results from the associated CQ
Application can use event driven and/or polling I/O model
Leverage Win32 asynchronous I/O for event driven operation
No kernel transitions for polling mode
“Simple” Memory Management Model
Memory Registrations are used for local access
Memory Windows are used for remote access
IP Addressing
No proprietary address management required
21
ND SPI Model
Collection of COM interfaces
No COM runtime dependency
Use the interface model only
Follows model adopted by the UMDF
Thread-less providers
No callbacks
Aligned with industry standard Verbs
Facilitates IHV adoption
Why COM Interfaces?
Well understood programming model
Easily extensible via IUnknown::QueryInterface
Allows retrieving any interface supported by an object
Object oriented
C/C++ language independent
Callers and providers can be independently
implemented in C or C++ without impact on one
another
Interfaces support native code syntax - no wrappers
Asynchronous Operations
Win32 Overlapped operations used for:
Memory Registration
CQ Notification
Connection Management
Client controls threading and completion mechanism
I/O Completion Port or GetOverlappedResult
Simpler for kernel drivers to support
IoCompleteRequest – I/O manager handles the rest.
References
Microsoft HPC web site - HPC Server 2008 (beta) Available Now!!
http://www.microsoft.com/hpc
Network Direct SPI documentation, header and test executables
In the HPC Server 2008 (beta) SDK
http://www.microsoft.com/hpc
Microsoft HPC Community Site
http://windowshpc.net/default.aspx
Argonne National Lab’s MPI website
http://www-unix.mcs.anl.gov/mpi/
CCS 2003 Performance Tuning Whitepaper
http://www.microsoft.com/downloads/details.aspx?FamilyID=40cd8
152-f89d-4abf-ab1c-a467e180cce4&DisplayLang=en
Or go to http://www.microsoft.com/downloads and
search for CCS Performance
Socrates software boosts performance by 30% on Microsoft
cluster to achieve 77.1% overall cluster efficiency
Performance improvement was demonstrated with
exactly the same hardware and is attributed to :
Improved networking performance of MS-MPI’s NetworkDirect
interface
Entirely new MS-MPI implementation for shared memory
communications
Tools and scripts to optimize process placement and tune the Linpack
parameters for this 256-node, 2048-processor cluster
Windows Server 2008 improvements in querying completion port
status
Use of Visual Studio’s Profile Guided Optimization (POGO) on the
Linpack, MS-MPI, and the ND provider binaries