2006_IBTA_OFA_1515_windows_devel

Download Report

Transcript 2006_IBTA_OFA_1515_windows_devel

OpenFabrics Windows Development
and Microsoft Windows CCS 2003
Part 2
Top500 with InfiniBand
Ranked 130 on the Top500 list – April 2006
Compute Cluster Server V1
- on 896 processors Dell PowerEdge 1855
- openIB + openSM
Rmax = 4.1 TFlops
- 72% Efficiency (Rmax/Rpeak)
- Rpeak = Nproc x cpu speed x n flops/clk
= 896 x 3.2 Ghz x 2 = 5.734 TFlops
>> But goal = not Top500 but to create a product that
responds to Industry needs
Partners
Market Perspective
1991
1998
2005
System
Cray Y-MP C916
Sun HPC10000
Small Form Factor PCs
Architecture
16 x Vector
4GB, Bus
24 x 333MHz UltraSPARCII, 24GB, SBus
4 x 2.2GHz Athlon64
4GB, GigE
OS
UNICOS
Solaris 2.5.1
Windows Server 2003 SP1
Performance
~10 GFlops
~10 GFlops
~10 GFlops
Top500 #
1
500
N/A
Price
$40,000,000
$1,000,000 (40x drop)
< $4,000 (250x drop)
Customers
Government Labs
Large Enterprises
Every Engineer & Scientist
Applications
Classified, Climate,
Physics Research
Manufacturing, Energy,
Finance, Telecom
Bioinformatics, Materials
Sciences, Digital Media
Customers Needs Today
Customers:
An integrated supported solution stack
Simplified job submission, status and progress monitoring
Maximum compute performance and scalability
Simplified environment from desktops to HPC clusters
Administrators:
Better cluster monitoring and management for maximum resource
utilization
Flexible, extensible, policy-driven job scheduling and resource
allocation
Maximum node uptime
Secure process startup and complete cleanup
Developers:
Programming environment that enables maximum productivity
Availability and optimized compilers (Fortran) and math libraries
Parallel debugger, profiler, and visualization tools
Parallel programming models (MPI)
CCS Key Features
Integration with existing Windows and management infrastructure
Integrates with AD, Windows security and existing systems management and
deployment tools
Node Deployment and Administration
Compute nodes automatically imaged and added to cluster
Node Management through UI and command line
To Do List to configure head node
Extensible job scheduler
3rd party extensibility at job submission and/or job assignment
Examples: admission policies and license verification
Submit jobs from command line, UI, or directly from applications
Simple management, similar to print queue management
Secure MPI
User credentials secured in job scheduler and compute nodes
Standardized MPI stack
Microsoft provided stack reduces application/MPI incompatibility issues
Integration with new App Development Excel 12, MatLab
Excel 12 and
CCS integration
Excel “12”
Excel Calculation on Compute Cluster
Server 2003 (workbook with ECS or
formula with exe)
Excel
Computation
Services
(ECS)
Scenario 1: entire workbooks sent to
nodes where there is an ECS that does the
computation and post back on site
Scenario 2: job submit of exe containing excel
formula on node. More granular but more dev
work.
Desktop
Servers
Clusters
High- Performance Networking Options
MS-MPI: based on MPICH2 implementation by
Argonne of MPI spec (open-source)
- secure (which is why not 100% compat)
Application
- msmpi supports any interconnects that plugs into
windows at the same time thanks to windows networking
stack. Mpiexec cmd with different network mask to
switch fabric
- any MPI works on CCS, not restricted to msmpi
MS-MPI
Winsock Direct
- exploits RDMA capability in provider
- different interconnects at the same time with no
change of wsd, app, MPI
- upgrade interconnect (ex:IB driver) no change
- better thoughput/latency
- all app benefit not just MPI app
- plug custom fabrics you develop
Interconnects
- IB for low-latency, GigE, Myrinet, etc ..
WSD
Inter
Connect
MS MPI Leverages Winsock Direct
User Mode
HPC Application
MPI
WinSock Switch
Switch traffic
based on sub-net
IB
w/ RDMA
GigE
w/ RDMA
IB WinSock Provider
DLL
IP path
User API (verbs based)
User Host Channel
Adapter Driver
TCP
GigE
RDMA
WinSock
Provider
DLL
IP
Kernel Mode
NDIS
Miniport
(GigE)
Virtual Bus Driver
Manage hardware
resources in user
space (eg., Send
and receive
queues)
Miniport
(IPoIB)
Kernel API (verbs based)
Host Channel Adapter Driver
Networking Hardware
OS
component
IHV-provided
component
Performance Tuning
Each application is different (parallel, small messages passing, large messages
passing) and is sensitive to different critical factors. A unique set of
mechanisms and parameters need to be applied to each application for
optimal performance.
Critical Factors:
- Network bandwidth, Network latency, Physical RAM, CPU
speed, Number of CPUs per node, File system speed, Job
scheduler.
Mechanisms:
- RDMA zero copy for better throughput
(when long transfer)
- Offloads (TCP, checksum, ..)
Drivers parameters:
- MPICH_SOCKET_SBUFFER_SIZE to zero for better
throughput
- IBWSD_POLL ~500 for low latency
- GigE: CPU interrupt modulation off for low cpu usage
InfiniBand in Microsoft’s Labs
HPC team:
- 6 clusters [6-20 nodes]
with IB equipment for testing
Other teams with smaller scale clusters:
- Windows Networking
- Windows Serviceability
- Windows Performance
- Sql Server
WHQL for InfiniBand
Background
WHQL = Windows
Hardware Quality Labs
Why WHQL? High quality
set of drivers for Windows
Driven by Windows
Networking Team
http://www.microsoft.com/
whdc/whql/default.mspx
Details
A test suite for WSD
providers and IP over IB
Miniport drivers
Include functional test
only (no code coverage)
Signature covers
networking only (no
storage)
High Speed Networking: Next Steps
Contribute back to open-source project
(MPICH2/Argonne National Laboratory) with MPI perf
enhancements. HPC is the first team at Microsoft
contributing to an open-source project. [0-6 months]
Performance tuning whitepaper coming soon. [0-3
months]
Windows Networking releases QFE with perf
enhancements .[0-3 months]
More perf work in general for WSD, MS-MPI, openIB
More network diagnosis tools
External Resources
Microsoft HPC Web Site http://www.microsoft.com/hpc
Microsoft HPC Community Site http://windowshpc.net/default.aspx
Partner Information
http://www.microsoft.com/windowsserver2003/ccs/partners/partnerlist.mspx
Develop Turbocharged Apps for Windows Compute Cluster Server
http://msdn.microsoft.com/msdnmag/issues/06/04/ClusterComputing/default.as
px
Argonne National Lab’s MPI Web Site
http://www-unix.mcs.anl.gov/mpi/
Tuning MPI Programs for Peak Performance
http://mpc.uci.edu/wget/wwwunix.mcs.anl.gov/mpi/tutorial/perf/mpiperf/index.htm
Winsock Direct Stability Patch
http://support.microsoft.com/kb/910481
http://www.microsoft.com/downloads/details.aspx?familyid=A747A23D-CF52493C-943B-B95051F42D68&displaylang=en
Q &A
© 2005 Microsoft Corporation. All rights reserved.
This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.