Parallel Method for Numerical Weather Prediction

Download Report

Transcript Parallel Method for Numerical Weather Prediction

Planning and building
linux based cluster
for NWP
Climatological Research Institute
(CRI Cluster)
Dr. Jamali
Chezgi
Outline
Introduction
 Our problem
 Our solution
 Building CRI Cluster
 Monitoring and controlling
 Benchmarking
 Feature plans
 references

Introduction
Environment / Climate / Weather
 Aeronautics and space exploration
 Energy research
 Virtual reality
 Scientific visualization
 Health sciences

Make
observation
Collect and
process data
Run forecast model
Create product
Provide for end users
Main issues
Very large data sets
 Distributed data
 High processing required
 Need to real-time processes
 Coupled models

Our problems

Data management


Lisa
NWP models
ARPS
 MM5
 HRM


Climatological models

NCM
NWP models
ARPS
 MM5
 HRM

ARPS




Advanced Regional Prediction System
Open source
Parallel code
Running on the all unixes
1.
IBM
RS/6000
2.
Cray
Cray
Cray
CM-5
PC
C-90
T3D
J90
3.
4.
5.
6.
LINUX
Workstation
ARPS Model Process Flow chart
Indexed terrain
ARPSTERN
( Terrain data preprocessor )
elevation file
Arpstern.input
( 1°,5 min,or 30 sec )
User-supplied
EXT2 ARPS
gridded data
(e.g.OLAPS.NMC
Arps40.input
( Gridded data
interpolater)
Arps40.input
ARPSSFC
Soil .vegetation
( Surface characteristics data
preprocessor )
type and other
land-use data
analysis )
ARPS Data
ARPSRETRV
Doppler
)Doppler Radar Data
Radar Data
Retrieval system(
Rawinsondes,VAD.
And wind profilers
Assimilation
System
ARPS Analysis System
Radar Data
Single-level data
ARPS
Arps40.input
Doppler
( Main model driver )
ARPSPLT
ARPSCVT
( Vector graphics post-processor)
( History data format converter )
Arpsplt.input
Other
post- processing tools
Visualization packages
( Savi3D,AVS etc )
Arpscvt .input
Climatological Models
Our solution
Memory:
using bigger memory ?
CPU:
using better CPU ?
Cluster:
for powering Memory and CPU
Building Cluster
CRI Cluster
Prebuilded clusters?
direct relation between technology
and end user
 Customize it for our users
 obtaining this technology
 Better use
 We can upgrade it
 Lower costs
 Samples on the world

OU Cluster


Breakdown of Nodes
 132 Compute Nodes
(computing jobs)

8 Storage Nodes
(Parallel Virtual File
System)

2 Head Nodes
(login, compile,
debug, test)

1 Management
Node
(PVFS
control, batch queue)
Each Node
 2 Pentium4 XeonDP
CPUs
(2 GHz,
512 KB L2 Cache)
 2 GB RDRAM
(400 MHz, 3.2 GB/sec)
 Myrinet-2000 adapter
Cluster Architecture
Cluster room
Space
 Packing
 Power
 Air condition
 Easily repairing
 Security
 Cabling

Linux
true multitasking
virtual memory
shared library
demand loading
shared copy-on-write executables
proper memory management
TCP/IP networking
Up to 64 GB memory support in i386
IP Virtual server Support
Virtual server via NAT
Virtual server Tunneling
Virtual server direct routing
Vlan
Fast Switching
Bonding driver
Eql
386/486 based pc, ARMS, DEC, ALPHA, SUN sparc, M 68000,
MIPS, PowerPC, …
Communication protocols
Internet protocols
 Low latency protocols

Active messages
 Fast messages
 VMMC
 U-net
 BIP

TCP/IP problems for clustering
1. Latency
for small packets
2. Bandwidth
for big packets
Protocol overhead
system
NIC
3) copy
5)Send to NIC
Internal buffers
Os
memory
1)Preparing data
User
memory
User
process
2)Sending intrupt
4)Intrupt to sending out data
OS
Cluster computing standards

VIA






Combination of the many protocols
Like U-net uses virtual network interface
native and emulated
A version of the emulated VIA has more performance than TCP/IP
MPICH over VIA
Infiniband






Compaq dell HP IBM intel microsoft sun
Replace the shared I/O with a
high speed
serial,channel based,message passing ,scalable ,switched fabric.
Using HCA and TCA to connect the channel
Uses Six type transfer method:
reliable and unreliable connections and datagrams,multicast
connections,raw
packets
Support DMA
IPv6
Hardware products










Ethernet fast ethernet and gigabit
ethernet
Giganet(cLAN)
Myrinet
Qsnet
ServerNet
SCI(Scalable Coherent Interface)
ATM
Fiber Channel
HIPPI
Reflective Memory
Installing and configuring
Installing server
 Building services
 Auto installing clients
 Auto configuring clients
 Management of the nodes

NIS configuration
In the server
1) Specifying domain name
#domainnam <DOMAIN_NAME>
2)
Putting in the “/etc/Sysconfig/network”
NISDOMAIN=<DOMAIN_NAME>
3) Specifying server name in “/etc/yp.conf ” :
NISDOMAIN <DOMAIN_NAME> SERVER <SERVER_NAME>
4) Restarting daemons :
# /etc/ rc.d/ ypserv
restart
# /etc/ rc.d/ypbind
restart
5) Putting it in the init
6)Editing “/etc/yp/Makefile”

MERGE_PASSWD= FALSE TRUE

MERGE_GROUP=FALSE
TRUE

delete netgrp from all options.
7)Bulding NIS Database :
#/usr/lib/yp/ypinit -m
8) If you make any changes in the feature only run this
# cd /var/yp; make
NIS configuration
In the client
1) Specifying domain name
#domainnam <DOMAIN_NAME>
2) Putting in the “/etc/Sysconfig/network”
NISDOMAIN=<DOMAIN_NAME>
3) Specifying server name in “/etc/yp.conf ” :
NISDOMAIN <DOMAIN_NAME> SERVER
<SERVER_NAME>
4) Restarting daemons :
# /etc/ rc.d/ypbind
restart
5) Putting it in the init
6) Testing it with logging in with the server users
Monitoring and controlling
1)scripts:
perl
python
bash
2) Prebuilded
Webmin
Scyld
SCD
Hardware monitoring and
control(IceBox)
Icebox management with hardware
















monitor temperatures within nodes and remotely reset motherboards
through internally placed probes
SNMP compliant
DHCP or static network configuration
NIMP (Network ICE Management Protocol)
SIMP (Serial ICE Management Protocol)
Out-of-band Serial Data Buffering
Accessible with several protocols (NIMP, SIMP, Null Modem, Telnet,
SNMP, ClusterWorX)
Remote temperature monitoring of CPU temperatures
Remote Power Management
Power sequencing to start-up nodes
Optional cabinet temperature monitoring (eight sensors per ICE Box)
Node reset
Multiple ICE Boxes scale to support large clusters
Embedded CPU powered by Linux for stable run-time environment
Ability to easily and safely update ICE Box Operating System without
cluster downtime
Security
SSH
 PAM
 Xinetd

Running ARPS





Fortran 77 compiler (GNU)
Pre processing data
BC and IC data from other models
Post processing tools (NCARG)
Running flowchart


Preprocessing (always one time)
splitting





Initializing
Boundary conditions
Running
Joining
Post processing (another computers)
Parallel architecture of the
ARPS
Transform Tool
800*400
200*200
800*800
Grid computing?
10 km
3 km
1 km
1-Big domain low res  coarse domain and better res
2-in data assimulation code goes to the near of data
AUI
Benchmarking

ARPS results

GMandel

BPS
Performance Utilities
1.
AIMS - instrumentors, monitoring library, and analysis tools
2.
MPE logging library and Nupshot performance visualization
tool
3.
Pablo - monitoring library and analysis tools
4.
Paradyn - dynamic instrumentation and run-time analysis tool
5.
SvPablo - integrated instrumentor, monitoring library, and
analysis tool
6.
VAMPIRtrace monitoring library and VAMPIR performance
visualization tool
7.
VT - monitoring library and performance analysis and
visualization tool for the IBM SP
ARPS performance
Performance is better for larger
domain per CPU
 Because of the network limitation
at the cluster and we need larger
calculation per data transfer.

Model situation
200*200 per processor
 Prediction time = 60s
 output = NONE
 Dtbig = 6s
 1km * 1km * 500m grids

--200 * 200 per domain {200 x 200}-1 cpu-ARPS stopped normally in the main program.
The ending time was
60.000 seconds.
Thanks for using ARPS.
Process
CPU time used
Percentage
----------------------------------------------Initialization : 0.760000E+01s
1.40%
Data output
: 0.829005E+01s
1.53%
Wind advection : 0.190701E+02s
3.52%
Scalar advection: 0.397800E+02s
7.34%
Coriolis force : 0.000000E+00s
0.00%
Buoyancy term
: 0.618995E+01s
1.14%
Small time steps: 0.241000E+03s
44.48%
Radiation
: 0.000000E+00s
0.00%
Soil model
: 0.000000E+00s
0.00%
Surface physics : 0.000000E+00s
0.00%
Turbulence
: 0.874099E+02s
16.13%
Comput. mixing : 0.352601E+02s
6.51%
Rayleigh damping: 0.271003E+01s
0.50%
TKE src terms
: 0.287300E+02s
5.30%
Bound.conditions: 0.220026E+00s
0.04%
Gridscale precp.: 0.000000E+00s
0.00%
Kuo cumulus
: 0.000000E+00s
0.00%
Kain-Fritsch
: 0.000000E+00s
0.00%
Warmrain microph: 0.452400E+02s
8.35%
Lin ice microph : 0.000000E+00s
0.00%
NEM ice microph : 0.000000E+00s
0.00%
Hydrometero fall: 0.000000E+00s
0.00%
Miscellaneous
: 0.169800E+02s
3.13%
Entire model
:
0.541820E+03s
100.00%
0.541820E+03s
--200 * 200 per domain {400 x 200}-2 cpu-ARPS stopped normally in the main program.
The ending time was
60.000 seconds.
Thanks for using ARPS.
Process
CPU time used
Percentage
----------------------------------------------Initialization : 0.763000E+01s
1.41%
Data output
: 0.822997E+01s
1.52%
Wind advection : 0.190600E+02s
3.52%
Scalar advection: 0.402001E+02s
7.42%
Coriolis force : 0.000000E+00s
0.00%
Buoyancy term
: 0.615997E+01s
1.14%
Small time steps: 0.241520E+03s
44.56%
Radiation
: 0.000000E+00s
0.00%
Soil model
: 0.000000E+00s
0.00%
Surface physics : 0.000000E+00s
0.00%
Turbulence
: 0.872100E+02s
16.09%
Comput. mixing : 0.351900E+02s
6.49%
Rayleigh damping: 0.276001E+01s
0.51%
TKE src terms
: 0.285300E+02s
5.26%
Bound.conditions: 0.240047E+00s
0.04%
Gridscale precp.: 0.000000E+00s
0.00%
Kuo cumulus
: 0.000000E+00s
0.00%
Kain-Fritsch
: 0.000000E+00s
0.00%
Warmrain microph: 0.451199E+02s
8.32%
Lin ice microph : 0.000000E+00s
0.00%
NEM ice microph : 0.000000E+00s
0.00%
Hydrometero fall: 0.000000E+00s
0.00%
Miscellaneous
: 0.168399E+02s
3.11%
Entire model
:
0.542000E+03s
100.00%
0.542000E+03s
--200 * 200 per domain {400 x 400}-4 cpu-ARPS stopped normally in the main program.
The ending time was
60.000 seconds.
Thanks for using ARPS.
Process
CPU time used
Percentage
----------------------------------------------Initialization : 0.762000E+01s
1.40%
Data output
: 0.827001E+01s
1.52%
Wind advection : 0.191300E+02s
3.52%
Scalar advection: 0.404000E+02s
7.44%
Coriolis force : 0.000000E+00s
0.00%
Buoyancy term
: 0.614000E+01s
1.13%
Small time steps: 0.241750E+03s
44.53%
Radiation
: 0.000000E+00s
0.00%
Soil model
: 0.000000E+00s
0.00%
Surface physics : 0.000000E+00s
0.00%
Turbulence
: 0.874600E+02s
16.11%
Comput. mixing : 0.351000E+02s
6.47%
Rayleigh damping: 0.273998E+01s
0.50%
TKE src terms
: 0.285099E+02s
5.25%
Bound.conditions: 0.249939E+00s
0.05%
Gridscale precp.: 0.000000E+00s
0.00%
Kuo cumulus
: 0.000000E+00s
0.00%
Kain-Fritsch
: 0.000000E+00s
0.00%
Warmrain microph: 0.451600E+02s
8.32%
Lin ice microph : 0.000000E+00s
0.00%
NEM ice microph : 0.000000E+00s
0.00%
Hydrometero fall: 0.000000E+00s
0.00%
Miscellaneous
: 0.169001E+02s
3.11%
Entire model
:
0.542850E+03s
100.00%
0.542850E+03s
--200 * 200 per domain {800 x 400}-8 cpu-ARPS stopped normally in the main program.
The ending time was
60.000 seconds.
Thanks for using ARPS.
Process
CPU time used
Percentage
----------------------------------------------Initialization : 0.758000E+01s
1.39%
Data output
: 0.827006E+01s
1.52%
Wind advection : 0.190499E+02s
3.50%
Scalar advection: 0.404402E+02s
7.44%
Coriolis force : 0.000000E+00s
0.00%
Buoyancy term
: 0.619997E+01s
1.14%
Small time steps: 0.242260E+03s
44.57%
Radiation
: 0.000000E+00s
0.00%
Soil model
: 0.000000E+00s
0.00%
Surface physics : 0.000000E+00s
0.00%
Turbulence
: 0.873999E+02s
16.08%
Comput. mixing : 0.352699E+02s
6.49%
Rayleigh damping: 0.271999E+01s
0.50%
TKE src terms
: 0.286100E+02s
5.26%
Bound.conditions: 0.290039E+00s
0.05%
Gridscale precp.: 0.000000E+00s
0.00%
Kuo cumulus
: 0.000000E+00s
0.00%
Kain-Fritsch
: 0.000000E+00s
0.00%
Warmrain microph: 0.451000E+02s
8.30%
Lin ice microph : 0.000000E+00s
0.00%
NEM ice microph : 0.000000E+00s
0.00%
Hydrometero fall: 0.000000E+00s
0.00%
Miscellaneous
: 0.169199E+02s
3.11%
Entire model
:
0.543510E+03s
100.00%
0.543510E+03s
--- {(200-3)*4+3=791 or ~ 800 totally }-16 cpu-ARPS stopped normally in the main program.
The ending time was
60.000 seconds.
Thanks for using ARPS.
Process
CPU time used
Percentage
----------------------------------------------Initialization : 0.762000E+01s
1.40%
Data output
: 0.820012E+01s
1.50%
Wind advection : 0.191300E+02s
3.50%
Scalar advection: 0.403599E+02s
7.39%
Coriolis force : 0.000000E+00s
0.00%
Buoyancy term
: 0.615000E+01s
1.13%
Small time steps: 0.243190E+03s
44.55%
Radiation
: 0.000000E+00s
0.00%
Soil model
: 0.000000E+00s
0.00%
Surface physics : 0.000000E+00s
0.00%
Turbulence
: 0.880600E+02s
16.13%
Comput. mixing : 0.354600E+02s
6.50%
Rayleigh damping: 0.276005E+01s
0.51%
TKE src terms
: 0.287300E+02s
5.26%
Bound.conditions: 0.309933E+00s
0.06%
Gridscale precp.: 0.000000E+00s
0.00%
Kuo cumulus
: 0.000000E+00s
0.00%
Kain-Fritsch
: 0.000000E+00s
0.00%
Warmrain microph: 0.455600E+02s
8.35%
Lin ice microph : 0.000000E+00s
0.00%
NEM ice microph : 0.000000E+00s
0.00%
Hydrometero fall: 0.000000E+00s
0.00%
Miscellaneous
: 0.169700E+02s
3.11%
Entire model
:
0.545870E+03s
100.00%
0.545870E+03s
Gmandel-pvm benchmark
calculating with:
x1=-0.760416667
y1=-0.354166667
x2=-0.614583333
y2=-0.208333333
limit=1000000
wall time=97 secs.
MFLOPS=19556.6
calculating with:
x1=-2.000000000
y1=-2.000000000
x2=2.000000000
y2=2.000000000
limit=1000000
wall time=17 secs.
MFLOPS=19461.0
Feature plans
Add VRML based monitoring and
controlling system
 adding scheduling for better use of
the resources
 Building one packaged solution
 Extending it
 Grid computing

references

ARPS documents

High-speed networking ,james P.G
Sterbenz joseph D.touch (wiley press)
Cluster computing white paper , mark
becker ,university of portsmouth,uk
Beowulf howto
www.beowulf.com
www.myricom.com
www.intel.com
www.infinibandfd.com
www.clustercomputing.com







Thank you
Hardware products

Fast Ethernet



HiPPI (High Performance Parallel
Interface)



100 Mbps
CSMA/CD (Carrier Sense Multiple Access
with Collision Detection)
copper-based, 800/1600 Mbps over 32/64
bit lines
point-to-point channel
ATM (Asynchronous Transfer Mode)



connection-oriented packet switching
fixed length (53 bytes cell)
suitable for WAN
Hardware products

ServerNet



Myrinet



programmable microcontroller
1.28 Gbps
Memory Channel




1 Gbps
originally, interconnection for high bandwidth I/O
800 Mbps
virtual shared memory
strict message ordering
Synfinity

12.8 Gbps
Link Parameters
Comparing products
Prices - Myricom
Low Cost, One Port
64-bit PCI-X and PCI
Low-profile PCI short card
225MHz RISC & 2 MB memory
$795
For applications requiring up to ~490MB/s user-level bidirectional data rate
High End, Two Port
64-bit PCI-X and PCI
Low-profile PCI short card
333MHz RISC & 2 MB memory
$1,195
For applications requiring up to ~950MB/s user-level bidirectional data rate
Prices-myricom
Myrinet-2000 Switch Enclosures
Product
Code
Price
2U high, 3-slot enclosure for
switches up to 16 ports
M3-E16
$1,600
3U high, 5-slot enclosure for
switch networks up to 32
ports
M3-E32
$3,200
5U high, 9-slot enclosure for
switch networks up to 64
hosts
M3-E64
$6,400
M3-E128
$12,800
Description
9U high, 17-slot enclosure for
switch networks up to 128
Clos network for 128 hosts
with all Fiber ports and
monitoring capability
Prices -dolphin
Sci adapter
PMC-SCI Adapter Card
(64 bit, 66 MHz PCI)
Sci Switches
$1,480
8 Port Expandable Modular BxBAR SCI Switch
$4,980
Active messages (zero
copy)
Berekley NOW project
 Short messages are asynchronous
(based on request-reply)
 No buffering used on the system
buffer
 Messages transfer directly from
user memory to user memory


GAM (generic Active Messages)
one copy at the reciver side
Fast Messages
University of the illinois
 Similar to AM
 Added Transfer control mechanism
 A credit system required to manage
pinned memories
 Good for heterogeneous nodes

VMMC
(virtual memmory-mapped
communication)
Princeton SHRIMP project
 Sending message=read and write
on user memory
 Mapping memory pages on two
side
 Uses A hardware to allow the NIC
to listen on memory write and send
it to the other side
 A type of DSM

U-net
Cornell University
 Zero copy where possible
 A virtual interface for each
connection
 Acting on demand

BIP
basic interface for parallelism
University of lyon
 Low level message layer
 Other higher layer message
passing (like MPICH) can build on
this layer
 BIP-SMP
 Using different protocols for
different message size(zero or
more copy)
 Flow control

Hardware products

Fast Ethernet



HiPPI (High Performance Parallel
Interface)



100 Mbps
CSMA/CD (Carrier Sense Multiple Access
with Collision Detection)
copper-based, 800/1600 Mbps over 32/64
bit lines
point-to-point channel
ATM (Asynchronous Transfer Mode)



connection-oriented packet switching
fixed length (53 bytes cell)
suitable for WAN
Hardware products

ServerNet



Myrinet



programmable microcontroller
1.28 Gbps
Memory Channel




1 Gbps
originally, interconnection for high bandwidth I/O
800 Mbps
virtual shared memory
strict message ordering
Synfinity

12.8 Gbps
Link Parameters
Comparing products
Prices - Myricom
Low Cost, One Port
64-bit PCI-X and PCI
Low-profile PCI short card
225MHz RISC & 2 MB memory
$795
For applications requiring up to ~490MB/s user-level bidirectional data rate
High End, Two Port
64-bit PCI-X and PCI
Low-profile PCI short card
333MHz RISC & 2 MB memory
$1,195
For applications requiring up to ~950MB/s user-level bidirectional data rate
Prices-myricom
Myrinet-2000 Switch Enclosures
Product
Code
Price
2U high, 3-slot enclosure for
switches up to 16 ports
M3-E16
$1,600
3U high, 5-slot enclosure for
switch networks up to 32
ports
M3-E32
$3,200
5U high, 9-slot enclosure for
switch networks up to 64
hosts
M3-E64
$6,400
M3-E128
$12,800
Description
9U high, 17-slot enclosure for
switch networks up to 128
Clos network for 128 hosts
with all Fiber ports and
monitoring capability
Prices -dolphin
Sci adapter
PMC-SCI Adapter Card
(64 bit, 66 MHz PCI)
Sci Switches
$1,480
8 Port Expandable Modular BxBAR SCI Switch
$4,980