Designing Cluster Computers & High Performance Storage Systems

Download Report

Transcript Designing Cluster Computers & High Performance Storage Systems

A Tutorial
Designing Cluster Computers
and
High Performance Storage Architectures
At
HPC ASIA 2002, Bangalore INDIA
December 16, 2002
By
Dheeraj Bhardwaj
N. Seetharama Krishna
Department of Computer Science &
Engineering
Indian Institute of Technology, Delhi INDIA
e-mail: [email protected]
http://www.cse.iitd.ac.in/~dheerajb
Centre for Development of Advanced
Computing
Pune University Campus, Pune INDIA
e-mail: [email protected]
http://www.cdacindia.com
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
1
Acknowledgments
• All the contributors of LINUX
• All the contributors of Cluster Technology
• All the contributors in the art and science of
parallel computing
• Department of Computer Science & Engineering,
IIT Delhi
• Centre for Development of Advanced Computing,
(C-DAC) and collaborators
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
2
Disclaimer
• The information and examples provided are based on the
Red Hat Linux 7.2 installation on the Intel PCs platforms
( our specific hardware specifications)
• Much of it should be applicable to other versions of
Linux
• There is no warranty that the materials are error free
• Authors will not be held responsible for any direct,
indirect, special, incidental or consequential damages
related to any use of these materials
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
3
Part – I
Designing Cluster Computers
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
4
Outline
•
•
•
•
•
Introduction
Classification of Parallel Computers
Introduction to Clusters
Classification of Clusters
Cluster Components Issues
– Hardware
– Interconnection Network
– System Software
• Design and Build a Cluster Computers
–
–
–
–
–
–
Principles of Cluster Design
Cluster Building Blocks
Networking Under Linux
PBS
PVFS
Single System Image
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
5
Outline
• Tools for Installation and Management
– Issues related to Installation, configuration, Monitoring
and Management
– NPACI Rocks
– OSCAR
– EU DataGrid WP4 Project
– Other Tools – Sycld Beowulf, OpenMosix, Cplant, SCore
• HPC Applications and Parallel Programming
–
–
–
–
–
–
–
–
HPC Applications
Issues related to parallel Programming
Parallel Algorithms
Parallel Programming Paradigms
Parallel Programming Models
Message Passing
Applications I/O and Parallel File System
Performance metrics of Parallel Systems
• Conclusion
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
6
Introduction
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
7
What do We Want to Achieve ?
• Develop High Performance Computing (HPC)
Infrastructure which is
•
•
•
•
•
•
•
Scalable (Parallel
MPP
Grid)
User Friendly
Based on Open Source
Efficient in Problem Solving
Able to Achieve High Performance
Able to Handle Large Data Volumes
Cost Effective
• Develop HPC Applications which are
• Portable ( Desktop
• Future Proof
• Grid Ready
Dheeraj Bhardwaj <[email protected]>
Supercomputers
Grid)
N. Seetharama Krishna <[email protected]>
8
Who Uses HPC ?
• Scientific & Engineering Applications
• Simulation of physical phenomena
• Virtual Prototyping (Modeling)
• Data analysis
• Business/ Industry Applications
•
•
•
•
•
Data warehousing for financial sectors
E-governance
Medical Imaging
Web servers, Search Engines, Digital libraries
…etc …..
• All face similar problems
• Not enough computational resources
• Remote facilities – Network becomes the bottleneck
• Heterogeneous and fast changing systems
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
9
HPC Applications
• Three Types
• High-Capacity – Grand Challenge Applications
• Throughput – Running hundreds/thousands of job, doing
parameter studies, statistical analysis etc…
• Data – Genome analysis, Particle Physics, Astronomical
observations, Seismic data processing etc
• We are seeing a Fundamental Change in HPC Applications
• They have become multidisciplinary
• Require incredible mix of varies technologies and expertise
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
10
Why Parallel Computing ?
• If your Application requires more computing power
than a sequential computer can provide ? !!!!!
– You might suggest to improve the operating speed of processor
and other components
– We do not disagree with your suggestion BUT how long you can
go ?
• We always have desire and prospects for greater
performance
Parallel Computing is the right answer
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
11
Serial and Parallel Computing
A parallel computer is a “Collection of processing
elements that communicate and co-operate to solve
large problems fast”.
SERIAL COMPUTING
PARALLEL COMPUTING
 Fetch/Store
 Fetch/Store
 Compute
 Compute/communicate
 Cooperative game
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
12
Classification of Parallel Computers
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
13
Classification of Parallel Computers
Flynn Classification: Number of Instructions & Data Streams
Conventional
Data Parallel,
Vector Computing
Systolic
Arrays
Dheeraj Bhardwaj <[email protected]>
very general,
multiple approaches
N. Seetharama Krishna <[email protected]>
14
MIMD Architecture: Classification
Current focus is on MIMD model, using general
purpose processors or multicomputers.
MPP
Non-shared memory
Clusters
MIMD
Uniform memory access
PVP
Shared memory
SMP
Non-Uniform memory access
CC-NUMA
NUMA
COMA
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
MIMD: Shared Memory Architecture
Processor
3
Memory Bus
Processor
2
Memory Bus
Memory Bus
Processor
1
Global Memory
Source PE writes data to Global Memory & destination retrieves it
 Easy to build
 Limitation : reliability & expandability. A memory component or
any processor failure affects the whole system.
 Increase of processors leads to memory contention.
Ex. : Silicon graphics supercomputers....
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
16
MIMD: Distributed Memory Architecture
High Speed Interconnection Network
Memory Bus
Processor
3
Memory Bus
Processor
2
Memory Bus
Processor
1
Memory
1
Memory
2
Memory
3
 Inter Process Communication using High Speed Network.
 Network can be configured to various topologies e.g. Tree, Mesh, Cube..
 Unlike Shared MIMD
 easily/ readily expandable
 Highly reliable (any CPU failure does not affect the whole system)
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
17
MIMD Features
 MIMD architecture is more general purpose
 MIMD needs clever use of synchronization that
comes from message passing to prevent the race
condition
 Designing efficient message passing algorithm is
hard because the data must be distributed in a way
that minimizes communication traffic
 Cost of message passing is very high
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
18
Shared Memory (Address-Space) Architecture

Non-Uniform memory access (NUMA) shared address
space computer with local and global memories

Time to access a remote memory bank is longer
than the time to access a local word

Shared address space computers have a local cache at
each processor to increase their effective processorbandwidth.

The cache can also be used to provide fast access to
remotely –located shared data

Mechanisms developed for handling cache coherence
problem
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
19
Shared Memory (Address-Space) Architecture
M
P
M
P
M
P
Interconnection Network
M
M
M
Non-uniform memory access (NUMA) sharedaddress-space computer with local and global
memories
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
20
Shared Memory (Address-Space) Architecture
M
P
M
P
M
P
Interconnection Network
Non-uniform-memory-access (NUMA) sharedaddress-space computer with local memory only
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
21
Shared Memory (Address-Space) Architecture

Provides hardware support for read and write access
by all processors to a shared address space.

Processors interact by modifying data objects stored
in a shared address space.

MIMD shared -address space computers referred as
multiprocessors

Uniform memory access (UMA) shared address
space computer with local and global memories

Time taken by processor to access any memory
word in the system is identical
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
22
Shared Memory (Address-Space) Architecture
P
P
P
Interconnection Network
M
M
M
Uniform Memory Access (UMA) shared-addressspace computer
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
23
Uniform Memory Access (UMA)
• Parallel Vector Processors (PVPs)
• Symmetric Multiple Processors (SMPs)
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
24
Parallel Vector Processor
VP : Vector Processor
SM : Shared memory
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
25
Parallel Vector Processor

Works good only for vector codes

Scalar codes mat not perform perform well

Need to completely rethink and re-express
algorithms so that vector instructions were
performed almost exclusively

Special purpose hardware is necessary

Fastest systems
uniprocessors.
Dheeraj Bhardwaj <[email protected]>
are
no
longer
N. Seetharama Krishna <[email protected]>
vector
26
Parallel Vector Processor

Small number of powerful custom-designed vector
processors used

Each processor is capable of at least 1 Giga flop/s
performance

A custom-designed, high bandwidth crossbar switch
networks these vector processors.

Most machines do not use caches, rather they use a
large number of vector registers and an instruction
buffer
Examples : Cray C-90, Cray T-90, Cray T-3D …
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
27
Symmetric Multiprocessors (SMPs)
P/C : Microprocessor and cache
SM : Shared memory
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
28
Symmetric Multiprocessors (SMPs)
Symmetric Multiprocessors (SMPs) characteristics

Uses commodity microprocessors with on-chip and
off-chip caches.

Processors are connected to a shared memory through
a high-speed snoopy bus

On Some SMPs, a crossbar switch is used in addition
to the bus.

Scalable upto:
 4-8 processors (non-back planed based)
 few tens of processors (back plane based)
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
Symmetric Multiprocessors (SMPs)
Symmetric Multiprocessors (SMPs) characteristics

All processors see same image of all system
resources

Equal priority for all processors (except for master
or boot CPU)

Memory coherency maintained by HW

Multiple I/O Buses for greater Input / Output
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
30
Symmetric Multiprocessors (SMPs)
Processor
L1 cache
Processor
L1 cache
DIR
Controller
Memory
Dheeraj Bhardwaj <[email protected]>
Processor
L1 cache
Processor
L1 cache
I/O
Bridge
I/O Bus
N. Seetharama Krishna <[email protected]>
Symmetric Multiprocessors (SMPs)
Issues

Bus based architecture :


Inadequate beyond 8-16 processors
Crossbar based architecture

multistage approach considering I/Os required
in hardware

Clock distribution and HF design issues for
backplanes

Limitation is mainly caused by using a centralized
shared memory and a bus or cross bar interconnect
which are both difficult to scale once built.
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
Commercial Symmetric Multiprocessors (SMPs)

Sun Ultra Enterprise 10000 (high end,
expandable upto 64 processors), Sun Fire

DEC Alpha server 8400

HP 9000

SGI Origin

IBM RS 6000

IBM P690, P630

Intel Xeon, Itanium, IA-64(McKinley)
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
33
Symmetric Multiprocessors (SMPs)

Heavily used in commercial applications (data
bases, on-line transaction systems)

System is symmetric (every processor has equal
equal access to the shared memory, the I/O
devices, and the operating systems.

Being symmetric, a higher degree of parallelism
can be achieved.
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
34
Massively Parallel Processors (MPPs)
P/C : Microprocessor and cache; LM : Local memory;
NIC : Network interface circuitry; MB : Memory bus
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
Massively Parallel Processors (MPPs)

Commodity microprocessors in processing nodes

Physically distributed memory over processing
nodes

High communication bandwidth and low latency
as an interconnect. (High-speed, proprietary
communication network)

Tightly coupled network interface which is
connected to the memory bus of a processing node
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
36
Massively Parallel Processors (MPPs)

Provide proprietary communication software to
realize the high performance

Processors Interconnected by a high-speed
memory bus to a local memory through and a
network interface circuitry (NIC)

Scaled up to hundred or even thousands of
processors

Each processes has its private address space and
Processes interact by passing messages
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
37
Massively Parallel Processors (MPPs)

MPPs support asynchronous MIMD modes
 MPPs support single system image at different
levels
 Microkernel operating system on compute nodes
 Provide high-speed I/O system
 Example : Cray – T3D, T3E, Intel Paragon,
IBM SP2
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
38
Cluster ?
clus·ter n.
1. A group of the same or similar elements
gathered or occurring closely together; a bunch:
“She held out her hand, a small tight cluster of
fingers” (Anne Tyler).
2. Linguistics. Two or more successive consonants
in a word, as cl and st in the word cluster.
A Cluster is a type of parallel or distributed processing
system, which consists of a collection of interconnected
stand alone/complete computers cooperatively working
together as a single, integrated computing resource.
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
39
Cluster System Architecture
Programming Environment
(Java, C, Fortran, MPI, PVM)
Web Windows Other Subsystems
User Interface (Database, OLTP)
Single System Image Infrastructure
Availability Infrastructure
OS
OS
OS
……… … …
Node
Node
Node
Interconnect
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
40
Clusters ?
A set of
 Nodes physically connected over commodity/
proprietary network
 Gluing Software

Other than this definition no Official Standard exists
 Depends on the user requirements






Commercial
Academic
Good way to sell old wine in a new bottle
Budget
Etc ..
Designing Clusters is not obvious but Critical issue.
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
Why Clusters NOW?
• Clusters gained momentum when three technologies
converged:
– Very high performance microprocessors
• workstation performance = yesterday supercomputers
– High speed communication
– Standard tools for parallel/ distributed computing
& their growing popularity
• Time to market => performance
• Internet services: huge demands for scalable,
available, dedicated internet servers
– big I/O, big computing power
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
42
How should we Design them ?
 Components
• Should they be off-the-shelf and low cost?
• Should they be specially built?
• Is a mixture a possibility?
 Structure
• Should each node be in a different box
(workstation)?
• Should everything be in a box?
• Should everything be in a chip?
 Kind of nodes
• Should it be homogeneous?
• Can it be heterogeneous?
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
43
What Should it offer ?
 Identity
• Should each node maintains its identity
(and owner)?
• Should it be a pool of nodes?
 Availability
• How far should it go?
 Single-system Image
• How far should it go?
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
44
Place for Clusters in HPC world ?
Distance between nodes
SM Parallel
computing
A chip
A box
Cluster computing
A room
A building
Grid computing
The world
Source: Toni Cortes ([email protected])
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
45
Where Do Clusters Fit?
1 TF/s delivered
15 TF/s delivered
Distributed
systems
•
•
•
•
•
•
•
•
MP
systems
Gather (unused) resources
System SW manages resources
System SW adds value
10% - 20% overhead is OK
Resources drive applications
Time to completion is not critical
Time-shared
Commercial: PopularPower, United
Devices, Centrata, ProcessTree,
Applied Meta, etc.
•
•
•
•
•
•
•
•
Bounded set of resources
Apps grow to consume all cycles
Application manages resources
System SW gets in the way
5% overhead is maximum
Apps drive purchase of equipment
Real-time constraints
Space-shared
Src: B. Maccabe, UNM, R.Pennington NCSA
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
46
Top 500 Supercomputers
Rank Computer/Procs
Peak
performance
Country/year
1
Earth Simulator (NEC)
5120
40960 GF
Japan / 2002
2
ASCI – Q (HP) AlphaServer 10240 GF
SC ES45/1.25 GHz/ 4096
LANL,
USA/2002
3
ASCI – Q (HP) AlphaServer 10240 GF
SC ES45/1.25 GHz/ 4096
LANL,
USA/2002
4
ASCI White (IBM) SP
power 3 375 MHz / 8192
LANL,
USA/2000
5
MCR Linux Cluster Xeon
11060GF
2.4 GHz – Qudratics / 2304
12288 GF
LANL,
USA/2002
• From www.top500.org
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
47
What makes the Clusters ?
• The same hardware used for
– Distributed computing
– Cluster computing
– Grid computing
• Software converts hardware in a cluster
– Tights everything together
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
48
Task Distribution
• The hardware is responsible for
– High-performance
– High-availability
– Scalability (network)
• The software is responsible for
–
–
–
–
–
Gluing the hardware
Single-system image
Scalability
High-availability
High-performance
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
49
Classification of
Cluster Computers
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
50
Clusters Classification 1
• Based on Focus (in Market)
– High performance (HP) clusters
• Grand challenging applications
– High availability (HA) clusters
• Mission critical applications
• Web/e-mail
• Search engines
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
51
HA Clusters
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
52
Clusters Classification 2
• Based on Workstation/PC Ownership
– Dedicated clusters
– Non-dedicated clusters
• Adaptive parallel computing
• Can be used for CPU cycle stealing
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
53
Clusters Classification 3
• Based on Node Architecture
– Clusters of PCs (CoPs)
– Clusters of Workstations (COWs)
– Clusters of SMPs (CLUMPs)
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
54
Clusters Classification 4
• Based on Node Components Architecture
& Configuration:
– Homogeneous clusters
• All nodes have similar configuration
– Heterogeneous clusters
• Nodes based on different processors and running
different OS
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
55
Clusters Classification 5
• Based on Node OS Type..
–
–
–
–
–
–
Linux Clusters (Beowulf)
Solaris Clusters (Berkeley NOW)
NT Clusters (HPVM)
AIX Clusters (IBM SP2)
SCO/Compaq Clusters (Unixware)
…….Digital VMS Clusters, HP clusters,
………………..
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
56
Clusters Classification 6
• Based on Levels of Clustering:
– Group clusters (# nodes: 2-99)
• A set of dedicated/non-dedicated computers --mainly connected by SAN like Myrinet
– Departmental clusters (# nodes: 99-999)
– Organizational clusters (# nodes: many 100s)
– Internet-wide clusters = Global clusters
(# nodes: 1000s to many millions)
• Computational Grid
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
57
2nd Gen.
Beowulf
Clusters
3rd Gen.
Commercial
Grade
Clusters
4th Gen.
Network
Transparent
Clusters
1st Gen.
MPP Super
Computers
1990
Dheeraj Bhardwaj <[email protected]>
Time
COMPLEXITY
COST
Clustering Evolution
2005
N. Seetharama Krishna <[email protected]>
58
Cluster Components
– Hardware
– System Software
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
59
Hardware
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
60
Nodes
• The idea is to use standard off-the-shelf
processors
–
–
–
–
–
Pentium like Intel, AMDK
Sun
HP
IBM
SGI
• No special development for clusters
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
61
Interconnection Network
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
62
Interconnection Network
• One of the key points in clusters
• Technical objectives
–
–
–
–
High bandwidth
Low latency
Reliability
Scalability
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
63
Network Design Issues
• Plenty of work has been done to improve
networks for clusters
• Main design issues
–
–
–
–
–
Physical layer
Routing
Switching
Error detection and correction
Collective operations
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
64
Physical Layer
• Trade-off between
– Raw data transfer rate and cable cost
• Bit width
– Serial mediums (*Ethernet, Fiber Channel)
• Moderate bandwidth
– 64-bit wide cable (HIPPI)
• Pin count limits the implementation of
switches
– 8-bit wide cable (Myrinet, ServerNet)
• Good compromise
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
65
Routing
• Source-path
– The entire path is attached to the message at its
source location
– Each switch deletes the current head of the
path
• Table-based routing
– The header only contains the destination node
– Each switch has a table to help in the decision
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
66
Switching
• Packet switching
– Packets are buffered in the switch before resent
• Implies an upper-bound packet size
• Needs buffers in the switch
– Used by traditional LAN/WAN networks
• Wormhole switching
– Data is immediately forwarded to the next stage
• Low latency
• No buffers are needed
• Error correction is more difficult
– Used by SANs such as Myrinet, PARAMNet
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
67
Error Detection
• It has to be done at hardware level
– Performance reasons
– i.e. CRC checking is done by the network
interface
• Networks are very reliable
– Only erroneous messages should see overhead
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
69
Collective Operations
• These operations are mainly
– Barrier
– Multicast
• Few interconnects offer this characteristic
– Synfinity is good example
• Normally offered by software
– Easy to achieve in bus-based like Ethernet
– Difficult to achieve in point-to-point like Myrinet
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
70
Examples of Network
• The most common networks used are
–
–
–
–
–
–
–
–
–
*-Ethernet
SCI
Myrinet
PARAMNet
HIPPI
ATM
Fiber Channel
AmpNet
Etc.
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
71
*-Ethernet
• Most widely used for LAN
– Affordable
– Serial transmission
– Packet switching and table-based routing
• Types of Ethernet
– Ethernet and Fast Ethernet
• Based on collision domain (Buses)
• Switched hubs can make different collision
domains
– Gigabit Ethernet
• Based on high-speed point-to-point switches
• Each nodes is in its own collision domain
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
72
ATM
• Standard designed for telecommunication
industry
–
–
–
–
Relatively expensive
Serial
Packet switching and table-based routing
Designed around the concept of fixed-size
packets
• Special characteristics
– Well designed for real-time systems
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
73
Scalable Coherent Interface (SCI)
• First standard specially designed for CC
– Low layer
• Point-to-point architecture but maintains busfunctionality
• Packet switching and table-based routing
• Split transactions
• Dolphin Interconnect Solutions, Sun SPARC Sbus
– High Layer
• Defines a distributed cache-coherent scheme
• Allows transparent shared memory programming
• Sequent NUMA-Q, Data general AViiON NUMA
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
74
PARAMNet & Myrinet
• Low-latency and High-bandwidth network
– Characteristics
• Byte-wise links
• Wormhole switching and source-path routing
– Low-latency cut-through routing switches
• Automatic mapping, which favors fault tolerance
• Zero-copying is not possible
– Programmable on-board processor
• Allows experimentation with new protocols
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
75
Communication Protocols
• Traditional protocols
– TCP and UDP
• Specially designed
–
–
–
–
–
Active messages
VMMC
BIP
VIA
Etc.
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
77
Data Transfer
• User-level lightweight communication
– Avoid OS calls
– Avoid data copying
– Examples
• Fast messages, BIP, ...
• Kernel-level lightweight
communication
– Simplified protocols
– Avoid data copying
– Examples
• GAMMA, PM, ...
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
78
TCP and UDP
• First messaging libraries used
– TCP is reliable
– UDP is not reliable
• Advantages
– Standard and well known
• Disadvantages
– Too much overhead (specially for fast
networks)
• Plenty OS interaction
• Many copies
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
79
Active Messages
• Low-latency communication library
• Main issues
– Zero-copying protocol
• Messages copied directly
– to/from the network
– to/from the user-address space
• Receiver memory has to be pinned
– There is no need of a receive operation
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
80
VMMC
• Virtual-Memory Mapped Communication
– View messages as read and writes to memory
• Similar to distributed shared memory
– Makes a correspondence between
• A virtual pages at the receiving side
• A virtual page at the sending side
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
81
BIP
• Basic Interface for Parallelism
– Low-level message-layer for Myrinet
– Uses various protocols for various message
sizes
– Tries to achieve zero copies (one at most)
• Used via MPI by programmers
– 7.6us latency
– 107 Mbytes/s bandwidth
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
82
VIA
• Virtual Interface Architecture
– First standard promoted by the industry
– Combines the best features of academic projects
• Interface
– Designed to be used by programmers directly
– Many programmers believe it to be too low level
• Higher-level APIs are expected
• NICs with VIA implemented in hardware
– This is the proposed path
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
83
Potential and Limitations
• High bandwidth
– Can be achieved at low cost
• Low latency
– Can be achieved, but at high cost
– The lower the latency is the closer to a
traditional supercomputer we get
• Reliability
– Can be achieved at low cost
• Scalability
– Easy to achieve for the size of clusters
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
84
System Software
–Operating system vs. middleware
–Processor management
–Memory management
–I/O management
–Single-system image
–Monitoring clusters
–High Availability
–Potential and limitations
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
85
Operating system vs. Middleware
• Operating system
– Hardware-control layer
• Middleware
– Gluing layer
• The barrier is not always
clear
• Similar
– User level
– Kernel level
Operating system
Middleware
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
86
System Software
• We will not distinguish between
– Operating system
– Middleware
• The middleware related to the operating system
• Objectives
–
–
–
–
–
–
Performance/Scalability
Robustness
Single-system image
Extendibility
Scalability
Heterogeneity
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
87
Processor Management
• Schedule jobs onto the nodes
– Scheduling policies should take into account
• Needed vs. available resources
– Processors
– Memory
– I/O requirements
• Execution-time limits
• Priorities
– Different kind of jobs
• Sequential and parallel jobs
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
88
Load Balancing
• Problem
– A perfect static balance is not possible
• Execution time of jobs is unknown
– Unbalanced systems may not be efficient
• Solution
– Process migration
• Prior to execution
– Granularity must be small
• During execution
– Cost must be evaluated
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
89
Fault Tolerance
• Large cluster must be fault tolerant
– The probability of a fault is quite high
• Solution
– Re-execution of applications in the failed node
• Not always possible or acceptable
– Checkpointing and migration
• It may have a high overhead
• Difficult with some kind of applications
– Applications that modify the environment
• Transactional behavior may be a solution
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
90
Managing Heterogeneous Systems
• Compatible nodes but different characteristics
– It becomes a load balancing problem
• Non compatible nodes
– Binaries for each kind of node are needed
– Shared data has to be in a compatible format
– Migration becomes nearly impossible
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
91
Scheduling Systems
• Kernel level
– Very few take care of cluster scheduling
• High-level applications do the scheduling
–
–
–
–
–
Distribute the work
Migrate processes
Balance the load
Interact with the users
Examples
• CODINE, CONDOR, NQS, etc
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
92
Memory Management
• Objective
– Use all the memory available in the cluster
• Basic approaches
– Software distributed-shared memory
• General purpose
– Specific usage of idle remote memory
• Specific purpose
– Remote memory paging
– File-system caches or RAMdisks (described later)
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
93
Software Distributed Shared Memory
• Software layer
– Allows applications running on different nodes to share memory
regions
– Relatively transparent to the programmer
• Address-space structure
– Single address space
• Completely transparent to the programmer
– Shared areas
• Applications have to mark a given region as shared
• Not completely transparent
• Approach mostly used due to its simplicity
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
94
Main Data Problems to be Solved
• Data consistency vs. Performance
– A strict semantic is very inefficient
• Current systems offer relaxed semantics
• Data location (finding the data)
– The most common solution is the owner node
• This node may be fixed or vary dynamically
• Granularity
– Usually a fixed block size is implemented
• Hardware MMU restrictions
• Leads to “false sharing”
• Variable granularity being studied
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
95
Other Problems to be Solved
• Synchronization
– Test-and-set-like mechanisms cannot be used
– SDSM systems have to offer new mechanisms
• i.e. semaphores (message passing implementation)
• Fault tolerance
– Very important and very seldom implemented
– Multiples copies
• Heterogeneity
– Different page sizes
– Different data-type implementations
• Use tags
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
96
Remote-Memory Paging
• Keep swapped-out pages in idle memory
– Assumptions
• Many workstations are idle
• Disks are much slower than Remote memory
– Idea
• Send swapped-out pages to idle workstations
• When no remote memory space then use disks
• Replicate copies to increase fault tolerance
• Examples
– The global memory service (GMS)
– Remote memory pager
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
97
I/O Management
• Advances very closely to parallel I/O
– There are two major differences
• Network latency
• Heterogeneity
• Interesting issues
–
–
–
–
Network configurations
Data distribution
Name resolution
Memory to increase I/O performance
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
98
Network Configurations
• Device location
– Attached to nodes
• Very easy to have (use the disks in the nodes)
– Network attached devices
• I/O bandwidth is not limited by memory bandwidth
• Number of networks
– Only one network for everything
– One special network for I/O traffic (SAN)
• Becoming very popular
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
99
Data Distribution
• Distribution per files
– Each nodes has its own independent file system
• Like in distributed file systems (NFS, Andrew, CODA, ...)
– Each node keeps a set of files locally
• It allows remote access to its files
– Performance
• Maximum performance = device performance
• Parallel access only to different files
• Remote files depends on the network
– Caches help but increase complexity (coherence)
– Tolerance
• File replication in different nodes
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
100
Data Distribution
• Distribution per blocks
– Also known as Software/Parallel RAIDs
• xFS, Zebra, RAMA, ...
– Blocks are interleaved among all disks
– Performance
• Parallel access to blocks in the same file
• Parallel access to different files
• Requires a fast network
– Usually solved with a SAN
• Especially good for large requests (multimedia)
– Fault tolerance
• RAID levels (3, 4 and 5)
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
101
Name Resolution
• Equal than in distributed systems
– Mounting remote file systems
• Useful when the distribution is per files
– Distributed name resolution
• Useful when the distribution is per files
– Returns the node where the file resides
• Useful when the distribution is per blocks
– Returns the node where the file’s meta-data is located
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
102
Caching
• Caching can be done at multiple levels
–
–
–
–
–
Disk controller
Disk servers
Client nodes
I/O libraries
etc.
• Good to have several levels of cache
• High levels decrease hit ratio of low levels
– Higher level caches absorb most of the locality
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
103
Cooperative Caching
• Problem of traditional caches
– Each nodes caches the data it needs
• Plenty of replication
• Memory space not well used
• Increase the coordination of the caches
– Clients know what other clients are caching
• Clients can access cached data in remote nodes
– Replication in the cache is reduced
• Better use of the memory
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
104
RAMdisks
• Assumptions
– Disks are slow and memory+network is fast
– Disks are persistent and memory is not
• Build “disk” unifying idle remote RAM
– Only used for non-persistent data
• Temporary data
– Useful in many applications
• Compilations
• Web proxies
• ...
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
105
Single-System Image
• SSI offers the idea that the cluster is a single machine
• It can be done a several levels
– Hardware
• Hardware DSM
– System software
• It can offers unified view to applications
– Application
• It can offer a unified view to the user
• All SSI have a boundary
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
106
Key Services of SSI
• Main services offered by SSI
–
–
–
–
–
–
–
–
Single point of entry
Single file hierarchy
Single I/O Space
Single point of management
Single virtual networking
Single job/resource management system
Single process space
Single user interface
• Not all of them are always available
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
107
Monitoring Clusters
• Clusters need tools to be monitored
– Administrators have many things to check
– The cluster must be visible from a single point
• Subjects of monitoring
– Physical environment
• Temperature, power, ..
– Logical services
• RPCs, NFS, ...
– Performance meters
• Paging, CPU load, ...
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
108
Monitoring Heterogeneous Clusters
• Monitoring is specially necessary in heterogeneous
clusters
– Several node types
– Several operating systems
• The tool should hide the differences
– The real characteristics are only needed to solve some
problems
• Very related to Single-System Image
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
109
Auto-Administration
• Monitors know how to make self diagnosis
• Next step is to run corrective procedures
– Some systems start to do so (NetSaint)
– Difficult because tools do not have common sense
• This step is necessary
–
–
–
–
Many nodes
Many devices
Many possible problems
High probability of error
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
110
High Availability
• One of the key points for clusters
– Specially needed for commercial applications
• 7 days a week and 24 hours a day
– Not necessarily very scalable (32 nodes)
• Based on many issues already described
– Single-system image
• Hide any possible change in the configuration
– Monitoring tools
• Detect the errors to be able to correct them
– Process migration
• Restart/continue applications in running nodes
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
111
Design and Build a Cluster Computer
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
112
Cluster Design
• Clusters are good as personal supercomputers
• Clusters are not often good as general purpose
multi-user production machines
• Building such a cluster requires planning and
understanding design tradeoffs
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
113
Scalable Cluster Design Principles
•
•
•
•
Principle of
Principle of
Principle of
Principle of
Dheeraj Bhardwaj <[email protected]>
Independence
Balanced Design
design for Scalability
Latency hiding
N. Seetharama Krishna <[email protected]>
114
Principle of independence
• Components (hardware & Software) of the system
should be independent of one another
• Incremental scaling - Scaling up a system along one
dimension by improving one component,
independent of others
• For example – upgrade processor to next
generation, system should operate at higher
performance with upgrading other components.
• Should enable heterogeneity scalability
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
115
Principle of independence
• The components independence can result in cost
cutting
• The component becomes a commodity, with following
features
– Open architecture with standard interfaces to the rest of the
system
– Off-the-shelf product; Public domain
– Multiple vendor in the open market with large volume
– Relatively mature
– For all these reasons – the commodity component has low cost,
high availability and reliability
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
116
Principle of independence
• Independence principle and application examples
– The algorithm should be independent of the architecture
– The application should be independent of platform
– The programming language should be independent of the
machine
– The language should be modular and have orthogonal feature
– The node should be independent of the network, and the
network interface should be independent of the network
topology
• Caveat
– In any parallel system, there is usually some key
component/technique that is novel
– We can not build en efficient system by simply scaling up one or
few components
• Design should be balanced
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
117
Principle of Balanced Design
• Minimize any performance bottleneck
• Should avoid an unbalanced system design, where slow
component degrades the performance of the entire system
• Should avoid single point of failure
• Example –
– The PetaFLOP project – The memory requirement for wide range
of scientific/Engineering applications
• Memory (GB) ~ Speed3/4 (Gflop/s)
• 30 TB of memory is appropriate for a Pflop/s machine.
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
118
Principle of Design for Scalability
• Provision must be made so that System can either scale
up to provide higher performance
• Or scale down to allow affordability or greater costeffectiveness
• Two approaches
– Overdesign
– Example – Modern processors support 64-bit address space.
This huge address may not be fully utilized by Unix supporting
32-bit address space. This overdesign will create much easier
transition of OS from 32-bit to 64-bit
– Backward compatibility
– Example – A parallel program designed to run on n nodes
should be able to run on a single node, may be with a reduced
input data.
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
119
Principle of Latency Hiding
• Future scalable system are most likely to use a
distributed shared-memory architecture.
• Access to remote memory may experience a long
latencies
– Example – GRID
• Scalable multiprocessors clusters must rely on use
of
– Latency hiding
– Latency avoiding
– Latency reduction
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
120
Cluster Building
• Conventional wisdom: Building a cluster is easy
– Recipe:
• Buy hardware from Computer Shop
• Install Linux, Connect them via network
• Configure NFS, NIS
• Install your application, run and be happy
• Building it right is a little more difficult
– Multi user cluster, security, performance tools
– Basic question - what works reliably?
• Building it to be compatible with Grid
– Compilers, libraries
– Accounts, file storage, reproducibility
• Hardware configuration may be an issue
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
121
How do people think of parallel programming
and using clusters …..
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
122
Panther Cluster
• Picked 8 PC and named
them from Panther family.
• Connected them by
network and setup this
cluster in a small lab.
• Using Panther Cluster
–
–
–
–
Select a PC & log in
Edit and Compile the code
Execute the program
Analyze the results
Dheeraj Bhardwaj <[email protected]>
Cheeta
Tiger
Cat
Kitten
Jaguar
Leopard
Lion
Panther
N. Seetharama Krishna <[email protected]>
123
Panther Cluster - Programming
• Explicit parallel Programming
isn't easy. You really have to do
yourself
• Network bandwidth and
Latency matter
• There are good reasons for
security patches
• Oops…Lion does not have
floating point
Dheeraj Bhardwaj <[email protected]>
Cheeta
Tiger
Cat
Kitten
Jaguar
Leopard
Lion
Panther
N. Seetharama Krishna <[email protected]>
124
Panther Cluster Attacks Users
• Grad Students wanted to
use the cool cluster. They
each need only half (a half
other than Lion)
• Grad Students discover that
using the same PC at the
same time is incredibly bad
• A solution would be to use
parts of the cluster
exclusively for one job at a
time.
• And so….
Dheeraj Bhardwaj <[email protected]>
Cheeta
Tiger
Cat
Kitten
Jaguar
Leopard
Lion
Panther
N. Seetharama Krishna <[email protected]>
125
We Discover Scheduling
• We tried
–
–
–
–
–
–
A sign up sheet
Yelling across the yard
A mailing list
‘finger schedule
…
A scheduler
Dheeraj Bhardwaj <[email protected]>
Queue
Job 1
Cheeta
Tiger
Cat
Kitten
Jaguar
Leopard
Lion
Panther
Job 2
Job 3
N. Seetharama Krishna <[email protected]>
126
Panther Expands
• Panther expands, adding
more users and more
systems
• Use Panther Node for
–
–
–
–
Login
File services
Scheduling services
….
Panther
PC1
Cheeta
Tiger
PC5
PC2
Cat
Kitten
PC6
PC3
Jaguar
Leopard
PC7
PC4
Lion
PC 09
PC8
• All other compute nodes
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
127
Evolution of Cluster Services
The Cluster grows
Login
Login
Login
Login
Login
File service
File service
File service
File service
File service
Scheduling
Scheduling
Scheduling
Scheduling
Scheduling
Management
Management
Management
Management
Management
I/O services
I/O services
I/O services
I/O services
I/O services
Basic goal:
Improve
computing
performance
Dheeraj Bhardwaj <[email protected]>
Improve system
reliability and
manageability
N. Seetharama Krishna <[email protected]>
128
Compute @ Panther
• Usage Model
–
–
–
–
–
Login to “login” node
Compile and test code
Schedule a test run
Schedule a serious run
Carry out I/O through I/O node
• Management Model
– The compute nodes are identical
– Users use Login, I/O and
compute nodes
– All I/O requests are managed by
Metadata server
Dheeraj Bhardwaj <[email protected]>
Login
File
Sched
Mgmt
I/O
PC1
Cheeta
Tiger
PC5
PC2
Cat
Kitten
PC6
PC3
Jaguar
Leopard
PC7
PC4
Lion
PC 09
PC8
N. Seetharama Krishna <[email protected]>
129
Cluster Building Block
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
130
Building Blocks - Hardware
• Processor
– Complex Instruction Set Computer (CISC)
• x86, Pentium Pro, Pentium II, III, IV
– Reduced Instruction Set Computer (RISC)
• SPARC, RS6000, PA-RISC, PPC, Power PC
– Explicitly Parallel Instruction Computer (EPIC)
• IA-64 (McKinley), Itanium
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
131
Building Blocks - Hardware
• Memory
– Extended Data Out (EDO)
• pipelining by loading next call to or from memory
• 50 - 60 ns
– DRAM and SDRAM
• Dynamic Access and Synchronous (no pairs)
• 13 ns
– PC100 and PC133
• 7ns and less
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
132
Building Blocks - Hardware
• Cache
– L1 - 4 ns
– L2 - 5 ns
– L3 (off the chip) - 30 ns
• Celeron
– 0 – 512KB
• Intel Xeon chips
– 512 KB - 2MB L2 Cache
• Intel Itanium
– 512KB -
• Most processors have at least 256 KB
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
133
Building Blocks - Hardware
• Disks and I/O
– IDE and EIDE
• IBM 75 GB 7200 rpm disk w/ 2MB onboard cache
– SCSI I, II, II and SCA
• 5400, 7400, and 10000 rpm
• 20 MB/s, 40 MB/s, 80 MB/s, 160 MB/s
• Can chain from 6-15 disks
– RAID Sets
• software and hardware
• best for dealing with parallel I/O
• reserved cache for before flushing to disks
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
134
Building Blocks - Hardware
• System Bus
– ISA
• 5Mhz - 13 Mhz
– 32 bit PCI
• 33Mhz
• 133 MB/s
– 64 bit PCI
• 66Mhz
• 266MB/s
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
135
Building Blocks - Hardware
• Network Interface Cards (NICs)
– Ethernet - 10 Mbps, 100 Mbps, 1 Gbps
– ATM - 155 Mbps and higher
• Quality of Service (QoS)
– Scalable Coherent Interface (SCI)
• 12 microseconds latency
– Myrinet - 1.28 Gbps
• 120 MB/s
• 5 microseconds latency
– PARAMNet – 2.5 Gbps
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
136
Building Blocks – Operating System
•
•
•
•
•
Solaris - Sun
AIX - IBM
HPUX - HP
IRIX - SGI
Linux - everyone!
– Is architecture independent
• Windows NT/2000
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
137
Building Blocks - Compilers
• Commercial
• Portland Group Incorporated (PGI)
– C, C++, F77, F90
– Not as expensive as vendor specific and
compile most applications
• GNU
– gcc, g++, g77, vast f90
– free!
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
138
Building Blocks - Scheduler
• Cron, at (NT/2000)
• Condor
– IBM Loadleveler
– LSF
•
•
•
•
Portable Batch System (PBS)
Maui Scheduler
GLOBUS
All free, run on more than one OS!
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
139
Building Blocks – Message Passing
• Commercial and free
• Naturally Parallel, Highly Parallel
• Condor
– High Throughput Computing (HTC)
• Parallel Virtual Machine PVM
– oak ridge national labs
• Message Passing Interface (MPI)
– mpich from anl
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
140
Building Blocks – Debugging and Analysis
• Parallel Debuggers
– TotalView
• GUI based
• Performance Analysis Tools
–
–
–
–
monitoring library calls and runtime analysis
AIMS, MPE, Pablo,
Paradyn - from Wisconsin,
SvPablo, Vampir, Dimemas, Paraver
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
141
Building Block – Other
• Cluster Administration Tools
• Cluster Monitoring Tools
These tools are the part of Single System Image Aspects
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
142
Scalability of Parallel Processors
Cluster of Uniprocessors
SMP
Performance
Cluster of SMPs
Processors
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
143
Installing the Operating System
• Which package ?
• Which Services ?
• Do I need a graphical environment ?
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
144
Identifying the hardware bottlenecks
• Is my hardware optimal ?
• Can I improve my hardware choices ?
• How can I identify where is the problem ?
• Common hardware bottlenecks !!
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
145
Benchmarks
• Synthetic Benchmarks
–
–
–
–
Bonnie
Stream
NetPerf
NetPipe
• Applications Benchmarks
– High Performance Linpack
– NAS
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
146
Networking under Linux
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
147
Network Terminology Overview
•
IP address: the unique machine address on the net (e.g.,
128.169.92.195)
•
netmask: determines which portion of the IP address
specifies the subnetwork number, and which portion
specifies the host on that subnet (e.g., 255.255.255.0)
• network address: IP address masked bitwise-ANDed
with the netmask (e.g.,128.169.92.0)
•
broadcast address: network address ORed with the
negation of the netmask (128.169.92.255)
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
148
Network Terminology Overview
• gateway address: the address of the gateway
machine that lives on two different networks and
routes packets between them
•
name server address: the address of the name server
that translates host names into IP addresses
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
149
A Cluster Network
Outside IP
(Get in using SSH only)
NIS domain name =
workshop
server
(galaxy)
192.168.1.100
Private IPs
(Allow RSH)
switch
Client
(star1)
192.168.1.1
Dheeraj Bhardwaj <[email protected]>
Client
(star2)
192.168.1.2
Client
(star3)
192.168.1.3
N. Seetharama Krishna <[email protected]>
150
Network Configuration
IP Address
• Three private IP address range –
– 10.0.0.0 to 10.255.255.255; 172.16.0.0 to 172.32.255.255; 196.168.0.0
to 192.168.255.255
– Information on private intranet is available in RFC 1918
• Warning: Should not use IP address 10.0.0.0 or 172.16.0.0
or 196.168.0.0 for server
• Netmask – 255.255.255.0 should be sufficient for most
clusters
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
151
Network Configuration
DHCP: Dynamic Host Configuration Protocol
• Advantages
– You can simplify network setup
• Disadvantages
– It is centralized solution ( is it scalable ?)
– IP addresses are linked to ethernet address, and that can
be a problem if you change the NIC or want to change
the hostname routinely
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
152
Network Configuration Files
• /etc/resolv.conf -- configures the name resolver
specifying the following fields:
• search (a list of alternate domain names to search for
a hostname)
• nameserver (IP addresses of DNS used for name
resolutions)
search cse.iitd.ac.in
nameserver 128.169.93.2
nameserver 128.169.201.2
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
153
Network Configuration Files
• /etc/hosts -- contains a list of IP addresses and their
corresponding hostnames. Used for faster name
resolution process (no need to query the domain name
server to get the IP address)
127.0.0.1
128.169.92.195
192.168.1.100
192.168.1.1
localhost
galaxy
galaxy
star1
localhost.localdomain
galaxy.cse.iitd.ac.in
galaxy
star1
• /etc/host.conf -- specifies the order of queries to resolve
host names Example:
order hosts, bind # check the /etc.../hosts first and then the
DNS
multi on
# allow to have multiple IP addresses
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
154
Host-specific Configuration Files
• /etc/conf.modules -- specifies the list of modules
(drivers) that have to be loaded by the kerneld (see
/lib/modules for a full list)
– alias eth0 tulip
• /etc/HOSTNAME - specifies your system hostname:
galaxy1.cse.iitd.ac.in
• /etc/sysconfig/network -- specifies a gateway host,
gateway device
– NETWORKING=yes
HOSTNAME=galaxy.cse.iitd.ac.in
GATEWAY=128.169.92.1
GATEWAYDEV=eth0
NISDOMAIN=workshop
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
155
Configure Ethernet Interface
• Loadable ethernet drivers – Loadable modules are pieces of object codes that can be loaded
into a running kernel. It allows Linux to add device drivers to a
running Linux system in real time. The loadable Ethernet
drivers are found in the /lib/modules/release/net directory
– /sbin/insmod tulip
– GET A COMPATIBLE NIC!!!!!!
• ifconfig command assigns TCP/IP configuration values
to network interfaces
– ifconfig eth0 128.169.95.112 netmask 255.255.0.0 broadcast
128.169.0.0
– ifconfig eth0 down
• To set default gatway
– route add default gw 128.169.92.1 eth0
• GUI ;
– system -> control panel
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
156
Troubleshooting
• Some useful commands:
– lsmod -- shows information about all loaded modules
– insmod -- installs a loadable module in the running kernel (e.g.,
insmod tulip)
– rmmod -- unloads loadable modules from the running kernel
(e.g., rmmod tulip)
– ifconfig -- sets up and maintains the kernel-resident network
interfaces (e.g., ifconfig eth0, ifconfig eth0 up)
– route -- shows / manipulates the IP routing table
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
157
Network File System (NFS)
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
158
NFS
• NFS has a number of characteristics:
– makes sharing of files over a network possible ( and transparent to
the users)
– allows keeping data files consuming large amounts of disk space on a
single server (/usr/local, etc....)
– allows keeping all user accounts on one host (/home)
– causes security risks
– slows down performance
• How NFS works:
– a server specifies which directory can be mounted from which host
(in the /etc/exports file)
– a client mounts a directory from a remote host on a local directory
(the mountd mount daemon on a server verifies the permissions)
–
when someone accesses a file over NFS, the kernel places an RPC
(Remote Procedure Call) call to the nfs NFS daemon on the server
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
159
NFS Configuration
• On both client and server:
• Make sure your kernel has NFS support compiled by running
– cat /proc/filesystems
• The following daemons must be running on each host (check by
running ps aux or run setup -> System Services, switch on the
daemons, and reboot):
– /etc/rc.d/init.d/portmap (or rpc.portmap)
– /usr/sbin/rpc.mountd
– /usr/sbin/rpc.nfsd
• Check that mountd and nfsd are running properly by running
/usr/sbin/rpcinfo -p
• If you experience problems, try to restart portmap, mountd, and nfsd
daemon in sequence
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
160
NFS Server Setup
• On an NFS server:
– Edit the /etc/exports file to specify a directory to be mounted, host(s)
allowed to mount, and the permissions (e.g., /home
client.cs.utk.edu(rw) gives client.cs.utk.edu host read/write access to
/home)
– Change the permissions of this file to world readable :
• chmod a+r /etc/exports
– Run /usr/sbin/exportfs to restart mountd and nfsd that read
/etc/exports files (or run /etc/rc.d/init.d/nfsfs restart
/home star1 (rw)
/use/export star2 (ro)
/home galaxy10.xx.iitd.ac.in (rw)
/mnt/cdrom galaxy13.xx.iitd.ac.in (ro)
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
161
NFS Client Setup
mount -t nfs remote_host:remote_dir local_dir
• On an NFS client:
– To mount a remote directory use the command:
mount -t nfs remote_host:remote_dir local_dir
where it is assumed that local_dir exists
Example:
mount -t nfs galaxy1: /usr/local /opt
– To make the system mount an nfs file system upon boot, edit the
/etc/fstab file like this:
server_host:dir_on_server local_dir nfs rw, auto 0 0
Example:
galaxy1:/home /home nfs defaults 0 0
– To unmount the file system: umount /local_dir
• Note: Check with the df command (or cat /etc/mtab) to see
mounted/unmounted directories.
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
162
NFS: Security Issues
• To forbid suid programs to work off the NFS file system,
specify nosuid option in /etc/fstab
• To deny the root user on the client to access and change
files that only root on the server can access or change,
specify root_squash option in /etc/exports; to grant client
root access to a filesystem use no_root_squash
• To prevent the access to NFS files on a server without
privileges specify portmap: ALL in /etc/hosts.deny
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
163
Network Information Service (NIS)
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
164
NIS
• NIS, Network Information Service, is a service that provides
information (e.g., login names/passwords, home directories, group
information), that has to be known throughout the network. This
information can be authentication, such as passwd and group files, or it
can be informational, such as hosts files.
• NIS was formerly called Yellow Pages (YP), thus yp is used as a prefix
on most NIS-related commands
• NIS is based on RPC (remote procedure call)
• NIS keeps database information in maps (e.g., hosts.byname,
hosts.byaddr) located in /var/yp/nis.domain/
• The NIS domain is a collection of all hosts that share part of their
system configuration data through NIS
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
165
How NIS Works
• Within an NIS domain there must be at least one machine acting as a
NIS server. The NIS server maintains the NIS database - a collection of
DBM files generated from the /etc/passwd, /etc/group, /etc/hosts and
other files. If a user has his/her password entry in the NIS password
database, he/she will be able to login on all the machines on the
network which have the NIS client programs running.
• When there is a NIS request, the NIS client makes an RPC call across
the network to the NIS server it was configured for. To make such a call,
the portmap daemon should be running on both client and server.
• You may have several so-called slave NIS servers that have copies of the
NIS database from the master NIS server. Having NIS Slave servers is
useful when there are a lot of users on the server or in case the master
server goes down. In this case, a NIS client tries to connect to the faster
active server.
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
166
Authentication Order
• /etc/nsswitch.conf
• /etc/nsswitch.conf (client)
(server)
passwd: nis files
passwd: files nis
group:
nis files
group: files nis
shadow: nis files
shadow: files nis
hosts:
nis files dns
hosts:
files nis dns
• The /etc/nsswitch.conf file is used by the C library (glibs) to
determine what order to query for information.
• The syntax is: service: <method> <method> …
• Each method is queried on order until one returns the
requested information. In the example above, the NIS server
is queried first for password information and if it fails, the
local /etc/password file is checked. Once all methods are
queried, if no answer is found, an error is returned.
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
167
NIS Client Setup
•
•
•
•
Setup the name of your NIS domain: /sbin/domainname nis.domain
The ypbind (under /etc/rc.d/init.d) daemon (switch it on from setup
command) handles the clients requests for NIS information.
The ypbind needs a configuration file /etc/yp.conf that contains
information on how the client is supposed to behave (broadcast the
request or send it directly to a server).
The /etc/yp.conf file can use three different options:
– domain NISDOMAIN server HOSTNAME
– domain NISDOMAIN broadcast (default)
– ypserver HOSTNAME
•
•
•
•
Create a directory /var/yp (if one does not exist)
start up /etc/rc.d/init.d/portmap
Run /usr/sbin/rpcinfo -p localhost and
rpcinfo -u localhost ypbind to check if ypbind was able to register its service
with the portmapper
Specify the NISDOMAIN in /etc/sysconfig/network:
– NISDOMAIN=workshop
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
168
NIS Server Setup
•
/var/yp contains one directory for each domain that the server can
respond to. It also has the Makefile used to update the data base maps.
•
Generate the NIS database: /usr/lib/yp/ypinit -m
•
To update a map (i.e. after creating a new user account), run make in the
/var/yp directory
•
Makefile controls which database maps are built for distribution and how
they are built.
•
Make sure that the following daemons are running:
– portmap, ypserv, yppasswd
•
You may switch these daemons on from the setup->System Services. They
are located under /etc/rc.d/init.d.
•
Run /usr/sbin/rpcinfo -p localhost and rpcinfo -u localhost ypserv to check
if ypserv was able to register its service with the portmapper.
•
Verify that the NIS services work properly, run ypcat passwd or ypmatch
userid passwd
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
169
Working Cluster
Outside IP
(Get in using SSH only)
NIS domain name =
workshop
server
(galaxy)
192.168.1.100
Private IPs
(Allow RSH)
switch
Client
(star1)
192.168.1.1
Dheeraj Bhardwaj <[email protected]>
Client
(star2)
192.168.1.2
Client
(star3)
192.168.1.3
N. Seetharama Krishna <[email protected]>
170
Kernel ??
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
171
Kernel Concepts
•
The Kernel acts as a mediator for your programs and hardware:
–
–
–
•
•
Manages the memory for all running processes
Makes sure that the procs share CPU cycles appropriately
Provides an interface for the programs to the hardware
Check the config.in file (normally under /usr/src/linux/arch/i386) to find
out what kind of hardware the kernel supports.
You can enable the kernel support for your hardware or programs (network,
NFS, modem, sound, video, etc.) in two ways:
–
by linking the pieces of the kernel code directly into the kernel (in this case,
they will be “burnt into” the kernel which increases the size of the kernel):
monolithic kernel
– by loading those pieces as modules (drivers) upon request which is a more
flexible and more preferable way: modular kernel
•
Why upgrade?
–
–
–
–
To support more/newer types of hardware
improve process management
faster, more stable, bug fixes
more secure
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
172
Why Use Kernel Modules?
•
Easier to test: no need to reboot the system to load and unload a
driver
• Less memory usage: modular kernels are smaller in size than
monolithic kernels. Memory used by the kernel is never swapped
out. Unused drivers compiled into your kernel waste your RAM.
•
One single boot image: No need for building different boot
images depending on the hardware needs
• If you have a diskless machine then you can not use the modular
kernel since the modules (drivers) are normally stored on the
hard disk.
• The following drivers cannot be modularized:
–
the driver for the hard disk where your root system resides
–
the root filesystem driver
–
the binary format loader for init, kerneld and other programs
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
173
Which module to load?
• The kerneld daemon allows kernel modules (device drivers, network
drivers, filesystems) to be loaded automatically when they are
needed, rather than doing it manually via insmod or modprobe.
• When there is a request for a certain device, service, or protocol, the
kernel sends this request to the kerneld daemon. The daemon
determines what module should be loaded by scanning the
configuration file /etc/conf.modules.
• You may create the /etc/conf.modules file by running:
– /sbin/modprobe -c | grep -v ‘^path’ > /etc/conf.modules
• /sbin/modprobe -c : get a listing of the modules that the kerneld
knows about
•
/sbin/lsmod: lists all currently loaded modules
•
/sbin/insmod: installs a loadable module in the running kernel
•
/sbin/rmmod: unload loadable modules
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
174
Configuration and Installation Steps
•
1. Check the /usr/src/linux directory on the system. If it is empty, you
need to get the kernel package (eg., ftp.kernel.org/pub/linux/kernel) and
•
unpack the package by typing:
–
$ cd kernel_src_tree
–
$ tar zpxf linux-verion.tar.gz
•
2. Configure the kernel: ‘make config’ or ‘make xconfig’
•
3. Check the correct dependencies: ‘ make desp
•
4. Clean stale object files: ‘make clean’ or ‘make mrproper’
•
5. Compile the kernel image: ‘make bzImage’ or ‘make zImage’
•
6. Install the kernel: ‘make bzlilo’ or re-install LILO by:
–
making the backup of the old kernel /vmlinuz
–
copying /usr/src/linux/arch/i386/boot/bzImage to /vmlinuz
–
running lilo: $ /sbin/lilo
•
7. Compile the kernel modules: ‘make modules’
•
8. Install the kernel modules: ‘make modules_install’
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
175
PBS: Portable Batch System
(www.OpenPBS.org)
Why PBS?
PBS Architecture
Installation and Configuration
Server Configuration
Configuration Files
PBS Daemons
PBS User Commands
PBS System Commands
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
176
PBS
• Develop by Veridian MRJ for NASA
• POSIX 1003.2d Batch Environment Standard Compliant
• Supports nearly all UNIX-like platforms (Linux, SP2, Origin2000)
This figure has been taken from http://www.extremelinux.org/activities/usenix99/docs/pbs/sld007.htm
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
177
PBS Components
• PBS consists of four major components:
– commands -used to submit, monitor, modify, and delete jobs
– job server (pbs_server) - provides the basic batch services
(receiving/creating a batch job, modifying the job, protecting the job
against system crashes, and running the job). All commands and the
other daemons communicate with the server.
– job executor, or MOM, mother of all the processes (pbs_mom) - the
daemon that places the job into execution, and later returns the job’s
output to the user
– job scheduler (pbs_sched) controls which job is to be run and where
and when it is run based on the implemented policies. It also
communicates with MOMs to get information about system
resources and with the server daemon to learn about the availability
of jobs to execute.
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
178
PBS Installation and Configuration
•
•
Package Configuration: run configuration script
Package Installation:
– Compile the PBS modules: “make”
– Install the PBS modules: “make install”
•
Server Configuration:
– Create a node description file: $PBS_HOME/server_priv/nodes
– One time only: “pbs_server -r create”
– Configure the server by running “qmgr”
• Scheduler Configuration:
– Edit $PBS_HOME/sched_priv/sched_config file defining scheduling
policies (first-in,first-out / round-robin/ load balancing)
– Start a scheduler program: “pbs_sched”
•
Client Configuration:
– Edit $PBS_HOME/mom_priv/config file: MOM’s config file
– Start the execution server: “pbs_mom”
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
179
Server & Client Configuration Files
•
To configure the server you should edit the file
$PBS_HOME/server_priv/nodes:
•
# nodes that this server will run jobs on.
•
#node
•
galaxy10 even
•
galaxy11 odd
•
galaxy12 even
•
galaxy13 odd
•
To configure the MOM, you should edit the file
$PBS_HOME/mom_priv/config:
•
$clienthost galaxy10
•
$clienthost galaxy11
•
$clienthost galaxy12
•
$clienthost galaxy13
•
$restricted *.cs.iitd.ac.in
•
See the PBS Administrator’s Guide for details.
properties
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
180
PBS Daemons
batch server (pbs_server)
MOM (pbs_mom)
scheduler (pbs_sched)
• To autostart PBS daemons at boot time add the appropriate
line in the
• file /etc/rc.d/rc.local:
–
/usr/local/pbs/sbin/pbs_server
# on the PBS server
–
/usr/local/pbs/sbin/pbs_sched
# on the PBS server
–
/usr/local/pbs/sbin/pbs_mom
# on each PBS Mom
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
181
PBS User Commands
•
qsub job_script: submit job_script file to PBS
•
qsub -I : submit an interactive-batch job
•
qstat -a : list all jobs on the system. It provides the following job
information:
•
the job IDs for each job
•
the name of the submission script
•
the number of nodes required/used by each job
•
the queue that the job was submitted to.
•
qdel job_id : delete a PBS job
• qhold job_id : put a job on hold
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
182
PBS System Commands
• Note: Only the manager can use these commands:
•
qmgr: pbs batch system manager
•
qrun : used to force a batch server to initiate the execution of a
batch job. The job is run regardless of scheduling position,
resource requirements, or state
•
pbsnodes : pbs node manipulation
•
pbsnodes -a : lists all nodes with their attributes
•
xpbs : Graphical User Interface to PBS command
•
xpbs -admin : Running xpbs in administrative mode
•
xpbsmon - GUI for displaying, monitoring the nodes/execution
hosts under PBS
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
183
Typical Usage
$ qsub -lnodes=4 -lwalltime=1:00:00 script.sh
where script.sh may look like this:
#! /bin/sh
mpirun -np 4 matmul
•
•
•
•
•
•
•
•
•
•
Sample script that can be used for MPICH jobs:
#!/bin/bash
#PBS -j oe
#PBS -l nodes=2,walltime=01:00:00
cd $PBS_O_WORKDIR
/usr/local/mpich/bin/mpirun \
-np `cat $PBS_NODEFILE | wc -l` \
-leave_pg \
-machinefile $PBS_NODEFILE \
$PBS_O_WORKDIR/matmul
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
184
Installation of PVFS: Parallel Virtual File System
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
185
PVFS - Objectives
•To meet the need for a
parallel file system for Linux
Cluster
•Provide high bandwidth for
concurrent read/write operations
from multiple processes or threads to
a common file
•No change in the working of the
common UNIX shell command like
ls, cp, rm
•Support for multiple APIs : a native
PVFS API, the UNIX/POSIX API,
MPI-IO API
•Scalable
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
186
PVFS - Features
• Clusterwide consistent name space
• User controlled stripping of data across disks on
different I/O nodes
• Existing binaries operate on PVFS files without the
need for recompiling
• User level implementation : no kernel modifications
are necessary for it to function properly
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
187
Design Features
• Manager Daemon with
Metadata
• I/O Daemon
• Client Library
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
188
Metadata
• Metadata : information describing the characteristics of a
file e.g owner and the group, permissions , physical
distribution of the file
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
189
Installing Metadata Server
[root@head /]# mkdir /pvfs-meta[root@head /]# cd /pvfs-meta
[root@head /pvfs-meta]# /usr/local/bin/mkmgrconf
This script will make the .iodtab and .pvfsdir
files
in the metadata directory of a PVFS file system.
Enter the
/pvfs-meta
Enter the
root
Enter the
root
Enter the
777
Enter the
root directory:
user id of directory:
group id of directory:
mode of the root directory:
hostname that will run the manager:
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
190
Installing Metadata Server
…Continue
head
Searching for host...success
Enter the port number on the host for manager:
(Port number 3000 is the default)
3000
Enter the I/O nodes: (can use form node1, node2,
... or
nodename#-#,#,#)
node0-7
Searching for hosts...success
I/O nodes: node0 node1 node2 node3 node4 node5
node6 node7
Enter the port number for the iods:
(Port number 7000 is the default)
7000
Done!
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
191
The Manager
• Applications communicate directly
with the PVFS manager (via TCP)
when opening/
creating/closing/removing files
• Manager returns the location of the
I/O nodes on which file data is stored
• Issue : Presentation of directory
hierarchy of PVFS files to Application
process
1. NFS : Drawbacks : NFS has to be
mounted on all nodes in the cluster,
Default caching of NFS caused
problem with metadata operations
2. System calls related to directory
access. Mapping routine determines
directory access, operations are
redirected to PVFS Manager
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
192
I/O Daemon
• Multiple servers in a client
server system, run on
separate I/O nodes in the
cluster
• PVFS files are striped
across the discs on the I/O
nodes
• Handles all file I/O
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
193
Configuring the I/O servers
•
On each of the machines that will act as I/O servers.
•
Create a directory that will be used to store PVFS file
data.
•
Set the appropriate permissions on that directory.
[root@node0 /]# mkdir /pvfs-data[root@node0 /]# chmod 700
/pvfs-data
[root@node0 /]# chown nobody.nobody /pvfs-data
•
Create a configuration file for the I/O daemon on each of
your I/O servers.
[root@node0 /]# cp /usr/src/pvfs/system/iod.conf
/etc/iod.conf
[root@node0 /]# ls -al /etc/iod.conf-rwxr-xr-x
root
57 Dec 17 11:22 /etc/iod.conf
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
1 root
194
Starting the PVFS daemons
PVFS manager handles access control to files on PVFS file
systems.
[root@head /root]# /usr/local/sbin/mgr
I/O daemon handles actual file reads and writes for PVFS.
[root@node0 /root]# /usr/local/sbin/iod
To verify that one of the iod's or mgr is running correctly on
your system you may check it by running the iod-ping or mgrping utilities, respectively.
e.g.
[root@head /root]# /usr/local/bin/iod-ping -h node0
node0:7000 is responding.
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
195
Client Library
• Libpvfs : Links clients to PVFS servers
• Hides details of PVFS access from application tasks
• Provides “partitioned-file interface” : noncontiguous
file regions can be accessed with a single function call
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
196
Client Library (contd.)
• User specification of file
partition using a special ioctl
call
• Parameters
offset : how far into the file
partition begins relative to the
first byte of the file
gsize : size of the simple strided
region to be accessed
stride : distance between the
start of two consecutive regions
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
197
Trapping Unix I/O Calls
• Applications call functions
through the C library
• PVFS wrappers determine the
type of file on which the operation
is to be performed
• If the file is a PVFS file, the PVFS
I/O library is used to handle the
function, else parameters are
passed to the actual kernel
• Limitation : Restricted portability
to new architectures and operating
system – a new module that can be
mounted like NFS enables
traversal and accessibility of the
existing binaries
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
198
We have built a Standard Beowulf
P
V
F
S
Dheeraj Bhardwaj <[email protected]>
Interconnection
Network
Missing Middleware for SSI
N. Seetharama Krishna <[email protected]>
199
Linux Clustering Today
Design criteria
• Superior scalability
• High availability
• Single system manageability
• Best price/performance
Availability
Multiple OSs—
Distr. Computing
Scalability and
manageability
issues
SSI
Clusters
•LifeKeeper
•Sun Clusters
•ServiceGuard
Commodity
SMP or UP
•OPS, XPS
(for DB Only)
Availability
issues
•Large SMP, NUMA
•HPC
Scalability
Single OSs—single system image
Source - Bruce Walker: www.opensource.compaq.com
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
200
What is Single System Image (SSI)?
Co-operating OS Kernels providing transparent
access to all OS resources cluster-wide, using a
single namespace
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
201
Benefits of Single System Image
• Usage of system resources transparently
• Improved reliability and higher availability
• Simplified system management
• Reduction in the risk of operator errors
• User need not be aware of the underlying system
architecture to use these machines effectively
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
202
SSI vs. Scalability
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
203
Components of SSI Services
•
•
•
•
•
•
•
•
Single Entry Point
Single File Hierarchy: xFS, AFS, Solaris MC Proxy
Single I/O Space – PFS, PVFS, GPFS
Single Control Point: Management from single GUI
Single virtual networking
Single memory space - DSM
Single Process Management: Glunix, Condin, LSF
Single User Interface: Like workstation/PC
windowing environment (CDE in Solaris/NT), may
it can use Web technology
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
204
Key Challenges
•
Technical and Applications
–
–
–
–
–
–
–
Development Environment
• Compilers, Debuggers
• Performance Tools
Storage Performance
• Scalable Storage
• Common Filesystem
Admin Tools
• Scalable Monitoring Tools
• Parallel Process Control
Node Size
• Resource Contention
• Shared Memory Apps
Few Users => Many Users
• 100 Users/month
Heterogeneous Systems
• New generations of systems
Integration with the Grid
Dheeraj Bhardwaj <[email protected]>
• Organizational
– Integration with Existing
Infrastructure
• Accounts, Accounting
• Mass Storage
– Acceptance by Community
• Increasing Quickly
• Software environments
N. Seetharama Krishna <[email protected]>
205
Tools for Installation and Management
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
206
What have we learned so far ?
• Installing and maintaining a Linux cluster is very
difficult and time consuming task
• Need certainly tools for Installation and
management
• Open Source Community has developed different
software tools.
• Let evaluate them on the basis of – Installation,
Configuration, Monitoring and Management
aspects by answering following questions ---
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
207
Installation
• General questions:
–
–
–
–
Can we use a standard Linux distribution, or its included?
Can we use our favorite Linux distribution ?
Are multiple configurations or profiles in the same cluster supported ?
Can we use heterogeneous hardware in the same cluster ?
• e.g. Node with different NICs
• Frontend Installation:
– How to install the Cluster Software ?
– Do we need to install additional software? e.g. Web server
– Which service must be running and configured on the frontend
• DHCP, NFS, DNS etc
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
208
Installation
• Nodes Installation:
– Can we install the nodes in automatic way ?
– Is it necessary to introduce information about nodes manually ?
• MAC address, node names, hardware configuration etc.
– Or is it collected automatically ?
– Can we boot the node from diskette or from NIC ?
– Do we need a mouse, keyboard, monitor physically attached to
nodes ?
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
209
Configuration
• Frontend Configuration:
– How do we configure the service in the fronend machine ?
• e.g.- How do we configure DHCP, or how to create a PBS node
file or how to create /etc/hosts file
• Node Configuration:
– How do we configure the OS of the nodes ?
– e.g. – Keyboard layout (English, spanish etc), disk partitioning,
timezone etc
– How do we create and configure the initial users ?
– e.g. – how do we generate ssh keys?
– Can we modify the configuration of individual software packages ?
– Can we modify the configuration of the niodes once they are
installed ?
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
210
Monitoring
• Nodes Monitoring:
– Can we monitor the status of the individual cluster nodes ?
• e.g. – load, amount of free memory, swap in use etc
– Can we monitor status of cluster as whole ?
– Do we have graphical tool to display this information ?
• Detecting Problems:
– How do we know if the cluster is working properly ?
– How do we detect that there is a problem with one node ?
• e.g. – The node has crashed, or its disk is full ?
– Can we have alarms ?
– Can we have defined actions for certain events ?
• e.g. mail to the system manager if a network interface is down ?
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
211
Maintenance
• General Questions:
– Is it easy to add or remove a node ?
– Is is easy to add or remove software package ?
– In case of installing a new packages, must it be in RPM format ?
• Reinstallations
– How do we reinstall a node ?
• Upgrading
– Can we upgrade the OS ?
– Can we upgrade the individual software packages ?
– Can we upgrade the nodes hardware ?
• Solving problems
– Do we have tools to solve problems in a remote way or is it
necessary to telnet to nodes ?
– Can we start a node remotely?
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
212
NPACI Rocks
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
213
Who is NPACI Rocks?
• Cluster Computing Group at SDSC
• UC Berkeley Millennium Project
– Provide Ganglia Support
• Linux Competency Centre in SCS Enterprise Systems
Pte Ltd in Singapore
– Provide PVFS Support
– Working on user documentation for Rocks 2.2
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
214
NPACI Rocks
• NPACI Rocks is Full Linux distribution based on Red
Hat Linux
• Designed and created for – Installation, administration
and use of Linux clusters.
• NPACI is the result of a collaboration between several
research groups and industry, led by Cluster development
group at San Diego Supercomputing Centre
•
NPACI Rocks distribution come in three CD-ROMs
(ISO image can be downloaded from web)
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
215
NPACI Rocks – included software
Software packages for cluster management
• ClustAdmin: Tools for cluster administration. It includes programs
like shoot-node to reinstall a node
• ClustConfig: Tools for cluster configuration. It includes software like
an improved DHCP server
• eKV: keyboard and video emulation through Ethernet cards
• Rock-dist: a utility to create the Linux distribution used for nodes
installation. This includes a modified version of Red Hat kickstart
with support for eKV
• Gangila: a real time cluster monitoring tool (with a web based
interface) and remote execution environment
• Cluster SQL: the SQL data base schema contains all node
configuration information.
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
216
NPACI Rocks – included software
Software packages for cluster Programming and Use
• PBS (Portable Batch System) and Maui Scheduler
• A modified version of secure network connectivity suite
OpenSSH
• Parallel programming libraries MPI and PVM, with
modification to work with OpenSSH
• Support for Myrinet GM network interface cards
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
217
Basic Architecture
Fast-Ethernet
Switching Complex
Gigabit Network
Switching Complex
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Power Distribution
(Net addressable units as option)
Public Ethernet
Front-end Node(s)
Source : Mason Katz et .al , SDSC
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
218
Major Components
Standard Beowulf
Source : Mason Katz et .al , SDSC
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
219
Major Components
NPACI Rocks
Source : Mason Katz et .al , SDSC
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
220
NPACI Rocks - Installation
• Installation procedure is based on the Red Hat kickstart
installation program and a DHCP server
• Frontend Installation –
– Create a configuration file for kickstart
– Config file contain the information about characteristics and
configuration parameter
• Such as domain name, language, disk partitioning, root
password etc
– Automatic installation – insert the Rocks CD, a floppy with the
kickstart file, and reset the machine
– When the installation is completes, CD will be ejected and
machine will reboot
– At this point frontend machine completely installed and
configured
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
221
NAPCI Rocks - Installation
• Node Installation
– Execute insert-ether program on frontend machine
– This program is responsible to gather information about the nodes
by capturing DHCP requests
– This program configure the necessary services in the frontend
• Such as NIS maps, MySQL database, DHCP config file etc.
– Insert the Rocks CD into the first node of the cluster and switch it
on
– If NO CD drive, floppy can be used.
– Node will be installed using kickstart without user intervention
Note: Once the installation starts we can follow its evolution from
frontend just by telneting to node port 8080.
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
222
NPACI Rocks - Configuration
• Rocks uses a MySQL database to store information about
cluster global and node specific configurations parameters
– e.g. – node names, Ethernet addresses, IP address etc
• The information for this database is collected in an automatic
way when cluster nodes are installed for first time.
• The node configuration is done at installation time using Red
Hat kickstart
• The kickstart configuration file for each node is created
automatically with CGI script on the frontend machine
• At installation time, the node request their kickstart files via
HTTP.
• The CGI script uses a set of XML-based file to construct a
general kickstart file then it applies node-specific
modification by querying the MySQL database
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
223
NPACI Rocks – Monitoring
• Monitoring using – Unix tools or Gangila cluster toolkit
Gangila – Real time monitoring and remote execution
– Monitoring is performed by a daemon – gmond
– This daemon must be running on each node of the cluster
– Collects the information about more than twenty matrices
• e.g. – CPU load, free memory, free swap etc)
– This information is collected and stored by the frontend machine
– A web browser can be used to output in a graphical way
SNMP based monitoring
– Process status on any machine can be inquired using snmp-status
script
– Rocks lets to forward all the syslog to frontend machine
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
224
Ganglia
• Scalable cluster monitoring system
– Based on ip multi-cast
– Matt Massie, et al from UCB
– http://ganglia.sourceforge.net
• Gmon daemon on every node
– Multicasts system state
– Listens to other daemons
– All data is represented in XML
• Ganglia command line
– Python code to parse XML to English
• Gmetric
– Extends Ganglia
– Command line to multicast single metrics
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
225
Ganglia
Source : Mason Katz et .al , SDSC
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
226
NPACI Rocks - Maintenance
• Principle cluster maintenance – reinstallation
• To install, upgrade or remove software or change in OS,
Rocks needs to perform a complete reinstallation of all the
cluster nodes to propagate the changes
• Example – to install a new software –
– copy the package (must be in RPM format) into Rocks software
repository
– Modify the XML configuration file
– Execute the program rocks-dist to make a new Linux distribution
– Reinstall all the nodes to apply the changes – use shot-node
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
227
NPACI Rocks
• Rocks uses a NIS map to define the users of the cluster
• Rocks uses NFS for mounting the user home directories
from frontend
• If we want to add or to remove a node of the cluster we
can use the installation tool insert-ether
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
228
NPACI Rocks - Discussion
• Easy to install a Cluster under Linux
• Does not require deep knowledge or cluster
architecture or system administration to install a
cluster
• Rock provides a good solution for monitoring with
Gangila
• Rocks philosophy for cluster configuration and
maintenance
– It becomes faster to reinstall all nodes to a known
configuration than it is to determine if nodes were out of
synchronization in the first place
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
229
NPACI Rocks - Discussion
Other good points
• Installation program kickstart has advantage over hard disk
image cloning – results in good support for cluster made of
heterogeneous hardware
• The eKV make almost unnecessary the use of monitor,
keyboard physically attached to nodes
• The node configuration parameters are stored on a MySQL
data base. This database can be queried to make reports
about the cluster, or to program new monitoring and
management scripts
• It is very easy to add or remove a user, we only have to
modify the NIS map.
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
230
NPACI Rocks - Discussion
Some Bad Points
• Rocks is an extended Red Hat Linux 7.2 distribution.
This means it can not work with other vendor
distributions, or even other versions of Red Hat Linux
• All software installed under Rocks must be in RPM. If
we have tar.gz we have create corresponding RPM
• The monitoring program does not issue any alarm when
something goes wrong in the cluster
• User homes are mounted by NFS, which is not a scalable
solution
• PXE is not supported to boot nodes from network.
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
231
OSCAR
• Open Source Cluster Application Resources (OSCAR)
• Tools for installation, administration and use of Linux
clusters
• Developed by a Open Cluster Group – an informal
group of people with objective to make cluster
computing practical for high performance computing.
OSCAR
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
232
Where is OSCAR?
• Open Cluster Group site
http://www.OpenClusterGroup.org
• SourceForge Development Home
http://sourceforge.net/projects/oscar
• OSCAR site
http://www.csm.ornl.gov/oscar
• OSCAR email lists
– Users: [email protected]
– Announcements: [email protected]
– Development: [email protected]
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
233
OSCAR – included software
• SIS(System Installation Suite) - A collection of software
packages designed to automate the installation and
configuration of networked workstations
• LUI (Linux Utility for cluster Installation) – A utility for
installation Linux workstations remotely over Ethernet
network
• C3 (Cluster Command & Control tool) – A set of command
line tools for cluster management
• Gangila – A real-time cluster monitoring tool and remote
execution enviornment
• OPIUM – A password installer & user management tool
• Switcher – A tool for user environment configuration
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
234
OSCAR – included software
Software packages for cluster Programming and Use
• PBS (Portable Batch System) and Maui Scheduler
• A secure network connectivity suite OpenSSH
• Parallel programming libraries MPI (LAM/MPI and
MPICH implementations) and PVM
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
235
OSCAR - Installation
• The installation of OSCAR must be done on an already
working Linux machine.
• Its based on SIS and DHCP server
• Frontend Installation
– NIC must be configured
– Hard disk must have at least 2GB of free space on a partition
with /tftpboot directory
– Another 2 GB for /var directory
– Copy all the RPMs that compose a standard Linux into
/tftpboot
– These RPMs will be used to create the OS image that will be
needed during the installation of cluster node
– Execute the program install_cluster
– This updates the server and configures by modifying file like
/etc/hosts or /etc/exports
– It shows a graphical wizard that will help installation procedure
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
236
OSCAR - Installation
• Node Installation
– Its done in three Phases – Cluster definition, node installation and
Cluster configuration
– Cluster Definition – build the image with OS to install.
• Provide list of software packages to install
• Description of disk partitioning
• Node name, IP address, netmask, subnet mask, default gateway
etc.
• Collect all the Ethernet addresses and match them to their
corresponding IP addresses – This is most time consuming part
and GUI help us during this phase.
– Node installation – Nodes are boot from network and are
automatically installed and configured
– Cluster Configuration – Frontend and node machines are configured
to work together as cluster
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
237
OSCAR - Monitoring
• OSCAR also uses Gangila Cluster toolkit
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
238
OSCAR - Maintenance
• C3 (Cluster Command & Control tool suite)
• C3 is a set of text oriented programs that are executed
from command line on the frontend, and takes effect on
the cluster nodes.
• C3 tools include following
–
–
–
–
–
–
–
–
cexec – remote execution of commands
cget – file transfer to the frontend
cpush – file transfer from frontend
ckill – similar to Unix kill but using process name instead PIDs
cps – similar to Unix ps
cpushimage – reinstall a node
crm – similar to Unix rm
cshutdown – shutdown individual node or whole cluster
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
239
OSCAR - Maintenance
• We can use the OSCAR installation wizard to add or
remove a cluster node
• With SIS we can add or remove software packages
• With OPIUM and switcher packages, we can manage
users and user environments.
• OSCAR has program start_over, can be used for
reinstall the complete cluster
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
240
OSCAR - Discussion
• OSCAR provides all aspects of Cluster management
–
–
–
–
Installation
Configuration
Monitoring
Maintenance
• We can say OSCAR is rather complete solution
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
241
OSCAR - Discussion
Other good points
• Good documentation
• The installation wizard is very helpful to install
cluster
• OSCAR can be installed on nay Linux distribution
although Red Hat and Mandrake are supported at
present
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
242
OSCAR - Discussion
Some bad point
• Though OSCAR has installation wizard but it requires
experienced system administrator
• Because the installation of Oscar is based on images of
the OS residing on the hard disk of the frontend, for
each kind of hardware we have create a different image
– It is necessary to create a image to nodes with IDE disk and
separate for SCSI disk
• It does not provide a solution to let us follow the node
installation without using physical keyboard and
monitor connected.
• The management tool C3 is rather basic
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
243
OSCAR - References
• LUI
http://oss.software.ibm.com/lui
• SIS
http://www.sisuite.org
• MPICH
http://wwwunix.mcs.anl.gov/mpi/mpich
• LAM/MPI
http://www.lam-mpi.org
• PVM
http://www.csm.ornl.gov/pvm
• OpenPBS
http://www.openpbs.org
• OpenSSL
http://www.openssl.org
• OpenSSH
http://www.openssh.com
• C3
http://www.csm.ornl.gov/torc/C3
• SystemImager
http://systemimager.sourceforge.net
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
244
EU DataGrid Fabric Management Project
WP4 Project
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
245
EU DataGrid Fabric Management Project
• Work Package Four (WP4) of DataGrid project is
responsible for fabric management for Next generation
of computing infrastructure
• The goal of WP4 is to develop a new set of tools and
techniques for the development of very large computing
fabric (upto tens of thousands of processors), with
reduced system administration and operating cost
• WP4 has developed software for DataGrid but can be
used with general purpose clusters as well
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
246
WP4 project – Included Software
• Installation – automatic installation and maintenance
• Configuration – tools for configuration information
gathering, database to store configuration information,
and protocols for configuration management and
distribution
• Monitoring – Software for gathering, storing and
analyzing information (performance, status,
environment) about nodes
• Fault tolerance – Tools for problem identification and
solution
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
247
WP4 project - Installation
• Software installation and configuration is done with the
aid of LCFG (Local ConFiGuration system)
• LCFG is a set of tools to
– Install, configure, upgrade and uninstall software for OS and
applications
– Monitor the status of the software
– perform version management
– Manage application installation dependencies
– Support policy based upgrades and automated schedule
upgrades
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
248
WP4 Project - Installation
• Frontend Installation
– Install LCFG installation server software
– Create a software repository directory (with at least 6 GB of free
space) that will contain all RPM packages necessary to install the
cluster nodes
– In addition to Red Hat Linux and LCFG server software, fronend
machine must have the following services
• A DHCP server – contain node Ethernet address
• NFS server – to export the installation root directory
• Web server – nodes will use to fetch their configuration profiles
• DNS server – to resolve node names given their IP address
• TFTP server – if want to use PXE to boot nodes
– Finally we need list of packages and LCFG profile file, that will
used to create individual XML profiles containing configuration
parameters.
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
249
WP4 Project - Installation
• Nodes Installation
– For each node , we need to add the following information in the
frontend
• Node Ethernet address in DHCP configuration file
• Update DNS server with IP and Host name
• Modify the LCFG server configuration file with the profile
of the node
• Invoke mkxprof for update XML file
– Boot the node with boot disk. The root file system will be
mounted by NFS from the frontend
– Finally the node will be installed automatically
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
250
WP4 project - Configuration
• Configuration of the Cluster managed by LCFG
• Configuration of the cluster is described in a set of source
files, held on the frontend machine and created using High
Level Description language
• Source files describe a global aspect of the cluster
configuration which will be used as profile for the cluster
node
• These source files must be validated and translated into
node profile files (with rdxprof program).
• Profile files use XML and published using the frontend web
server
• The cluster nodes fetch their XML profile files with HTTP
and reconfigure themselves using a set of components
scripts
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
251
WP4 Project - Monitoring
• Monitoring is done with the aid of a set of Monitoring
Sensors (MS) that run on cluster nodes
• The information generated by MS is collected at each
node by Monitoring Sensor Agent
• Sensor Agents send information to frontend to a central
monitoring repository
• A process call Monitoring Repository Collector (MRC)
on frontend gather the information
• With web-based interface we can query any metric for
any node
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
252
WP4 Project - Maintenance
• With LCFG tools we can install remove or upgrade a
software by modifying RPM package list
– Remove the entry from package list and call updaterpms to update
the nodes
• With LCFG tools we can add, remove or modify users and
groups from nodes
• Modify the nodes LCFG profile file and populate the
change on nodes with rdxprof program
• A Fault Tolerance Engine is responsible to detect critical
states on the nodes and to decide if it is necessary to
dispatch any action
• An Actuator Dispatcher, that receives orders to start some
Fault Tolerance Actuators.
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
253
WP4 Project - Discussion
• LCFG software provides a a good solution for installation
and configuration of a cluster
– With the package files we can control the software to install on
the cluster nodes
– With the source files we can have full control over the
configuration of nodes
• The design of the monitoring part is very flexible
– We can implement our own monitoring sensor agents, and we
can monitor remote machines ( for example, network switches)
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
254
WP 4 - Discussion
Other good points
• LCFG supports heterogeneous hardware clusters
• We can multiple configuration profiles defined in the
same cluster
• We can define alarms when something goes wrong,
and trigger action to solve these problems
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
255
WP4 - Discussion
Bad points
• Its under development – components are not very
mature
• You need a DNS server, or alternatively, to modify by
hand the /etc/hosts file on the frontend
• There is not an automatic way to configure the frontend
DHCP server
• The web based monitoring interface is rather basic
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
256
Other tools
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
257
Scyld Beowulf
• Scyld founded by Donald Becker
• Distribution is released as Free / Open Source
(GPL’d)
• Built upon RedHat 6.2 with the difficult work of
customizing for use a Beowulf already
accomplished
• Represents “second generation beowulf
software”
– Instead of having full Linux installs on each machine,
only one machine, the master node, contains the full
install
– Each slave node has a very small boot image with just
enough resources to download the real “slave”
operating system from the master
– Vastly simplifies software maintenance and upgrades
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
258
Other Tools
• OpenMosix – Multicomputer Operating System for Unix
– Clustering software that allows a set of Linux computers to work
like a single system
– Applications do not need MPI or PVM
– No need for Batch system
– Its still under development
• Score – High performance parallel programming
environment for workstations and PC clusters
• Cplant – Computational Plant developed at Sandia National
Laboratory, with the aim to build scalable cluster of COTS
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
259
Comparison of Tools
• NPACI Rocks is the easiest solution for the installation and
management of a cluster under Linux
• OSCAR solutions are more flexible and complete than those
provided by Rocks. But installation is more difficult
• WP4 project offers the most flexible, complete and powerful
solution of the analyzed packages
• Also WP4 is most difficult solution to understand, to install
and to use
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
260
HPC Applications and Parallel Programming
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
261
HPC Applications Requirements
• Scientific and Engineering Components
– Physics, Biology, CFD, Astrophysics, Computer Science and
Engineering …etc..
• Numerical Algorithm Components
–
–
–
–
–
Finite Difference/ Finite Volume /Finite Elements etc…
Dense Matrix Algorithms
Solving linear system of equations
Solving Sparse system of equations
Fast Fourier Transformations
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
262
HPC Applications Requirements
• Non - Numerical Algorithm Components
–
–
–
–
Graph Algorithms
Sorting algorithms
Search algorithms for discrete Optimization
Dynamic Programming
• Different Computational Components
–
–
–
–
–
Parallelism (MPI, PVM, OpenMP, Pthreads, F90, etc..)
Architecture Efficiency (SMP, Clusters, Vector, DSM, etc..)
I/O Bottlenecks (generate gigabytes per simulation)
High Performance Storage (High I/O throughput from Disks)
Visualization of all that comes out
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
263
Future of Scientific Computing
• Require Large Scale Simulations, beyond reach of
any machine
• Require Large Geo-distributed Cross Disciplinary
Collaborations
• Systems getting larger by 2- 3- 4x per year !!
– Increasing parallelism: add more and more processors
• New Kind of Parallelism: GRID
– Harness the power of Computing Resources which are
growing
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
264
HPC Applications Issues
• Architectures and Programming Models
– Distributed Memory Systems MPP, Clusters – Message Passing
– Shared Memory Systems SMP – Shared Memory Programming
– Specialized Architectures – Vector Processing, Data Parallel
Programming
– The Computational Grid – Grid Programming
• Applications I/O
– Parallel I/O
– Need for high performance I/O systems and techniques,
scientific data libraries, and standard data representation
•
•
•
•
Checkpointing and Recovery
Monitoring and Steering
Visualization (Remote Visualization)
Programming Frameworks
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
265
Important Issues in Parallel Programming
 Partitioning of data
 Mapping of data onto the processors
 Reproducibility of results
 Synchronization
 Scalability and Predictability of performance
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
266
Designing Parallel Algorithms

Detect and exploit any inherent parallelism in
an existing sequential Algorithm

Invent a new parallel algorithm

Adopt another parallel algorithm that solves a
similar problem
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
267
Principles of Parallel Algorithms and Design
Questions to be answered
 How to partition the data?
 Which data is going to be partitioned?
 How many types of concurrency?
 What are the key principles of designing parallel
algorithms?
 What are the overheads in the algorithm design?
 How the mapping for balancing the load is done
effectively?
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
268
Serial and Parallel Algorithms - Evaluation
• Serial Algorithm
– Execution time as a function of size of input
• Parallel Algorithm
– Execution time as a function of input size, parallel architecture
and number of processors used
Parallel System
A parallel system is the combination of an algorithm and
the parallel architecture on which its implemented
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
269
Success depends on the combination of
 Architecture, Compiler, Choice of
Algorithm, Programming Language
Right
 Design of software, Principles of Design of
algorithm,
Portability,
Maintainability,
Performance analysis measures, and Efficient
implementation
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
270
Parallel Programming Paradigm
 Phase parallel
 Divide and conquer
 Pipeline
 Process farm
 Work pool
Remark :
The parallel program consists of number of super
steps, and each super step has two phases :
computation phase and interaction phase
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
271
Phase Parallel Model

C
C
. . .
C
Synchronous Interaction
C
C
. . .
C



Synchronous Interaction

Dheeraj Bhardwaj <[email protected]>
The phase-parallel model offers a
paradigm that is widely used in
parallel programming.
The parallel program consists of a
number of supersteps, and each has
two phases.
In a computation phase, multiple
processes
each
perform
an
independent computation C.
In the subsequent interaction phase,
the processes perform one or more
synchronous interaction operations,
such as a barrier or a blocking
communication.
Then next superstep is executed.
N. Seetharama Krishna <[email protected]>
272
Divide and Conquer




Dheeraj Bhardwaj <[email protected]>
A parent process divides its
workload into several smaller
pieces and assigns them to a
number of child processes.
The child processes then
compute their workload in
parallel and the results are
merged by the parent.
The dividing and the merging
procedures are done recursively.
This paradigm is very natural
for computations such as quick
sort. Its disadvantage is the
difficulty in achieving good load
balance.
N. Seetharama Krishna <[email protected]>
273
Pipeline
Data stream

In pipeline paradigm, a
number of processes form a
virtual pipeline.
P

Q
R
Dheeraj Bhardwaj <[email protected]>
A continuous data stream is
fed into the pipeline, and the
processes execute at different
pipeline stages simultaneously
in an overlapped fashion.
N. Seetharama Krishna <[email protected]>
274
Process Farm

Data stream

Master

Slave
Slave
Slave

Dheeraj Bhardwaj <[email protected]>
This paradigm is also known as the
master-slave paradigm.
A master process executes the
essentially sequential part of the
parallel program and spawns a
number of slave processes to execute
the parallel workload.
When a slave finishes its workload, it
informs the master which assigns a
new workload to the slave.
This is a very simple paradigm,
where the coordination is done by the
master.
N. Seetharama Krishna <[email protected]>
275
Work Pool


Work pool

Work
Pool

P
P
P


Dheeraj Bhardwaj <[email protected]>
This paradigm is often used in a shared
variable model.
A pool of works is realized in a global data
structure.
A number of processes are created.
Initially, there may be just one piece of
work in the pool.
Any free process fetches a piece of work
from the pool and executes it, producing
zero, one, or more new work pieces put
into the pool.
The parallel program ends when the work
pool becomes empty.
This paradigm facilitates load balancing,
as the workload is dynamically allocated to
free processes.
N. Seetharama Krishna <[email protected]>
276
Parallel Programming Models
Implicit parallelism

If the programmer does not explicitly specify
parallelism, but let the compiler and the run-time
support system automatically exploit it.
Explicit Parallelism

It means that parallelism is explicitly specified in the
source code by the
programming using special
language constructs, complex directives, or library
cells.
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
277
Explicit Parallel Programming Models
Three dominant parallel programming models are :
 Data-parallel model
 Message-passing model
 Shared-variable model
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
278
Explicit Parallel Programming Models
Message – Passing
 Message passing has the following characteristics :
–
–
–
Multithreading
–
–
Explicit interaction
Asynchronous parallelism (MPI reduce)
Separate address spaces (Interaction by
MPI/PVM)
Explicit allocation by user
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
279
Explicit Parallel Programming Models
Message – Passing
• Programs are multithreading and asynchronous
requiring explicit synchronization
• More flexible than the data parallel model, but it
still lacks support for the work pool paradigm.
• PVM and MPI can be used
• Message passing programs exploit large-grain
parallelism
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
280
Basic Communication Operations

One-to-All Broadcast

One-to-All Personalized Communication

All-to-All Broadcast

All-to-All personalized Communication

Circular Shift

Reduction

Prefix Sum
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
281
Basic Communication Operations
One-to-all broadcast on an eight-processor tree
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
282
Applications I/O and Parallel File Systems
• Large-scale parallel applications generate data in the terabytes range
which cannot be efficiently managed by traditional serial I/O
schemes (I/O bottleneck)
Design your applications with parallel I/O in mind !
• Datafiles should be interchangeable in a heterogenous computing
environment
• Make use of existing tools for data postprocessingand visualization
• Efficient support for checkpointing/recovery
Need for high performance
I/O systems and techniques,
scientific data libraries, and
standard data representation
Dheeraj Bhardwaj <[email protected]>
Applications
Data Models and Formats
Scientific Data Libraries
Parallel I/O Libraries
Parallel Filesystems
Hardware/Storage Subsystem
N. Seetharama Krishna <[email protected]>
283
Performance Metrics of Parallel Systems
Speedup metrics
Three performance models based on three speedup metrics are
commonly used.

Amdahl’s law
--
Fixed problem size

Gustafson’s law
--
Fixed time speedup

Sun-Ni’s law
--
Memory Bounding speedup
Three approaches to scalability analysis are based on
• Maintaining a constant efficiency,
• A constant speed, and
• A constant utilization
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
284
Conclusions
Clusters are promising



Solve parallel processing paradox
Offer incremental growth and matches with funding pattern
New trends in hardware and software technologies are likely
to make clusters more promising.
Success depends on the combination of


Architecture, Compiler,
Programming Language
Choice
of
Right
Algorithm,
Design of
software, Principles of Design of algorithm,
Portability, Maintainability, Performance analysis measures, and
Efficient implementation
Dheeraj Bhardwaj <[email protected]>
N. Seetharama Krishna <[email protected]>
285