Parallel and Distributed Computing

Transcript Parallel and Distributed Computing

Parallel and Distributed
Computing
References
Introduction to Parallel Computing, Second Edition •
Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar
Publisher: Addison Wesley , ISBN: 0-201-64865-2
Publication Date: January 16, 2003
Tutorials about Cluster , Grid, Cloud Computing, Internet Sites •
Chapter 1
Introduction to Parallel
Computing
Main Objective of this chapter is to
provide a survey about innovation on
parallel computing with respect to the
architectures
von Neumann Architecture
• Common machine
model
– for over 40 years
• Stored-program concept
• CPU executes a stored
program
• A sequence of read and
write operations on the
memory (RAM)
• Order of operations is
sequential
A More Detailed Architecture
based on von Neumann
Model
Old von Neumann Computer
CISC von Neumann Computer
• CISC ( Complex Instruction Set Computer
with a single bus system
• Harvard (RISC) architecture utilizes two
buses,
– separate data bus and address bus
• RISC ( Reduced Instruction Set Computer)
• They are SISD machines
– Single Instruction Stream Single on Data
Stream
John von Neumann
• December 28, 1903 –
February 8, 1957
• Hungarian mathematician
• Got his Ph.D. at 23
Motivations for Parallel
Computing
• Fundamental limits on single processor
speed
• Disparity between CPU & memory speed
– Performance Mismatch Problem
• Distributed data communications
• Need for very large scale computing
platforms
Fundamental Limits – Cycle
Speed
•
•
•
•
•
•
•
Cray 1:
12ns
1975
Cray 2:
6ns
1986
Cray T-90
2ns
1997
Intel PC
1ns
2000
Today’s PC 0.3ns
2006 (P4)
Speed of light: 30cm in 1ns
Signal travels about 10 times slower
Moore’s Law
• Moore’s observation in 1965:
– number of transistors per square inch on
integrated circuits had doubled every year
• Moore’s revised observation in 1975:
– the space slowed down a bit, but data density
had doubled approximately every 18 months
• How about the future?
– (price of computing power falls by a half every
18 months?)
Moore’s Law – Held for Now
CPU and Memory Speeds
• In 20 years, CPU speed (clock rate) has
increased by a factor of 1000
• DRAM speed has increased only by a
factor of smaller than 4
• How to feed data faster enough to keep
CPU busy?
• CPU speed: 1-2 ns
• DRAM speed: 50-60 ns
• Cache: 10 ns
Memory Access and CPU
Speed
CPU, Memory, and Disk Speed
Possible Solutions
•
•
•
•
A hierarchy of successively fast memory
devices (multilevel caches)
Location of data reference (data
Locality)
Efficient programming can be an issue
Parallel systems may provide
1. larger aggregate cache
2. higher aggregate bandwidth to the memory
system
Distributed Data Communications
• Data may be collected and stored at
different locations
• It is expensive to bring them to a central
location for processing
• Many computing assignments many be
inherently parallel
• Privacy issues in data mining and other
large scale commercial database
manipulations
Distributed Data
Communications
Why Use Parallel Computing
• Save time – wall clock time – many
processors work together
• Solve larger problems – larger than one
processor’s CPU and memory can handle
• Provide concurrency – do multiple things
at the same time: online access to
databases, search engine
• Google’s 4,000 PC servers are one of the
largest in clusters the world
Other Reasons for Parallel Computing
• Taking advantages of non-local resources –
using computing resources on a wide area
network, or even internet (grid & cloud
computing)
– Remote Access Resources
• Cost savings – using multiple “cheap” computing
resources instead of a high-end CPU
• Overcoming memory constraints – for large
problems, using memories of multiple computers
may overcome the memory constraint obstacle
Need for Large Scale Modeling
•
•
•
•
•
•
•
•
Weather forecasting
Ocean modeling
Oil reservoir simulations
Car and airplane manufacture
Semiconductor simulation
Pollution tracking
Large commercial databases
Aerospace NASA microgravity modeling)
Issues in Parallel Computing
(main issues of the course)
•
•
•
•
•
•
•
•
Design of parallel computers
Design of efficient parallel algorithms
Methods for evaluating parallel algorithms
Parallel computer languages
Parallel programming tools
Portable parallel programs
Automatic programming of parallel computers
Education of parallel computing philosophy
*PARALLEL ARCHITECTURES
• A parallel computer can be characterized as a system
where multiple processing elements cooperate in executing
one or more tasks.
• The numerous existing parallel architectures and their
different approaches require some kind of classification.
Flynn’s Taxonomy
• the design of a computer is characterized by:
– flow (or stream) of instructions, and
– flow (or stream) of data.
• Flynn’s taxonomy classifies according to the multiplicity of the
instruction and the data flows.
PARALLEL ARCHITECTURES
SISD (Single Instruction Single Data)
architecture
• Single instruction: only one instruction stream
is being acted on by the CPU during any one
clock cycle
• Single data: only one data stream is being
used as input during any one clock cycle
• Deterministic execution
• Corresponds to the conventional sequential
computer.
– oldest and even today, the most common
type of computer
• Examples: older generation mainframes,
minicomputers and workstations; most
modern day PCs.
PARALLEL ARCHITECTURES
MISD (Multiple Instruction Single Data)
 A single data stream is fed into multiple processing units.
 Each processing unit operates on the data independently via independent



instruction streams.
Few type of this class of parallel computer have ever existed.
 The experimental Carnegie-Mellon C.mmp computer (1971).
Some uses might be:
– multiple frequency filters operating on a signal stream
– multiple cryptography algorithms attempting to crack a single coded
message.
Although MISD does not seem to be meaningful
 pipeline architectures, as found in all modern processors, can be
considered MISD
Parallel & Distributed Computing
HARDWARE PLATFORMS
Hardware Platforms
• Three principal classes
of parallel computers
used today:
Distributed Computers
– SIMD machines
– MIMD machines
Parallel Computers
SIMD Networked
Shared
• Shared-memory
Distributed Machines Workstations
Memory
Memory
multiprocessors,
(Multiprocessors)
(Multi-computers)
• Distributed memory
Taxonomy of Parallel and Distributed
multicomputers, and
Computers
• Networked Workstations
SIMD Machines
• Main characteristic of SIMD
machines is that:
– all processors must execute the same
instruction (on a different data element
) at any instant in the program's
execution.
– These machines execute in "lockstep" synchronous to a global clock
• All processors must complete
execution of the current instruction
before any is allowed to proceed to
the next instruction.
SIMD Machines
• SIMD machines typically contain more,
simpler, processors (Processing
Elements (PE))
 Best suited for specialized problems
characterized by a high degree of
regularity, such as graphics/image
processing.
 Two varieties: Processor Arrays and
Vector Pipelines:
Processor Arrays: Connection Machine
CM-2, MasPar MP-1 & MP-2, ILLIAC IV
Vector Pipelines: IBM 9000, Cray X-MP,
Y-MP & C90, Fujitsu VP, NEC SX-2,
Hitachi S820, ETA10
• Most modern computers, particularly
those with graphics processor units
(GPUs) employ SIMD instructions
and execution units.
Processor Arrays- ILLIAC IV
VECTOR PROCESSOR Vs ARRAY
PROCESSOR
• Vector processor is a computer with built-in
instructions that
– perform multiple calculations on a vector of data
• (One dimensional arrays)
• Vector processor is used to solve the same or
similar problems as an array processor.
• A vector processor passes vector of data to
functional units.
• An array processor passes each element of a
vector to a different arithmetic unit.
MIMD Machines
• most common type of parallel
computer, and modern computers
fall into this category
• Every Processor Element (PE)
has its own Control Unit (CU).
 Examples
– most current supercomputers,
networked parallel computer
clusters and "grids", multiprocessor SMP computers,
multi-core PCs.
MIMD Machines
• PEs operate independently
of each other and execute
independent instructions
on different data streams
 Execution can be
deterministic or nonnon-deterministic
deterministic
multiple ways of processing the
same input, without any
specification of which one will be
taken to arrive at outcomes
MIMD Machines
• A parallel execution of a global task (i.e., the
collaboration of PEs is achieved through
synchronization or asynchronous and data exchange
between the PEs via the interconnection network
• An MIMD architecture can simulate an SIMD architecture
by executing the same program on all the processors,
which is called SPMD (Single Program Multiple Data)
mode.
– deterministic or non-deterministic execution?!
Memory Architectures
• For both the design and the programming model of a parallel
system, the memory organization is a very important issue
– not considered by Flynn’s taxonomy
• The memory organization of a parallel system can divide into
two aspects:
– location and
– access policy of the memory.
• Regarding the location, memory is either centralized or
distributed with the processors.
• systems with a common memory, distributed or not, are called
Shared-memory machines
– all processors have full access to the memory.
•
systems where there is no such shared memory, are called
Distributed-memory machines
– processors have to use explicit means of
communication like message passing.
Memory Architectures
• With these two aspects of memory
organization, two common memory
organizations:
– Centralized Memory
Multiprocessor
•
a common memory,
– Distributed Memory Multiprocessor
•
•
Shared Memory (distributed-Shred
Memory)
Message Passing
»
Multi-computer
Distributed Shard Multiprocessor (DSM)
Symmetric Multiprocessor (SMP)
Shared Memory Multi-Processor Systems
• With these two aspects of memory organization, two
common memory organizations:
– Centralized Memory Multiprocessor
•
a common memory,
• Two different types of shared memory multi-processor
systems:
– Symmetric Multi-Processors (SMPs)
– Distributed Shared Memory (DSM).
• In both cases every processor can read and write to
any portion of the system’s memory (Shared Variable).
• Cache coherence protocols are needed in both cases
Distributed Shard Multiprocessor (DSM)
Symmetric Multiprocessor (SMP)
Shared Memory Multi-Processor
Systems
• SMP system; a single pool of memory, which is
equally fast for all processors,
– referred to as a Uniform Memory Access (UMA).
• DSM system; multiple pools of memory and the
latency to access memory depends on the relative
position of the processor and memory
– referred to as cache coherent Non-Uniform
Memory Access (NUMA).
– each processor has local memory (the lowest
latency) while everything else is classified as
remote memory and is slower to access.
Symmetric Multi-Processors (SMPs)
• Shared bus provides a single point of arbitration
• Cache coherency is relatively straight forward
– When a processor needs to broadcast a cache
coherency message, it simply sends that
information over the bus, and all other processors
can receive it
 This is the model that Intel has pursued since the
advent of the Pentium Pro, and continues today with
the entire Xeon line.
Distributed Shared Memory (DSM)


Scale more effectively because local memory can be accessed rapidly
Point to point interconnects are a natural fit, because the bandwidth
grows proportionally to the number of processors and amount of
memory in the system.
 The biggest downside of distributed memory is that they only work well
if the Operating System is “NUMA-aware” and can efficiently place
memory and processes.
– intelligent OS would place data and schedule processes such that
each processor only accesses local memory and never needs
remote data.
 Within DSM, variety of different topologies can be used: crossbar, fat
tree, torus, ring, hypercube, etc. Moreover, these topologies can be
combined into hybrid topologies: a tree of rings, a crossbar of meshes,
etc.
Memory Hierarchies of Shared Memory
Multi-Processor Systems
Distributed-Memory Multicomputers
•
•
Multi-computers do not support shared variables. Rather, all
communications between processors must occur via message
passing.
Message-passing libraries are provided to send and receive messages
between processors.
– Examples
•
•
the Cray T3D, NCube/Ten, and Intel Paragon.
Large multicomputers may contain hundreds of processors
– no need for cache coherence protocol
•
Network Of Workstation (NOW) {Cluster}:
– Each processor is stand alone computer
– Using LAN network to connect computers
– Homogenous & Heterogeneous
Distinction between Shared-Memory
Multiprocessors and Multicomputers
• Distinction because
– Common address space provided by shared
memory machines allows global data
structures referenced by more than one
processor to be used
– Distributed address space provided by
distributed memory machines allows local data
structures.
– Many memory management techniques may
be used in shared-memory machines.
“Parallel” Computing
• Traditional supercomputers
– SIMD, MIMD, pipelines
– Tightly coupled shared memory (SMP, DSM)
– Loosely Coupled Distributed Memory
(Multicomputer)
– Bus level connections
– Expensive to buy and to maintain
• Very high starting cost
– Expensive hardware
– Expensive software
• High maintenance
• Expensive to upgrade
• Cooperating networks of computers

Parallel and Distributed Computing

Transcript Parallel and Distributed Computing

Directory