Topic Overview
Download
Report
Transcript Topic Overview
An Overview of Parallel Computing
Hardware
• There are many varieties of parallel computing hardware
and many different architectures
• The original classification of parallel computers is
popularly known as Flynn’s taxonomy.
• Flynn classified systems according to the number of
instruction streams and the number of data streams
1. SISD (Von Neumann machine)
2. MIMD (most general, a collection of autonomous
processors operate on their own data streams)
3. SIMD
4. MISD
The Classical von Neumann Machine
•
•
•
•
•
Divided into a CPU and main memory
CPU is divided into a control unit and an ALU
The memory stores both instructions and data
The control unit directs the execution of programs
The ALU carries out the calculations. When being used
by the program, instructions and data are stored in very
fast memory location, called registers. Both data and
instructions are moved between memory and registers in
CPU via bus.
• Bus is a collection of parallel wires, faster buses have
more wires
The Classical von Neumann Machine
• To be useful, some additional devices are needed
including input devices, output devices, and extended
storage devices (disk)
• The bottleneck is the transfer of data and instructions
between memory and CPU
• Few computers use classical Neumann machine
• Most machines now have a hierarchical memory. Cache
is used to achieve faster access.
Pipeline and Vector Architecture
The first widely used extension to Neumann model was
pipelining.
Suppose we have the following program
float x[100], y[100], z[100];
for(i =0; i<100; i++)
z[i]=x[i]+y[i];
Further a single addition consists of following operations:
1. Fetch the operands from memory; 2. Compare
exponents; 3. Shift one operand; 4. Add; 5. Normalize
the results; 6. Store results in memory.
Pipeline and Vector Architecture
• A further improvement: add vector instructions
• With vector instruction, each of the basic instruction only
needs to be issued once. One short instruction encodes
N operations.
• Another improvement is the use of multiple memory
banks.
• Different authors regard vector processors as different
categories (MISD, variant of SIMD, even not really
parallel machines)
• Examples: CRAY C90 and NEC SX4
Pipeline and Vector Architecture
• Advantages: relatively easy to write programs to obtain
very high performance, therefore very popular for high
performance scientific computing
• Disadvantages: Don’t work well for programs that use
irregular structures or use many branches
SIMD Systems
• A pure SIMD system is opposed to a vector processor
since it has single CPU
• During each instruction cycle, the control processor
broadcasts an instruction to all of the subordinate
processors. Each of them either executes the instruction
or idle.
Example: for (i=0; i< 1000; i++)
if (y[i]!=0.0)
z[i]=x[i]/y[i];
else
z[i]=x[i];
SIMD Systems
• Each subordinate processor would execute
Time Step 1: Test local_y=0.0.
Time Step 2:
a. If local_y was nonzero, z[i]=x[i]/y[i];
b. If local_y was zero, do nothing.
Time Step 3:
a. If local_y was nonzero, do nothing.
b. if local_y was zero, z[i]=x[i].
• It is completely synchronous execution. A given
subordinate processor either active or idle at given
instant of time
SIMD Systems
• The disadvantage is clear: in a program with many
conditional branches or long segments of code whose
execution depends on conditionals, possibly many
processes will be idle for long period of time
• Easy program if underlying problem has a regular
structure.
• The most famous examples of SIMD machines are the
CM-1 and CM-2 Connection Machines produced by
Thinking Machines.
General MIMD Systems
• The key difference between SIMD and MIMD: the
processors are autonomous.
• MIMD systems are asynchronous. Often no global clock;
maybe no correspondence between different processors
even if they execute the same program
• MIMD systems consist of shared-memory (and
distributed memory systems, also sometimes called
multiprocessors and multicomputers.
Shared-Memory MIMD
• The generic shared-memory architecture
Bus-based Architecture
• Simplest interconnection network
• If multiple processors access memory, bus will become
saturated, thus long delays
• A fairly large cache
• Due to limited bandwidth of a bus, do not scale to large
number of processors.
Switched-based Architecture
• Most others rely on some type of switch-based network
• A crossbar as a rectangular mesh of wires with switches
at the point of intersection, and terminals on its left and
top edges.
Switched-based Architecture
• Processors or memory modules can be connected to the
terminals
• The switches can either allow a signal to pass through in
both directions simultaneously, or they can redirect a
signal from vertical to horizontal or vice versa.
• Any other processor can simultaneously access any
other memory module, therefore, don’t suffer from the
problems of saturation
• However, they are very expensive: an m*n crossbar
needs mn hardware switches
Cache Coherence
• Cache coherence is a problem for any shared-memory
architecture
• A processor accesses a shared variable in its cache,
how will it know whether the value stored in the variable
is current?
• Example: assume x=2; //initially
P1
P2
Time 0 y0=x;
y1=3*x;
Time 1 x=7;
z=6;
Time 2 y=5;
z1=4*x;
y0 ends up 2 and y1 ends up 6. How about z1?
Cache Coherence
• The simplest solution is probably the snoopy protocol
• Each CPU has a cache controller
• The ache controller monitors the bus traffic. When a
processor updates a shared variable, it also updated the
corresponding main memory location. The cache
controllers on the other processors detect the write to
main memory and mark their copies of the variable as
invalid
• This approach is only suitable for bus-based sharedmemory because any traffic on the bus can be monitored
by all the controllers
Distributed-Memory MIMD
• Each processor has its own private memory
• Generic distributed-memory MIMD
• If we view it as a graph, the edges are communication
wires. Each vertex corresponds to a processor/memory
pair (or node), or some vertices correspond to nodes and
others correspond to switches
• They are static networks and dynamic networks
Distributed-Memory MIMD
• Different types of distributed systems
(a) a static network (mesh) (b) a dynamic network
(crossbar)
Dynamic Interconnection Networks
• Dynamic interconnection networks
Example: An omega network
Dynamic Interconnection Networks
•
A less expensive solution is to use the multistage
switching network, such as omega network
• If p nodes, plogp/2 switches are needed, less than the
crossbar using p2 switches
• The delay in transmitting a message is increased since
logp switches must be set
Static Interconnection Networks
•
•
•
•
Fully connected interconnection network
Ideal case from the performance and programming
Communication has no delay
Costs are huge
Static Interconnection Networks
• A linear array or a ring
•
•
•
•
Relatively inexpensive (p or p-1 wires)
Easy to increase the size of the network
Number of available wires is extremely limited
The longest path is p-1 or p/2
Static Interconnection Networks
• Hypercube: practically closest to the fully connected
network
• A d-dimension hypercube has 2d nodes
• Any two nodes traverse at most d wires
• Drawback: relative lack of scalability
Static Interconnection Networks
• Mesh or torus
Static Interconnection Networks
• Mesh or torus is between hypercube and linear array
• Scale better than hypercube
• Quite popular
Communication and Routing
• If two nodes are not directly connected or if a processor
is not directly connected to a memory module, how is
data transmitted between the two?
• If there are multiple routes, how to decide? Is the route
always the shortest path? Most systems use a
deterministic shortest-path algorithm
• How do intermediate nodes forward communications?
Two basic approaches are store-and-forward routing and
cut-through routing
• Store-and-forward routing uses considerably more
memory
• Most systems use some variant of cut-through routing
Store-and-Forward Routing
• Read in the entire message ,and then send to C
Cut-Through Routing
• Immediately forward each identifiable pieces of the
message