Transcript note2

Lecture 2 Parallel
Programming Platforms
Parallel Computing
Fall 2014
1
Implicit Parallelism: Trends in
Microprocessor Architectures





Microprocessor clock speeds have posted impressive gains over the
past two decades (two to three orders of magnitude).
Higher levels of device integration have made available a large
number of transistors.
The question of how best to utilize these resources is an important
one.
Current processors use these resources in multiple functional units
and execute multiple instructions in the same cycle.
The precise manner in which these instructions are selected and
executed provides impressive diversity in architectures.
2
Implicit Parallelism


Pipelining and Superscalar execution
Very Long Instruction Word Processors
3
Pipelining and Superscalar Execution



Pipelining overlaps various stages of
instruction execution to achieve performance.
At a high level of abstraction, an instruction
can be executed while the next one is being
decoded and the next one is being fetched.
This is akin to an assembly line for
manufacture of cars.
4
Pipelining and Superscalar Execution

Limitations:




The speed of a pipeline is eventually limited by the
slowest stage.
For this reason, conventional processors rely on
very deep pipelines (20 stage pipelines in state-ofthe-art Pentium processors).
However, in typical program traces, every 5-6th
instruction is a conditional jump! This requires
very accurate branch prediction.
The penalty of a misprediction grows with the
depth of the pipeline, since a larger number of
instructions will have to be flushed.
5
Pipelining and Superscalar Execution


One simple way of alleviating these
bottlenecks is to use multiple pipelines.
The question then becomes one of selecting
these instructions.
6
Superscalar Execution: An Example
Example of a two-way superscalar execution of instructions.
7
Superscalar Execution

Scheduling of instructions is determined by a number of
factors:





True Data Dependency: The result of one operation is an input
to the next.
Resource Dependency: Two operations require the same
resource.
Branch Dependency: Scheduling instructions across conditional
branch statements cannot be done deterministically a-priori.
The scheduler, a piece of hardware looks at a large number of
instructions in an instruction queue and selects appropriate
number of instructions to execute concurrently based on these
factors.
The complexity of this hardware is an important constraint on
superscalar processors.
8
Very Long Instruction Word (VLIW)
Processors





The hardware cost and complexity of the superscalar scheduler
is a major consideration in processor design.
To address this issues, VLIW processors rely on compile time
analysis to identify and bundle together instructions that can be
executed concurrently.
These instructions are packed and dispatched together, and
thus the name very long instruction word.
This concept was used with some commercial success in the
Multiflow Trace machine (circa 1984).
Variants of this concept are employed in the Intel IA64
processors.
9
Very Long Instruction Word (VLIW)
Processors: Considerations






Issue hardware is simpler.
Compiler has a bigger context from which to select
co-scheduled instructions.
Compilers, however, do not have runtime information
such as cache misses. Scheduling is, therefore,
inherently conservative.
Branch and memory prediction is more difficult.
VLIW performance is highly dependent on the
compiler. A number of techniques such as loop
unrolling, speculative execution, branch prediction
are critical.
Typical VLIW processors are limited to 4-way to 8way parallelism.
10
Limitations of
Memory System Performance




Memory system, and not processor speed, is often
the bottleneck for many applications.
Memory system performance is largely captured by
two parameters, latency and bandwidth.
Latency is the time from the issue of a memory
request to the time the data is available at the
processor.
Bandwidth is the rate at which data can be pumped
to the processor by the memory system.
11
Improving Effective Memory
Latency Using Caches





Caches are small and fast memory elements between the
processor and DRAM.
This memory acts as a low-latency high-bandwidth storage.
If a piece of data is repeatedly used, the effective latency of this
memory system can be reduced by the cache.
The fraction of data references satisfied by the cache is called
the cache hit ratio of the computation on the system.
Cache hit ratio achieved by a code on a memory system often
determines its performance.
12
Memory System Performance:
Summary



Exploiting spatial and temporal locality in applications is
critical for amortizing memory latency and increasing
effective memory bandwidth.
The ratio of the number of operations to number of
memory accesses is a good indicator of anticipated
tolerance to memory bandwidth.
Memory layouts and organizing computation
appropriately can make a significant impact on the
spatial and temporal locality.
13
Alternate Approaches for
Hiding Memory Latency



Prefetching
Multithreading
Spatial locality in accessing memory
words.
14
Tradeoffs of Multithreading and
Prefetching




Bandwidth requirements of a multithreaded system
may increase very significantly because of the
smaller cache residency of each thread.
Multithreaded systems become bandwidth bound
instead of latency bound.
Multithreading and prefetching only address the
latency problem and may often exacerbate the
bandwidth problem.
Multithreading and prefetching also require
significantly more hardware resources in the form of
storage.
15
Explicitly Parallel Platforms
16
Flynn’s taxonomy of computer
architectures (control mechanism)

Depending on the execution and data streams computer architectures
can be distinguished into the following groups.


(1) SISD (Single Instruction Single Data) : This is a sequential computer.
(2) SIMD (Single Instruction Multiple Data) : This is a parallel machine like
the TM CM-200. SIMD machines are suited for data-parallel programs
where the same set of instructions are executed on a large data set.



Some of the earliest parallel computers such as the Illiac IV, MPP, DAP, CM-2,
and MasPar MP-1 belonged to this class of machines
(3) MISD (Multiple Instructions Single Data) : Some consider a systolic array
a member of this group.
(4) MIMD (Multiple Instructions Multiple Data) : All other parallel machines.
A MIMD architecture can be an MPMD or an SPMD. In a Multiple Program
Multiple Data organization, each processor executes its own program as
opposed to a single program that is executed by all processors on a Single
Program Multiple Data architecture.

Examples of such platforms include current generation Sun Ultra Servers, SGI
Origin Servers, multiprocessor PCs, workstation clusters, and the IBM SP
Note: Some consider CM-5 as a combination of a MIMD and SIMD as it
contains control hardware that allows it to operate in a SIMD mode.
17
SIMD and MIMD Processors
A typical SIMD architecture (a) and a typical MIMD architecture (b).
18
Taxonomy based on Address-Space
Organization (memory distribution)

Message-Passing Architecture



In a distributed memory machine each processor has its own exclusive address space.
On such platforms, interactions between processes running on different nodes must be
accomplished using messages, hence the name message-passing. Such machines
are commonly referred to as multicomputers.
Examples: Cray T3D/T3E, IBM SP1/SP2, workstation clusters.
Shared-Address-Space Architecture


Provides hardware support for read/write to a shared address space. Machines built
this way are often called multiprocessors.
(1) A shared memory machine has a single address space shared by all processors
(UMA, for Uniform Memory Access).



The time taken by a processor to access any memory word in the system is identical.
Examples: SGI Power Challenge, SMP machines.
(2) A distributed shared memory system is a hybrid between the two previous
ones. A global address space is shared among the processors but is distributed
among them. Each processor can access its own memory faster than it can access the
memory of a remote processor (NUMA for Non-Uniform Memory Access).
Example: SGI Origin 2000
Note: The existence of a cache in shared-memory parallel machines cause cache
coherence problems when a cached variable is modified by a processor and the
shared-variable is requested by another processor. cc-NUMA for cache-coherent
NUMA architectures (Origin 2000).
19
NUMA and UMA Shared-AddressSpace Platforms
Typical shared-address-space architectures: (a) Uniform-memory
access shared-address-space computer; (b) Uniform-memoryaccess shared-address-space computer with caches and
memories; (c) Non-uniform-memory-access shared-address-space
computer with local memory only.
20
Message Passing
vs.
Shared Address Space Platforms


Message passing requires little hardware support, other than a
network.
Shared address space platforms can easily emulate message
passing. The reverse is more difficult to do (in an efficient
manner).
21
Taxonomy based on processor
granularity

The granularity sometimes refers to the power of individual
processors. Sometimes is also used to denote the degree of
parallelism.




(1) A coarse-grained architecture consists of (usually few)
powerful processors (eg old Cray machines).
(2) a fine-grained architecture consists of (usually many
inexpensive) processors (eg TM CM-200, CM-2).
(3) a medium-grained architecture is between the two (eg CM-5).
Process Granularity refers to the amount of computation
assigned to a particular processor of a parallel machine for a
given parallel program. It also refers, within a single program,
to the amount of computation performed before communication
is issued. If the amount of computation is small (low degree of
concurrency) a process is fine-grained. Otherwise granularity
is coarse.
22
Taxonomy based on processor
synchronization



(1) In a fully synchronous system a global clock is
used to synchronize all operations performed by the
processors.
(2) An asynchronous system lacks any
synchronization facilities. Processor synchronization
needs to be explicit in a user’s program.
(3) A bulk-synchronous system comes in between a
fully synchronous and an asynchronous system.
Synchronization of processors is required only at
certain parts of the execution of a parallel program.
23
Physical Organization of Parallel
Platforms – ideal architecture(PRAM)



The Parallel Random Access Machine (PRAM) is one of the simplest ways to model a
parallel computer.
A PRAM consists of a collection of (sequential) processors that can synchronously
access a global shared memory in unit time. Each processor can thus access its
shared memory as fast (and efficiently) as it can access its own local memory.
The main advantages of the PRAM is its simplicity in capturing parallelism and
abstracting away communication and synchronization issues related to parallel computing.



Processors are considered to be in abundance and unlimited in number. The resulting PRAM
algorithms thus exhibit unlimited parallelism (number of processors used is a function of
problem size).
The abstraction thus offered by the PRAM is a fully synchronous collection of processors and a
shared memory which makes it popular for parallel algorithm design.
It is, however, this abstraction that also makes the PRAM unrealistic from a practical point
of view.



Full synchronization offered by the PRAM is too expensive and time demanding in parallel
machines currently in use.
Remote memory (i.e. shared memory) access is considerably more expensive in real machines
than local memory access
UMA machines with unlimited parallelism are difficult to build.
24
Four Subclasses of PRAM

Depending on how concurrent access to a single memory cell (of the shared memory)
is resolved, there are various PRAM variants.



ER (Exclusive Read) or EW (Exclusive Write) PRAMs do not allow concurrent access of the
shared memory.
It is allowed, however, for CR (Concurrent Read) or CW (Concurrent Write) PRAMs.
Combining the rules for read and write access there are four PRAM variants:

EREW:



CREW


Multiple read accesses to a memory location are allowed. Multiple write accesses to a memory
location are serialized.
ERCW



access to a memory location is exclusive. No concurrent read or write operations are allowed.
Weakest PRAM model
Multiple write accesses to a memory location are allowed. Multiple read accesses to a memory
location are serialized.
Can simulate an EREW PRAM
CRCW



Allows multiple read and write accesses to a common memory location.
Most powerful PRAM model
Can simulate both EREW PRAM and CREW PRAM
25
Resolve concurrent write access




(1) in the arbitrary PRAM, if multiple processors write into a single
shared memory cell, then an arbitrary processor succeeds in writing
into this cell.
(2) in the common PRAM, processors must write the same value into
the shared memory cell.
(3) in the priority PRAM the processor with the highest priority
(smallest or largest indexed processor) succeeds in writing.
(4) in the combining PRAM if more than one processors write into the
same memory cell, the result written into it depends on the combining
operator. If it is the sum operator, the sum of the values is written, if it
is the maximum operator the maximum is written.
Note: An algorithm designed for the common PRAM can be executed on a priority or
arbitrary PRAM and exhibit similar complexity. The same holds for an arbitrary PRAM
algorithm when run on a priority PRAM.
26