EECC756 - Shaaban

Download Report

Transcript EECC756 - Shaaban

Parallel Computer Architecture
• A parallel computer is a collection of processing elements
that cooperate to solve large problems fast
• Broad issues involved:
– Resource Allocation:
• Number of processing elements (PEs).
• Computing power of each element.
• Amount of physical memory used.
– Data access, Communication and Synchronization
• How the elements cooperate and communicate.
• How data is transmitted between processors.
• Abstractions and primitives for cooperation.
– Performance and Scalability
• Performance enhancement of parallelism: Speedup.
• Scalabilty of performance to larger systems/problems.
EECC756 - Shaaban
#1 lec # 1
Spring 2002 3-12-2002
The Need And Feasibility of Parallel Computing
• Application demands: More computing cycles:
– Scientific computing: CFD, Biology, Chemistry, Physics, ...
– General-purpose computing: Video, Graphics, CAD, Databases,
Transaction Processing, Gaming…
– Mainstream multithreaded programs, are similar to parallel programs
• Technology Trends
– Number of transistors on chip growing rapidly
– Clock rates expected to go up but only slowly
• Architecture Trends
– Instruction-level parallelism is valuable but limited
– Coarser-level parallelism, as in MPs, the most viable approach
• Economics:
– Today’s microprocessors have multiprocessor support eliminating the
need for designing expensive custom PEs
– Lower parallel system cost.
– Multiprocessor systems to offer a cost-effective replacement of
uniprocessor systems in mainstream computing.
EECC756 - Shaaban
#2 lec # 1
Spring 2002 3-12-2002
Scientific Computing Demand
EECC756 - Shaaban
#3 lec # 1
Spring 2002 3-12-2002
Scientific Supercomputing Trends
• Proving ground and driver for innovative architecture
and advanced techniques:
– Market is much smaller relative to commercial segment
– Dominated by vector machines starting in 70s
– Meanwhile, microprocessors have made huge gains in
floating-point performance
•
•
•
•
High clock rates.
Pipelined floating point units.
Instruction-level parallelism.
Effective use of caches.
• Large-scale multiprocessors replace vector supercomputers
– Well under way already
EECC756 - Shaaban
#4 lec # 1
Spring 2002 3-12-2002
Raw Uniprocessor Performance:
LINPACK
10,000
CRAY
 CRAY
 Micro
Micro
n = 1,000
n = 100
n = 1,000
n = 100

1,000

T94

LINPACK (MFLOPS)
C90



100




DEC 8200

Ymp

Xmp/416



 
 IBM Pow er2/990
MIPS R4400
Xmp/14se


DEC Alpha
 HP9000/735

 DEC Alpha AXP
 HP 9000/750
 CRAY 1s
 IBM RS6000/540
10

MIPS M/2000


MIPS M/120

Sun 4/260
1
1975


1980
1985
1990
1995
2000
EECC756 - Shaaban
#5 lec # 1
Spring 2002 3-12-2002
Raw Parallel Performance:
LINPACK
10,000
 MPP peak
 CRAY peak
ASCI Red 
LINPACK (GFLOPS)
1,000
Paragon XP/S MP
(6768)

Paragon XP/S MP
(1024) 
 T3D
CM-5 
100
T932(32) 
Paragon XP/S
CM-200 
CM-2 
Ymp/832(8)
1

 C90(16)
Delta
10


 iPSC/860
 nCUBE/2(1024)
Xmp /416(4)
0.1
1985
1987
1989
1991
1993
1995
1996
EECC756 - Shaaban
#6 lec # 1
Spring 2002 3-12-2002
General Technology Trends
• Microprocessor performance increases 50% - 100% per year
• Transistor count doubles every 3 years
• DRAM size quadruples every 3 years
180
160
140
DEC
120
alpha
100
IBM
80
RS6000
60
540
40
20
MIPS
Sun 4 M/120
260
MIPS
Integer
FP
HP 9000
750
M2000
0
1987
1988
1989
1990
1991
1992
EECC756 - Shaaban
#7 lec # 1
Spring 2002 3-12-2002
Clock Frequency Growth Rate
Clock rate (MHz)
1,000
100
10








R10000











Pentium100



















 



i80386

i80286
i8086 


1
i8080


 i8008
i4004
0.1
1970
1980
1990
2000
1975
1985
1995
2005
• Currently increasing 30% per year
EECC756 - Shaaban
#8 lec # 1
Spring 2002 3-12-2002
Transistor Count Growth Rate
100,000,000

Transistors
10,000,000




R10000


Pentium





















i80386

i80286 
  R3000
R2000
 
1,000,000
100,000
i8086
10,000

i8080

 i8008
i4004
1,000
1970
1980
1990
2000
1975
1985
1995
2005
• 100 million transistors on chip by early 2000’s A.D.
• Transistor count grows much faster than clock rate
- Currently 40% per year
EECC756 - Shaaban
#9 lec # 1
Spring 2002 3-12-2002
System Attributes to Performance
•
•
•
•
Performance benchmarking is program-mix dependent.
Ideal performance requires a perfect machine/program match.
Performance measures:
– Cycles per instruction (CPI)
– Total CPU time = T = C x t = C / f = Ic x CPI x t
= Ic x (p + m x k) x t
Ic = Instruction count
t = CPU cycle time
p = Instruction decode cycles
m = Memory cycles k = Ratio between memory/processor cycles
C = Total program clock cycles f = clock rate
– MIPS Rate = Ic / (T x 106) = f / (CPI x 106) = f x Ic /(C x 106)
– Throughput Rate: Wp = f /(Ic x CPI) = (MIPS) x 106 /Ic
Performance factors: (Ic, p, m, k, t) are influenced by: instruction-set
architecture, compiler design, CPU implementation and control, cache
and memory hierarchy.
EECC756 - Shaaban
#10 lec # 1
Spring 2002 3-12-2002
CPU Performance Trends
The microprocessor is currently the most natural
building block for multiprocessor systems in
terms of cost and performance.
Performance
100
Supercomputers
10
Mainframes
Microprocessors
Minicomputers
1
0.1
1965
1970
1975
1980
1985
1990
1995
EECC756 - Shaaban
#11 lec # 1
Spring 2002 3-12-2002
Parallelism in Microprocessor VLSI Generations
Bit-level parallelism
Instruction-level
Thread-level (?)
100,000,000

10,000,000






1,000,000


R10000




 










Pentium
Transistors


 i80386



i80286 
100,000


 R3000
 R2000

 i8086
10,000
 i8080
 i8008

 i4004
1,000
1970
1975
1980
1985
1990
1995
2000
2005
EECC756 - Shaaban
#12 lec # 1
Spring 2002 3-12-2002
The Goal of Parallel Computing
• Goal of applications in using parallel machines: Speedup
Speedup (p processors) =
Performance (p processors)
Performance (1 processor)
• For a fixed problem size (input data set),
performance = 1/time
Speedup fixed problem (p processors) =
Time (1 processor)
Time (p processors)
EECC756 - Shaaban
#13 lec # 1
Spring 2002 3-12-2002
Elements of Modern Computers
Computing
Problems
Algorithms
and Data
Structures
Mapping
Hardware
Architecture
Programming
High-level
Languages
Operating System
Binding
(Compile,
Load)
Applications Software
Performance
Evaluation
EECC756 - Shaaban
#14 lec # 1
Spring 2002 3-12-2002
Elements of Modern Computers
1 Computing Problems:
– Numerical Computing: Science and technology numerical
problems demand intensive integer and floating point
computations.
– Logical Reasoning: Artificial intelligence (AI) demand logic
inferences and symbolic manipulations and large space searches.
2 Algorithms and Data Structures
– Special algorithms and data structures are needed to specify the
computations and communication present in computing
problems.
– Most numerical algorithms are deterministic using regular data
structures.
– Symbolic processing may use heuristics or non-deterministic
searches.
– Parallel algorithm development requires interdisciplinary
interaction.
EECC756 - Shaaban
#15 lec # 1
Spring 2002 3-12-2002
Elements of Modern Computers
3 Hardware Resources
– Processors, memory, and peripheral devices form the
hardware core of a computer system.
– Processor instruction set, processor connectivity, memory
organization, influence the system architecture.
4 Operating Systems
– Manages the allocation of resources to running processes.
– Mapping to match algorithmic structures with hardware
architecture and vice versa: processor scheduling, memory
mapping, interprocessor communication.
– Parallelism exploitation at: algorithm design, program
writing, compilation, and run time.
EECC756 - Shaaban
#16 lec # 1
Spring 2002 3-12-2002
Elements of Modern Computers
5 System Software Support
– Needed for the development of efficient programs in highlevel languages (HLLs.)
– Assemblers, loaders.
– Portable parallel programming languages
– User interfaces and tools.
6 Compiler Support
– Preprocessor compiler: Sequential compiler and low-level
library of the target parallel computer.
– Precompiler: Some program flow analysis, dependence
checking, limited optimizations for parallelism detection.
– Parallelizing compiler: Can automatically detect
parallelism in source code and transform sequential code
into parallel constructs.
EECC756 - Shaaban
#17 lec # 1
Spring 2002 3-12-2002
Approaches to Parallel Programming
Programmer
Programmer
Source code written in
sequential languages C, C++
FORTRAN, LISP ..
Source code written in
concurrent dialects of C, C++
FORTRAN, LISP ..
Parallelizing
compiler
Concurrency
preserving compiler
Parallel
object code
Execution by
runtime system
(a) Implicit
Parallelism
Concurrent
object code
(b) Explicit
Parallelism
Execution by
runtime system
EECC756 - Shaaban
#18 lec # 1
Spring 2002 3-12-2002
Evolution of Computer
Architecture
Scalar
Sequential
Lookahead
Functional
Parallelism
I/E Overlap
Multiple
Func. Units
I/E: Instruction Fetch and
Execute
Pipeline
Implicit
Vector
SIMD: Single Instruction stream
over Multiple Data streams
Explicit
Vector
Memory-to
-Memory
MIMD: Multiple Instruction
streams over Multiple Data
streams
SIMD
Associative
Processor
Processor
Array
Register-to
-Register
MIMD
Multicomputer
Massively Parallel Processors
(MPPs)
Multiprocessor
EECC756 - Shaaban
Parallel Architectures History
Historically, parallel architectures tied to programming models
• Divergent architectures, with no predictable pattern of growth.
Application Software
Systolic
Arrays
System
Software
Architecture
SIMD
Message Passing
Dataflow
Shared Memory
EECC756 - Shaaban
#20 lec # 1
Spring 2002 3-12-2002
Programming Models
• Programming methodology used in coding applications
• Specifies communication and synchronization
• Examples:
– Multiprogramming:
No communication or synchronization at program level
– Shared memory address space:
– Message passing:
Explicit point to point communication
– Data parallel:
More regimented, global actions on data
• Implemented with shared address space or message passing
EECC756 - Shaaban
#21 lec # 1
Spring 2002 3-12-2002
Flynn’s 1972 Classification of
Computer Architecture
• Single Instruction stream over a Single Data stream
(SISD): Conventional sequential machines.
• Single Instruction stream over Multiple Data streams
(SIMD): Vector computers, array of synchronized
processing
elements.
• Multiple Instruction streams and a Single Data stream
(MISD): Systolic arrays for pipelined execution.
• Multiple Instruction streams over Multiple Data streams
(MIMD): Parallel computers:
• Shared memory multiprocessors.
• Multicomputers: Unshared distributed memory,
message-passing used instead.
EECC756 - Shaaban
#22 lec # 1
Spring 2002 3-12-2002
Flynn’s Classification of Computer Architecture
Fig. 1.3 page 12 in
Advanced Computer Architecture: Parallelism,
Scalability, Programmability, Hwang, 1993.
EECC756 - Shaaban
#23 lec # 1
Spring 2002 3-12-2002
Current Trends In Parallel Architectures
• The extension of “computer architecture” to support
communication and cooperation:
– OLD: Instruction Set Architecture
– NEW: Communication Architecture
• Defines:
– Critical abstractions, boundaries, and primitives
(interfaces)
– Organizational structures that implement interfaces
(hardware or software)
• Compilers, libraries and OS are important bridges today
EECC756 - Shaaban
#24 lec # 1
Spring 2002 3-12-2002
Modern Parallel Architecture
Layered Framework
CAD
Database
Multiprogramming
Shared
address
Scientific modeling
Message
passing
Parallel applications
Data
parallel
Programming models
Compilation
or library
Operating systems support
Communication hardware
Communication abstraction
User/system boundary
Hardware/software boundary
Physical communication medium
EECC756 - Shaaban
#25 lec # 1
Spring 2002 3-12-2002
Shared Address Space Parallel Architectures
• Any processor can directly reference any memory location
– Communication occurs implicitly as result of loads and stores
• Convenient:
– Location transparency
– Similar programming model to time-sharing in uniprocessors
• Except processes run on different processors
• Good throughput on multiprogrammed workloads
• Naturally provided on a wide range of platforms
– Wide range of scale: few to hundreds of processors
• Popularly known as shared memory machines or model
– Ambiguous: Memory may be physically distributed among
processors
EECC756 - Shaaban
#26 lec # 1
Spring 2002 3-12-2002
Shared Address Space (SAS) Model
• Process: virtual address space plus one or more threads of control
• Portions of address spaces of processes are shared
Virtual address spaces for a
collection of processes communicating
via shared addresses
Load
P1
Machine physical address space
Pn pr i v at e
Pn
P2
Common physical
addresses
P0
St or e
Shared portion
of address space
Private portion
of address space
P2 pr i v at e
P1 pr i v at e
P0 pr i v at e
• Writes to shared address visible to other threads (in other processes too)
•
Natural extension of the uniprocessor model:
• Conventional memory operations used for communication
• Special atomic operations needed for synchronization
• OS uses shared memory to coordinate processes
EECC756 - Shaaban
#27 lec # 1
Spring 2002 3-12-2002
Models of Shared-Memory Multiprocessors
• The Uniform Memory Access (UMA) Model:
– The physical memory is shared by all processors.
– All processors have equal access to all memory addresses.
• Distributed memory or Nonuniform Memory Access
(NUMA) Model:
– Shared memory is physically distributed locally among
processors.
• The Cache-Only Memory Architecture (COMA) Model:
– A special case of a NUMA machine where all distributed
main memory is converted to caches.
– No memory hierarchy at each processor.
EECC756 - Shaaban
#28 lec # 1
Spring 2002 3-12-2002
Models of Shared-Memory Multiprocessors
Uniform Memory Access (UMA) Model
I/O
devices
Mem
Mem
Mem
Mem
I/O ctrl
I/O ctrl
Interconnect
Interconnect
Processor
Interconnect:
Bus, Crossbar, Multistage network
P: Processor
M: Memory
C: Cache
D: Cache directory
Processor
Network
Network

M
$
P
M
$
P

M
D
D
D
C
C
C
P
P
P
$
P
Distributed memory or
Nonuniform Memory Access (NUMA) Model
Cache-Only Memory Architecture (COMA)
EECC756 - Shaaban
#29 lec # 1
Spring 2002 3-12-2002
Uniform Memory Access Example:
Intel Pentium Pro Quad
CPU
P-Pr o
module
256-KB
Interrupt
L2 $
controller
Bus interface
P-Pr o
module
P-Pr o
module
PCI
bridge
PCI bus
PCI
I/O
cards
PCI
bridge
PCI bus
P-Pr o bus (64-bit data, 36-bit address, 66 MHz)
Memory
controller
MIU
1-, 2-, or 4-w ay
interleaved
DRAM
•
All coherence and multiprocessing
glue in processor module
•
Highly integrated, targeted at high
volume
•
Low latency and bandwidth
EECC756 - Shaaban
#30 lec # 1
Spring 2002 3-12-2002
Uniform Memory Access Example:
SUN Enterprise
P
$
P
$
$2
$2
CPU/mem
cards
Mem ctrl
Bus interf ace/sw itch
Gigaplane bus (256 data, 41 address, 83 MHz)
I/O cards
2 FiberChannel
SBUS
SBUS
SBUS
100bT, SCSI
Bus interf ace
– 16 cards of either type: processors + memory, or I/O
– All memory accessed over bus, so symmetric
– Higher bandwidth, higher latency bus
EECC756 - Shaaban
#31 lec # 1
Spring 2002 3-12-2002
Distributed Shared-Memory
Multiprocessor System Example:
Cray T3E
External I/O
P
$
Mem
Mem
ctrl
and NI
XY
Sw itch
Z
• Scale up to 1024 processors, 480MB/s links
• Memory controller generates communication requests for nonlocal
references
• No hardware mechanism for coherence (SGI Origin etc. provide this)
EECC756 - Shaaban
#32 lec # 1
Spring 2002 3-12-2002
Message-Passing Multicomputers
• Comprised of multiple autonomous computers (nodes).
• Each node consists of a processor, local memory, attached
storage and I/O peripherals.
• Programming model more removed from basic hardware
operations
• Local memory is only accessible by local processors.
• A message passing network provides point-to-point static
connections among the nodes.
• Inter-node communication is carried out by message passing
through the static connection network
• Process communication achieved using a message-passing
programming environment.
EECC756 - Shaaban
#33 lec # 1
Spring 2002 3-12-2002
Message-Passing Abstraction
Match
Receive Y, P, t
Address Y
Send X, Q, t
Addr ess X
•
•
•
•
•
•
•
Local pr ocess
addr ess space
Local pr ocess
addr ess space
Process P
Process Q
Send specifies buffer to be transmitted and receiving process
Recv specifies sending process and application storage to receive into
Memory to memory copy, but need to name processes
Optional tag on send and matching rule on receive
User process names local data and entities in process/tag space too
In simplest form, the send/recv match achieves pairwise synch event
Many overheads: copying, buffer management, protection
EECC756 - Shaaban
#34 lec # 1
Spring 2002 3-12-2002
Message-Passing Example: IBM SP-2
Pow er 2
CPU
IBM SP-2 node
L2 $
Memory bus
•
•
Made out of essentially
complete RS6000
workstations
Network interface
integrated in I/O bus
(bandwidth limited by I/O
bus)
4-w ay
interleaved
DRAM
Memory
controller
MicroChannel bus
NIC
I/O
DMA
i860
NI
DRAM
General interconnection
netw ork f ormed fom
r
8-port sw itches
EECC756 - Shaaban
#35 lec # 1
Spring 2002 3-12-2002
Message-Passing Example:
Intel Paragon
i860
i860
L1 $
L1 $
Intel
Paragon
node
Memory bus (64-bit, 50 MHz)
Mem
ctrl
DMA
Driver
Sandia’ s Intel Paragon XP/S-based Super computer
2D grid netw ork
w ith processing node
attached to every sw itch
NI
4-w ay
interleaved
DRAM
8 bits,
175 MHz,
bidirectional
EECC756 - Shaaban
#36 lec # 1
Spring 2002 3-12-2002
Message-Passing Programming Tools
• Message-passing programming libraries include:
– Message Passing Interface (MPI):
• Provides a standard for writing concurrent message-passing
programs.
• MPI implementations include parallel libraries used by
existing programming languages.
– Parallel Virtual Machine (PVM):
• Enables a collection of heterogeneous computers to used as a
coherent and flexible concurrent computational resource.
• PVM support software executes on each machine in a userconfigurable pool, and provides a computational
environment of concurrent applications.
• User programs written for example in C, Fortran or Java are
provided access to PVM through the use of calls to PVM
library routines.
EECC756 - Shaaban
#37 lec # 1
Spring 2002 3-12-2002
Data Parallel Systems SIMD in Flynn taxonomy
•
Programming model
– Operations performed in parallel on each
element of data structure
– Logically single thread of control, performs
sequential or parallel steps
Control
processor
– Conceptually, a processor is associated with
each data element
•
Architectural model
– Array of many simple, cheap processors each
with little memory
• Processors don’t sequence through
instructions
– Attached to a control processor that issues
instructions
– Specialized and general communication, cheap
global synchronization
•
PE
PE

PE
PE
PE

PE


PE
PE


PE
Example machines:
– Thinking Machines CM-1, CM-2 (and CM-5)
– Maspar MP-1 and MP-2,
EECC756 - Shaaban
#38 lec # 1
Spring 2002 3-12-2002
Dataflow Architectures
• Represent computation as a graph of essential dependences
– Logical processor at each node, activated by availability of operands
– Message (tokens) carrying tag of next instruction sent to next processor
– Tag compared with others in matching store; match fires execution
1
a = (b +1)  (b  c)
d=ce
f = a d
b
c
e

+

d

Dataflow graph
a

Netw ork
f
Token
store
Program
store
Waiting
Matching
Instruction
fetch
Execute
Form
token
Netw ork
Token queue
Netw ork
EECC756 - Shaaban
#39 lec # 1
Spring 2002 3-12-2002
Systolic Architectures
• Replace single processor with an array of regular processing elements
• Orchestrate data flow for high throughput with less memory access
M
M
PE
PE
PE
PE
• Different from pipelining
– Nonlinear array structure, multidirection data flow, each PE
may have (small) local instruction and data memory
• Different from SIMD: each PE may do something different
• Initial motivation: VLSI enables inexpensive special-purpose chips
• Represent algorithms directly by chips connected in regular pattern
EECC756 - Shaaban
#40 lec # 1
Spring 2002 3-12-2002