Transcript ppt

CS 258
Parallel Computer Architecture
Lecture 1
Introduction to Parallel Architecture
January 23, 2002
Prof John D. Kubiatowicz
Computer Architecture Is …
the attributes of a [computing] system as seen
by the programmer, i.e., the conceptual
structure and functional behavior, as distinct
from the organization of the data flows and
controls the logic design, and the physical
implementation.
Amdahl, Blaaw, and Brooks, 1964
SOFTWARE
1/23/02
John Kubiatowicz
Slide 1.2
Instruction Set Architecture (ISA)
software
instruction set
hardware
• Great picture for uniprocessors….
– Rapidly crumbling, however!
• Can this be true for multiprocessors???
– Much harder to say.
1/23/02
John Kubiatowicz
Slide 1.3
What is Parallel Architecture?
• A parallel computer is a collection of processing
elements that cooperate to solve large problems fast
• Some broad issues:
– Models of computation: PRAM? BSP? Sequential Consistency?
– Resource Allocation:
» how large a collection?
» how powerful are the elements?
» how much memory?
– Data access, Communication and Synchronization
» how do the elements cooperate and communicate?
» how are data transmitted between processors?
» what are the abstractions and primitives for cooperation?
– Performance and Scalability
» how does it all translate into performance?
» how does it scale?
1/23/02
John Kubiatowicz
Slide 1.4
Computer Architecture Topics (252+)
Input/Output and Storage
Disks, WORM, Tape
VLSI
Coherence,
Bandwidth,
Latency
L2 Cache
L1 Cache
Instruction Set Architecture
Addressing,
Protection,
Exception Handling
Pipelining, Hazard Resolution,
Superscalar, Reordering,
Prediction, Speculation,
Vector, Dynamic Compilation
1/23/02
Network
Communication
Other Processors
Emerging Technologies
Interleaving
Bus protocols
DRAM
Memory
Hierarchy
RAID
Pipelining and Instruction
Level Parallelism
John Kubiatowicz
Slide 1.5
Computer Architecture Topics (258)
P
M
P
S
M
° ° °
P
M
P
Interconnection Network
Processor-Memory-Switch
M
Shared Memory,
Message Passing,
Data Parallelism
Network Interfaces
Topologies,
Routing,
Bandwidth,
Latency,
Reliability
Multiprocessors
Networks and Interconnections
Everything in previous slide but more so!
1/23/02
John Kubiatowicz
Slide 1.6
What will you get out of CS258?
• In-depth understanding of the design and engineering
of modern parallel computers
– technology forces
– Programming models
– fundamental architectural issues
» naming, replication, communication, synchronization
– basic design techniques
» cache coherence, protocols, networks, pipelining, …
– methods of evaluation
• from moderate to very large scale
• across the hardware/software boundary
• Study of REAL parallel processors
– Research papers, white papers
• Natural consequences??
– Massive Parallelism  Reconfigurable computing?
– Message Passing Machines  NOW  Peer-to-peer systems?
1/23/02
John Kubiatowicz
Slide 1.7
Will it be worthwhile?
• Absolutely!
– even through few of you will become PP designers
• The fundamental issues and solutions translate
across a wide spectrum of systems.
– Crisp solutions in the context of parallel machines.
• Pioneered at the thin-end of the platform
pyramid on the most-demanding applications
– migrate downward with time
• Understand implications
for software
• Network attached
storage, MEMs, etc?
SuperServers
Departmenatal Servers
Workstations
Personal Computers
1/23/02
John Kubiatowicz
Slide 1.8
Why Study Parallel Architecture?
Role of a computer architect:
To design and engineer the various levels of a computer system
to maximize performance and programmability within limits of
technology and cost.
Parallelism:
• Provides alternative to faster clock for performance
• Applies at all levels of system design
• Is a fascinating perspective from which to view architecture
• Is increasingly central in information processing
• How is instruction-level parallelism related to course-grained
parallelism??
1/23/02
John Kubiatowicz
Slide 1.9
Is Parallel Computing Inevitable?
This was certainly not clear just a few years ago
Today, however:
• Application demands: Our insatiable need for
computing cycles
• Technology Trends: Easier to build
• Architecture Trends: Better abstractions
• Economics: Cost of pushing uniprocessor
• Current trends:
– Today’s microprocessors have multiprocessor support
– Servers and workstations becoming MP: Sun, SGI, DEC,
COMPAQ!...
– Tomorrow’s microprocessors are multiprocessors
1/23/02
John Kubiatowicz
Slide 1.10
Can programmers handle parallelism?
• Humans not as good at parallel programming as
they would like to think!
– Need good model to think of machine
– Architects pushed on instruction-level parallelism
really hard, because it is “transparent”
• Can compiler extract parallelism?
– Sometimes
• How do programmers manage parallelism??
– Language to express parallelism?
– How to schedule varying number of processors?
• Is communication Explicit (message-passing) or
Implicit (shared memory)?
– Are there any ordering constraints on
communication?
1/23/02
John Kubiatowicz
Slide 1.11
Granularity:
• Is communication fine or coarse grained?
– Small messages vs big messages
• Is parallelism fine or coarse grained
– Small tasks (frequent synchronization) vs big tasks
• If hardware handles fine-grained parallelism, then
easier to get incremental scalability
• Fine-grained communication and parallelism harder
than coarse-grained:
– Harder to build with low overhead
– Custom communication architectures often needed
• Ultimate course grained communication:
– GIMPS (Great Internet Mercenne Prime Search)
– Communication once a month
1/23/02
John Kubiatowicz
Slide 1.12
CS258: Staff
Instructor:Prof John D. Kubiatowicz
Office: 673 Soda Hall, 643-6817 kubitron@cs
Office Hours: Thursday 1:30 - 3:00 or by appt.
Class: Wed, Fri, 1:00 - 2:30pm
310 Soda Hall
Administrative: Veronique Richard,
Office: 676 Soda Hall, 642-4334 nicou@cs
Web page: http://www.cs/~kubitron/courses/cs258-S02/
Lectures available online <11:30AM day of lecture
Email: [email protected]
Clip signup link on web page (as soon as it is up)
1/23/02
John Kubiatowicz
Slide 1.13
Why Me? The Alewife Multiprocessor
• Cache-coherence Shared Memory
•
•
•
•
– Partially in Software!
User-level Message-Passing
Rapid Context-Switching
Asynchronous network
One node/board
1/23/02
John Kubiatowicz
Slide 1.14
TextBook: Two leaders in field
Text: Parallel Computer Architecture:
A Hardware/Software Approach,
By: David Culler & Jaswinder Singh
Covers a range of topics
We will not necessarily cover
them in order.
1/23/02
John Kubiatowicz
Slide 1.15
Lecture style
•
•
•
•
•
•
•
1-Minute Review
20-Minute Lecture/Discussion
5- Minute Administrative Matters
25-Minute Lecture/Discussion
5-Minute Break (water, stretch)
25-Minute Lecture/Discussion
Instructor will come to class early & stay after to
answer questions
Attention
20 min.
1/23/02
Break “In Conclusion, ...”
Time
John Kubiatowicz
Slide 1.16
Research Paper Reading
• As graduate students, you are now researchers.
• Most information of importance to you will be in
research papers.
• Ability to rapidly scan and understand research
papers is key to your success.
• So: you will read lots of papers in this course!
– Quick 1 paragraph summaries will be due in class
– Students will take turns discussing papers
• Papers will be scanned and on web page.
1/23/02
John Kubiatowicz
Slide 1.17
How will grading work?
• No TA This term!
• Tentative breakdown:
–
–
–
–
1/23/02
20%
30%
40%
10%
homeworks / paper presentations
exam
project (teams of 2)
participation
John Kubiatowicz
Slide 1.18
Application Trends
• Application demand for performance fuels advances in
hardware, which enables new appl’ns, which...
– Cycle drives exponential increase in microprocessor performance
– Drives parallel architecture harder
» most demanding applications
New Applications
More Performance
• Programmers willing to work really hard to improve
high-end applications
• Need incremental scalability:
– Need range of system performance with progressively increasing cost
1/23/02
John Kubiatowicz
Slide 1.19
Speedup
• Speedup (p processors) =
Time (1 processor)
Time (p processors)
• Common mistake:
– Compare parallel program on 1 processor to parallel program
on p processors
• Wrong!:
– Should compare uniprocessor program on 1 processor to
parallel program on p processors
• Why? Keeps you honest
– It is easy to parallelize overhead.
1/23/02
John Kubiatowicz
Slide 1.20
Amdahl's Law
Speedup due to enhancement E:
Speedup(E)
ExTime w/o E
= ------------ExTime w/ E
=
Performance w/ E
------------------Performance w/o E
Suppose that enhancement E accelerates a
fraction F of the task by a factor S, and
the remainder of the task is unaffected
1/23/02
John Kubiatowicz
Slide 1.21
Amdahl’s Law for parallel programs?
Fractionpar 

ExTime par  ExTime ser   1  Fractionpar 
  ExTime overhead p, stuff
p



Speedup 

1
1  Fraction
parallel

Fractionpar

p
ExTime overhead p, stuff 
ExTime par
Best you could ever hope to do:
Speedupmaximum 
1
1 - Fractionpar


Worse: Overhead may kill your performance!
1/23/02
John Kubiatowicz
Slide 1.22
Metrics of Performance
Application
Programming
Language
Answers per month
Operations per second
Compiler
(millions) of Instructions per second: MIPS
ISA
(millions) of (FP) operations per second:
MFLOP/s
Datapath
Megabytes per second
Control
Function Units
Cycles per second (clock rate)
Transistors Wires Pins
1/23/02
John Kubiatowicz
Slide 1.23
Commercial Computing
• Relies on parallelism for high end
– Computational power determines scale of business that can be
handled
• Databases, online-transaction processing, decision
support, data mining, data warehousing ...
• TPC benchmarks (TPC-C order entry, TPC-D decision
support)
–
–
–
–
1/23/02
Explicit scaling criteria provided
Size of enterprise scales with size of system
Problem size not fixed as p increases.
Throughput is performance measure (transactions per minute or tpm)
John Kubiatowicz
Slide 1.24
TPC-C Results for March 1996
25,000






Throughput (tpmC)
20,000
Tandem Himalaya
DEC Alpha
SGI Pow erChallenge
HP P A
IBM Pow erPC
Other

15,000

10,000









5,000 











 

 


 














0 
0
20

40
60
80
100
120
Number of processors
• Parallelism is pervasive
• Small to moderate scale parallelism very important
• Difficult to obtain snapshot to compare across vendor
platforms
1/23/02
John Kubiatowicz
Slide 1.25
Scientific Computing Demand
1/23/02
John Kubiatowicz
Slide 1.26
Engineering Computing Demand
• Large parallel machines a mainstay in many
industries
– Petroleum (reservoir analysis)
– Automotive (crash simulation, drag analysis, combustion
efficiency),
– Aeronautics (airflow analysis, engine efficiency, structural
mechanics, electromagnetism),
– Computer-aided design
– Pharmaceuticals (molecular modeling)
– Visualization
» in all of the above
» entertainment (films like Toy Story)
» architecture (walk-throughs and rendering)
– Financial modeling (yield and derivative analysis)
– etc.
1/23/02
John Kubiatowicz
Slide 1.27
Applications: Speech and Image Processing
10 GIPS
1 GIPS
Telephone
Number
Recognition
100 M IPS
10 M IP S
1 M IPS
1980
200 Words
Isolated Sp eech
Recognition
Sub-Band
Speech Coding
1985
1,000 Words
Continuous
Speech
Recognition
ISDN-CD Stereo
Receiver
5,000 Words
Continuous
Speech
Recognition
HDTVReceiver
CIF Video
CELP
Speech Coding
Speaker
Veri¼cation
1990
1995
• Also CAD, Databases, . . .
• 100 processors gets you 10 years, 1000 gets you 20 !
1/23/02
John Kubiatowicz
Slide 1.28
Is better parallel arch enough?
• AMBER molecular dynamics simulation program
• Starting point was vector code for Cray-1
• 145 MFLOP on Cray90, 406 for final version on 128processor Paragon, 891 on 128-processor Cray T3D
1/23/02
John Kubiatowicz
Slide 1.29
Summary of Application Trends
• Transition to parallel computing has occurred
for scientific and engineering computing
• In rapid progress in commercial computing
– Database and transactions as well as financial
– Usually smaller-scale, but large-scale systems also used
• Desktop also uses multithreaded programs,
which are a lot like parallel programs
• Demand for improving throughput on sequential
workloads
– Greatest use of small-scale multiprocessors
• Solid application demand exists and will
increase
1/23/02
John Kubiatowicz
Slide 1.30
Technology Trends
Performance
100
Supercomputers
10
Mainframes
Microprocessors
Minicomputers
1
0.1
1965
1970
1975
1980
1985
1990
1995
• Today the natural building-block is also fastest!
1/23/02
John Kubiatowicz
Slide 1.31
Can’t we just wait for it to get faster?
• Microprocessor performance increases 50% - 100% per year
• Transistor count doubles every 3 years
• DRAM size quadruples every 3 years
• Huge investment per generation is carried by huge commodity market
180
160
140
DEC
alpha
120
100
80
60
40
20
MIPS
Sun 4 M/120
260
MIPS
M2000
IBM
RS6000
540
Integer
FP
HP 9000
750
0
1987
1/23/02
1988
1989
1990
1991
1992
John Kubiatowicz
Slide 1.32
Technology: A Closer Look
• Basic advance is decreasing feature size ( )
– Circuits become either faster or lower in power
• Die size is growing too
– Clock rate improves roughly proportional to improvement in 
– Number of transistors improves like  (or faster)
• Performance > 100x per decade
– clock rate < 10x, rest is transistor count
• How to use more transistors?
– Parallelism in processing
» multiple operations per cycle reduces CPI
– Locality in data access
» avoids latency and reduces CPI
» also improves processor utilization
– Both need resources, so tradeoff
Proc
$
Interconnect
• Fundamental issue is resource distribution, as in
uniprocessors
1/23/02
John Kubiatowicz
Slide 1.33
Growth Rates
100,000,000







R10000










Pentium100















 

i80386
100
10

i8086  i80286


1
i8080


 i8008
i4004
0.1
1970
1980
1990
2000
1975
1985
1995
2005
• 30% per year
1/23/02

10,000,000
Transistors
Clock rate (MHz)
1,000

 

 R10000


Pentium













i80386

i80286 
  R3000

R2000
 
1,000,000
100,000
i8086
10,000

i8080

 i8008
i4004
1,000
1970
1980
1990
2000
1975
1985
1995
2005
40% per year
John Kubiatowicz
Slide 1.34
Architectural Trends
• Architecture translates technology’s gifts into
performance and capability
• Resolves the tradeoff between parallelism and
locality
– Current microprocessor: 1/3 compute, 1/3 cache, 1/3 offchip connect
– Tradeoffs may change with scale and technology advances
• Understanding microprocessor architectural
trends
=> Helps build intuition about design issues or parallel machines
=> Shows fundamental role of parallelism even in “sequential”
computers
1/23/02
John Kubiatowicz
Slide 1.35
Phases in “VLSI” Generation
Bit-level parallelism
Instruction-level
Thread-level (?)
100,000,000

10,000,000






1,000,000


R10000




 









 

Pentium
Transistors


 i80386



i80286 
100,000


 R3000
 R2000

 i8086
10,000
 i8080
 i8008

 i4004
1,000
1970
1/23/02
1975
1980
1985
1990
1995
2000
2005
John Kubiatowicz
Slide 1.36
Architectural Trends
• Greatest trend in VLSI generation is increase in
parallelism
– Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
» slows after 32 bit
» adoption of 64-bit now under way, 128-bit far (not
performance issue)
» great inflection point when 32-bit micro and cache fit on a
chip
– Mid 80s to mid 90s: instruction level parallelism
» pipelining and simple instruction sets, + compiler advances
(RISC)
» on-chip caches and functional units => superscalar
execution
» greater sophistication: out of order execution, speculation,
prediction
• to deal with control transfer and latency problems
– Next step: thread level parallelism? Bit-level parallelism?
1/23/02
John Kubiatowicz
Slide 1.37
How far will ILP go?
3
25
2.5
20
2
Speedup
Fraction of total cycles (%)
30
15

1.5
10
1
5
0.5
0
0
0
1
2
3
4
5
6+




0
Number of instructions issued
5
10
15
Instructions issued per cycle
• Infinite resources and fetch bandwidth, perfect
branch prediction and renaming
– real caches and non-zero miss latencies
1/23/02
John Kubiatowicz
Slide 1.38
Threads Level Parallelism “on board”
Proc
Proc
Proc
Proc
MEM
• Micro on a chip makes it natural to connect many to
shared memory
– dominates server and enterprise market, moving down to desktop
• Alternative: many PCs sharing one complicated pipe
• Faster processors began to saturate bus, then bus
technology advanced
– today, range of sizes for bus-based systems, desktop to large servers
1/23/02
No. of processors in fully configured commercial shared-memory systems
John Kubiatowicz
Slide 1.39
What about Multiprocessor Trends?
70
CRAY CS6400

Sun
E10000
60
Number of processors
50
40
SGI Challenge

30
Sequent B2100
Symmetry81


SE60


Sun E6000
SE70
Sun SC2000 
20
Symmetry21
SE10

10
Pow er 
SGI Pow erSeries 
0
1984
1986
 SC2000E
 SGI Pow erChallenge/XL
AS8400
 Sequent B8000
1/23/02

1988
SS690MP 140 
SS690MP 120 
1990
1992

SS1000 

 SE30
 SS1000E
AS2100  HP K400
 SS20
SS10 
1994
1996
 P-Pro
1998
John Kubiatowicz
Slide 1.40
Bus Bandwidth
100,000
Sun E10000

Shared bus bandwidth (MB/s)
10,000
SGI
 Sun E6000
Pow erCh
 AS8400
XL
 CS6400
SGI Challenge 

 HPK400
 SC2000E
 AS2100
 SC2000
 P-Pro
 SS1000E
SS1000
 SS20
SS690MP 120 
 SE70/SE30
SS10/ 
SS690MP 140
SE10/
1,000
SE60
Symmetry81/21
100

 SGI Pow erSeries

 Pow er
 Sequent B2100
Sequent
B8000
10
1984
1/23/02
1986
1988
1990
1992
1994
1996
1998
John Kubiatowicz
Slide 1.41
What about Storage Trends?
• Divergence between memory capacity and speed even more
pronounced
– Capacity increased by 1000x from 1980-95, speed only 2x
– Gigabit DRAM by c. 2000, but gap with processor speed much greater
• Larger memories are slower, while processors get faster
– Need to transfer more data in parallel
– Need deeper cache hierarchies
– How to organize caches?
• Parallelism increases effective size of each level of hierarchy,
without increasing access time
• Parallelism and locality within memory systems too
– New designs fetch many bits within memory chip; follow with fast pipelined
transfer across narrower interface
– Buffer caches most recently accessed data
– Processor in memory?
• Disks too: Parallel disks plus caching
1/23/02
John Kubiatowicz
Slide 1.42
Economics
• Commodity microprocessors not only fast but CHEAP
– Development costs tens of millions of dollars
– BUT, many more are sold compared to supercomputers
– Crucial to take advantage of the investment, and use the commodity
building block
• Multiprocessors being pushed by software vendors (e.g.
database) as well as hardware vendors
• Standardization makes small, bus-based SMPs
commodity
• Desktop: few smaller processors versus one larger one?
• Multiprocessor on a chip?
1/23/02
John Kubiatowicz
Slide 1.43
Can anyone afford high-end MPPs???
• ASCI (Accellerated Strategic Computing Initiative)
ASCI White: Built by IBM
–
–
–
–
1/23/02
12.3 TeraOps, 8192 processors (RS/6000)
6TB of RAM, 160TB Disk
2 basketball courts in size
Program it??? Message passing
John Kubiatowicz
Slide 1.44
Consider Scientific Supercomputing
• Proving ground and driver for innovative architecture and
techniques
– Market smaller relative to commercial as MPs become mainstream
– Dominated by vector machines starting in 70s
– Microprocessors have made huge gains in floating-point
performance
» high clock rates
» pipelined floating point units (e.g., multiply-add every cycle)
» instruction-level parallelism
» effective use of caches (e.g., automatic blocking)
– Plus economics
• Large-scale multiprocessors replace vector
supercomputers
1/23/02
John Kubiatowicz
Slide 1.45
Raw Uniprocessor Performance: LINPACK
10,000
CRAY
 CRAY
 Micro
Micro
n = 1,000
n = 100
n = 1,000
n = 100

1,000

T94

LINPACK (MFLOPS)
C90



100




DEC 8200

Ymp

Xmp/416



 
 IBM Pow er2/990
MIPS R4400
Xmp/14se


DEC Alpha

 HP9000/735
 DEC Alpha AXP
 HP 9000/750
 CRAY 1s
 IBM RS6000/540
10

MIPS M/2000


MIPS M/120

Sun 4/260
1
1975
1/23/02


1980
1985
1990
1995
2000
John Kubiatowicz
Slide 1.46
Raw Parallel Performance: LINPACK
10,000
 MPP peak
 CRAY peak
ASCI Red 
LINPACK (GFLOPS)
1,000
Paragon XP/S MP
(6768)

Paragon XP/S MP
(1024) 
 T3D
CM-5 
100
T932(32) 
Paragon XP/S
CM-200 
CM-2 

1
 C90(16)
Delta
10
Ymp/832(8)


 iPSC/860
 nCUBE/2(1024)
Xmp /416(4)
0.1
1985
1987
1989
1991
1993
1995
1996
• Even vector Crays became parallel
– X-MP (2-4) Y-MP (8), C-90 (16), T94 (32)
• Since 1993, Cray produces MPPs too (T3D, T3E)
1/23/02
John Kubiatowicz
Slide 1.47
500 Fastest Computers
350
Number of systems
300 
313
239

250
200
187
0
11/93
1/23/02
 MPP
 PVP
 SMP
110

106
100
50
284

 198
150
319


63
11/94
11/95
106


73
11/96
John Kubiatowicz
Slide 1.48
Summary: Why Parallel Architecture?
• Increasingly attractive
– Economics, technology, architecture, application demand
• Increasingly central and mainstream
• Parallelism exploited at many levels
– Instruction-level parallelism
– Multiprocessor servers
– Large-scale multiprocessors (“MPPs”)
• Focus of this class: multiprocessor level of parallelism
• Same story from memory system perspective
– Increase bandwidth, reduce average latency with many local
memories
• Spectrum of parallel architectures make sense
– Different cost, performance and scalability
1/23/02
John Kubiatowicz
Slide 1.49
Where is Parallel Arch Going?
Old view: Divergent architectures, no predictable pattern of growth.
Application Software
Systolic
Arrays
System
Software
Architecture
SIMD
Message Passing
Dataflow
Shared Memory
• Uncertainty of direction paralyzed parallel software development!
1/23/02
John Kubiatowicz
Slide 1.50
Today
• Extension of “computer architecture” to support communication
and cooperation
– Instruction Set Architecture plus Communication Architecture
• Defines
– Critical abstractions, boundaries, and primitives (interfaces)
– Organizational structures that implement interfaces (hw or sw)
• Compilers, libraries and OS are important bridges today
1/23/02
John Kubiatowicz
Slide 1.51