CS433 Introduction - Parallel Programming Laboratory

Download Report

Transcript CS433 Introduction - Parallel Programming Laboratory

CS433
Introduction
Laxmikant Kale
Course objectives and outline
• See the course outline document for details
• You will learn about:
– Parallel architectures
• Cache-coherent Shared memory, distributed memory, networks
– Parallel programming models
• Emphasis on 3: message passing, shared memory, and shared objects
– Performance analysis of parallel applications
– Commonly needed parallel algorithms/operations
– Parallel application case studies
• Significant (effort and grade percntage) course project
– groups of 5 students
• Homeworks/machine problems:
– biweekly (sometimes weekly)
• Parallel machines:
– NCSA Origin 2000, PC/SUN clusters
2
Resources
• Much of the course will be run via the web
– Lecture slides, assignments, will be available on the course web page
• http://www-courses.cs.uiuc.edu/~cs433
– Projects will coordinate and submit information on the web
• Web pages for individual pages will be linked to the course web page
– Newsgroup: uiuc.class.cs433
• You are expected to read the newsgroup and web pages regularly
3
Advent of parallel computing
• “Parallel computing is necessary to increase speeds”
– cry of the ‘70s
– processors kept pace with Moore’s law:
• Doubling speeds every 18 months
• Now, finally, the time is ripe
– uniprocessors are commodities (and proc. speeds shows signs of
slowing down)
– Highly economical to build parallel machines
4
Technology Trends
Performance
100
Supercomputers
10
Mainframes
Microprocessors
Minicomputers
1
0.1
1965
1970
1975
1980
1985
1990
1995
The natural building block for multiprocessors is now also about the fastest!
5
General Technology Trends
• Microprocessor performance increases 50% - 100% per year
• Transistor count doubles every 3 years
• DRAM size quadruples every 3 years
• Huge investment per generation is carried by huge commodity market
180
160
140
DEC
alpha
120
100
80
60
40
20
MIPS
Sun 4 M/120
260
MIPS
M2000
IBM
RS6000
540
Integer
FP
HP 9000
750
0
1987
1988
1989
1990
1991
1992
• Not that single-processor performance is plateauing, but that
parallelism is a natural way to improve it.
6
Technology: A Closer Look
• Basic advance is decreasing feature size ( )
– Circuits become either faster or lower in power
• Die size is growing too
– Clock rate improves roughly proportional to improvement in 
– Number of transistors improves like  (or faster)
• Performance > 100x per decade; clock rate 10x, rest transistor count
• How to use more transistors?
– Parallelism in processing
• multiple operations per cycle reduces CPI
Proc
$
– Locality in data access
• avoids latency and reduces CPI
• also improves processor utilization
Interconnect
– Both need resources, so tradeoff
• Fundamental issue is resource distribution, as in uniprocessors
7
Clock Frequency Growth Rate
Clock rate (MHz)
1,000
100
10








R10000











Pentium100



















 



i80386

i80286
i8086 


1
i8080


 i8008
i4004
0.1
1970
1980
1990
2000
1975
1985
1995
2005
• 30% per year
8
Transistor Count Growth Rate
100,000,000

Transistors
10,000,000
1,000,000
i80286 
100,000

 




R10000





 Pentium



















i80386
  R3000
R2000
i8086
10,000

i8080

 i8008
i4004
1,000
1970
1980
1990
2000
1975
1985
1995
2005
• 100 million transistors on chip by early 2000’s A.D.
• Transistor count grows much faster than clock rate
- 40% per year, order of magnitude more contribution in 2 decades
9
Similar Story for Storage
• Divergence between memory capacity and speed more
pronounced
– Capacity increased by 1000x from 1980-95, speed only 2x
– Gigabit DRAM by c. 2000, but gap with processor speed much
greater
• Larger memories are slower, while processors get faster
– Need to transfer more data in parallel
– Need deeper cache hierarchies
– How to organize caches?
• Parallelism increases effective size of each level of hierarchy,
without increasing access time
• Parallelism and locality within memory systems too
– New designs fetch many bits within memory chip; follow with fast
pipelined transfer across narrower interface
10
Architectural Trends
• Architecture translates technology’s gifts to performance and
capability
• Resolves the tradeoff between parallelism and locality
– Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip
connect
– Tradeoffs may change with scale and technology advances
• Understanding microprocessor architectural trends
– Helps build intuition about design issues or parallel machines
– Shows fundamental role of parallelism even in “sequential”
computers
• Four generations of architectural history: tube, transistor, IC, VLSI
– Here focus only on VLSI generation
• Greatest delineation in VLSI has been in type of parallelism
exploited
11
Architectural Trends
• Greatest trend in VLSI generation is increase in parallelism
– Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
• slows after 32 bit
• adoption of 64-bit now under way, 128-bit far (not performance issue)
• great inflection point when 32-bit micro and cache fit on a chip
– Mid 80s to mid 90s: instruction level parallelism
• pipelining and simple instruction sets, + compiler advances (RISC)
• on-chip caches and functional units => superscalar execution
• greater sophistication: out of order execution, speculation, prediction
– to deal with control transfer and latency problems
12
Architectural Trends: Bus-based MPs
•Micro on a chip makes it natural to connect many to shared memory
– dominates server and enterprise market, moving down to desktop
•Faster processors began to saturate bus, then bus technology advanced
– today, range of sizes for bus-based systems, desktop to large servers
70
CRAY CS6400

Sun
E10000
60
Number of processors
50
40
SGI Challenge

30
Sequent B2100
Symmetry81


SE60



Sun E6000
SE70
Sun SC2000 
20
AS8400
 Sequent B8000
Symmetry21
SE10

10
Pow er 
SGI Pow erSeries 
0
1984
1986
 SC2000E
 SGI Pow erChallenge/XL
1988
SS690MP 140 
SS690MP 120 
1990
1992

SS1000 

 SE30
 SS1000E
AS2100  HP K400
 SS20
SS10 
1994
1996
 P-Pro
1998
No. of processors in fully configured commercial shared-memory systems
13
Bus Bandwidth
100,000
Sun E10000

Shared bus bandwidth (MB/s)
10,000
SGI
 Sun E6000
Pow erCh
 AS8400
XL
 CS6400
SGI Challenge 

 HPK400
 SC2000E
 AS2100
 SC2000
 P-Pro
 SS1000E
SS1000
 SS20
SS690MP 120 
 SE70/SE30
SS10/ 
SS690MP 140
SE10/
1,000
SE60
Symmetry81/21
100

 SGI Pow erSeries

 Pow er
 Sequent B2100
Sequent
B8000
10
1984
1986
1988
1990
1992
1994
1996
1998
14
Economics
• Commodity microprocessors not only fast but CHEAP
• Development cost is tens of millions of dollars (5-100 typical)
• BUT, many more are sold compared to supercomputers
– Crucial to take advantage of the investment, and use the
commodity building block
– Exotic parallel architectures no more than special-purpose
• Multiprocessors being pushed by software vendors (e.g.
database) as well as hardware vendors
• Standardization by Intel makes small, bus-based SMPs
commodity
• Desktop: few smaller processors versus one larger one?
– Multiprocessor on a chip
15
What to Expect?
• Parallel Machine classes:
– Cost and usage defines a class! Architecture of a class may change.
– Desktops, Engineering workstations, database/web servers,
suprtcomputers,
• Commodity (home/office) desktop:
– less than $10,000
– possible to provide 10-50 processors for that price!
– Driver applications:
• games, video /signal processing,
• possibly “peripheral” AI: speech recognition, natural language
understanding (?), smart spaces and agents
• New applications?
16
Engineeering workstations
• Price: less than $100,000 (used to be):
– new proce level acceptable may be $50,000
– 100+ processors, large memory,
– Driver applications:
•
•
•
•
CAD (Computer aided design) of various sorts
VLSI
Structural and mechanical simulations…
Etc. (many specialized applications)
17
Commercial Servers
• Price range: variable ($10,000 - several hundreds of thousands)
– defining characteristic: usage
– Database servers, decision support (MIS), web servers, e-commerce
• High availability, fault tolerance are main criteria
• Trends to watch out for:
– Likely emergence of specialized architectures/systems
• E.g. Oracle’s “No Native OS” approach
• Currently dominated by database servers, and TPC benchmarks
– TPC: transactions per second
– But this may change to data mining and application servers, with
corresponding impact on architecure.
18
Supercomputers
• “Definition”: expensive system?!
– Used to be defined by architecture (vector processors, ..)
– More than a million US dollars?
– Thousands of processors
• Driving applications
–
–
–
–
–
–
–
–
Grand challenges in science and engineering:
Global weather modeling and forecast
Rational Drug design / molecular simulations
Processing of genetic (genome) information
Rocket simulation
Airplane design (wings and fluid flow..)
Operations research?? Not recognized yet
Other non-traditional applications?
19
Consider Scientific Supercomputing
• Proving ground and driver for innovative architecture and
techniques
– Market smaller relative to commercial as MPs become mainstream
– Dominated by vector machines starting in 70s
– Microprocessors have made huge gains in floating-point
performance
•
•
•
•
high clock rates
pipelined floating point units (e.g., multiply-add every cycle)
instruction-level parallelism
effective use of caches (e.g., automatic blocking)
– Plus economics
• Large-scale multiprocessors replace vector supercomputers
– Well under way already
20
Scientific Computing Demand
21
Engineering Computing Demand
• Large parallel machines a mainstay in many industries
– Petroleum (reservoir analysis)
– Automotive (crash simulation, drag analysis, combustion efficiency),
– Aeronautics (airflow analysis, engine efficiency, structural
mechanics, electromagnetism),
– Computer-aided design
– Pharmaceuticals (molecular modeling)
– Visualization
• in all of the above
• entertainment (films like Toy Story)
• architecture (walk-throughs and rendering)
– Financial modeling (yield and derivative analysis)
– etc.
22
Applications: Speech and Image Processing
10 GIPS
1 GIPS
Telephone
Number
Recognition
100 M IPS
10 M IP S
1 M IPS
1980
200 Words
Isolated Sp eech
Recognition
Sub-Band
Speech Coding
1985
1,000 Words
Continuous
Speech
Recognition
ISDN-CD Stereo
Receiver
5,000 Words
Continuous
Speech
Recognition
HDTVReceiver
CIF Video
CELP
Speech Coding
Speaker
Veri¼cation
1990
1995
• Also CAD, Databases, . . .
• 100 processors gets you 10 years, 1000 gets you 20 !
23
Learning Curve for Parallel Applications
• AMBER molecular dynamics simulation program
• Starting point was vector code for Cray-1
• 145 MFLOP on Cray90, 406 for final version on 128-processor Paragon,
891 on 128-processor Cray T3D
24
Raw Uniprocessor Performance: LINPACK
10,000
CRAY
 CRAY
 Micro
Micro
n = 1,000
n = 100
n = 1,000
n = 100

1,000

T94

LINPACK (MFLOPS)
C90



100




DEC 8200

Ymp

Xmp/416



 
 IBM Pow er2/990
MIPS R4400
Xmp/14se


DEC Alpha

 HP9000/735
 DEC Alpha AXP
 HP 9000/750
 CRAY 1s
 IBM RS6000/540
10

MIPS M/2000


MIPS M/120

Sun 4/260
1
1975


1980
1985
1990
1995
2000
25
500 Fastest Computers
350
Number of systems
300 
313
239

250
200
187
0
11/93
 MPP
 PVP
 SMP
110

106
100
50
284

 198
150
319


63
11/94
11/95
106


73
11/96
26