CS433 Introduction - Parallel Programming Laboratory
Download
Report
Transcript CS433 Introduction - Parallel Programming Laboratory
CS433
Introduction
Laxmikant Kale
Course objectives and outline
• See the course outline document for details
• You will learn about:
– Parallel architectures
• Cache-coherent Shared memory, distributed memory, networks
– Parallel programming models
• Emphasis on 3: message passing, shared memory, and shared objects
– Performance analysis of parallel applications
– Commonly needed parallel algorithms/operations
– Parallel application case studies
• Significant (effort and grade percntage) course project
– groups of 5 students
• Homeworks/machine problems:
– biweekly (sometimes weekly)
• Parallel machines:
– NCSA Origin 2000, PC/SUN clusters
2
Resources
• Much of the course will be run via the web
– Lecture slides, assignments, will be available on the course web page
• http://www-courses.cs.uiuc.edu/~cs433
– Projects will coordinate and submit information on the web
• Web pages for individual pages will be linked to the course web page
– Newsgroup: uiuc.class.cs433
• You are expected to read the newsgroup and web pages regularly
3
Advent of parallel computing
• “Parallel computing is necessary to increase speeds”
– cry of the ‘70s
– processors kept pace with Moore’s law:
• Doubling speeds every 18 months
• Now, finally, the time is ripe
– uniprocessors are commodities (and proc. speeds shows signs of
slowing down)
– Highly economical to build parallel machines
4
Technology Trends
Performance
100
Supercomputers
10
Mainframes
Microprocessors
Minicomputers
1
0.1
1965
1970
1975
1980
1985
1990
1995
The natural building block for multiprocessors is now also about the fastest!
5
General Technology Trends
• Microprocessor performance increases 50% - 100% per year
• Transistor count doubles every 3 years
• DRAM size quadruples every 3 years
• Huge investment per generation is carried by huge commodity market
180
160
140
DEC
alpha
120
100
80
60
40
20
MIPS
Sun 4 M/120
260
MIPS
M2000
IBM
RS6000
540
Integer
FP
HP 9000
750
0
1987
1988
1989
1990
1991
1992
• Not that single-processor performance is plateauing, but that
parallelism is a natural way to improve it.
6
Technology: A Closer Look
• Basic advance is decreasing feature size ( )
– Circuits become either faster or lower in power
• Die size is growing too
– Clock rate improves roughly proportional to improvement in
– Number of transistors improves like (or faster)
• Performance > 100x per decade; clock rate 10x, rest transistor count
• How to use more transistors?
– Parallelism in processing
• multiple operations per cycle reduces CPI
Proc
$
– Locality in data access
• avoids latency and reduces CPI
• also improves processor utilization
Interconnect
– Both need resources, so tradeoff
• Fundamental issue is resource distribution, as in uniprocessors
7
Clock Frequency Growth Rate
Clock rate (MHz)
1,000
100
10
R10000
Pentium100
i80386
i80286
i8086
1
i8080
i8008
i4004
0.1
1970
1980
1990
2000
1975
1985
1995
2005
• 30% per year
8
Transistor Count Growth Rate
100,000,000
Transistors
10,000,000
1,000,000
i80286
100,000
R10000
Pentium
i80386
R3000
R2000
i8086
10,000
i8080
i8008
i4004
1,000
1970
1980
1990
2000
1975
1985
1995
2005
• 100 million transistors on chip by early 2000’s A.D.
• Transistor count grows much faster than clock rate
- 40% per year, order of magnitude more contribution in 2 decades
9
Similar Story for Storage
• Divergence between memory capacity and speed more
pronounced
– Capacity increased by 1000x from 1980-95, speed only 2x
– Gigabit DRAM by c. 2000, but gap with processor speed much
greater
• Larger memories are slower, while processors get faster
– Need to transfer more data in parallel
– Need deeper cache hierarchies
– How to organize caches?
• Parallelism increases effective size of each level of hierarchy,
without increasing access time
• Parallelism and locality within memory systems too
– New designs fetch many bits within memory chip; follow with fast
pipelined transfer across narrower interface
10
Architectural Trends
• Architecture translates technology’s gifts to performance and
capability
• Resolves the tradeoff between parallelism and locality
– Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip
connect
– Tradeoffs may change with scale and technology advances
• Understanding microprocessor architectural trends
– Helps build intuition about design issues or parallel machines
– Shows fundamental role of parallelism even in “sequential”
computers
• Four generations of architectural history: tube, transistor, IC, VLSI
– Here focus only on VLSI generation
• Greatest delineation in VLSI has been in type of parallelism
exploited
11
Architectural Trends
• Greatest trend in VLSI generation is increase in parallelism
– Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
• slows after 32 bit
• adoption of 64-bit now under way, 128-bit far (not performance issue)
• great inflection point when 32-bit micro and cache fit on a chip
– Mid 80s to mid 90s: instruction level parallelism
• pipelining and simple instruction sets, + compiler advances (RISC)
• on-chip caches and functional units => superscalar execution
• greater sophistication: out of order execution, speculation, prediction
– to deal with control transfer and latency problems
12
Architectural Trends: Bus-based MPs
•Micro on a chip makes it natural to connect many to shared memory
– dominates server and enterprise market, moving down to desktop
•Faster processors began to saturate bus, then bus technology advanced
– today, range of sizes for bus-based systems, desktop to large servers
70
CRAY CS6400
Sun
E10000
60
Number of processors
50
40
SGI Challenge
30
Sequent B2100
Symmetry81
SE60
Sun E6000
SE70
Sun SC2000
20
AS8400
Sequent B8000
Symmetry21
SE10
10
Pow er
SGI Pow erSeries
0
1984
1986
SC2000E
SGI Pow erChallenge/XL
1988
SS690MP 140
SS690MP 120
1990
1992
SS1000
SE30
SS1000E
AS2100 HP K400
SS20
SS10
1994
1996
P-Pro
1998
No. of processors in fully configured commercial shared-memory systems
13
Bus Bandwidth
100,000
Sun E10000
Shared bus bandwidth (MB/s)
10,000
SGI
Sun E6000
Pow erCh
AS8400
XL
CS6400
SGI Challenge
HPK400
SC2000E
AS2100
SC2000
P-Pro
SS1000E
SS1000
SS20
SS690MP 120
SE70/SE30
SS10/
SS690MP 140
SE10/
1,000
SE60
Symmetry81/21
100
SGI Pow erSeries
Pow er
Sequent B2100
Sequent
B8000
10
1984
1986
1988
1990
1992
1994
1996
1998
14
Economics
• Commodity microprocessors not only fast but CHEAP
• Development cost is tens of millions of dollars (5-100 typical)
• BUT, many more are sold compared to supercomputers
– Crucial to take advantage of the investment, and use the
commodity building block
– Exotic parallel architectures no more than special-purpose
• Multiprocessors being pushed by software vendors (e.g.
database) as well as hardware vendors
• Standardization by Intel makes small, bus-based SMPs
commodity
• Desktop: few smaller processors versus one larger one?
– Multiprocessor on a chip
15
What to Expect?
• Parallel Machine classes:
– Cost and usage defines a class! Architecture of a class may change.
– Desktops, Engineering workstations, database/web servers,
suprtcomputers,
• Commodity (home/office) desktop:
– less than $10,000
– possible to provide 10-50 processors for that price!
– Driver applications:
• games, video /signal processing,
• possibly “peripheral” AI: speech recognition, natural language
understanding (?), smart spaces and agents
• New applications?
16
Engineeering workstations
• Price: less than $100,000 (used to be):
– new proce level acceptable may be $50,000
– 100+ processors, large memory,
– Driver applications:
•
•
•
•
CAD (Computer aided design) of various sorts
VLSI
Structural and mechanical simulations…
Etc. (many specialized applications)
17
Commercial Servers
• Price range: variable ($10,000 - several hundreds of thousands)
– defining characteristic: usage
– Database servers, decision support (MIS), web servers, e-commerce
• High availability, fault tolerance are main criteria
• Trends to watch out for:
– Likely emergence of specialized architectures/systems
• E.g. Oracle’s “No Native OS” approach
• Currently dominated by database servers, and TPC benchmarks
– TPC: transactions per second
– But this may change to data mining and application servers, with
corresponding impact on architecure.
18
Supercomputers
• “Definition”: expensive system?!
– Used to be defined by architecture (vector processors, ..)
– More than a million US dollars?
– Thousands of processors
• Driving applications
–
–
–
–
–
–
–
–
Grand challenges in science and engineering:
Global weather modeling and forecast
Rational Drug design / molecular simulations
Processing of genetic (genome) information
Rocket simulation
Airplane design (wings and fluid flow..)
Operations research?? Not recognized yet
Other non-traditional applications?
19
Consider Scientific Supercomputing
• Proving ground and driver for innovative architecture and
techniques
– Market smaller relative to commercial as MPs become mainstream
– Dominated by vector machines starting in 70s
– Microprocessors have made huge gains in floating-point
performance
•
•
•
•
high clock rates
pipelined floating point units (e.g., multiply-add every cycle)
instruction-level parallelism
effective use of caches (e.g., automatic blocking)
– Plus economics
• Large-scale multiprocessors replace vector supercomputers
– Well under way already
20
Scientific Computing Demand
21
Engineering Computing Demand
• Large parallel machines a mainstay in many industries
– Petroleum (reservoir analysis)
– Automotive (crash simulation, drag analysis, combustion efficiency),
– Aeronautics (airflow analysis, engine efficiency, structural
mechanics, electromagnetism),
– Computer-aided design
– Pharmaceuticals (molecular modeling)
– Visualization
• in all of the above
• entertainment (films like Toy Story)
• architecture (walk-throughs and rendering)
– Financial modeling (yield and derivative analysis)
– etc.
22
Applications: Speech and Image Processing
10 GIPS
1 GIPS
Telephone
Number
Recognition
100 M IPS
10 M IP S
1 M IPS
1980
200 Words
Isolated Sp eech
Recognition
Sub-Band
Speech Coding
1985
1,000 Words
Continuous
Speech
Recognition
ISDN-CD Stereo
Receiver
5,000 Words
Continuous
Speech
Recognition
HDTVReceiver
CIF Video
CELP
Speech Coding
Speaker
Veri¼cation
1990
1995
• Also CAD, Databases, . . .
• 100 processors gets you 10 years, 1000 gets you 20 !
23
Learning Curve for Parallel Applications
• AMBER molecular dynamics simulation program
• Starting point was vector code for Cray-1
• 145 MFLOP on Cray90, 406 for final version on 128-processor Paragon,
891 on 128-processor Cray T3D
24
Raw Uniprocessor Performance: LINPACK
10,000
CRAY
CRAY
Micro
Micro
n = 1,000
n = 100
n = 1,000
n = 100
1,000
T94
LINPACK (MFLOPS)
C90
100
DEC 8200
Ymp
Xmp/416
IBM Pow er2/990
MIPS R4400
Xmp/14se
DEC Alpha
HP9000/735
DEC Alpha AXP
HP 9000/750
CRAY 1s
IBM RS6000/540
10
MIPS M/2000
MIPS M/120
Sun 4/260
1
1975
1980
1985
1990
1995
2000
25
500 Fastest Computers
350
Number of systems
300
313
239
250
200
187
0
11/93
MPP
PVP
SMP
110
106
100
50
284
198
150
319
63
11/94
11/95
106
73
11/96
26