excess withdrawal point speed

Download Report

Transcript excess withdrawal point speed

Today
•
About the class
•
Introductions
•
Any new people?
•
Start of first module: Parallel Computing
1
Goals for This Module
•
•
•
Overview of Parallel Architecture and Programming Models
•
Drivers of Parallel Computing (Appl, Tech trends, Arch., Economics)
•
Trends in “Supercomputers” for Scientific Computing
•
Evolution and Convergence of Parallel Architectures
•
Fundamental Issues in Programming Models and Architecture
Parallel programs
•
Process of parallelization
•
What parallel programs look like in major programming models
Programming for performance
•
Key performance issues and architectural interactions
2
Overview of Parallel Architecture
and Programming Models
What is a Parallel Computer?
A collection of processing elements that cooperate to solve
large problems fast
Some broad issues that distinguish parallel computers:
•
Resource Allocation:
–
–
–
•
Data access, Communication and Synchronization
–
–
–
•
how large a collection?
how powerful are the elements?
how much memory?
how do the elements cooperate and communicate?
how are data transmitted between processors?
what are the abstractions and primitives for cooperation?
Performance and Scalability
–
–
how does it all translate into performance?
how does it scale?
4
Why Parallelism?
• Provides alternative to faster clock for performance
• Assuming a doubling of effective per-node performance every 2
years, 1024-CPU system can get you the performance that it
would take 20 years for a single-CPU system to deliver
• Applies at all levels of system design
• Is increasingly central in information processing
• Scientific computing: simulation, data analysis, data storage and
management, etc.
• Commercial computing: Transaction processing, databases
• Internet applications: Search; Google operates at least 50,000
CPUs, many as part of large parallel systems
5
How to Study Parallel Systems
History: diverse and innovative organizational structures, often
tied to novel programming models
Rapidly matured under strong technological constraints
The microprocessor is ubiquitous
• Laptops and supercomputers are fundamentally similar!
• Technological trends cause diverse approaches to converge
•
Technological trends make parallel computing inevitable
•
In the mainstream
Need to understand fundamental principles and design tradeoffs,
not just taxonomies
•
Naming, Ordering, Replication, Communication performance
6
Outline
•
Drivers of Parallel Computing
•
Trends in “Supercomputers” for Scientific Computing
•
Evolution and Convergence of Parallel Architectures
•
Fundamental Issues in Programming Models and Architecture
7
Drivers of Parallel Computing
Application Needs: Our insatiable need for computing cycles
•
•
•
Scientific computing: CFD, Biology, Chemistry, Physics, ...
General-purpose computing: Video, Graphics, CAD, Databases, TP...
Internet applications: Search, e-Commerce, Clustering ...
Technology Trends
Architecture Trends
Economics
Current trends:
•
•
•
All microprocessors have multiprocessor support
Servers and workstations are often MP: Sun, SGI, Dell, COMPAQ...
Microprocessors are multiprocessors: SMP on a chip
8
Application Trends
Demand for cycles fuels advances in hardware, and vice-versa
•
Cycle drives exponential increase in microprocessor performance
•
Drives parallel architecture harder: most demanding applications
Range of performance demands
•
Need range of system performance with progressively increasing cost
•
Platform pyramid
Goal of applications in using parallel machines: Speedup
Speedup (p processors) = Performance (p processors)
Performance (1 processor)
For a fixed problem size (input data set), performance = 1/time
Speedup fixed problem (p processors) =
Time (1 processor)
Time (p processors)
9
Scientific Computing Demand
10
Engineering Computing Demand
Large parallel machines a mainstay in many industries
•
Petroleum (reservoir analysis)
•
Automotive (crash simulation, drag analysis, combustion efficiency),
•
Aeronautics (airflow analysis, engine efficiency, structural mechanics,
electromagnetism),
•
Computer-aided design
•
Pharmaceuticals (molecular modeling)
•
Visualization
–
–
in all of the above
entertainment (movies), architecture (walk-throughs, rendering)
•
Financial modeling (yield and derivative analysis)
•
etc.
11
Learning Curve for Parallel Applications
AMBER molecular dynamics simulation program
• Starting point was vector code for Cray-1
• 145 MFLOP on Cray90, 406 for final version on 128-processor Paragon,
891 on 128-processor Cray T3D
•
12
Commercial Computing
Also relies on parallelism for high end
•
Scale not so large, but use much more wide-spread
•
Computational power determines scale of business that can be handled
Databases, online-transaction processing, decision support, data
mining, data warehousing ...
TPC benchmarks (TPC-C order entry, TPC-D decision support)
•
Explicit scaling criteria provided
•
Size of enterprise scales with size of system
•
Problem size no longer fixed as p increases, so throughput is used as a
performance measure (transactions per minute or tpm)
13
TPC-C Results for Wintel Systems
Performance
342K
Price-Performance
Performance (tpmC)
350,000
$90
$80
300,000
$70
234K
250,000
$60
165K
200,000
$50
$40
150,000
$30
100,000
50,000
$100
7K
12K
19K
40K
57K
$20
Price-Perf ($/tpmC)
400,000
$10
0
$0
1996
4-way
Cpq PL 5000
Pentium Pro
200 MHz
6,751 tpmC
$89.62/tpmC
Avail: 12-1-96
TPC-C v3.2
(withdrawn)
1997
1998
6-way
4-way
Unisys AQ
IBM NF 7000
HS6
PII Xeon 400
Pentium Pro
MHz
200 MHz
18,893 tpmC
12,026 tpmC
$29.09/tpmC
$39.38/tpmC Avail: 12-29-98
Avail: 11-30-97
TPC-C v3.3
TPC-C v3.3
(withdrawn)
(withdrawn)
1999
2000
8-way
8-way
Cpq PL 8500 Dell PE 8450
PIII Xeon 550 PIII Xeon 700
MHz
MHz
40,369 tpmC 57,015 tpmC
$18.46/tpmC $14.99/tpmC
Avail: 12-31-99 Avail: 1-15-01
TPC-C v3.5
TPC-C v3.5
(withdrawn) (withdrawn)
2001
2002
2002
32-way
32-way
32-way
Unisys
Unisys ES7000 NEC Express5800
ES7000
Xeon MP 2 GHz
Itanium2 1GHz
PIII Xeon 900 234,325 tpmC
342,746 tpmC
MHz
$11.59/tpmC
$12.86/tpmC
165,218 tpmC Avail: 3-31-03
Avail: 3-31-03
$21.33/tpmC
TPC-C v5.0
TPC-C v5.0
Avail: 3-10-02
TPC-C v5.0
Parallelism is pervasive
• Small to moderate scale parallelism very important
• Difficult to obtain snapshot to compare across vendor platforms
•
14
Summary of Application Trends
Transition to parallel computing has occurred for scientific and
engineering computing
In rapid progress in commercial computing
•
Database and transactions as well as financial
•
Usually smaller-scale, but large-scale systems also used
Desktop also uses multithreaded programs, which are a lot like
parallel programs
Demand for improving throughput on sequential workloads
•
Greatest use of small-scale multiprocessors
Solid application demand, keeps increasing with time
15
Drivers of Parallel Computing
Application Needs
Technology Trends
Architecture Trends
Economics
16
Technology Trends: Rise of the Micro
Performance
100
Supercomputers
10
Mainframes
Microprocessors
Minicomputers
1
0.1
1965
1970
1975
1980
1985
1990
1995
The natural building block for multiprocessors is now also about the fastest!
17
General Technology Trends
• Microprocessor performance increases 50% - 100% per year
• Transistor count doubles every 3 years
• DRAM size quadruples every 3 years
• Huge investment per generation is carried by huge commodity market
• Not that single-processor performance is plateauing, but that
parallelism is a natural way to improve it.
18
Clock Frequency Growth Rate (Intel family)
• 30% per year
19
Transistor Count Growth Rate (Intel family)
• 100 million transistors on chip by early 2000’s A.D.
• Transistor count grows much faster than clock rate
- 40% per year, order of magnitude more contribution in 2 decades
20
Technology: A Closer Look
Basic advance is decreasing feature size ( )
•
Circuits become either faster or lower in power
Die size is growing too
Clock rate improves roughly proportional to improvement in 
• Number of transistors improves like  (or faster)
•
Performance > 100x per decade; clock rate 10x, rest transistor count
How to use more transistors?
•
Parallelism in processing
–
•
Locality in data access
–
–
•
multiple operations per cycle reduces CPI
avoids latency and reduces CPI
also improves processor utilization
Both need resources, so tradeoff
Proc
$
Interconnect
Fundamental issue is resource distribution, as in uniprocessors
21
Similar Story for Storage
Divergence between memory capacity and speed more pronounced
Capacity increased by 1000x from 1980-95, and increases 50% per yr
• Latency reduces only 3% per year (only 2x from 1980-95)
• Bandwidth per memory chip increases 2x as fast as latency reduces
•
Larger memories are slower, while processors get faster
Need to transfer more data in parallel
• Need deeper cache hierarchies
• How to organize caches?
•
22
Similar Story for Storage
Parallelism increases effective size of each level of hierarchy,
without increasing access time
Parallelism and locality within memory systems too
New designs fetch many bits within memory chip; follow with fast
pipelined transfer across narrower interface
• Buffer caches most recently accessed data
•
Disks too: Parallel disks plus caching
Overall, dramatic growth of processor speed, storage capacity and
bandwidths relative to latency (especially) and clock speed point
toward parallelism as the desirable architectural direction
23
Drivers of Parallel Computing
Application Needs
Technology Trends
Architecture Trends
Economics
24
Architectural Trends
Architecture translates technology’s gifts to performance and
capability
Resolves the tradeoff between parallelism and locality
•
Recent microprocessors: 1/3 compute, 1/3 cache, 1/3 off-chip connect
•
Tradeoffs may change with scale and technology advances
Four generations of architectural history: tube, transistor, IC, VLSI
•
Here focus only on VLSI generation
Greatest delineation in VLSI has been in type of parallelism exploited
25
Architectural Trends in Parallelism
Greatest trend in VLSI generation is increase in parallelism
•
Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
–
–
–
•
Mid 80s to mid 90s: instruction level parallelism
–
–
–
•
slows after 32 bit
adoption of 64-bit well under way, 128-bit is far (not performance issue)
great inflection point when 32-bit micro and cache fit on a chip
pipelining and simple instruction sets, + compiler advances (RISC)
on-chip caches and functional units => superscalar execution
greater sophistication: out of order execution, speculation, prediction
• to deal with control transfer and latency problems
Next step: thread level parallelism
26
Phases in VLSI Generation
Bit-level parallelism
Instruction-level
Thread-level (?)
100,000,000

10,000,000





1,000,000



R10000




 










Pentium
Transistors


 i80386



i80286 
100,000


 R3000
 R2000

 i8086
10,000
 i8080
 i8008

 i4004
1,000
1970
1975
1980
1985
1990
1995
2000
2005
How good is instruction-level parallelism (ILP)?
• Thread-level needed in microprocessors?
• SMT, Intel Hyperthreading
•
27
Can ILP get us there?
• Reported speedups for superscalar processors
•
• Horst, Harris, and Jardine [1990] ......................
1.37
• Wang and Wu [1988] ..........................................
1.70
• Smith, Johnson, and Horowitz [1989] ..............
2.30
• Murakami et al. [1989] ........................................
2.55
• Chang et al. [1991] .............................................
2.90
• Jouppi and Wall [1989] ......................................
3.20
• Lee, Kwok, and Briggs [1991] ...........................
3.50
• Wall [1991] ..........................................................
5
• Melvin and Patt [1991] .......................................
8
• Butler et al. [1991] .............................................
17+
Large variance due to difference in
–
–
application domain investigated (numerical versus non-numerical)
capabilities of processor modeled
28
ILP Ideal Potential
3
25
2.5
20
2
Speedup
Fraction of total cycles (%)
30
15

1.5
10
1
5
0.5
0
0
0
1
2
3
4
5
6+
Number of instructions issued




0
5
10
15
Instructions issued per cycle
• Infinite resources and fetch bandwidth, perfect branch prediction and renaming
– real caches and non-zero miss latencies
29
Results of ILP Studies
•
Concentrate on parallelism for 4-issue machines
• Realistic studies show only 2-fold speedup
• More recent work examines ILP that looks across threads for parallelism
30
Architectural Trends: Bus-based MPs
•Micro on a chip makes it natural to connect many to shared memory
– dominates server and enterprise market, moving down to desktop
•Faster processors began to saturate bus, then bus technology advanced
– today, range of sizes for bus-based systems, desktop to large servers
70
CRAY CS6400

Sun
E10000
60
Number of processors
50
40
SGI Challenge

30
Sequent B2100
Symmetry81


SE60



Sun E6000
SE70
Sun SC2000 
20
AS8400
 Sequent B8000
Symmetry21
SE10

10
Pow er 
SGI Pow erSeries 
0
1984
1986
 SC2000E
 SGI Pow erChallenge/XL
1988
SS690MP 140 
SS690MP 120 
1990
1992

SS1000 

 SE30
 SS1000E
AS2100  HP K400
 SS20
SS10 
1994
1996
 P-Pro
1998
No. of processors in fully configured commercial shared-memory systems
31
Bus Bandwidth
100,000
Sun E10000

Shared bus bandwidth (MB/s)
10,000
SGI
 Sun E6000
Pow erCh
 AS8400
XL
 CS6400
SGI Challenge 

 HPK400
 SC2000E
 AS2100
 SC2000
 P-Pro
 SS1000E
SS1000
 SS20
SS690MP 120 
 SE70/SE30
SS10/
SS690MP 140
SE10/
1,000
SE60
Symmetry81/21
100

 SGI Pow erSeries

 Pow er
 Sequent B2100
Sequent
B8000
10
1984
1986
1988
1990
1992
1994
1996
1998
32
Do Buses Scale?
Buses are a convenient way to extend architecture to parallelism,
but they do not scale
• bandwidth doesn’t grow as CPUs are added
Scalable systems use physically distributed memory
External I/O
P
$
Mem
Mem
ctrl
and NI
XY
Sw itch
Z
33
Drivers of Parallel Computing
Application Needs
Technology Trends
Architecture Trends
Economics
34
Finally, Economics
Commodity microprocessors not only fast but CHEAP
•
Development cost is tens of millions of dollars (5-100 typical)
•
BUT, many more are sold compared to supercomputers
•
Crucial to take advantage of the investment, and use the
commodity building block
•
Exotic parallel architectures no more than special-purpose
Multiprocessors being pushed by software vendors (e.g. database)
as well as hardware vendors
Standardization by Intel makes small, bus-based SMPs commodity
Desktop: few smaller processors versus one larger one?
•
Multiprocessor on a chip
35
Summary: Why Parallel Architecture?
Increasingly attractive
•
Economics, technology, architecture, application demand
Increasingly central and mainstream
Parallelism exploited at many levels
Instruction-level parallelism
• Multiprocessor servers
• Large-scale multiprocessors (“MPPs”)
•
Focus of this class: multiprocessor level of parallelism
Same story from memory (and storage) system perspective
•
Increase bandwidth, reduce average latency with many local memories
Wide range of parallel architectures make sense
•
Different cost, performance and scalability
36
Outline
•
Drivers of Parallel Computing
•
Trends in “Supercomputers” for Scientific Computing
•
Evolution and Convergence of Parallel Architectures
•
Fundamental Issues in Programming Models and Architecture
37
Scientific Supercomputing
Proving ground and driver for innovative architecture and techniques
•
Market smaller relative to commercial as MPs become mainstream
•
Dominated by vector machines starting in 70s
•
Microprocessors have made huge gains in floating-point performance
–
–
–
–
•
high clock rates
pipelined floating point units (e.g. mult-add)
instruction-level parallelism
effective use of caches
Plus economics
Large-scale multiprocessors replace vector supercomputers
38
Raw Uniprocessor Performance: LINPACK
10,000
CRAY
 CRAY
 Micro
Micro
n = 1,000
n = 100
n = 1,000
n = 100

1,000

T94

LINPACK (MFLOPS)
C90



100




DEC 8200

Ymp

Xmp/416



 
 IBM Pow er2/990
MIPS R4400
Xmp/14se


DEC Alpha
 HP9000/735

 DEC Alpha AXP
 HP 9000/750
 CRAY 1s
 IBM RS6000/540
10

MIPS M/2000


MIPS M/120

Sun 4/260
1
1975


1980
1985
1990
1995
2000
39
Raw Parallel Performance: LINPACK
10,000
 MPP peak
 CRAY peak
ASCI Red 
LINPACK (GFLOPS)
1,000
Parago n XP/S MP
(6768)

Parago n XP/S MP
(1024) 
 T3D
CM-5 
100
T932(32) 
Paragon XP/S
CM-200 
CM-2 
Ymp/832(8)
1

 C90(16)
Delta
10


 iPSC/860
 nCUBE/2(1024)
Xmp /416(4)
0.1
1985
1987
1989
1991
1993
1995
1996
• Even vector Crays became parallel: X-MP (2-4) Y-MP (8), C-90 (16), T94 (32)
• Since 1993, Cray produces MPPs too (T3D, T3E)
40
500 Fastest Computers
41
Top 500 as of 2003
•
Earth Simulator, built by NEC, remains the unchallenged #1, at 38 Tflop/s
•
ASCI Q at Los Alamos is #2 at 13.88 TFlop/s.
•
The third system ever to exceed the 10 TFflop/s mark is Virgina Tech's X: a
cluster with the Apple G5 as building blocks and the new Infiniband interconnect.
•
#4 is also a cluster, at NCSA. Dell PowerEdge system with Myrinet interconnect
•
#5 is also a cluster, with upgraded Itanium2-based HP system at DOE's Pacific
Northwest National Lab, with Quadrics interconnect
•
#6 is the based on AMD's Opteron chip. It was installed by Linux Networx at the
Los Alamos National Laboratory and also uses a Myrinet interconnect
•
The list of clusters in the TOP10 has grown to seven systems. The Earth Simulator
and two IBM SP systems at Livermore and LBL are the non-clusters.
•
The performance of the #10 system is 6.6 TFlop/s.
42
43
Another View of Performance Growth
44
Another View of Performance Growth
45
Another View of Performance Growth
46
Another View of Performance Growth
47
Outline
•
Drivers of Parallel Computing
•
Trends in “Supercomputers” for Scientific Computing
•
Evolution and Convergence of Parallel Architectures
•
Fundamental Issues in Programming Models and Architecture
48
History
Historically, parallel architectures tied to programming models
• Divergent architectures, with no predictable pattern of growth.
Application Software
Systolic
Arrays
Dataflow
System
Software
Architecture
SIMD
Message Passing
Shared Memory
• Uncertainty of direction paralyzed parallel software development!
49
Today
Extension of “computer architecture” to support communication
and cooperation
•
OLD: Instruction Set Architecture
•
NEW: Communication Architecture
Defines
•
Critical abstractions, boundaries, and primitives (interfaces)
•
Organizational structures that implement interfaces (hw or sw)
Compilers, libraries and OS are important bridges between
application and architecture today
50
Modern Layered Framework
CAD
Database
Multiprogramming
Shared
address
Scientific modeling
Message
passing
Data
parallel
Compilation
or library
Operating systems support
Communication hardware
Parallel applications
Programming models
Communication abstraction
User/system boundary
Hardware/software boundary
Physical communication medium
51
Parallel Programming Model
What the programmer uses in writing applications
Specifies communication and synchronization
Examples:
•
Multiprogramming: no communication or synch. at program level
•
Shared address space: like bulletin board
•
Message passing: like letters or phone calls, explicit point to point
•
Data parallel: more regimented, global actions on data
–
Implemented with shared address space or message passing
52
Communication Abstraction
User level communication primitives provided by system
•
Realizes the programming model
•
Mapping exists between language primitives of programming model
and these primitives
Supported directly by hw, or via OS, or via user sw
Lot of debate about what to support in sw and gap between layers
Today:
•
Hw/sw interface tends to be flat, i.e. complexity roughly uniform
•
Compilers and software play important roles as bridges today
•
Technology trends exert strong influence
Result is convergence in organizational structure
•
Relatively simple, general purpose communication primitives
53
Communication Architecture
= User/System Interface + Implementation
User/System Interface:
Comm. primitives exposed to user-level by hw and system-level sw
• (May be additional user-level software between this and prog model)
•
Implementation:
Organizational structures that implement the primitives: hw or OS
• How optimized are they? How integrated into processing node?
• Structure of network
•
Goals:
•
•
•
•
•
Performance
Broad applicability
Programmability
Scalability
Low Cost
54