ENGIN112 - lecture 2

Download Report

Transcript ENGIN112 - lecture 2

Technology ---> Limitations & Opportunities
• Wires
- Area
- Propagation speed
• Clock
• Power
• VLSI
- I/O pin limitations
- Chip area
- Chip crossing delay
- Power
• Can not make light go any faster
• KISS rule
(Keep It Simple, Stupid)
INEL6067
Major theme
Application
requirements
Technological
constraints
ARCHITECTURE
• Look at typical applications
• Understand physical limitations
• Make tradeoffs
INEL6067
Unfortunately
° Requirements and constraints are often at odds
with each other!
Full
connectivity !
° Architecture ---> making tradeoffs
INEL6067
Gasp!!!
Putting it all together
° The systems approach
• Lesson from RISCs
• Hardware software tradeoffs
• Functionality implemented at the right level
- Hardware
- Runtime system
- Compiler
- Language, Programmer
- Algorithm
INEL6067
Commercial Computing
° Relies on parallelism for high end
• Computational power determines scale of business that can be
handled
° Databases, online-transaction processing, decision
support, data mining, data warehousing ...
INEL6067
Scientific Computing Demand
INEL6067
Applications: Speech and Image Processing
10 GIPS
1 GIPS
Telephone
Number
Recognition
100 M IPS
10 M IP S
1 M IPS
1980
200 Words
Isolated Sp eech
Recognition
Sub-Band
Speech Coding
1985
1,000 Words
Continuous
Speech
Recognition
ISDN-CD Stereo
Receiver
5,000 Words
Continuous
Speech
Recognition
HDTVReceiver
CIF Video
CELP
Speech Coding
Speaker
Veri¼cation
1990
• Also CAD, Databases, . . .
• 100 processors gets you 10 years, 1000 gets you 20 !
INEL6067
1995
Is better parallel arch enough?
° AMBER molecular dynamics simulation program
° Starting point was vector code for Cray-1
° 145 MFLOP on Cray90, 406 for final version on 128processor Paragon, 891 on 128-processor Cray T3D
INEL6067
Summary of Application Trends
° Transition to parallel computing has occurred for
scientific and engineering computing
° In rapid progress in commercial computing
• Database and transactions as well as financial
• Usually smaller-scale, but large-scale systems also used
° Desktop also uses multithreaded programs, which
are a lot like parallel programs
° Demand for improving throughput on sequential
workloads
• Greatest use of small-scale multiprocessors
° Solid application demand exists and will increase
INEL6067
Technology Trends
Performance
100
Supercomputers
10
Mainframes
Microprocessors
Minicomputers
1
0.1
1965
1970
1975
1980
1985
1990
° Today the natural building-block is also fastest!
INEL6067
1995
Technology: A Closer Look
° Basic advance is decreasing feature size ( )
• Circuits become either faster or lower in power
° Die size is growing too
• Clock rate improves roughly proportional to improvement in 
• Number of transistors improves like  (or faster)
° Performance > 100x per decade
• clock rate < 10x, rest is transistor count
° How to use more transistors?
• Parallelism in processing
- multiple operations per cycle reduces CPI
• Locality in data access
- avoids latency and reduces CPI
- also improves processor utilization
• Both need resources, so tradeoff
Proc
$
Interconnect
° Fundamental issue is resource distribution, as in
uniprocessors
INEL6067
Growth Rates
100,000,000







R10000










Pentium100















 

i80386
100
10

i8086  i80286


1
i8080


 i8008
i4004
0.1
1970
1980
1990
2000
1975
1985
1995
2005
• 30% per year
INEL6067

10,000,000
Transistors
Clock rate (MHz)
1,000

 

 R10000


Pentium













i80386

i80286 
  R3000

R2000
 
1,000,000
100,000
i8086
10,000

i8080

 i8008
i4004
1,000
1970
1980
1990
2000
1975
1985
1995
2005
40% per year
Architectural Trends
° Architecture translates technology’s gifts into
performance and capability
° Resolves the tradeoff between parallelism and
locality
• Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip
connect
• Tradeoffs may change with scale and technology advances
° Understanding microprocessor architectural
trends
=> Helps build intuition about design issues or parallel
machines
=> Shows fundamental role of parallelism even in “sequential”
computers
INEL6067
Phases in “VLSI” Generation
Bit-level parallelism
Instruction-level
Thread-level (?)
100,000,000

10,000,000






1,000,000


R10000




 









 

Pentium
Transistors


 i80386



i80286 
100,000


 R3000
 R2000

 i8086
10,000
 i8080
 i8008

 i4004
1,000
1970
INEL6067
1975
1980
1985
1990
1995
2000
2005
Architectural Trends
° Greatest trend in VLSI generation is increase in
parallelism
• Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
- slows after 32 bit
- adoption of 64-bit now under way, 128-bit far (not
performance issue)
- great inflection point when 32-bit micro and cache fit on a
chip
• Mid 80s to mid 90s: instruction level parallelism
- pipelining and simple instruction sets, + compiler
advances (RISC)
- on-chip caches and functional units => superscalar
execution
- greater sophistication: out of order execution,
speculation, prediction
– to deal with control transfer and latency problems
• Next step: thread level parallelism
INEL6067
How far will ILP go?
3
25
2.5
20
2
Speedup
Fraction of total cycles (%)
30
15

1.5
10
1
5
0.5
0
0
0
1
2
3
4
5
6+




0
Number of instructions issued
5
10
Instructions issued per cycle
° Infinite resources and fetch bandwidth, perfect
branch prediction and renaming
– real caches and non-zero miss latencies
INEL6067
15
Threads Level Parallelism “on board”
Proc
Proc
Proc
Proc
MEM
° Micro on a chip makes it natural to connect many to shared
memory
– dominates server and enterprise market, moving down to desktop
° Faster processors began to saturate bus, then bus
technology advanced
– today, range of sizes for bus-based systems, desktop to large servers
INEL6067
What about Multiprocessor Trends?
70
CRAY CS6400

Sun
E10000
60
Number of processors
50
40
SGI Challenge

30
Sequent B2100
Symmetry81


SE60


Sun E6000
SE70
Sun SC2000 
20
Symmetry21
SE10

10
Pow er 
SGI Pow erSeries 
0
1984
1986
 SC2000E
 SGI Pow erChallenge/XL
AS8400
 Sequent B8000
INEL6067

1988
SS690MP 140 
SS690MP 120 
1990
1992

SS1000 

 SE30
 SS1000E
AS2100  HP K400
 SS20
SS10 
1994
1996
 P-Pro
1998
What about Storage Trends?
° Divergence between memory capacity and speed even more
pronounced
• Capacity increased by 1000x from 1980-95, speed only 2x
• Gigabit DRAM by c. 2000, but gap with processor speed much greater
° Larger memories are slower, while processors get faster
• Need to transfer more data in parallel
• Need deeper cache hierarchies
• How to organize caches?
° Parallelism increases effective size of each level of hierarchy,
without increasing access time
° Parallelism and locality within memory systems too
• New designs fetch many bits within memory chip; follow with fast
pipelined transfer across narrower interface
• Buffer caches most recently accessed data
° Disks too: Parallel disks plus caching
INEL6067
Economics
° Commodity microprocessors not only fast but CHEAP
• Development costs tens of millions of dollars
• BUT, many more are sold compared to supercomputers
• Crucial to take advantage of the investment, and use the commodity
building block
° Multiprocessors being pushed by software vendors (e.g.
database) as well as hardware vendors
° Standardization makes small, bus-based SMPs commodity
° Desktop: few smaller processors versus one larger one?
° Multiprocessor on a chip?
INEL6067
Consider Scientific Supercomputing
° Proving ground and driver for innovative architecture and
techniques
• Market smaller relative to commercial as MPs become
mainstream
• Dominated by vector machines starting in 70s
• Microprocessors have made huge gains in floating-point
performance
- high clock rates
- pipelined floating point units (e.g., multiply-add every
cycle)
- instruction-level parallelism
- effective use of caches (e.g., automatic blocking)
• Plus economics
° Large-scale multiprocessors replace vector supercomputers
INEL6067
Raw Parallel Performance: LINPACK
10,000
 MPP peak
 CRAY peak
ASCI Red 
LINPACK (GFLOPS)
1,000
Paragon XP/S MP
(6768)

Paragon XP/S MP
(1024) 
 T3D
CM-5 
100
T932(32) 
Paragon XP/S
CM-200 
CM-2 

1
 C90(16)
Delta
10
Ymp/832(8)


 iPSC/860
 nCUBE/2(1024)
Xmp /416(4)
0.1
1985
1987
1989
1991
1993
1995
1996
° Even vector Crays became parallel
• X-MP (2-4) Y-MP (8), C-90 (16), T94 (32)
° Since 1993, Cray produces MPPs too (T3D, T3E)
INEL6067
Where is Parallel Arch Going?
Old view: Divergent architectures, no predictable pattern of growth.
Application Software
Systolic
Arrays
System
Software
Architecture
SIMD
Message Passing
Dataflow
Shared Memory
• Uncertainty of direction paralyzed parallel software development!
INEL6067
Modern Layered Framework
CAD
Database
Multiprogramming
Shared
address
Scientific modeling
Message
passing
Data
parallel
Compilation
or library
Operating systems support
Communication hardware
Physical communication medium
INEL6067
Parallel applications
Programming models
Communication abstraction
User/system boundary
Hardware/software boundary
Summary: Why Parallel Architecture?
° Increasingly attractive
• Economics, technology, architecture, application demand
° Increasingly central and mainstream
° Parallelism exploited at many levels
• Instruction-level parallelism
• Multiprocessor servers
• Large-scale multiprocessors (“MPPs”)
° Focus of this class: multiprocessor level of
parallelism
° Same story from memory system perspective
• Increase bandwidth, reduce average latency with many local
memories
° Spectrum of parallel architectures make sense
• Different cost, performance and scalability
INEL6067
Threads Level Parallelism “on board”
Proc
Proc
Proc
Proc
MEM
° Micro on a chip makes it natural to connect many to shared
memory
– dominates server and enterprise market, moving down to desktop
° Faster processors began to saturate bus, then bus
technology advanced
– today, range of sizes for bus-based systems, desktop to large servers
INEL6067
What about Multiprocessor Trends?
70
CRAY CS6400

Sun
E10000
60
Number of processors
50
40
SGI Challenge

30
Sequent B2100
Symmetry81


SE60


Sun E6000
SE70
Sun SC2000 
20
Symmetry21
SE10

10
Pow er 
SGI Pow erSeries 
0
1984
1986
 SC2000E
 SGI Pow erChallenge/XL
AS8400
 Sequent B8000
INEL6067

1988
SS690MP 140 
SS690MP 120 
1990
1992

SS1000 

 SE30
 SS1000E
AS2100  HP K400
 SS20
SS10 
1994
1996
 P-Pro
1998
What about Storage Trends?
° Divergence between memory capacity and speed even more
pronounced
• Capacity increased by 1000x from 1980-95, speed only 2x
• Gigabit DRAM by c. 2000, but gap with processor speed much greater
° Larger memories are slower, while processors get faster
• Need to transfer more data in parallel
• Need deeper cache hierarchies
• How to organize caches?
° Parallelism increases effective size of each level of hierarchy,
without increasing access time
° Parallelism and locality within memory systems too
• New designs fetch many bits within memory chip; follow with fast
pipelined transfer across narrower interface
• Buffer caches most recently accessed data
° Disks too: Parallel disks plus caching
INEL6067
Economics
° Commodity microprocessors not only fast but CHEAP
• Development costs tens of millions of dollars
• BUT, many more are sold compared to supercomputers
• Crucial to take advantage of the investment, and use the commodity
building block
° Multiprocessors being pushed by software vendors (e.g.
database) as well as hardware vendors
° Standardization makes small, bus-based SMPs commodity
° Desktop: few smaller processors versus one larger one?
° Multiprocessor on a chip?
INEL6067
Raw Parallel Performance: LINPACK
10,000
 MPP peak
 CRAY peak
ASCI Red 
LINPACK (GFLOPS)
1,000
Paragon XP/S MP
(6768)

Paragon XP/S MP
(1024) 
 T3D
CM-5 
100
T932(32) 
Paragon XP/S
CM-200 
CM-2 

1
 C90(16)
Delta
10
Ymp/832(8)


 iPSC/860
 nCUBE/2(1024)
Xmp /416(4)
0.1
1985
1987
1989
1991
1993
1995
1996
° Even vector Crays became parallel
• X-MP (2-4) Y-MP (8), C-90 (16), T94 (32)
° Since 1993, Cray produces MPPs too (T3D, T3E)
INEL6067
Where is Parallel Arch Going?
Old view: Divergent architectures, no predictable pattern of growth.
Application Software
Systolic
Arrays
System
Software
Architecture
SIMD
Message Passing
Dataflow
Shared Memory
• Uncertainty of direction paralyzed parallel software development!
INEL6067
Modern Layered Framework
CAD
Database
Multiprogramming
Shared
address
Scientific modeling
Message
passing
Data
parallel
Compilation
or library
Operating systems support
Communication hardware
Physical communication medium
INEL6067
Parallel applications
Programming models
Communication abstraction
User/system boundary
Hardware/software boundary
History
° Parallel architectures tied closely to programming
models
• Divergent architectures, with no predictable pattern of growth.
• Mid 80s revival
Application Software
Systolic
Arrays
System
Software
Architecture
SIMD
Message Passing
Dataflow
INEL6067
Shared Memory
Programming Model
° Look at major programming models
• Where did they come from?
• What do they provide?
• How have they converged?
° Extract general structure and fundamental issues
° Reexamine traditional camps from new perspective
Systolic
Arrays
Dataflow
INEL6067
Generic
Architecture
SIMD
Message Passing
Shared Memory
Programming Model
° Conceptualization of the machine that programmer
uses in coding applications
• How parts cooperate and coordinate their activities
• Specifies communication and synchronization operations
° Multiprogramming
• no communication or synch. at program level
° Shared address space
• like bulletin board
° Message passing
• like letters or phone calls, explicit point to point
° Data parallel:
• more regimented, global actions on data
• Implemented with shared address space or message passing
INEL6067
Adding Processing Capacity
I/O
devices
Mem
Mem
Mem
Interconnect
Processor
Mem
I/O ctrl
I/O ctrl
Interconnect
Processor
° Memory capacity increased by adding modules
° I/O by controllers and devices
°
Add processors for processing!
•
INEL6067
For higher-throughput multiprogramming, or parallel
programs
Historical Development
° “Mainframe” approach
•
•
•
•
•
Motivated by multiprogramming
Extends crossbar used for Mem and I/O
Processor cost-limited => crossbar
Bandwidth scales with p
High incremental cost
- use multistage instead
P
P
I/O
C
I/O
C
M
M
M
M
° “Minicomputer” approach
•
•
•
•
•
•
Almost all microprocessor systems have bus
Motivated by multiprogramming, TP
Used heavily for parallel computing
I/O
Called symmetric multiprocessor (SMP)
C
Latency larger than for uniprocessor
Bus is bandwidth bottleneck
I/O
C
M
M
- caching is key: coherence problem
• Low incremental cost
INEL6067
$
$
P
P
Shared Physical Memory
° Any processor can directly reference any memory
location
° Any I/O controller - any memory
° Operating system can run on any processor, or all.
•
OS uses shared memory to coordinate
° Communication occurs implicitly as result of loads
and stores
°
INEL6067
What about application processes?
Shared Virtual Address Space
° Process = address space plus thread of control
° Virtual-to-physical mapping can be established so
that processes shared portions of address space.
• User-kernel or multiple processes
° Multiple threads of control on one address space.
• Popular approach to structuring OS’s
• Now standard application capability
°
Writes to shared address visible to other threads
•
Natural extension of uniprocessors model
• conventional memory operations for communication
• special atomic operations for synchronization
- also load/stores
INEL6067
Structured Shared Address Space
Virtual address spaces for a
collection of processes communicating
via shared addresses
Load
P1
Machine physical address space
Pn pr i v at e
Pn
P2
Common physical
addresses
P0
St or e
Shared portion
of address space
Private portion
of address space
P2 pr i vat e
P1 pr i vat e
P0 pr i vat e
° Add hoc parallelism used in system code
° Most parallel applications have structured SAS
° Same program on each processor
• shared variable X means the same thing to each thread
INEL6067