Deep Pipelining
Download
Report
Transcript Deep Pipelining
Is There Anything More to Learn
about High Performance
Processors?
J. E. Smith
The State of the Art
Multiple instructions per cycle
Out-of-order issue
Register renaming
Deep pipelining
Branch prediction
Speculative execution
Cache memories
Multi-threading
June 2003
copyright J. E. Smith, 2003
3
History Quiz
Superscalar processing was invented by
a) Intel in 1993
b) RISC designers in the late ’80s, early ’90s
c) IBM ACS in late ’60s; Tjaden and Flynn 1970
June 2003
copyright J. E. Smith, 2003
4
History Quiz
Out-of-order issue was invented by
a) Intel in 1993
b) RISC designers in the late ’80s, early ’90s
c) Thornton/Cray in the 6600, 1963
June 2003
copyright J. E. Smith, 2003
5
History Quiz
What Keller said in1975:
Register renaming
was invented by
a) Intel in 1995
b) RISC designers in
the late ’80s, early
’90s
c) Tomasulo in late
’60s; also Tjaden
and Flynn 1970
June 2003
copyright J. E. Smith, 2003
6
History Quiz
Deep pipelining was invented by
a)
b)
c)
Intel in 2001
RISC designers in the late ’80s, early ’90s
Seymour Cray in 1976
1969:
1976:
1985:
1991:
June 2003
7600
Cray-1
Cray-2
Cray-3
12 gates/stage (?)
8 gates/stage
4 gates/stage
6 gates/stage (?)
copyright J. E. Smith, 2003
7
History Quiz
Branch prediction was invented by
a)
b)
c)
Intel in 1995
RISC designers in the late ’80s, early ’90s
Stretch 1959 (static); Livermore S1(?) 1979 or
earlier at IBM(?)
June 2003
copyright J. E. Smith, 2003
8
History Quiz
Speculative Execution was invented by
a)
b)
c)
Intel in 1995
RISC designers in the late ’80s, early ’90s
CDC 180/990 (?) in 1983
June 2003
copyright J. E. Smith, 2003
9
History Quiz
Cache memories were invented by
a)
b)
c)
Intel in 1985
RISC designers in the late ’80s, early ’90s
Maurice Wilkes in 1965
June 2003
copyright J. E. Smith, 2003
10
History Quiz
Multi-threading was
invented by
a) Intel in 2001
b) RISC designers in
the ’80s
c) Seymour Cray in
1964
June 2003
copyright J. E. Smith, 2003
11
Summary
Multiple instructions per cycle -- 1969
Out-of-order issue -- 1964
Register renaming -- 1967
Deep pipelining -- 1975
Branch prediction -- 1979
Speculative Execution -- 1983
Cache memories -- 1965
Multi-threading -- 1964
All were done as part of a development project and immediately put
into practice.
After introduction, only a few remained in common
use
June 2003
copyright J. E. Smith, 2003
12
The 1970s & 80s – Less Complexity
Level of integration wouldn’t support it
•
Not because of transistor counts, but because of
small replaceable units.
Cray went toward simple issue, deep
pipelining
Microprocessor development first used high
complexity then drove pipelines deeper
Limits to Wide Issue
Limits to Deep Pipelining
June 2003
copyright J. E. Smith, 2003
13
Typical Superscalar Performance
Your basic superscalar
processor:
4
4-way issue, 32 window
16K I-cache and D-Cache
8K gshare branch predictor
2.5
2
1.5
1
0.5
June 2003
copyright J. E. Smith, 2003
r
vp
x
rte
vo
ol
f
tw
rl
pe
r
rs
e
pa
m
cf
ip
gz
c
gc
p
ga
n
eo
cr
af
ty
0
ip
Wide performance range
Performance typically
much less than peak (4)
3
bz
3.5
IPC
14
Superscalar Processor Performance
Compare
4
3.5
2
1.5
1
0.5
June 2003
r
x
vp
ol
f
rte
vo
tw
rl
r
pe
rs
e
pa
m
cf
c
ip
gz
p
n
gc
ga
eo
bz
ip
cr
af
ty
0
4
3.5
3
2.5
2
1.5
1
0.5
copyright J. E. Smith, 2003
r
vp
x
rte
vo
ol
f
tw
rl
pe
rs
er
pa
ip
m
cf
gz
c
gc
p
ga
n
eo
cr
af
ty
0
ip
Peak performance would be
achievable
IF it weren’t for “bad” events
I Cache misses
D Cache misses
Branch mispredictions
2.5
bz
3
IPC
4-way issue, 32 window
Ideal I-cache, D-cache, Branch
predictor
Non-ideal I-cache, D-cache, Branch
predictor
IPC
15
Performance Model
Consider profile of dynamic instructions issued per
cycle:
i-cache miss
branch mispredicts
long d-cache miss
IPC
time
Background "issue-width" near-peak IPC
•
With never-ending series of transient events
determine performance with ideal caches &
predictors then account for “bad” transient events
June 2003
copyright J. E. Smith, 2003
16
Backend: Ideal Conditions
Key Result (Michaud, Seznec, Jourdan):
•
Square Root relationship between Issue Rate
and Window size R W
June 2003
copyright J. E. Smith, 2003
17
Branch Misprediction Penalty
1) lost opportunity
•
performance lost by issuing soon-to-be flushed instructions
2) pipeline re-fill penalty
•
obvious penalty; most people equate this with the penalty
3) window fill penalty
•
performance lost due to window startup
lost
opportunity
June 2003
pipeline
re-fill
window fill
copyright J. E. Smith, 2003
18
Calculate Mispredict Penalty
19.75 insts/4 = 4.9 cp
8.5 insts/4 = 2.1 cp
9 insts/4 = 2.2 cp
4.5
4
instructions issued
3.5
3
2.5
2
1.5
Total Penalty = 9.2 cp
1
0.5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
clock cycle
June 2003
copyright J. E. Smith, 2003
19
Importance of Branch Prediction
Instructions between mispredictions
1800
1600
Issue width 4
Issue width 8
issue width 16
1400
1200
1000
800
600
400
200
0
10
20
30
40
50
Percent time at 3.5, 7, 14 issues per cycle
June 2003
copyright J. E. Smith, 2003
20
Importance of Branch Prediction
Doubling issue width means predictor has to be
four times better for similar performance profile
Assumes everything else is ideal
•
I-caches & D-caches
Research State of the Art:
about 5 percent mispredicts on average (perceptron predictor)
=> one misprediction per 100 instructions
June 2003
copyright J. E. Smith, 2003
21
Next Generation Branch Prediction
Classic Memory/Computation
Tradeoff
Future
predictors
should
balance
Perceptron
Predictor
Conventional
Branch
Predictors
memory,
computation,
prediction
•• Add
heavier
computation
Heavy
on
memory;
light on
computation
latency
• Also adds latency to prediction
PC
PC PC
Global
Global
Global
History
History
History
History
Memory
Memory
Memory
Comput
Comput
computation
-ation
-ation
Prediction
Memory
June 2003
copyright J. E. Smith, 2003
Prediction
Prediction
Prediction
22
Implication of Deeper Pipelines
Assume 1 misprediction per 96 instructions
Vary fetch/decode/rename section of pipe
Advantage of wide
issue diminishes as
pipe deepens
This ignores
implementation
complexity
Graph also ignores
longer execution
latencies
June 2003
5
IPC
4
issue=8
3
issue=4
2
issue=2
1
0
0
2 4 6 8 10 12 14 16
Fetch/Decode Pipe Length
copyright J. E. Smith, 2003
23
Deep Pipelining: the Optimality of Eight
Hrishikesh et al. : 8 F04s
Kunkel et me : 8 gates
Cray-1: 8 4/5 NANDS
We’re getting there!
June 2003
copyright J. E. Smith, 2003
24
Deep Pipelining
Consider time per instruction (TPI) versus
pipeline depth (Hartstein and Puzak)
The curve is very flat near the optimum
Good engineering
June 2003
Good sales
copyright J. E. Smith, 2003
25
Transistor Radios and High MHz
A lesson from transistor
radios…
Wonderful new technology in
the late ’50s
Clearly, the more transistors,
the better the radio!
=> Easy way to improve sales
6 transistors, 8 transistors, 14
transistors…
Use transistors as diodes…
Lesson: Eventually, people
caught on
June 2003
copyright J. E. Smith, 2003
26
The Optimality of Eight
8 Transistors!
June 2003
copyright J. E. Smith, 2003
27
So, Processors are Dead for Research?
Of course not
BUT IPC oriented research may be on life
support
June 2003
copyright J. E. Smith, 2003
28
Consider Car Engine Development
18
16
14
Number Cylinders
Don’t focus (obsess) on one aspect of performance
12
10
And don’t
focus only on performance
8
Power
efficiency
6
Reliability
4
2
Security
0
Design
Complexity
1890
1895
1900
1905
1910
1915
1920
1925
1930
1935
Year
Conclusion: We should be driving cars with 48 cylinders!
June 2003
copyright J. E. Smith, 2003
29
Co-Designed VMs
Hardware Concealed
Memory
Visible
Memory
Move hardware/software boundary
Give “hardware” designer some software in
concealed memory
Hardware does what it does best: speed
Software does what it does best: manage complexity
June 2003
Operating
System
Application
Prog.
VMM
Profiling HW
Data
Tables
Configuration HW
copyright J. E. Smith, 2003
30
Co-Designed VMs: Micro-OS
Manage processor with micro-OS VMM software
•
•
•
•
Manage processor resources in an integrated way
Identify program phase changes
Save/restore implementation contexts
A microprocessor-controlled microprocessor
Variable branch
predictor global
history
Simultaneous
multithreading
Configurable
I-cache size
June 2003
Configurable
Instruction
window
Configurable
D-Cache size
Pipeline
Variable D-cache
prefetch algorithm
copyright J. E. Smith, 2003
Configurable
Reorder Buffer
31
Co-Designed VMs
Other Applications
•
•
Binary Translation (e.g. Transmeta)
Enables new ISAs
Security (Dynamo/RIO)
Conventional
Traditional ISA
ISAprogram
program
Translate
VMM
Dynamic profiling
Integer
Integer
unit N
IFIF
...
Rename
Rename
& steer
Integer
Integer
unit 1
Integer
Integer
unit 0
June 2003
D Cache
unit N
...
D cache
unit 1
D cache
unit 0
copyright J. E. Smith, 2003
32
Speculative Multi-threading
Reasons for skepticism
•
•
•
•
Complex
Incompatible w/ deep pipelining
The devil will be in the details
researcher: 4 instruction types
designer: 100(s) of instruction types
High Power Consumption
Performance advantages tend to be focused on specific
programs (benchmarks)
Better to push ahead with the real thread
June 2003
copyright J. E. Smith, 2003
33
The Memory Wall: D-Cache Misses
Divide into:
•
•
Short misses
– handle like long latency functional unit
Long misses
– need special treatment
Things that can reduce performance
1) Structural hazards
ROB fills up behind load and dispatch stalls
Window fills with instructions dependent on load and issue
stops
2) Control dependences
Mispredicted branch dependent on load data
Instructions beyond branch wasted
June 2003
copyright J. E. Smith, 2003
34
Structural and Data Blockages
Experiment:
•
•
•
•
•
Window size 32, Issue width 4
Ideal branch prediction
Cache miss delay 1000 cycles
Separate Window and ROB 4K entries each
Simulate single cache miss and see what happens
June 2003
copyright J. E. Smith, 2003
35
Results
Benchmark
Bzip2
Crafty
Eon
Gap
Gcc
Mcf
Gzip
Parser
Perl
Twolf
Vortex
Vpr
June 2003
Avg. # insts
issued after
miss
3950
3747
3923
3293
3678
3502
3853
3648
3519
3673
3606
2371
Avg. #insts
in window
dep. on load
17.8
Issue continues at full
20.1
speed
22.4
31.6
Typical dependent
17.2
96.2
instructions: about 30
11.5
32.6
Usually dependent
30.3
instructions follow load
44.7
closely
7.8
34.0
copyright J. E. Smith, 2003
36
Control Dependences
Non-ideal Branch prediction
•
•
How many cache misses lead to branch mispredict
and when?
Use 8K gshare
June 2003
copyright J. E. Smith, 2003
37
Results
Benchmark
Bzip2
Crafty
Eon
Gap
Gcc
Mcf
Gzip
Parser
Perl
Twolf
Vortex
Vpr
June 2003
fract. loads
driving
mispredict
.01
.30
.18
.33
.35
.01
.44
.08
.40
.37
.16
.47
#insts before
mispredict
33.5
20.3
30.6
27.0
32.4
27.7
32.4
35.9
30.2
65.6
41.2
31.3
•
•
•
•
Bimodal behavior; for
some programs,
branch mispredictions
are crucial
In many cases 30-40%
cache miss data leads
to mispredicted branch
Inhibits ability to
overlap data cache
misses
One more reason to
worry about branch
prediction
copyright J. E. Smith, 2003
38
Dealing with the Memory Wall
Don’t speculate about it
Run through it
ROB grows as nD
•
issue width is n ; miss delay D cycles
miss delay of 200 cycles; four-issue machine
ROB of about 800 entries
Window grows as dm
•
•
m outstanding misses; d dependent instructions each
Example:
6 outstanding misses and 30 dependent instructions
then the window should be enlarged by 180 slots
June 2003
copyright J. E. Smith, 2003
39
Future High Performance Processors
Fast clock cycle: 8 gates per stage
Less speculation
•
Return to Simplicity
•
Deciding what to take out more important than new
things to put in
Leave the baroque era behind
ILP less important
June 2003
copyright J. E. Smith, 2003
40
Research in the deep pipeline domain
latch
latch
logic
logic
logic
logic
logic
logic
logic
Neat Gadget
When
To really
there
evaluate
are 40 performance
gate levels, we
impact
can be
of sloppy
adding a
about
gadget,
adding
we need
gadgets
a detailed logic design
When
Futurethere
research
are 8should
gate levels,
be focused
a gadget
in jettisoning
requiring even
one
gadgets,
morenot
level
adding
slowsthem
clock by 12.5%
June 2003
copyright J. E. Smith, 2003
41
Conclusion: Important Research Areas
Processor simplicity
Power efficiency
Security
Reliability
Reduced design times
Systems (on a chip) balancing threads and on-chip
RAM
Many very simple processors on a chip
•
Look at architecture of Denelcor HEP…
June 2003
copyright J. E. Smith, 2003
42
Attack of Killer Game Chips
OR: The most important thing I learned at Cray Rsch.
OR: What happened to SSTs?
•It
isn’t enough that we can build them
•It isn’t enough that there are interested customers
•Volume rules!
Researchers have made a supercomputer - which is powerful enough to
rival the top systems in the world - out of PlayStation 2 components
A US research centre has clustered 70 Sony PlayStation 2 game consoles into a
Linux supercomputer that ranks among the 500 most powerful in the world.
According to the New York Times, the National Centre for Supercomputing
Applications (NCSA) at the University of Illinois assembled the $50,000
(£30,000) machine out of components bought in retail shops. In all, 100
PlayStation 2 consoles were bought but only 70 have been used for this project.
June 2003
copyright J. E. Smith, 2003
43
Acknowledgements
Performance Model
Tejas Karkhanis
Funding
NSF, SRC, IBM, Intel
Japanese Transistor Radio
Radiophile.com
June 2003
copyright J. E. Smith, 2003
44