www.eng.yale.edu
Download
Report
Transcript www.eng.yale.edu
EENG 449bG/CPSC 439bG
Computer Systems
Lecture 17
Instruction Level Parallelism III
(Multiple Issue Processors and Speculation)
March 29, 2005
Prof. Andreas Savvides
Spring 2005
http://www.eng.yale.edu/courses/2005s/een
g449b
3/29/05
EENG449b/Savvides
Lec 17.1
Why can Tomasulo overlap iterations
of loops?
• Register renaming
– Multiple iterations use different physical destinations for
registers (dynamic loop unrolling).
• Reservation stations
– Permit instruction issue to advance past integer control flow
operations
– Also buffer old values of registers - totally avoiding the WAR
stall that we saw in the scoreboard.
• Other perspective: Tomasulo building data flow
dependency graph on the fly.
3/29/05
EENG449b/Savvides
Lec 17.2
Tomasulo’s scheme offers 2 major
advantages
(1) the distribution of the hazard detection logic
–
–
–
distributed reservation stations and the CDB (Common
Data Bus)
If multiple instructions waiting on single result, & each
instruction has other operand, then instructions can be
released simultaneously by broadcast on CDB
If a centralized register file were used, the units
would have to read their results from the registers
when register buses are available.
(2) the elimination of stalls for WAW and WAR
hazards
3/29/05
EENG449b/Savvides
Lec 17.3
Multiple Issue Processors
• Two main types:
– Superscalar Processors
» Issue variable number of instructions per clock
cycle
» Can be statically or dynamically scheduled
– VLIW (Very Large Instruction Set) Processors
» Issue a constant number of instructions formatted
as a packet of smaller instructions
» Parallelism across instructions is specifically
indicated
» Statically scheduled by the compiler
3/29/05
EENG449b/Savvides
Lec 17.4
Multiple Issue Issues
• issue packet: group of instructions from fetch
unit that could potentially issue in 1 clock
– If instruction causes structural hazard or a data hazard
either due to earlier instruction in execution or to earlier
instruction in issue packet, then instruction does not issue
– 0 to N instruction issues per clock cycle, for N-issue
• Performing issue checks in 1 cycle could limit
clock cycle time: O(n2-n) comparisons
– => issue stage usually split and pipelined
– 1st stage decides how many instructions from within this
packet can issue, 2nd stage examines hazards among selected
instructions and those already been issued
– => splitting leads to higher branch penalties => prediction
accuracy important
3/29/05
EENG449b/Savvides
Lec 17.5
Getting CPI < 1: Issuing
Multiple Instructions/Cycle
• Superscalar MIPS: 2 instructions, 1 FP & 1 anything
– Fetch 64-bits/clock cycle; Int on left, FP on right
– Can only issue 2nd instruction if 1st instruction issues
– More ports for FP registers to do FP load & FP op in a pair
Type
Int. instruction
FP instruction
Int. instruction
FP instruction
Int. instruction
FP instruction
PipeStages
IF
IF
ID
ID
IF
IF
EX MEM WB
EX MEM WB
ID EX MEM WB
ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
• 1 cycle load delay expands to 3 instructions in SS
– instruction in right half can’t use it, nor instructions in next slot
3/29/05
EENG449b/Savvides
Lec 17.6
Dynamic Scheduling in Superscalar
The easy way
• How to issue two instructions and keep in-order
instruction issue for Tomasulo?
– Assume 1 integer + 1 floating point
– 1 Tomasulo control for integer, 1 for floating point
• Issue 2X Clock Rate, so that issue remains in order
• Only loads/stores might cause dependency between
integer and FP issue:
– Replace load reservation station with a load queue;
operands must be read in the order they are fetched
– Load checks addresses in Store Queue to avoid RAW violation
– Store checks addresses in Load Queue to avoid WAR,WAW
3/29/05
EENG449b/Savvides
Lec 17.7
Superscalar Processors Require More
Ambitious Scheduling
• Need to deal with preserving exception
order
• Pipelining the issue stage will result in
additional overheads
– E.g in a superscalar pipeline the outcome of a load
instruction cannot be used in the next 3 instructions
– Because the issue stage is pipelined
• Without more ambitious scheduling,
Superscalar processors do not have an
advantage
• How can we extend Tomasulo’s algorithm to
schedule a superscalar pipeline?
3/29/05
EENG449b/Savvides
Lec 17.8
Hardware Speculation
• Tomasulo had:
In-order issue, out-of-order execution, and
out-of-order completion
• Need to “fix” the out-of-order completion
aspect so that we can find precise
breakpoint in instruction stream.
3/29/05
EENG449b/Savvides
Lec 17.9
Relationship between precise
interrupts and specultation:
• Speculation is a form of guessing.
• Important for branch prediction:
– Need to “take our best shot” at predicting branch direction.
– Go further than dynamic branch prediction – start executing
• If we speculate and are wrong, need to back up and
restart execution to point at which we predicted
incorrectly:
– This is exactly same as precise exceptions!
• Technique for both precise interrupts/exceptions
and speculation: in-order completion or commit
• Tomasulo’s algorithm can be extended to support
speculation
3/29/05
EENG449b/Savvides
Lec 17.10
Tomasulo Recap
3/29/05
EENG449b/Savvides
Lec 17.11
Improvements to Tomasulo Algorithm
• Separate the bypassing of results among
instructions
– Allow an instruction to execute and bypass its results
to other instructions
– Do not allow the instruction to make changes that
cannot be undone unless we know the instruction is no
longer speculative
– When an instruction is no longer speculative, we allow it
to update the register file – instruction commit
• Speculation key idea
– Execute instructions out of order but commit them in
order
3/29/05
EENG449b/Savvides
Lec 17.12
HW support for precise interrupts
• Need HW buffer for results
of uncommitted instructions:
reorder buffer (ROB)
– 3 fields: instr, destination, value
– Use reorder buffer number
instead of reservation station
FP
when execution completes
Op
– Supplies operands between
Queue
execution complete & commit
– (Reorder buffer can be operand
source => more registers like RS)
– Instructions commit
Res Stations
– Once instruction commits,
FP Adder
result is put into register
– As a result, easy to undo
speculated instructions
on mispredicted branches
or exceptions
3/29/05
Reorder
Buffer
FP Regs
Res Stations
FP Adder
EENG449b/Savvides
Lec 17.13
Four Steps of Speculative Tomasulo
Algorithm
1.Issue—get instruction from FP Op Queue
If reservation station and reorder buffer slot free, issue instr &
send operands & reorder buffer no. for destination (this stage
sometimes called “dispatch”)
2.Execution—operate on operands (EX)
When both operands ready then execute; if not ready, watch CDB
for result; when both in reservation station, execute; checks RAW
(sometimes called “issue”)
3.Write result—finish execution (WB)
Write on Common Data Bus to all awaiting FUs
& reorder buffer; mark reservation station available.
4.Commit—update register with reorder result
When instr. at head of reorder buffer & result present, update
register with result (or store to memory) and remove instr from
reorder buffer. Mispredicted branch flushes reorder buffer
(sometimes called “graduation”)
3/29/05
EENG449b/Savvides
Lec 17.14
Program Counter
Valid
Exceptions?
Result
Reorder Table
FP
Op
Queue
Res Stations
FP Adder
Compar network
Dest Reg
What are the hardware complexities with
reorder buffer (ROB)?
Reorder
Buffer
FP Regs
Res Stations
FP Adder
• Need as many ports on ROB as register file
3/29/05
EENG449b/Savvides
Lec 17.15
Summary
• Reservations stations: implicit register renaming to
larger set of registers + buffering source operands
– Prevents registers as bottleneck
– Avoids WAR, WAW hazards of Scoreboard
– Allows loop unrolling in HW
• Not limited to basic blocks
(integer units gets ahead, beyond branches)
• Today, helps cache misses as well
– Don’t stall for L1 Data cache miss (insufficient ILP for L2 miss?)
• Lasting Contributions
– Dynamic scheduling
– Register renaming
– Load/store disambiguation
3/29/05
EENG449b/Savvides
Lec 17.16
Register renaming, virtual registers
versus Reorder Buffers
• Alternative to Reorder Buffer is a larger virtual
set of registers and register renaming
• Virtual registers hold both architecturally visible
registers + temporary values
– replace functions of reorder buffer and reservation station
• Renaming process maps names of architectural
registers to registers in virtual register set
– Changing subset of virtual registers contains architecturally
visible registers
• Simplifies instruction commit: mark register as no
longer speculative, free register with old value
• Adds 40-80 extra registers: Alpha, Pentium,…
– Size limits no. instructions in execution (used until commit)
3/29/05
EENG449b/Savvides
Lec 17.17
How much to speculate?
• Speculation Pro: uncover events that would
otherwise stall the pipeline (cache misses)
• Speculation Con: speculate costly if exceptional
event occurs when speculation was incorrect
• Typical solution: speculation allows only lowcost exceptional events (1st-level cache miss)
• When expensive exceptional event occurs,
(2nd-level cache miss or TLB miss) processor
waits until the instruction causing event is no
longer speculative before handling the event
• Assuming single branch per cycle: future may
speculate across multiple branches!
3/29/05
EENG449b/Savvides
Lec 17.18
Limits to ILP
• Conflicting studies of amount
– Benchmarks (vectorized Fortran FP vs. integer C programs)
– Hardware sophistication
– Compiler sophistication
• How much ILP is available using existing
mechanisms with increasing HW budgets?
• Do we need to invent new HW/SW mechanisms to
keep on processor performance curve?
–
–
–
–
3/29/05
Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints
Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock
Motorola AltaVec: 128 bit ints and FPs
Supersparc Multimedia ops, etc.
EENG449b/Savvides
Lec 17.19
Limits to ILP
Initial HW Model here; MIPS compilers.
Assumptions for ideal/perfect machine to start:
1. Register renaming – infinite virtual registers
=> all register WAW & WAR hazards are avoided
2. Branch prediction – perfect; no mispredictions
3. Jump prediction – all jumps perfectly predicted
2 & 3 => machine with perfect speculation & an
unbounded buffer of instructions available
4. Memory-address alias analysis – addresses are
known & a store can be moved before a load
provided addresses not equal
Also:
unlimited number of instructions issued/clock cycle;
perfect caches;
1 cycle latency for all instructions (FP *,/);
3/29/05
EENG449b/Savvides
Lec 17.20
Upper Limit to ILP: Ideal Machine
(Figure 3.34, page 294)
160
150.1
FP: 75 - 150
Instruction Issues per cycle
IPC
140
120
118.7
Integer: 18 - 60
100
75.2
80
62.6
60
54.8
40
17.9
20
0
gcc
espresso
li
fpppp
doducd
tomcatv
Programs
3/29/05
EENG449b/Savvides
Lec 17.21
More Realistic HW: Branch Impact
Figure 3.38, Page 300
Change from Infinite
window to examine to
2000 and maximum
issue of 64
instructions per clock
cycle
60
Instruction issues per cycle
IPC
50
FP: 15 - 45
61
60
58
48
46
46
45
45 45
41
40
35
Integer: 6 - 12
30
29
19
20
16
15
13
12
14
10
10
9
6
7
6
6
6
7
4
2
2
2
0
gcc
espresso
li
fpppp
doducd
tomcatv
Program
Perfect
3/29/05
Perfect
Tournament
Selective predictor
Standard 2-bit
BHT (512)
Static
Profile
None
EENG449b/Savvides
No prediction
Lec 17.22
More Realistic HW:
Renaming Register Impact
Figure 3.41, Page 304
FP: 11 - 45
70
Change 2000 instr
window, 64 instr
issue, 8K 2 level
Prediction
60
IPC
Instruction issues per cycle
50
59
54
49
45
44
40
35
Integer: 5 - 15
30
29
28
20
20
16
15 15
13
10
11 10 10
12 12 12 11
10
9
5
5
4
15
11
6
4
5
5
5
4
7
5
5
0
gcc
espresso
li
fpppp
doducd
tomcatv
Program
Infinite
3/29/05
Infinite
256
256
128
128
64
32
64
None
32
None
EENG449b/Savvides
Lec 17.23
More Realistic HW:
Memory Address Alias Impact
49 49
Figure 3.43, Page 306
50
Change 2000 instr
window, 64 instr issue,
8K 2 level Prediction,
256 renaming registers
45
40
Instruction issues per cycle
IPC
35
45
45
FP: 4 - 45
(Fortran,
no heap)
30
25
Integer: 4 - 9
20
16
16
15
15
12
10
10
5
9
7
7
4
5
5
4
3
3
4
6
4
3
5
4
0
gcc
espresso
li
f pppp
doducd
tomcat v
Program
Perf ect
Perfect
3/29/05
Global/ stack Perf ect
Inspection
Global/Stack perf; Inspec.
heap conflicts
Assem.
None
None
EENG449b/Savvides
Lec 17.24
Realistic HW for ‘00: Window Impact
(Figure 3.45, Page 309)
60
IPC
Instruction issues per cycle
50
Perfect disambiguation
(HW), 1K Selective
52
Prediction, 16 entry return, 47
64 registers, issue as many
as window
56
FP: 8 - 45
45
40
35
34
30
22
Integer: 6 - 12
20
15 15
10 10 10
10
9
13
12 12 11 11
10
8
8
6
4
6
3
17 16
14
9
6
4
22
2
15
14
12
9
8
4
9
7
5
4
3
3
6
3
3
0
gcc
expresso
li
f pppp
doducd
tomcat v
Program
Inf inite
3/29/05
256
128
Infinite 256 128
64
32
16
64
32
16
8
8
4
4
EENG449b/Savvides
Lec 17.25
How to Exceed ILP Limits of this
study?
• WAR and WAW hazards through memory:
eliminated WAW and WAR hazards through
register renaming, but not in memory usage
• Unnecessary dependences (compiler not unrolling
loops so iteration variable dependence)
• Overcoming the data flow limit: value prediction,
predicting values and speculating on prediction
– Address value prediction and speculation predicts addresses
and speculates by reordering loads and stores; could provide
better aliasing analysis, only need predict if addresses =
3/29/05
EENG449b/Savvides
Lec 17.26
Workstation Microprocessors 3/2001
• Max
Max
Max
Max
Max
3/29/05
issue: 4 instructions (many CPUs)
rename registers: 128 (Pentium 4)
BHT: 4K x 9 (Alpha 21264B), 16Kx2 (Ultra III)
Window Size (OOO): 126 intructions (Pent. 4)
Pipeline: 22/24 stages (Pentium 4)
Source: Microprocessor Report, www.MPRonline.com
EENG449b/Savvides
Lec 17.27
Conclusion
• 1985-2000: 1000X performance
– Moore’s Law transistors/chip => Moore’s Law for Performance/MPU
• Hennessy: industry been following a roadmap of ideas
known in 1985 to exploit Instruction Level Parallelism
and (real) Moore’s Law to get 1.55X/year
– Caches, Pipelining, Superscalar, Branch Prediction, Out-of-order
execution, …
• ILP limits: To make performance progress in future
need to have explicit parallelism from programmer vs.
implicit parallelism of ILP exploited by compiler, HW?
– Otherwise drop to old rate of 1.3X per year?
– Less than 1.3X because of processor-memory performance gap?
• Impact on you: if you care about performance,
better think about explicitly parallel algorithms
vs. rely on ILP?
3/29/05
EENG449b/Savvides
Lec 17.28
Review: Dynamic Branch Prediction
• Prediction becoming important part of scalar
execution
• Branch History Table: 2 bits for loop accuracy
• Correlation: Recently executed branches correlated
with next branch.
– Either different branches
– Or different executions of same branches
3/29/05
• Tournament Predictor: more resources to
competitive solutions and pick between them
• Branch Target Buffer: include branch address &
prediction
• Predicated Execution can reduce number of
branches, number of mispredicted branches
• Return address stack for prediction of indirect
jump
EENG449b/Savvides
Lec 17.29
Review: Limits of ILP
• 1985-2000: 1000X performance
– Moore’s Law transistors/chip => Moore’s Law for Performance/MPU
• Hennessy: industry been following a roadmap of ideas
known in 1985 to exploit Instruction Level Parallelism
to get 1.55X/year
– Caches, Pipelining, Superscalar, Branch Prediction, Out-of-order
execution, …
• ILP limits: To make performance progress in future
need to have explicit parallelism from programmer vs.
implicit parallelism of ILP exploited by compiler, HW?
– Otherwise drop to old rate of 1.3X per year?
– Less because of processor-memory performance gap?
• Impact on you: if you care about performance,
better think about explicitly parallel algorithms
vs. rely on ILP?
3/29/05
EENG449b/Savvides
Lec 17.30
Dynamic Scheduling in P6
(Pentium Pro, II, III)
• Q: How pipeline 1 to 17 byte 80x86 instructions?
• P6 doesn’t pipeline 80x86 instructions
• P6 decode unit translates the Intel instructions into 72-bit
micro-operations (~ MIPS)
• Sends micro-operations to reorder buffer & reservation
stations
• Many instructions translate to 1 to 4 micro-operations
• Complex 80x86 instructions are executed by a conventional
microprogram (8K x 72 bits) that issues long sequences of microoperations
• 14 clocks in total pipeline (~ 3 state machines)
3/29/05
EENG449b/Savvides
Lec 17.31
Dynamic Scheduling in P6
Parameter
80x86 microops
Max. instructions issued/clock
3
6
Max. instr. complete exec./clock
5
Max. instr. commited/clock
3
Window (Instrs in reorder buffer)
40
Number of reservations stations
20
Number of rename registers
40
No. integer functional units (FUs)
2
No. floating point FUs
1
No. SIMD Fl. Pt. FUs
1
No. memory Fus
1 load + 1 store
3/29/05
EENG449b/Savvides
Lec 17.32
P6 Pipeline
• 14 clocks in total (~3 state machines)
• 8 stages are used for in-order instruction
fetch, decode, and issue
– Takes 1 clock cycle to determine length of 80x86 instructions +
2 more to create the micro-operations (uops)
• 3 stages are used for out-of-order execution
in one of 5 separate functional units
• 3 stages are used for instruction commit
Instr
Fetch
16B
/clk
3/29/05
16B
Instr 6 uops
Decode
3 Instr
/clk
Reserv.
Reorder
ExecuGraduStation
Buffer
tion
ation
Renaming
units
3 uops
3 uops
(5)
/clk
/clk
EENG449b/Savvides
Lec 17.33
P6 Block Diagram
• IP = PC
From: http://www.digitlife.com/articles/pentium4/
3/29/05
EENG449b/Savvides
Lec 17.34
Pentium III Die Photo
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
1st Pentium III, Katmai: 9.5 M transistors, 12.3 *
3/29/05
10.4 mm in 0.25-mi. with 5 layers of aluminum
EBL/BBL - Bus logic, Front, Back
MOB - Memory Order Buffer
Packed FPU - MMX Fl. Pt. (SSE)
IEU - Integer Execution Unit
FAU - Fl. Pt. Arithmetic Unit
MIU - Memory Interface Unit
DCU - Data Cache Unit
PMH - Page Miss Handler
DTLB - Data TLB
BAC - Branch Address Calculator
RAT - Register Alias Table
SIMD - Packed Fl. Pt.
RS - Reservation Station
BTB - Branch Target Buffer
IFU - Instruction Fetch Unit (+I$)
ID - Instruction Decode
ROB - Reorder Buffer
MS - Micro-instruction Sequencer
EENG449b/Savvides
Lec 17.35
P6 Performance: Stalls at decode stage
I$ misses or lack of RS/Reorder buf. entry
go
m88ksim
Instruction stream
Resource capacity stalls
gcc
compress
li
ijpeg
perl
vortex
tomcatv
swim
su2cor
hydro2d
mgrid
applu
turb3d
apsi
fpppp
wave5
3/29/05
0
0.5
1
1.5
2
2.5
3
0.5 to 2.5 Stall cycles per instruction: 0.98 avg. (0.36 integer)
EENG449b/Savvides
Lec 17.36
P6 Performance: uops/x86 instr
200 MHz, 8KI$/8KD$/256KL2$, 66 MHz bus
go
m88ksim
gcc
compress
li
ijpeg
perl
vortex
tomcatv
swim
su2cor
hydro2d
mgrid
applu
turb3d
apsi
fpppp
wave5
1
3/29/05
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.2 to 1.6 uops per IA-32 instruction: 1.36 avg. (1.37 integer)
EENG449b/Savvides
Lec 17.37
P6 Performance: Branch Mispredict Rate
go
m88ksim
gcc
compress
li
ijpeg
perl
vortex
tomcatv
swim
su2cor
BTB miss frequency
Mispredict frequency
hydro2d
mgrid
applu
turb3d
apsi
fpppp
wave5
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
10% to 40% Miss/Mispredict ratio: 20% avg. (29% integer)
3/29/05
EENG449b/Savvides
Lec 17.38
P6 Performance: Speculation rate
(% instructions issued that do not commit)
go
m88ksim
gcc
compress
li
ijpeg
perl
vortex
tomcatv
swim
su2cor
hydro2d
mgrid
applu
turb3d
apsi
fpppp
wave5
0%
3/29/05
10%
20%
30%
40%
50%
60%
1% to 60% instructions do not commit: 20% avg (30% integer)EENG449b/Savvides
Lec 17.39
P6 Performance: Cache Misses/1k instr
go
m88ksim
gcc
L1 Instruction
L1 Data
L2
compress
li
ijpeg
perl
vortex
tomcatv
swim
su2cor
hydro2d
mgrid
applu
turb3d
apsi
fpppp
wave5
0
20
40
60
80
100
120
140
160
10 to 160 Misses per Thousand Instructions: 49 avg (30 integer)
3/29/05
EENG449b/Savvides
Lec 17.40
P6 Performance: uops commit/clock
go
m88ksim
gcc
compress
li
ijpeg
perl
0 uops commit
1 uop commits
2 uops commit
3 uops commit
vortex
tomcatv
swim
su2cor
hydro2d
mgrid
Average
0: 55%
1: 13%
2: 8%
3: 23%
applu
turb3d
apsi
fpppp
wave5
0%
3/29/05
20%
40%
60%
80%
Integer
0: 40%
1: 21%
2: 12%
3: 27%
100%
EENG449b/Savvides
Lec 17.41
P6 Dynamic Benefit?
Sum of parts CPI vs. Actual CPI
go
m88ksim
gcc
compress
li
ijpeg
uops
Instruction cache stalls
Resource capacity stalls
Branch mispredict penalty
Data Cache Stalls
perl
vortex
tomcatv
swim
su2cor
hydro2d
mgrid
applu
Actual CPI
Ratio of
sum of
parts vs.
actual CPI:
1.38X avg.
(1.29X
integer)
turb3d
apsi
fpppp
wave5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
0.8 to 3.8 Clock cycles per instruction: 1.68 avg (1.16 integer)
3/29/05
EENG449b/Savvides
Lec 17.42
AMD Althon
• Similar to P6 microarchitecture
(Pentium III), but more resources
• Transistors: PIII 24M v. Althon 37M
• Die Size: 106 mm2 v. 117 mm2
• Power: 30W v. 76W
• Cache: 16K/16K/256K v. 64K/64K/256K
• Window size: 40 vs. 72 uops
• Rename registers: 40 v. 36 int +36 Fl. Pt.
• BTB: 512 x 2 v. 4096 x 2
• Pipeline: 10-12 stages v. 9-11 stages
• Clock rate: 1.0 GHz v. 1.2 GHz
• Memory bandwidth: 1.06 GB/s v. 2.12 GB/s
3/29/05
EENG449b/Savvides
Lec 17.43
Pentium 4
• Still translate from 80x86 to micro-ops
• P4 has better branch predictor, more FUs
• Instruction Cache holds micro-operations vs. 80x86
instructions
– no decode stages of 80x86 on cache hit
– called “trace cache” (TC)
• Faster memory bus: 400 MHz v. 133 MHz
• Caches
– Pentium III: L1I 16KB, L1D 16KB, L2 256 KB
– Pentium 4: L1I 12K uops, L1D 8 KB, L2 256 KB
– Block size: PIII 32B v. P4 128B; 128 v. 256 bits/clock
• Clock rates:
– Pentium III 1 GHz v. Pentium IV 1.5 GHz
– 14 stage pipeline vs. 24 stage pipeline
3/29/05
EENG449b/Savvides
Lec 17.44
Pentium 4 features
• Multimedia instructions 128 bits wide vs. 64 bits
wide => 144 new instructions
– When used by programs??
– Faster Floating Point: execute 2 64-bit Fl. Pt. Per clock
– Memory FU: 1 128-bit load, 1 128-store /clock to MMX regs
• Using RAMBUS DRAM
– Bandwidth faster, latency same as SDRAM
– Cost 2X-3X vs. SDRAM
•
•
•
•
3/29/05
ALUs operate at 2X clock rate for many ops
Pipeline doesn’t stall at this clock rate: uops replay
Rename registers: 40 vs. 128; Window: 40 v. 126
BTB: 512 vs. 4096 entries (Intel: 1/3 improvement)
EENG449b/Savvides
Lec 17.45
Pentium, Pentium Pro, Pentium 4 Pipeline
• Pentium (P5) = 5 stages
Pentium Pro, II, III (P6) = 10 stages (1 cycle ex)
Pentium 4 (NetBurst) = 20 stages (no decode)
From “Pentium 4 (Partially) Previewed,” Microprocessor Report, 8/28/00
3/29/05
EENG449b/Savvides
Lec 17.46
Block Diagram of Pentium 4 Microarchitecture
•
•
•
•
BTB = Branch Target Buffer (branch predictor)
I-TLB = Instruction TLB, Trace Cache = Instruction cache
RF = Register File; AGU = Address Generation Unit
"Double pumped ALU" means ALU clock rate 2X => 2X ALU F.U.s
From “Pentium 4 (Partially) Previewed,” Microprocessor Report, 8/28/00
3/29/05
EENG449b/Savvides
Lec 17.47
Pentium 4 Die Photo
• 42M Xtors
– PIII: 26M
• 217 mm2
– PIII: 106 mm2
• L1 Execution
Cache
– Buffer 12,000
Micro-Ops
• 8KB data
cache
• 256KB L2$
3/29/05
EENG449b/Savvides
Lec 17.48
Benchmarks: Pentium 4 v. PIII v. Althon
• SPECbase2000
– Int, [email protected] GHz: 524, PIII @1GHz: 454, AMD [email protected]:?
– FP, [email protected] GHz: 549, PIII @1GHz: 329, AMD
[email protected]:304
• WorldBench 2000 benchmark (business) PC World
magazine, Nov. 20, 2000 (bigger is better)
– P4 : 164, PIII : 167, AMD Althon: 180
•
•
•
•
Quake 3 Arena: P4 172, Althon 151
SYSmark 2000 composite: P4 209, Althon 221
Office productivity: P4 197, Althon 209
S.F. Chronicle 11/20/00: "… the challenge for AMD
now will be to argue that frequency is not the most
important thing-- precisely the position Intel has
argued while its Pentium III lagged behind the Athlon
in clock speed."
3/29/05
EENG449b/Savvides
Lec 17.49
Why?
•
•
•
•
Instruction count is the same for x86
Clock rates: P4 > Althon > PIII
How can P4 be slower?
Time =
Instruction count x CPI x 1/Clock rate
• Average Clocks Per Instruction (CPI) of P4 must
be worse than Althon, PIII
• Will CPI ever get < 1.0 for real programs?
3/29/05
EENG449b/Savvides
Lec 17.50
Simultaneous Multithreading (SMT)
• Simultaneous multithreading (SMT): insight that
dynamically scheduled processor already has
many HW mechanisms to support multithreading
– large set of virtual registers that can be used to hold the
register sets of independent threads (assuming separate
renaming tables are kept for each thread)
– out-of-order completion allows the threads to execute out
of order, and get better utilization of the HW
Source: Micrprocessor Report, December 6, 1999
“Compaq Chooses SMT for Alpha”
3/29/05
EENG449b/Savvides
Lec 17.51
SMT is coming
• Just adding a per thread renaming table and
keeping separate PCs
– Independent commitment can be supported by logically
keeping a separate reorder buffer for each thread
• Compaq has announced it for future Alpha
microprocessor: 21464 in 2003; others likely
On a multiprogramming workload
comprising a mixture of SPECint95
and SPECfp95 benchmarks, Compaq
claims the SMT it simulated
achieves a 2.25X higher throughput
with 4 simultaneous threads than
with just 1 thread. For parallel
programs, 4 threads 1.75X v. 1
3/29/05
Source: Micrprocessor Report, December 6, 1999
“Compaq Chooses SMT for Alpha”
EENG449b/Savvides
Lec 17.52