P6 and NetBurst Microarchitecture - ECE Users Pages
Download
Report
Transcript P6 and NetBurst Microarchitecture - ECE Users Pages
ECE4100/6100
H-H. S. Lee
ECE4100/6100 Guest Lecture:
P6 & NetBurst Microarchitecture
Prof. Hsien-Hsin Sean Lee
School of ECE
Georgia Institute of Technology
February 11, 2003
1
ECE4100/6100
H-H. S. Lee
Why studies P6 from last millennium?
A paradigm shift from Pentium
A RISC core disguised as a CISC
Huge market success:
Microarchitecture
And stock price
Architected by former VLIW and RISC folks
Multiflow (pioneer in VLIW architecture for superminicomputer)
Intel i960 (Intel’s RISC for graphics and embedded
controller)
Netburst (P4’s microarchitecture)
is based on P6
2
ECE4100/6100
H-H. S. Lee
P6 Basics
One implementation of IA32 architecture
Super-pipelined processor
3-way superscalar
In-order front-end and back-end
Dynamic execution engine (restricted dataflow)
Speculative execution
P6 microarchitecture family processors include
Pentium Pro
Pentium II (PPro + MMX + 2x caches—16KB I/16KB D)
Pentium III (P-II + SSE + enhanced MMX, e.g. PSAD)
Celeron (without MP support)
Later P-II/P-III/Celeron all have on-die L2 cache
3
ECE4100/6100
H-H. S. Lee
x86 Platform Architecture
Host Processor
P6 Core
L1
Cache
(SRAM) Back-Side
L2 Cache
(SRAM)
On-die
or on-package
Bus
GPU
Graphics
Processor
Front-Side
Bus
AGP
System
Memory
(DRAM)
MCH
ICH
Local
Frame
Buffer
chipset
PCI
USB
4
I/O
ECE4100/6100
H-H. S. Lee
Pentium III Die Map
5
EBL/BBL – External/Backside Bus logic
MOB - Memory Order Buffer
Packed FPU - Floating Point Unit for SSE
IEU - Integer Execution Unit
FAU - Floating Point Arithmetic Unit
MIU - Memory Interface Unit
DCU - Data Cache Unit (L1)
PMH - Page Miss Handler
DTLB - Data TLB
BAC - Branch Address Calculator
RAT - Register Alias Table
SIMD - Packed Floating Point unit
RS - Reservation Station
BTB - Branch Target Buffer
TAP – Test Access Port
IFU - Instruction Fetch Unit and L1 I-Cache
ID - Instruction Decode
ROB - Reorder Buffer
MS - Micro-instruction Sequencer
ECE4100/6100
H-H. S. Lee
ISA Enahncement (on top of Pentium)
CMOVcc / FCMOVcc r, r/m
Conditional moves (predicated move) instructions
Based on conditional code (cc)
FCOMI/P : compare FP stack and set integer flags
RDPMC/RDTSC instructions
Uncacheable Speculative Write-Combining (USWC) —
weakly ordered memory type for graphics memory
MMX in Pentium II
SIMD integer operations
SSE in Pentium III
Prefetches (non-temporal nta + temporal t0, t1, t2), sfence
SIMD single-precision FP operations
6
ECE4100/6100
H-H. S. Lee
RS Disp
Exec / WB
ROB
DIS
RET1 RET2
EX
31 32 33 42 43
Dcache2
DCache1
MOB disp
31 32 33 42 43
…….. 40 41 42 43
81 82 83
7
MOB Scheduling
Delay
…
Ret ROB rd
RRF wr
Retirement in-order boundary
DCache1
DCache2
81 82 83
Ret ptr wr
..
81 82 83
…
RAT
Exec n
Exec2
31 32 33 ..
AGU
…
DEC2
Mob wakeup
Blocking
memory
pipeline
81: Mem/FP WB
82: Int WB
83: Data WB
DEC1
ROB Scheduling
Delay
MOB blk
MOB wr
Non-blocking
memory pipeline
RS Scheduling
Delay
IFU3
31 32 33
AGU
Multi-cycle inst
pipeline
…
IFU2
82 83
FE in-order boundary
Single-cycle inst
pipeline
IDQ
RAT
20 21 22
RS schd
Dec2
Br Dec
11 12 13 14 15 16 17
RS Write
In-order FE
I-Cache
ILD
Rotate
Dec1
Next IP
P6 Pipelining
IFU1
91 92 93
ECE4100/6100
H-H. S. Lee
P6 Microarchitecture
External bus
Data Cache
Unit (L1)
Chip boundary
Bus Cluster
Memory
Order Buffer
Bus interface unit
Memory
Cluster
AGU
Instruction
Instruction Fetch
Fetch Unit
Unit
MMX
IEU/JEU
IEU/JEU
Control
Flow
BTB/BAC
FEU
Instruction Fetch Cluster
(Restricted)
Data
Flow
MIU
Instruction
Instruction
Decoder
Decoder
Microcode
Sequencer
Register
Alias Table
Reservation
Station
Allocator
Issue Cluster
8
ROB &
Retire RF
Out-of-order
Cluster
ECE4100/6100
H-H. S. Lee
Instruction Fetching Unit
data
Other fetch
requests
addr
Streaming
Buffer
Select
mux
Instruction
buffer
Next PC
Mux
Linear Address
Instruction
Cache
ILD
Length
marks
Instruction
rotator
Victim
Cache
P.Addr
Instruction
TLB
Prediction
marks
#bytes
consumed
by ID
Branch Target Buffer
IFU1: Initiate fetch, requesting 16 bytes at a time
IFU2: Instruction length decoder, mark instruction boundaries, BTB makes prediction
IFU3: Align instructions to 3 decoders in 4-1-1 format
9
ECE4100/6100
H-H. S. Lee
Dynamic Branch Prediction
W0
W1
W2
Pattern History Tables
(PHT)
W3
New (spec) history
512-entry BTB
1 1 1 0 1
Branch History Register 0
(BHR)
0000
0001
0010
Spec. update
1101
1110 1
1111
Prediction
0
2-bit sat. counter
Rc: Branch Result
Similar to a 2-level PAs design
Associated with each BTB entry
W/ 16-entry Return Stack Buffer
4 branch predictions per cycle
(due to 16-byte fetch per cycle)
Static prediction provided by Branch Address
Calculator when BTB misses (see prior slide)
10
ECE4100/6100
H-H. S. Lee
Static Branch Prediction
No
No
Unconditional
PC-relative?
BTB miss?
Yes
Yes
PC-relative?
No
Return?
Yes
No
BTB’s
decision
Yes
No
Indirect
jump
Conditional?
Yes
Taken
Backwards?
Taken
Yes
Taken
11
No
Taken
Not Taken
Taken
ECE4100/6100
H-H. S. Lee
X86 Instruction Decode
IFU3
complex
(1-4)
Microinstruction
sequencer
(MS)
simple
(1)
simple
(1)
Instruction decoder queue
(6 ops)
4-1-1 decoder
Decode rate depends on instruction alignment
DEC1: translate x86 into micro-operation’s (ops)
DEC2: move decoded ops to ID queue
MS performs translations either
Next 3 inst
#Inst to dec
S,S,S
3
S,S,C
First 2
S,C,S
First 1
S,C,C
First 1
C,S,S
3
C,S,C
First 2
C,C,S
First 1
C,C,C
First 1
S: Simple
C: Complex
Generate entire op sequence from microcode ROM
Receive 4 ops from complex decoder, and the rest from microcode ROM
12
ECE4100/6100
H-H. S. Lee
Allocator
The interface between in-order and out-of-order
pipelines
Allocates
“3-or-none” ops per cycle into RS, ROB
“all-or-none” in MOB (LB and SB)
Generate physical destination Pdst from the ROB
and pass it to the Register Alias Table (RAT)
Stalls upon shortage of resources
13
ECE4100/6100
H-H. S. Lee
Register Alias Table (RAT)
FP
TOS
Adjust
Integer
RAT
Array
Array
Physical
Src (Psrc)
FP
RAT
Array
Int and FP Overrides
In-order queue
Logical Src
Renaming Example
RRF PSrc
EAX 0 25
RAT
PSrc’s
EBX
ECX
0
EDX
0
1
2
ECX
15
Allocator
Physical ROB Pointers
RRF
ROB
Register renaming for 8 integer registers, 8 floating point (stack) registers and flags: 3 op per cycle
40 80-bit physical registers embedded in the ROB (thereby, 6 bit to specify PSrc)
RAT looks up physical ROB locations for renamed sources based on RRF bit
14
ECE4100/6100
H-H. S. Lee
Partial Register Width Renaming
FP
TOS
Adjust
Size(2) RRF(1)
Array
Physical
Src
FP
RAT
Array
Int and FP Overries
In-order queue
Logical Src
Integer
RAT
Array
INT Low Bank
(32b/16b/L): 8 entries
INT High Bank (H):
4 entries
RAT
Physical Src
Allocator
op0:
op1:
op2:
op3:
MOV
MOV
ADD
ADD
Physical ROB Pointers from Allocator
32/16-bit accesses:
Read from low bank
Write to both banks
8-bit RAT accesses: depending on which Bank is being written
15
PSrc(6)
AL
AH
AL
AH
=
=
=
=
(a)
(b)
(c)
(d)
ECE4100/6100
H-H. S. Lee
Partial Stalls due to RAT
read
AX
EAX
write
CMP
INC
JBE
EAX, EBX
ECX
XX
; stall
Partial flag stalls (1)
MOVB AL, m8 ;
ADD EAX, m32 ; stall
Partial register stalls
TEST EBX, EBX
LAHF
; stall
XOR EAX, EAX
MOVB AL, m8 ;
ADD EAX, m32 ; no stall
Partial flag stalls (2)
Idiom Fix (1)
JBE reads both ZF and CF while
INC affects (ZF,OF,SF,AF,PF)
LAHF loads low byte of EFLAGS
SUB EAX, EAX
MOVB AL, m8 ;
ADD EAX, m32 ; no stall
Idiom Fix (2)
Partial register stalls: Occurs when writing a smaller (e.g. 8/16-bit)
register followed by a larger (e.g. 32-bit) read
Partial flags stalls: Occurs when a subsequent instruction read more
flags than a prior unretired instruction touches
16
ECE4100/6100
H-H. S. Lee
Reservation Stations
WB bus 0
Port 0
IEU0
Fadd
Fmul
Imul
Div
WB bus 1
Port 1
IEU1
JEU
Pfadd
Pfshuf
Loaded data
RS
Port 2
AGU0 Ld addr
LDA
MOB
Port 3
AGU1 St addr
STA
DCU
STD
St data
Port 4
ROB
Retired
RRF
data
Gateway to execution: binding max 5 op to each port per cycle
20 op entry buffer bridging the In-order and Out-of-order engine
RS fields include op opcode, data valid bits, Pdst, Psrc, source data, BrPred, etc.
Oldest first FIFO scheduling when multiple ops are ready at the same cycle
17
Pfmul
ECE4100/6100
H-H. S. Lee
ReOrder Buffer
A 40-entry circular buffer
Similar to that described in [SmithPleszkun85]
157-bit wide
Provide 40 alias physical registers
Out-of-order completion
Deposit exception in each entry
Retirement (or de-allocation)
RS
ALLOC
ROB
RAT
After resolving prior speculation
Handle exceptions thru MS
Clear OOO state when a mis-predicted branch or
exception is detected
3 op’s per cycle in program order
For multi-op x86 instructions: none or all (atomic)
18
RRF
..
.
(exp) code assist
MS
ECE4100/6100
H-H. S. Lee
Memory Execution Cluster
RS / ROB
LD
STA
STD
Load Buffer
DTLB
FB
DCU
LD
STA
Store Buffer
EBL
Memory Cluster Blocks
Manage data memory accesses
Address Translation
Detect violation of access ordering
Fill buffers in DCU (similar to MSHR
[Kroft’81]) for handling cache misses (nonblocking)
19
ECE4100/6100
H-H. S. Lee
Memory Order Buffer (MOB)
Allocated by ALLOC
A second order RS for memory operations
1 op for load; 2 op’s for store: Store Address (STA) and Store Data (STD)
MOB
16-entry load buffer (LB)
12-entry store address buffer (SAB)
SAB works in unison with
Store data buffer (SDB) in MIU
Physical Address Buffer (PAB) in DCU
Store Buffer (SB): SAB + SDB + PAB
Senior Stores
Upon STD/STA retired from ROB
SB marks the store “senior”
Senior stores are committed back in program order to memory when bus idle or SB full
Prefetch instructions in P-III
Senior load behavior
Due to no explicit architectural destination
20
ECE4100/6100
H-H. S. Lee
Store Coloring
x86 Instructions
op’s
mov (0x1220), ebx
std
sta
std
sta
ld
ld
std
sta
ld
mov (0x1110), eax
mov ecx, (0x1220)
mov edx, (0x1280)
mov (0x1400), edx
mov edx, (0x1380)
(ebx)
0x1220
(eax)
0x1100
(edx)
0x1400
store color
2
2
3
3
3
3
4
4
4
ALLOC assigns Store Buffer ID (SBID) in program order
ALLOC tags loads with the most recent SBID
Check loads against stores with equal or younger SBIDs for potential
address conflicts
SDB forwards data if conflict detected
21
ECE4100/6100
H-H. S. Lee
Memory Type Range Registers (MTRR)
Control registers written by the system (OS)
Supporting Memory Types
UnCacheable (UC)
Uncacheable Speculative Write-combining (USWC or WC)
Use a fill buffer entry as WC buffer
WriteBack (WB)
Write-Through (WT)
Write-Protected (WP)
E.g. Support copy-on-write in UNIX, save memory space by allowing child
processes to share with their parents. Only create new memory pages when child
processes attempt to write.
Page Miss Handler (PMH)
Look up MTRR while supplying physical addresses
Return memory types and physical
address to DTLB
22
ECE4100/6100
H-H. S. Lee
Intel NetBurst Microarchitecture
Pentium 4’s microarchitecture, a post-P6 new generation
Original target market: Graphics workstations, but … the
major competitor screwed up themselves…
Design Goals:
Performance, performance, performance, …
Unprecedented multimedia/floating-point performance
Streaming SIMD Extensions 2 (SSE2)
Reduced CPI
Low latency instructions
High bandwidth instruction fetching
Rapid Execution of Arithmetic & Logic operations
Reduced clock period
New pipeline designed for scalability
23
ECE4100/6100
H-H. S. Lee
Innovations Beyond P6
Hyperpipelined technology
Streaming SIMD Extension 2
Enhanced branch predictor
Execution trace cache
Rapid execution engine
Advanced Transfer Cache
Hyper-threading Technology (in Xeon and Xeon MP)
24
ECE4100/6100
H-H. S. Lee
Pentium 4 Fact Sheet
IA-32 fully backward compatible
Available at speeds ranging from 1.3 to ~3 GHz
Hyperpipelined (20+ stages)
42+ million transistors
0.18 μ for 1.7 to 1.9GHz; 0.13μ for 1.8 to 2.8GHz;
Die Size of 217mm2
Consumes 55 watts of power at 1.5Ghz
400MHz (850) and 533MHz (850E) system bus
512KB or 256KB 8-way full-speed on-die L2 Advanced Transfer Cache
(up to 89.6 GB/s @2.8GHz to L1)
1MB or 512KB L3 cache (in Xeon MP)
144 new 128 bit SIMD instructions (SSE2)
HyperThreading Technology (only enabled in Xeon and Xeon MP)
25
ECE4100/6100
H-H. S. Lee
Recent Intel IA-32 Processors
26
ECE4100/6100
H-H. S. Lee
Building Blocks of Netburst
System bus
Bus Unit
L1 Data Cache
Level 2 Cache
Execution Units
Memory subsystem
INT and FP Exec. Unit
Fetch/
Dec
ETC
μROM
OOO
logic
Branch history update
BTB / Br Pred.
Front-end
Retire
27
Out-of-Order Engine
ECE4100/6100
H-H. S. Lee
Pentium 4 Microarchitectue
BTB (4k entries)
I-TLB/Prefetcher
Trace Cache BTB
(512 entries)
64 bits
IA32 Decoder
Code ROM
Execution Trace Cache
op Queue
Allocator / Register Renamer
Memory op Queue
Memory scheduler
INT / FP op Queue
64-bit
System
Bus
Quad
Pumped
400M/533MHz
3.2/4.3 GB/sec
BIU
Fast Slow/General FP scheduler Simple FP
INT Register File / Bypass Network
FP RF / Bypass Ntwk
U-L2 Cache
FP
FP
AGU
AGU
2x ALU 2x ALU Slow ALU
256KB 8-way
Move
Simple Simple Complex
MMX
Ld addr St addr
128B line, WB
Inst.
Inst.
Inst.
SSE/2
48 GB/s
256 bits
L1 Data Cache (8KB 4-way, 64-byte line,28WT, 1 rd + 1 wr port)
@1.5Gz
ECE4100/6100
H-H. S. Lee
Pipeline Depth Evolution
PREF DEC
DEC
EXEC
WB
P5 Microarchitecture
IFU1
IFU2
IFU3
DEC1
DEC2
RAT
ROB
DIS
EX
RET1 RET2
P6 Microarchitecture
TC NextIP
TC Fetch Drive Alloc
Rename Queue
Schedule
Dispatch
NetBurst Microarchitecture
29
Reg File
Exec Flags Br Ck Drive
ECE4100/6100
H-H. S. Lee
Execution Trace Cache
Primary first level I-cache to replace conventional L1
Decoding several x86 instructions at high frequency is difficult, take
several pipeline stages
Branch misprediction penalty is horrible
lost 20 pipeline stages vs. 10 stages in P6
Advantages
Cache post-decode ops
High bandwidth instruction fetching
Eliminate x86 decoding overheads
Reduce branch recovery time if TC hits
Hold up to 12,000 ops
6 ops per trace line
Many (?) trace lines in a single trace
30
ECE4100/6100
H-H. S. Lee
Execution Trace Cache
Deliver 3 op’s per cycle to OOO engine
X86 instructions read from L2 when TC misses (7+ cycle latency)
TC Hit rate ~ 8K to 16KB conventional I-cache
Simplified x86 decoder
Only one complex instruction per cycle
Instruction > 4 op will be executed by micro-code ROM (P6’s MS)
Perform branch prediction in TC
512-entry BTB + 16-entry RAS
With BP in x86 IFU, reduce 1/3 misprediction compared to P6
Intel did not disclose the details of BP algorithms used in TC and x86
IFU (Dynamic + Static)
31
ECE4100/6100
H-H. S. Lee
Out-Of-Order Engine
Similar design philosophy with P6 uses
Allocator
Register Alias Table
128 physical registers
126-entry ReOrder Buffer
48-entry load buffer
24-entry store buffer
32
ECE4100/6100
H-H. S. Lee
Register Renaming Schemes
Data
Status
RRF
Front-end RAT
EAX
EBX
ECX
EDX
ESI
EDI
ESP
EBP
Retirement RAT
EAX
EBX
ECX
EDX
ESI
EDI
ESP
EBP
P6 Register Renaming
RF (128-entry) ROB (126)
Allocated sequentially
ROB (40-entry)
Allocated sequentially
RAT
EAX
EBX
ECX
EDX
ESI
EDI
ESP
EBP
..
.
..
.
Data
..
.
..
.
Status
NetBurst Register Renaming
33
ECE4100/6100
H-H. S. Lee
Micro-op Scheduling
op FIFO queues
Memory queue for loads and stores
Non-memory queue
op schedulers
Several schedulers fire instructions to execution (P6’s RS)
4 distinct dispatch ports
Maximum dispatch: 6 ops per cycle (2 fast ALU from Port 0,1 per cycle; 1 from
ld/st ports)
Exec Port 0
Fast ALU
(2x pumped)
FP
Move
Exec Port 1
Fast ALU
(2x pumped)
•Add/sub
•FP/SSE Move •Add/sub
•Logic
•FP/SSE Store
•Store Data •FXCH
•Branches
INT
Exec
•Shift
•Rotate
FP
Exec
•FP/SSE Add
•FP/SSE Mul
•FP/SSE Div
34 •MMX
Load Port
Store Port
Memory
Load
Memory
Store
•Loads
•LEA
•Prefetch
•Stores
ECE4100/6100
H-H. S. Lee
Data Memory Accesses
8KB 4-way L1 + 256KB 8-way L2 (with a HW prefetcher)
Load-to-use speculation
Dependent instruction dispatched before load finishes
Due to the high frequency and deep pipeline depth
Scheduler assumes loads always hit L1
If L1 miss, dependent instructions left the scheduler receive
incorrect data temporarily – mis-speculation
Replay logic – Re-execute the load when mis-speculated
Independent instructions are allowed to proceed
Up to 4 outstanding load misses (= 4 fill buffers in original P6)
Store-to-load forwarding buffer
24 entries
Have the same starting physical address
Load data size <= store data size
35
ECE4100/6100
H-H. S. Lee
Streaming SIMD Extension 2
P-III SSE (Katmai New Instructions: KNI)
Eight 128-bit wide xmm registers (new architecture state)
Single-precision 128-bit SIMD FP
Four 32-bit FP operations in one instruction
Broken down into 2 ops for execution (only 80-bit data in ROB)
64-bit SIMD MMX (use 8 mm registers — map to FP stack)
Prefetch (nta, t0, t1, t2) and sfence
P4 SSE2 (Willamette New Instructions: WNI)
Support Double-precision 128-bit SIMD FP
Two 64-bit FP operations in one instruction
Throughput: 2 cycles for most of SSE2 operations (exceptional examples:
DIVPD and SQRTPD: 69 cycles, non-pipelined.)
Enhanced 128-bit SIMD MMX using xmm registers
36
ECE4100/6100
H-H. S. Lee
Examples of Using SSE
X3 X2 X1 X0 xmm1
X3 X2 X1 X0
xmm1
X3 X2 X1 X0 xmm1
Y3 Y2 Y1 Y0 xmm2
Y3 Y2 Y1 Y0
xmm2
Y3 Y2 Y1 Y0 xmm2
xmm1
Y3 .. Y0 Y3 .. Y0 X3 .. X0 X3 .. X0
op
op
op
op
X3 op Y3 X2 op Y2 X1 op Y1X0 op Y0
op
xmm1
Packed SP FP operation
(e.g. ADDPS xmm1, xmm2)
X3 X2 X1
X0 op Y0
Scalar SP FP operation
(e.g. ADDSS xmm1, xmm2)
37
Y3 Y3 X0 X1 xmm1
Shuffle FP operation (8-bit imm)
imm8)
(e.g. SHUFPS xmm1, xmm2, 0xf1)
ECE4100/6100
H-H. S. Lee
Examples of Using SSE and SSE2
SSE
X3 X2 X1 X0 xmm1
X3 X2 X1 X0
xmm1
X3 X2 X1 X0 xmm1
Y3 Y2 Y1 Y0 xmm2
Y3 Y2 Y1 Y0
xmm2
Y3 Y2 Y1 Y0 xmm2
xmm1
Y3 .. Y0 Y3 .. Y0 X3 .. X0 X3 .. X0
op
op
op
op
X3 op Y3 X2 op Y2 X1 op Y1X0 op Y0
op
xmm1
Packed SP FP operation
(e.g. ADDPS xmm1, xmm2)
X3 X2 X1
X0 op Y0
Scalar SP FP operation
(e.g. ADDSS xmm1, xmm2)
Y3 Y3 X0 X1 xmm1
Shuffle FP operation (8-bit imm)
(e.g. SHUFPS xmm1, xmm2, imm8)
0xf1)
SSE2
X1
X0
xmm1
X1
X0
xmm1
X1
X0
Y1
Y0
xmm2
Y1
Y0
xmm2
Y1
Y0
op
op
X1 op Y1 X0 op Y0 xmm1
Packed DP FP operation
(e.g. ADDPD xmm1, xmm2)
op
X1
X0 op Y0 xmm1
Scalar DP FP operation
(e.g. ADDSD xmm1, xmm2)
38
Y1 or Y0 X1 or X0
Shuffle FP
DP operation (2-bit imm)
(e.g. SHUFPD
imm2)
SHUFPS xmm1, xmm2, imm8)
ECE4100/6100
H-H. S. Lee
HyperThreading
In Intel Xeon Processor and Intel Xeon MP
Processor
Enable Simultaneous Multi-Threading (SMT)
Exploit ILP through TLP (—Thread-Level Parallelism)
Issuing and executing multiple threads at the same
snapshot
Single P4 Xeon appears to be 2 logical processors
Share the same execution resources
Architectural states are duplicated in hardware
39
ECE4100/6100
H-H. S. Lee
Multithreading (MT) Paradigms
Unused
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Execution Time
FU1 FU2 FU3 FU4
Conventional
Superscalar
Single
Threaded
Chip
Fine-grained
Coarse-grained
Multithreading Multithreading Multiprocessor
(CMP)
(cycle-by-cycle (Block Interleaving)
Interleaving)
40
Simultaneous
Multithreading
ECE4100/6100
H-H. S. Lee
More SMT commercial processors
Intel Xeon Hyperthreading
Supports 2 replicated hardware contexts: PC (or IP) and
architecture registers
New directions of usage
Helper (or assisted) threads (e.g. speculative precomputation)
Speculative multithreading
Clearwater (once called Xtream logic) 8 context SMT
“network processor” designed by DISC architect
(company no longer exists)
SUN 4-SMT-processor CMP?
41
ECE4100/6100
H-H. S. Lee
Speculative Multithreading
SMT can justify wider-than-ILP datapath
But, datapath is only fully utilized by multiple threads
How to speed up single-thread program by utilizing multiple threads?
What to do with spare resources?
Execute both sides of hard-to-predictable branches
Eager execution or Polypath execution
Dynamic predication
Send another thread to scout ahead to warm up caches & BTB
Speculative precomputation
Early branch resolution
Speculatively execute future work
Multiscalar or dynamic multithreading
e.g. start several loop iterations concurrently as different threads, if data
dependence is detected, redo the work
Run a dynamic compiler/optimizer on the side
Dynamic verification
DIVA or Slipstream Processor
42