Lec12-p6_P4 - ECE Users Pages

Download Report

Transcript Lec12-p6_P4 - ECE Users Pages

ECE 4100/6100
Advanced Computer Architecture
Lecture 12 P6 and NetBurst Microarchitecture
Prof. Hsien-Hsin Sean Lee
School of Electrical and Computer Engineering
Georgia Institute of Technology
P6 System Architecture
Host Processor
P6 Core
L1
Cache
(SRAM) Back-Side
L2 Cache
(SRAM)
Bus
GPU
Graphics
Processor
Front-Side
Bus
PCIExpress
AGP
System
Memory
(DRAM)
MCH
ICH
Local
Frame
Buffer
chipset
PCI
USB
I/O
2
P6 Microarchitecture
External bus
Chip boundary
Bus Cluster
Data Cache
Unit (L1)
Memory
Order Buffer
Bus interface unit
Memory
Cluster
AGU
Instruction
Instruction Fetch
Fetch Unit
Unit
MMX
IEU/JEU
IEU/JEU
Control
Flow
BTB/BAC
FEU
Instruction Fetch Cluster
(Restricted)
Data
Flow
MIU
Instruction
Instruction
Decoder
Decoder
Microcode
Sequencer
Register
Alias Table
Reservation
Station
Allocator
Issue Cluster
Out-of-order
Cluster
ROB &
Retire RF
3
Pentium III Die Map



















EBL/BBL – External/Backside Bus logic
MOB - Memory Order Buffer
Packed FPU - Floating Point Unit for SSE
IEU - Integer Execution Unit
FAU - Floating Point Arithmetic Unit
MIU - Memory Interface Unit
DCU - Data Cache Unit (L1)
PMH - Page Miss Handler
DTLB - Data TLB
BAC - Branch Address Calculator
RAT - Register Alias Table
SIMD - Packed Floating Point unit
RS - Reservation Station
BTB - Branch Target Buffer
TAP – Test Access Port
IFU - Instruction Fetch Unit and L1 I-Cache
ID - Instruction Decode
ROB - Reorder Buffer
MS - Micro-instruction Sequencer
4
P6 Basics
• One implementation of IA32 architecture
• Deeply pipeline processor
• In-order front-end and back-end
• Dynamic execution engine (restricted dataflow)
• Speculative execution
• P6 microarchitecture family processors include
–
–
–
–
–
–
Pentium Pro
Pentium II (PPro + MMX + 2x caches)
Pentium III (P-II + SSE + enhanced MMX, e.g. PSAD)
Pentium 4 (Not P6, will be discussed separately)
Pentium M (+SSE2, SSE3, op fusion)
Core (PM + SSSE3, SSE4, Intel 64 (EM64T), MacroOp
fusion, 4 op retired rate vs. 3 of previous proliferation)
5
P6 Pipelining
RS Disp
Exec / WB
ROB
DIS
RET1 RET2
EX
31 32 33 42 43
DCache1
Dcache2
MOB disp
31 32 33 42 43
…….. 40 41 42 43
81 82 83
…
Ret ROB rd
RRF wr
Retirement in-order boundary
DCache1
DCache2
81 82 83
Ret ptr wr
..
81 82 83
…
RAT
Exec n
Exec2
31 32 33 ..
AGU
…
DEC2
MOB wakeup
Blocking
memory
pipeline
81: Mem/FP WB
82: Int WB schedule
83: Data WB
DEC1
ROB Scheduling
Delay
MOB blk
MOB wr
Non-blocking
memory pipeline
RS Scheduling
Delay
IFU3
31 32 33
AGU
Multi-cycle
pipeline
…
IFU2
82 83
FE in-order boundary
Single-cycle
pipeline
IDQ
RAT
20 21 22
RS schd
Dec2
Br Dec
11 12 13 14 15 16 17
RS Write
In-order FE
I-Cache
ILD
Rotate
Dec1
Next IP
IFU1
91 92 93
MOB Scheduling
Delay
6
Instruction Fetching Unit
data
Other fetch
requests
addr
Streaming
Buffer
Select
mux
Next PC
Mux
Linear Address
Instruction
Cache
Instruction
buffer
ILD
Length
marks
Instruction
rotator
Victim
Cache
P.Addr
Instruction
TLB
Prediction
marks
#bytes
consumed
by ID
Branch Target Buffer
•
•
•
IFU1: Initiate fetch, requesting 16 bytes at a time
IFU2: Instruction length decoder, mark instruction boundaries, BTB makes prediction (2 cycles)
IFU3: Align instructions to 3 decoders in 4-1-1 format
7
Static Branch Prediction (stage 17 Br. Dec of pg. 6)
No
No
Unconditional
PC-relative?
BTB miss?
Yes
Yes
PC-relative?
No
Return?
Yes
No
BTB dynamic
predictor’s
decision
Yes
No
Indirect
jump
Conditional?
Yes
Taken
Backwards?
Taken
Yes
Taken
No
Taken
Not Taken
Taken
8
Dynamic Branch Prediction
W0
W1
W2
Pattern History Tables
(PHT)
W3
New (spec) history
512-entry BTB
1 1 1 0 1
Branch History Register 0
(BHR)
•
•
•
•
Similar to a 2-level PAs design
Associated with each BTB entry
W/ 16-entry Return Stack Buffer
4 branch predictions per cycle (due to
16-byte fetch per cycle)
• Speculative update (2 copies of BHR)
0000
0001
0010
Spec. update
1101
1110 1
1111
Prediction
0
2-bit sat. counter
Rc: Branch Result

Static prediction provided by Branch Address
Calculator when BTB misses (see prior slide)
9
X86 Instruction Decode
Instruction Buffer
(16 bytes)
complex
(1-4)
Microinstruction
sequencer
(MS)
•
•
•
•
•
•
simple
(1)
simple
(1)
Instruction decoder queue
(6 ops)
To RAT/ALLOC
Next 3 inst
#Inst to dec
S,S,S
3
S,S,C
First 2
S,C,S
First 1
S,C,C
First 1
C,S,S
3
C,S,C
First 2
4-1-1 decoder
C,C,S
Decode rate depends on instruction alignment
C,C,C
DEC1: translate x86 into micro-operation’s (ops)
S: Simple
DEC2: move decoded ops to ID queue
C: Complex
MS performs translations either
– Generate entire op sequence from the “microcode ROM”
– Receive 4 ops from complex decoder, and the rest from microcode ROM
Subsequent Instructions followed by the inst needing MS are flushed
First 1
First 1
10
Register Alias Table (RAT)
In-order
queue
Logical Src
Integer
RAT
Array
Array
Physical
Src (Psrc)
FP
TOS
Adjus
t
FP
RAT
Array
Int and FP Overrides
Renaming Example
RRF PSrc
EAX 0 25
EBX
ECX
0
EDX
0
1
2
ECX
15
RAT
PSrc’s
Allocator
Physical ROB Pointers
ROB
• Register renaming for 8 integer registers, 8 floating point (stack) registers and flags: 3 op per
•
•
•
•
RRF
cycle
40 80-bit physical registers embedded in the ROB (thereby, 6 bit to specify PSrc)
RAT looks up physical ROB locations for renamed sources based on RRF bit
Override logic is for dependent ops decoded at the same cycle
Misprediction will revert all pointers to point to Retirement Register File (RRF)
11
Partial Stalls due to RAT
read
AX
EAX
write
CMP
INC
JBE
EAX, EBX
ECX
XX
; stall
Partial flag stalls (1)
MOV
ADD
AX, m8 ;
EAX, m32 ; stall
Partial register stalls
TEST EBX, EBX
LAHF
; stall
XOR
MOV
ADD
EAX, EAX
AL, m8 ;
EAX, m32 ; no stall
Partial flag stalls (2)
Idiom Fix (1)
SUB
MOV
ADD
EAX, EAX
AL, m8 ;
EAX, m32 ; no stall
Idiom Fix (2)
 JBE reads both ZF and CF while
INC affects (ZF,OF,SF,AF,PF) i.e.
only ZF
 LAHF loads low byte of EFLAGS
while TEST writes partial of them
• Partial register stalls: Occurs when writing a smaller (e.g. 8/16-bit) register
followed by a larger (e.g. 32-bit) read
– Because need to read different partial pieces from multiple physical registers !
• Partial flags stalls: Occurs when a subsequent instruction reads more flags than a
prior unretired instruction touches
12
Partial Register Width Renaming
FP
TOS
Adjust
FP
RAT
Array
Size(2) RRF(1)
Array
Physical
Src
Int and FP Overries
In-order queue
Logical Src
Integer
RAT
Array
PSrc(6)
INT Low Bank
(32b/16b/L): 8 entries
INT High Bank (H):
4 entries
RAT
Physical Src
Allocator
op0:
op1:
op2:
op3:
MOV
MOV
ADD
ADD
AL
AH
AL
AH
=
=
=
=
(a)
(b)
(c)
(d)
Physical ROB Pointers from Allocator
• 32/16-bit accesses:
– Read from low bank (AL/BL/CL/DL;AX/BX/CX/DX;EAX/EBX/ECX/EDX/EDI/ESI/EBP/ESP)
– Write to both banks (AH/BH/CH/DH)
• 8-bit RAT accesses: depending on which bank is being written and only update the
particular bank
13
Allocator (ALLOC)
• The interface between in-order and out-of-order
pipelines
• Allocates into ROB, MOB and RS
– “3-or-none” ops per cycle into ROB and RS
• Must have 3 free ROB entries or no allocation
– “all-or-none” policy for MOB
• Stall allocation when not all the valid MOB ops can be allocated
• Generate physical destination token Pdst from the
ROB and pass it to the Register Alias Table (RAT)
and RS
• Stalls upon shortage of resources
14
Reservation Stations (RS)
WB bus 0
Port 0
IEU0
Fadd
Fmul
Imul
Div
Pfmul
WB bus 1
Port 1
IEU1
JEU
Pfadd
Pfshuf
Loaded data
RS
Port 2
AGU0 Ld addr
LDA
MOB
Port 3
AGU1 St addr
STA
DCU
STD
St data
Port 4
ROB
•
•
•
•
•
Retired
RRF
data
Gateway to execution: binding max 5 op to each port per cycle
Port binding at dispatch time (certain op can only be bound to one port)
20 op entry buffer bridging the In-order and Out-of-order engine (32 entries in Core)
RS fields include op opcode, data valid bits, Pdst, Psrc, source data, BrPred, etc.
Oldest first FIFO scheduling when multiple ops are ready at the same cycle
15
ReOrder Buffer (ROB)
• A 40-entry circular buffer (96-entry in Core)
– 157-bit wide
– Provide 40 alias physical registers
• Out-of-order completion
ALLOC
• Deposit exception in each entry
• Retirement (or de-allocation)
– After resolving prior speculation
– Handle exceptions thru MS
RAT
– Clear OOO state when a mis-predicted
branch or exception is detected
– 3 op’s per cycle in program order
– For multi-op x86 instructions: none or all
(atomic)
RS
ROB
RRF
..
.
(exp) code assist
MS
16
Memory Execution Cluster
RS / ROB
LD
STA
STD
Load Buffer
DTLB
movl
ecx, edi
addl
ecx, 8
movl -4(edi), ebx
movl
FB
EBL
DCU
LD
STA
eax, 4(ecx)
RS cannot detect this and could
dispatch them at the same time
Store Buffer
Memory Cluster
•
•
•
•
Manage data memory accesses
Address Translation
Detect violation of access ordering
Fill buffers (FB) in DCU, similar to MSHR for non-blocking cache support
17
Memory Order Buffer (MOB)
•
•
•
•
Allocated by ALLOC
A second order RS for memory operations
1 op for load; 2 op’s for store: Store Address (STA) and Store Data (STD)
MOB
 16-entry load buffer (LB) (32-entry in Core, 64 in SandyBridge)
 12-entry store address buffer (SAB) (20-entry in Core, 36 in SandyBridge)
 SAB works in unison with
• Store data buffer (SDB) in MIU
• Physical Address Buffer (PAB) in DCU
 Store Buffer (SB): SAB + SDB + PAB
• Senior Stores
 Upon STD/STA retired from ROB
 SB marks the store “senior”
 Senior stores are committed back in program order to memory when bus idle
or SB full
• Prefetch instructions in P-III
 Senior load behavior
 Due to no explicit architectural destination
 New Memory dependency predictor in Core to predict store-to-load
dependencies
18
Store Coloring
x86 Instructions
op’s
mov (0x1220), ebx
std
sta
std
sta
ld
ld
std
sta
ld
mov (0x1110), eax
mov ecx, (0x1220)
mov edx, (0x1280)
mov (0x1400), edx
mov edx, (0x1380)
ebx
0x1220
eax
0x1100
0x1220
0x1280
edx
0x1400
0x1380
store color
2
2
3
3
3
3
4
4
4
• ALLOC assigns Store Buffer ID (SBID) in program order
• ALLOC tags loads with the most recent SBID
• Check loads against stores with equal or younger SBIDs for potential
address conflicts
• SDB forwards data if conflict detected
19
Memory Type Range Registers (MTRR)
• Control registers written by the system (OS)
• Supporting Memory Types
– UnCacheable (UC)
– Uncacheable Speculative Write-combining (USWC or WC)
• Use a fill buffer entry as WC buffer
– WriteBack (WB)
– Write-Through (WT)
– Write-Protected (WP)
• E.g. Support copy-on-write in UNIX, save memory space by allowing
child processes to share with their parents. Only create new memory
pages when child processes attempt to write.
• Page Miss Handler (PMH)
– Look up MTRR while supplying physical addresses
– Return memory types and physical address to DTLB
20
Intel NetBurst Microarchitecture
• Pentium 4’s microarchitecture
• Original target market: Graphics workstations,
but …
• Design Goals:
– Performance, performance, performance, …
– Unprecedented multimedia/floating-point performance
• Streaming SIMD Extensions 2 (SSE2)
• SSE3 introduced in Prescott Pentium 4 (90nm)
– Reduced CPI
• Low latency instructions
• High bandwidth instruction fetching
• Rapid Execution of Arithmetic & Logic operations
– Reduced clock period
• New pipeline designed for scalability
21
Innovations Beyond P6
• Hyperpipelined technology
• Streaming SIMD Extension 2
• Hyper-threading Technology (HT)
• Execution trace cache
• Rapid execution engine
• Staggered adder unit
• Enhanced branch predictor
• Indirect branch predictor (also in Banias Pentium M)
• Load speculation and replay
22
Pentium 4 Fact Sheet
•
•
•
•
•
•
•
•
•
•
IA-32 fully backward compatible
Available at speeds ranging from 1.3 to ~3.8 GHz
Hyperpipelined (20+ stages)
125 million transistors in Prescott (1.328 billion in 16MB on-die L3 Tulsa, 65nm)
0.18 μ for 1.3 to 2GHz; 0.13μ for 1.8 to 3.4GHz; 90nm for 2.8GHz to 3.6GHz
Die Size of 122mm2 (Prescott 90nm), 435mm2 (Tulsa 65nm),
Consumes 115 watts of power at 3.6Ghz
1066MHz system bus
Prescott L1 16KB, 8-way vs. previous P4’s 8KB 4-way
1MB, 512KB or 256KB 8-way full-speed on-die L2 (B/W example: 89.6 GB/s
@2.8GHz to L1)
• 2MB L3 cache (in P4 HT Extreme edition, 0.13μ only), 16MB in Tulsa
• 144 new 128 bit SIMD instructions (SSE2), and 16 SSSE instructions in Prescott
• HyperThreading Technology (Not in all versions)
23
Building Blocks of Netburst
System bus
Bus Unit
L1 Data Cache
Level 2 Cache
Execution Units
Memory subsystem
Fetch/
Dec
ETC
μROM
BTB / Br Pred.
Front-end
INT and FP Exec. Unit
OOO
logic
Retire
Branch history update
Out-of-Order Engine
24
Pentium 4 Microarchitectue (Prescott)
BTB (4k entries)
I-TLB/Prefetcher
Trace Cache BTB
(2k entries)
64 bits
IA32 Decoder
Code ROM
Execution Trace Cache
(12K ops)
op Queue
Quad
Pumped
800MHz
6.4 GB/sec
BIU
Allocator / Register Renamer
Memory op Queue
Memory scheduler
INT / FP op Queue
Fast Slow/General FP scheduler Simple FP
INT Register File / Bypass Network
AGU
AGU
Ld addr
St addr
2x ALU
Simple
Inst.
64-bit
System
Bus
FP RF / Bypass Ntwk
FP
2x ALU Slow ALU
Simple Complex
MMX
Inst.
Inst.
SSE/2/3
L1 Data Cache (16KB 8-way, 64-byte line, WT, 1 rd + 1 wr port)
FP
Move
256 bits
U-L2 Cache
1MB 8-way
128B line, WB
108 GB/s
25
Pipeline Depth Evolution
PREF DEC
DEC
EXEC
WB
P5 Microarchitecture
IFU1
IFU2
IFU3
DEC1
DEC2
RAT
ROB
DIS
EX
RET1 RET2
P6 Microarchitecture
20 stages
TC NextIP
TC Fetch
Drive Alloc
Rename Queue
Schedule
Dispatch
Reg File
Exec Flags Br Ck Drive
NetBurst Microarchitecture (Willamette)
> 30 stages
NetBurst Microarchitecture (Prescott)
26
Execution Trace Cache
• Primary first level I-cache to replace conventional L1
– Decoding several x86 instructions at high frequency is difficult, take
several pipeline stages
– Branch misprediction penalty is considerable
• Advantages
– Cache post-decode ops (think about fill unit)
– High bandwidth instruction fetching
– Eliminate x86 decoding overheads
– Reduce branch recovery time if TC hits
• Hold up to 12,000 ops
– 6 ops per trace line
– Many (?) trace lines in a single trace
27
Execution Trace Cache
Deliver 3 op’s per cycle to OOO engine if br pred is good
X86 instructions read from L2 when TC misses (7+ cycle latency)
TC Hit rate ~ 8K to 16KB conventional I-cache
Simplified x86 decoder
– Only one complex instruction per cycle
– Instruction > 4 op will be executed by micro-code ROM (P6’s MS)
• Perform branch prediction in TC
– 512-entry BTB + 16-entry RAS
– With BP in x86 IFU, reduce 33% misprediction compared to P6
– Intel did not disclose the details of BP algorithms used in TC and x86 IFU
(Dynamic + Static)
•
•
•
•
28
Out-Of-Order Engine
• Similar design philosophy with P6 uses
– Allocator
– Register Alias Table
– 128 physical registers
– 126-entry ReOrder Buffer
– 48-entry load buffer
– 24-entry store buffer
29
Register Renaming Schemes
Data
Status
RRF
P6 Register Renaming
Front-end RAT
EAX
EBX
ECX
EDX
ESI
EDI
ESP
EBP
Retirement RAT
EAX
EBX
ECX
EDX
ESI
EDI
ESP
EBP
RF (128-entry) ROB (126)
Allocated sequentially
ROB (40-entry)
Allocated sequentially
RAT
EAX
EBX
ECX
EDX
ESI
EDI
ESP
EBP
..
.
..
.
Data
..
.
..
.
Status
NetBurst Register Renaming
30
Micro-op Scheduling
• op FIFO queues
– Memory queue for loads and stores
– Non-memory queue
• op schedulers
– Several schedulers fire instructions from 2 op queues to execution (P6’s
RS)
– 4 distinct dispatch ports
– Maximum dispatch: 6 ops per cycle (2 fast ALU from Port 0,1 per cycle;
1 from ld/st ports)
Exec Port 0
Fast ALU
(2x pumped)
FP
Move
Exec Port 1
Fast ALU
(2x pumped)
•Add/sub
•FP/SSE Move •Add/sub
•Logic
•FP/SSE Store
•Store Data •FXCH
•Branches
INT
Exec
•Shift
•Rotate
Load Port
Store Port
FP
Exec
Memory
Load
Memory
Store
•FP/SSE Add
•FP/SSE Mul
•FP/SSE Div
•MMX
•Loads
•LEA
•Prefetch
•Stores
31
Data Memory Accesses
• Prescott: 16KB 8-way L1 + 1MB 8-way L2 (with a HW prefetcher), 128B line
• Load-to-use speculation
– Dependent instruction dispatched before load finishes
• Due to the high frequency and deep pipeline depth
• From load scheduler to execution is longer than execution itself
– Scheduler assumes loads always hit L1
– If L1 miss, dependent instructions left the scheduler receive incorrect data
temporarily – mis-speculation
– Replay logic
• Re-execute the load when mis-speculated
• Mis-speculated operations are placed into a replay queue for being redispatched
– All trailing independent instructions are allowed to proceed
– Tornado breaker
• Up to 4 outstanding load misses (= 4 fill buffers in original P6)
• Store-to-load forwarding buffer
– 24 entries
– Have the same starting physical address
– Load data size <= store data size
32
Fast Staggered ALU
Flags
Bit[31:16]
Bit[15:0]
• For frequent ALU instructions (No multiply, no shift, no rotate, no branch
processing)
• Double pumped clocks
• Each operation finishes in 3 fast cycles
– Lower-order 16-bit and bypass
– Higher-order 16-bit and bypass
– ALU flags generation
33
Branch Predictor
• P4 uses the same hybrid predictor of Pentium M
Local
Predictor
Bimodal
Predictor
Pred_B
Global
Predictor
Pred_L
L_hit
MUX
Pred_G
G_hit
MUX
34
Indirect Branch Predictor
• In Pentium M and Prescott Pentium 4
• Prediction based on global history
35
New Instructions over Pentium
• CMOVcc / FCMOVcc r, r/m
– Conditional moves (predicated move) instructions
– Based on conditional code (cc)
• FCOMI/P : compare FP stack and set integer flags
• RDPMC/RDTSC instructions
– PMC: P6 has 2, Netburst (P4) has 18
• Uncacheable Speculative Write-Combining (USWC) —weakly
ordered memory type for graphics memory
36
New Instructions
• SSE2 in Pentium 4 (not p6 microarchitecture)
– Double precision SIMD FP
• SSSE in Core 2
– Supplemental instructions for shuffle, align, add, subtract.
• Intel 64 (EM64T)
–
–
–
–
64 bit support, new registers (8 more on top of 8)
In Celeron D, Core 2 (and P4 Prescott, Pentium D)
Almost compatible with AMD64
AMD’s NX bit or Intel’s XD bit for preventing buffer overflow attacks
37
Streaming SIMD Extension 2
• P-III SSE (Katmai New Instructions: KNI)
– Eight 128-bit wide xmm registers (new architecture state)
– Single-precision 128-bit SIMD FP
• Four 32-bit FP operations in one instruction
• Broken down into 2 ops for execution (only 80-bit data in ROB)
– 64-bit SIMD MMX (use 8 mm registers — map to FP stack)
– Prefetch (nta, t0, t1, t2) and sfence
• P4 SSE2 (Willamette New Instructions: WNI)
– Support Double-precision 128-bit SIMD FP
• Two 64-bit FP operations in one instruction
• Throughput: 2 cycles for most of SSE2 operations (exceptional
examples: DIVPD and SQRTPD: 69 cycles, non-pipelined.)
– Enhanced 128-bit SIMD MMX using xmm registers
38
Examples of Using SSE
X3 X2 X1 X0 xmm1
X3 X2 X1 X0
xmm1
X3 X2 X1 X0 xmm1
Y3 Y2 Y1 Y0 xmm2
Y3 Y2 Y1 Y0
xmm2
Y3 Y2 Y1 Y0 xmm2
xmm1
Y3 .. Y0 Y3 .. Y0 X3 .. X0 X3 .. X0
op
op
op
op
X3 op Y3 X2 op Y2 X1 op Y1X0 op Y0
op
xmm1
Packed SP FP operation
(e.g. ADDPS xmm1, xmm2)
X3 X2 X1
X0 op Y0
Scalar SP FP operation
(e.g. ADDSS xmm1, xmm2)
Y3 Y3 X0 X1 xmm1
Shuffle FP operation (8-bit imm)
imm8)
(e.g. SHUFPS xmm1, xmm2, 0xf1)
39
Examples of Using SSE and SSE2
SSE
X3 X2 X1 X0 xmm1
X3 X2 X1 X0
xmm1
X3 X2 X1 X0 xmm1
Y3 Y2 Y1 Y0 xmm2
Y3 Y2 Y1 Y0
xmm2
Y3 Y2 Y1 Y0 xmm2
xmm1
Y3 .. Y0 Y3 .. Y0 X3 .. X0 X3 .. X0
op
op
op
op
X3 op Y3 X2 op Y2 X1 op Y1X0 op Y0
op
xmm1
Packed SP FP operation
(e.g. ADDPS xmm1, xmm2)
X3 X2 X1
X0 op Y0
Scalar SP FP operation
(e.g. ADDSS xmm1, xmm2)
Y3 Y3 X0 X1 xmm1
Shuffle FP operation (8-bit imm)
(e.g. SHUFPS xmm1, xmm2, imm8)
0xf1)
SSE2
X1
X0
xmm1
X1
X0
xmm1
X1
X0
Y1
Y0
xmm2
Y1
Y0
xmm2
Y1
Y0
op
op
X1 op Y1 X0 op Y0 xmm1
Packed DP FP operation
(e.g. ADDPD xmm1, xmm2)
op
X1
X0 op Y0 xmm1
Scalar DP FP operation
(e.g. ADDSD xmm1, xmm2)
Y1 or Y0 X1 or X0
Shuffle FP
DP operation (2-bit imm)
(e.g. SHUFPD
imm2)
SHUFPS xmm1, xmm2, imm8)
40
HyperThreading
• Intel Xeon Processor and Intel Xeon MP Processor
• Enable Simultaneous Multi-Threading (SMT)
– Exploit ILP through TLP (—Thread-Level Parallelism)
– Issuing and executing multiple threads at the same snapshot
• Single P4 w/ HT appears to be 2 logical processors
• Share the same execution resources
– dTLB shared with logical processor ID
– Some other shared resources are partitioned (next slide)
• Architectural states and some microarchitectural states are duplicated
– IPs, iTLB, streaming buffer
– Architectural register file
– Return stack buffer
– Branch history buffer
– Register Alias Table
41
Multithreading (MT) Paradigms
Unused
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Execution Time
FU1 FU2 FU3 FU4
Conventional
Superscalar
Single
Threaded
Chip
Fine-grained
Coarse-grained
Simultaneous
Multithreading Multithreading Multiprocessor
Multithreading
(CMP)
(cycle-by-cycle (Block Interleaving)
(or Intel’s HT)
or called
Interleaving)
Multi-Core Processors
today
42
HyperThreading Resource Partitioning
• TC (or UROM) is alternatively accessed per cycle for
each logical processor unless one is stalled due to
TC miss
• op queue (into ½) after fetched from TC
• ROB (126/2)
• LB (48/2)
• SB (24/2) (32/2 for Prescott)
• General op queue and memory op queue (1/2)
• TLB (½?) as there is no PID
• Retirement: alternating between 2 logical
processors
43