L12-13_advance_2014x
Download
Report
Transcript L12-13_advance_2014x
Computer Structure
Advanced Topics
Lihu Rappoport and Adi Yoaz
1
Computer Structure 2014 – Advanced Topics
Intel® CoreTM μArch
2
Computer Structure 2014 – Advanced Topics
Tick/Tock Development Model
45nm Process
Technology
32nm Process Technology
22nm Process Technology
Nehalem
Westmere
Sandy Bridge
Ivy Bridge
Haswell
1st Generation
Intel® Core™
New Intel® μarch
(Nehalem)
1st Generation
Intel® Core™
Intel® μarch
(Nehalem)
2nd Generation
Intel® Core™
New Intel® μarch
(Sandy Bridge)
3rd Generation
Intel® Core™
Intel® μarch
(Sandy Bridge)
4th Generation
Intel® Core™
New Intel® μarch
(Haswell)
TICK
TOCK
TICK
TOCK
TOCK
Haswell CPU
22nm Process Technology
2nd
3
Haswell builds upon innovations in the
and 3rd Generation Intel® Core™ i3/i5/i7 Processors
(Sandy Bridge and Ivy Bridge)
Foil taken from IDF 2012
Computer Structure 2014 – Advanced Topics
5th Generation Intel CoreTM Processor
14nm Process Technology
Broadwell micro-architecture
4
Computer Structure 2014 – Advanced Topics
Haswell Core at a Glance
Next generation branch prediction
Branch Prediction
I-cache
Tag
ITLB
• Improves performance and saves wasted work
µop-Cache
Tag
Improved front-end
• Initiate TLB and cache misses speculatively
• Handle cache misses in parallel to hide latency
• Leverages improved branch prediction
I-cache
Data
Deeper buffers
• Extract more instruction parallelism
µop-Cache
Data
Decode
Decode
• More resources when running a single thread
More execution units, shorter latencies
• Power down when not in use
More load/store bandwidth
• Better prefetching, better cache line split latency
& throughput, double L2 bandwidth
μop Queue
µop Allocation
• New modes save power without losing
performance
Out-of-Order Execution
No pipeline growth
0
1
5
2
3
4
5
6
7
• Same branch misprediction latency
• Same L1/L2 cache latency
Foil taken from IDF 2012
Computer Structure 2014 – Advanced Topics
Branch Prediction Unit
Predict branch targets
Branch Prediction
I-cache
Tag
µop-Cache
Tag
ITLB
I-cache
Data
µop-Cache
Data
Decode
Decode
– Direct Calls and Jumps –
target provided by a Target Array
– Indirect Calls and Jumps –
predicted either as having a fixed
target or as having targets that
vary based on execution path
– Returns – predicted by a 16 entry
Return Stack Buffer (RSB)
For conditional branches
– predict if taken or not
μop Queue
µop Allocation
The BPU makes predictions for 32
bytes at a time
– Twice the width of the fetch engine
Out-of-Order Execution
0
1
6
2
3
4
5
6
7
– Enables taken branches to be
predicted with no penalty
From the Optimization Manual / IDF 2012
Computer Structure 2014 – Advanced Topics
Instruction TLB
Branch Prediction
I-cache
Tag
µop-Cache
Tag
ITLB
Initiate TLB and cache
misses speculatively
– Handle cache misses in
parallel to hide latency
– Leverages improved branch
prediction
I-cache
Data
µop-Cache
Data
Decode
Decode
μop Queue
µop Allocation
Out-of-Order Execution
0
1
7
2
3
4
5
6
7
From the Optimization Manual / IDF 2012
Computer Structure 2014 – Advanced Topics
Instruction Fetch and Pre-decode
32KB L1 I-Cache
Pre decode
Instruction
Queue
Decoders
Decoders
Decoders
Decoders
μop
Queue
MSROM
• 32KB 8-way I-Cache
– Fetches aligned 16 bytes per cycle
Typical programs average ~4 bytes per instruction
– A misaligned target into the line or a taken branch out of the
line reduces the effective number of instruction bytes fetched
In typical integer code: a taken branches every ~10 instructions,
which translates into a partial fetch every 3~4 cycles
• The Pre-Decode Unit
– Determines the length of each instruction
– Decodes all prefixes associated with instructions
– Mark various properties of instructions for the decoders
8
for example, “is branch”
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
Instruction Pre-decode and IQ
32KB L1 I-Cache
Pre decode
Instruction
Queue
Decoders
Decoders
Decoders
Decoders
μop
Queue
MSROM
The pre-decode unit writes 6 instructions/cycle into the IQ
– If a fetch line contains >6 instructions
Continues to pre-decode 6 instructions/cycle, until all
instructions in the fetch line are written into IQ
The subsequent fetch line starts pre-decode only after the current
fetch completes
– A fetch of line of 7 instructions takes 2 cycles to pre-decode
Average of 3.5 inst/cycle (still higher than IPC of most apps)
Length changing prefixes (LCP) change the inst. length
– Each LCP within the fetch line takes 3 cycle to pre-decode
The Instruction Queue is 18 instructions deep
9
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
Instruction Decode and μop Queue
32KB L1 I-Cache
Pre decode
Instruction
Queue
Decoders
Decoders
Decoders
Decoders
μop
Queue
MSROM
• 4 Decoders, which decode instructions into μops
– 1st decoder can decode all instructions of up to 4 μops
– The other 3 decoders handle common single μop instructions
• Instructions with >4 μops generate μops from the MSROM
• Up to 4 μops/cycle are delivered into the μop Queue
– Buffers 56 μops (28 μops / thread when running 2 threads)
– Decouples the front end and the out-of order engine
– Helps hiding bubbles introduced between the various sources
of μops in the front end
10
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
Macro-Fusion
L1 I-Cache
Pre decode
Instruction
Queue
Decoders
Decoders
Decoders
Decoders
μop
Queue
MSROM
The IQ sends up to 5 inst. / cycle to the decoders
Merge two adjacent instructions into a single μop
– A macro-fused instruction executes with a single dispatch
Reduces latency and frees execution resources
– Increased decode, rename and retire bandwidth
– Power savings from representing more work in fewer bits
The 1st instruction modifies flags
– CMP, TEST, ADD, SUB, AND, INC, DEC
The 2nd instruction pair is a conditional branch
These pairs are common in many types of applications
11
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
Stack Pointer Tracker
PUSH, POP, CALL, RET implicitly update ESP
– Add or sub an offset, which would require a dedicated μop
– The Stack Pointer Tracker performs implicit ESP updates
Δ=0
PUSH EAX
PUSH EBX
INC ESP
ESP = ESP - 4
Δ=Δ-4
STORE [ESP], EAX
STORE [ESP-4], EAX
ESP = ESP - 4
Δ=Δ-4
STORE [ESP], EBX
STORE [ESP-8], EBX
ESP = ADD ESP, 1
ESPNeed
= SUB
ESP, ESP
8
to sync
!
Δ=-4
Δ=-8
Δ=0
ESP = ADD ESP, 1
Provides the following benefits
–
–
–
–
12
Improves decode BW: PUSH, POP and RET become single μop instructions
Conserves rename, execution and retire resources
Remove dependencies on ESP – can execute stack operations in parallel
Saves power
From the Optimization Manual and ITJ
Computer Structure 2014 – Advanced Topics
Micro-Fusion
Fuse multiple μops from same instruction into a
single μop
– Instruction which decode into a single micro-fused μop can be
handled by all decoders
– Improves instruction bandwidth delivered from decode to
retirement and saves power
– A micro-fused μop is dispatched multiple times in the OOO
As it would if it were not micro-fused
Micro-fused instructions
– Stores are comprised of 2 μop: store-address and store-data
Fused into a single μop
– Load + op instruction
e.g., FADD DOUBLE PTR [RDI+RSI*8]
– Load + jump
13
e.g., JMP [RDI+200]
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
Decoded μop-Cache
32KB L1 I-Cache
Pre decode
Instruction
Queue
Decoders
Decoders
Decoders
Decoders
Branch Prediction Unit
μop
Queue
μop-Cache
• Caches the μops coming out of the decoders
– Up to 1.5K μops (32 sets × 8 ways × 6 μops/way)
– Next time μops are taken from the μop Cache
– ~80% hit rate for most applications
– Included in the IC and iTLB, flushed on a context switch
• Higher Bandwidth and Lower Latency
– More cycles sustaining 4 instruction/cycle
14
In each cycle provide μops for instructions mapped to 32 bytes
Able to ‘stitch’ across taken branches in the control flow
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
Decoded μop-Cache
32KB L1 I-Cache
Pre decode
Instruction
Queue
Branch Prediction Unit
Decoders
Decoders
Decoders
Decoders
Zzzz
μop
Queue
μop-Cache
• Decoded μop Cache lets the normal front end sleep
– Decode one time instead of many times
• Branch misprediction penalty reduced
– The correct path is also the most efficient path
Save Power while Increasing Performance
15
Foil taken from IDF 2011
Computer Structure 2014 – Advanced Topics
Loop Stream Detector (LSD)
32KB L1 I-Cache
Pre decode
Instruction
Queue
Decoders
Decoders
Decoders
Decoders
Branch Prediction Unit
Zzzz
LSD
μop
Queue
μop-Cache
LSD detects small loops that fit in the μop queue
– The μop queue streams the loop, allowing the front-end to sleep
– Until a branch miss-prediction inevitably ends it
Loops qualify for LSD replay if all following conditions are met
– Up to 28 μops, with 8 taken branches, 8 32-byte chunks
– All μops are also resident in the μop-Cache
– No CALL or RET
– No mismatched stack operations (e.g., more PUSH than POP)
16
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
The Renamer
Load
Buffers
Store Reorder
Buffers Buffers
μop
Queue
Allocate/Rename/Retire
In order
OOO
Scheduler
Moves ≤4μops/cycle from the μop-queue to the OOO
– Renames architectural sources and destinations of the μops to
micro-architectural sources and destinations
– Allocates resources to the μops, e.g., load or store buffers
– Binds the μop to an appropriate dispatch port
– Up to 2 branches each cycle
Some μops are executed to completion during rename,
effectively costing no execution bandwidth
– A subset of register-to-register MOV and FXCHG
– Zero-Idioms
– NOP
17
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
Dependency Breaking Idioms
Zero-Idiom – an instruction that zeroes a register
– Regardless of the input data, the output data is always 0
– E.g.: XOR REG, REG
and
SUB REG, REG
– No μop dependency on its sources
Zero-Idioms are detected and removed by the Renamer
– Do not consume execution resource, have zero exe latency
Zero-Idioms remove partial register dependencies
– Improve instruction parallelism
Ones-Idiom – an instruction that sets a register to “all 1s”
– Regardless of the input data the output is always "all 1s"
– E.g., CMPEQ XMM1, XMM1;
No μop dependency on its sources, as with the zero idiom
Can execute as soon as it finds a free execution port
– As opposed to Zero-Idiom, the Ones-Idiom μop must execute
18
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
Out-of-Order Cluster
Load
Buffers
Store Reorder
Buffers Buffers
Allocate/Rename/Retire
In order
Out-oforder
Scheduler
FP/INT Vector PRF
Int PRF
• The Scheduler queues μops until all source operands are ready
– Schedules and dispatches ready μops to the available execution
units in as close to a first in first out (FIFO) order as possible
• Physical Register File (PRF)
– Instead of centralized Retirement Register File
Single copy of every data with no movement after calculation
– Allows significant increase in buffer sizes
Dataflow window ~33% larger
• Retirement
– Retires μops in order and handles faults and exceptions
19
Foil taken from IDF 2011
Computer Structure 2014 – Advanced Topics
Intel® Advanced Vector Extensions
Vectors are a natural data-type for many apps
– Extend SSE FP instruction set to 256 bits operand size
– Extend all 16 XMM registers to 256bits
XMM0 – 128 bits
YMM0 – 256 bits (AVX)
• New, non-destructive source syntax
– VADDPS ymm1, ymm2, ymm3
• New Operations to enhance vectorization
– Broadcasts, masked load & store
Wide vectors+ non-destructive source: more work with fewer instructions
Extending the existing state is area and power efficient
20
Foil taken from IDF 2011
Computer Structure 2014 – Advanced Topics
Execution Cluster
Scheduler sees matrix:
• 3 “ports” to 3 “stacks”
of execution units
ALU
Port 0
– General Purpose
Integer
– SIMD (Vector)
Integer
– SIMD
Floating Point
• Challenge: double the
output of one of these
stacks in a manner that
is invisible to the others
21
FP MUL
VI Shuffle
Blend
DIV
GPR
SIMD INT
ALU
VI ADD
Port 1
Port 5
VI MUL
SIMD FP
FP ADD
VI Shuffle
ALU
FP Shuf
JMP
FP Bool
Blend
Foil taken from IDF 2011
Computer Structure 2014 – Advanced Topics
Execution Cluster – 256 bit Vectors
Solution:
• Repurpose existing data
paths to dual-use
• SIMD integer and legacy
SIMD FP use legacy
stack style
• Intel® AVX utilizes both
128-bit execution stacks
ALU
FP MUL
VI MUL
FP Multiply
VI Shuffle
FP BlendBlend
Port 0
DIV
GPR
ALU
Port 1
SIMD INT
SIMD FP
VI ADD
FP ADDFP ADD
VI Shuffle
• Double FLOPs
– 256-bit Multiply +
256-bit ADD +
256-bit Load
per clock
Port 5
ALU
FP Shuf
FP Shuffle
JMP
FP Bool
FP Boolean
FP BlendBlend
“Cool” Implementation of Intel AVX
256-bit Multiply + 256-bit ADD + 256-bit Load per clock…
Double your FLOPs with great energy efficiency
22
Foil taken from IDF 2011
Computer Structure 2014 – Advanced Topics
2 new FMA units provide
2× peak FLOPs/cycle
– Fused Multiply-Add
Instruction Set
SP/
cyc
DP
/cyc
Nehalem
SSE (128-bits)
8
4
SandyBridge
AVX (256-bits)
16
8
Haswell
AVX2 & FMA
32
16
Peak Cache bytes/clock
Haswell AVX2: FMA & Peak FLOPS
70
Haswell
60
50
40
Sandy Bridge
30
20
10
0
0
Merom
Banias
5
10
– 2× L2 bandwidth
5-cycle FMA latency same as
an FP multiply
23
20
25
Peak FLOPS/clock
2× cache bandwidth to feed
wide vector units
– 32-byte load/store for L1
15
Latency (clks)
Prior
Gen
MulPS, PD
5
5
AddPS, PD
3
3
Mul+Add /FMA
8
5
Foil taken from IDF 2012
30
35
Haswell Ratio
1.6
Computer Structure 2014 – Advanced Topics
Haswell Execution Units
Unified Reservation Station
Vector
Logicals
Vector
Logicals
Branch
Divide
Vector
Shifts
Port 7
Vector Int
ALU
2×FMA
• Doubles peak FLOPs
• Two FP multiplies
benefits legacy
Port 6
Vector Int
Multiply
Store
Data
Port 5
FMA FP Mult
FP Add
Load &
Store Address
Port 4
FMA
FP Multiply
Port 3
Integer
ALU & LEA
Port 2
Port 1
Port 0
Integer
ALU & Shift
Integer
ALU & LEA
Integer
ALU & Shift
Store
Address
Vector
Shuffle
Vector Int
ALU
Vector
Logicals
4th ALU
• Great for integer workloads
• Frees Port0 & 1 for vector
New Branch Unit
• Reduces Port0 Conflicts
• 2nd EU for high branch code
Branch
New AGU for Stores
• Leaves Port 2 & 3
open for Loads
Intel® Microarchitecture (Haswell)
24
Foil taken from IDF 2012
Computer Structure 2014 – Advanced Topics
Memory Cluster
Store
Data
Load/
Load/
Store
Store
Store
Address
Address Address
Memory Control
32×3 bytes/cycle
Fill
Buffers
256KB ML$
32KB L1 D$
The L1 D$ handles 2× 256-bit loads + 1× 256-bit store per cycle
The ML$ can service one cache line (64 bytes) each cycle
72 load buffers keep load μops from allocation till retirement
– Re-dispatch blocked loads
42 store buffers keep store μops from allocation till the store
value is written to L1-D$
– or written to the line fill buffers – for non-temporal stores
25
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
L1 Data Cache
Handles 2× 256-bit loads + 1× 256-bit store per cycle
Maintains requests which cannot be completed
– Cache misses
– Unaligned access that splits across cache lines
– Data not ready to be forwarded from a preceding store
– Load block due to cache line replacement
Handles up to 10 outstanding cache misses in the LFBs
– Continues to service incoming stores and loads
The L1 D$ is a write-back write-allocate cache
– Stores that hit in the L1-D$ do not update the L2/LLC/Mem
– Stores that miss the L1-D$ allocate a cache line
26
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
Stores
Stores to memory are executed in two phases
At EXE
– Fill store buffers with linear + physical address and with data
– Once store address and data are known
Forward Store data to load operations that need it
After the store retires – completion phase
– First, the line must be in L1 D$, in E or M MESI state
Otherwise, fetch it using a Read for Ownership request from
L1 D$ L2$ LLC L2$ and L1 D$ in other cores Memory
– Read the data from the store buffers and write it to L1 D$ in M state
Done at retirement to preserve the order of memory writes
– Release the store buffer entry taken by the store
27
Affects performance only if store buffer becomes full (allocation stalls)
Loads needing the store data get when store is executed
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
Store Forwarding
Forward data directly from a store to a load that needs it
– Instead of having the store write the data to the DCU
and then the load read the data from the DCU
Store to load forwarding conditions
– The store is the last store to that address, prior to the load
Requires addresses of all stores older than the load to be known
– The store contains all data being loaded
– The load is from a write-back memory type
– Neither the load nor the store are non-temporal accesses
– The load is not a 4 an 8 byte load that crosses 8 byte boundary,
relative to the preceding 16- or 32-byte store
– The load does not cross a 16-byte boundary of a 32-byte store
28
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
Memory Disambiguation
A load may depend on a preceding store
– A load is blocked until all preceding store addresses are known
Predict which loads do not depend on any previous stores
– Forward data from a store or L1 D$ even when not all previous store
addresses are known
– The prediction is verified, and if a conflict is detected
The load and all succeeding instructions are re-fetched
– Always assumes dependency between loads and earlier stores that
have the same address bits 0:11
The following loads are not disambiguated
– Loads that cross the 16-byte boundary
– 32-byte Intel AVX loads that are not 32-byte aligned
29
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
Data Prefetching
Two hardware prefetchers load data to the L1 D$
DCU prefetcher (the streaming prefetcher)
– Triggered by an ascending access to very recently loaded
data
– Fetches the next line, assuming a streaming load
Instruction pointer (IP)-based stride prefetcher
– Tracks individual load instructions, detecting a regular stride
30
Prefetch address = current address + stride
Detects strides of up to 2K bytes, both forward and
backward
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
Core Cache Size/Latency/Bandwidth
Metric
Nehalem
Sandy Bridge
Haswell
L1 Instruction Cache
32K, 4-way
32K, 8-way
32K, 8-way
L1 Data Cache
32K, 8-way
32K, 8-way
32K, 8-way
4 cycles
4 cycles
4 cycles
Load bandwidth
16 Bytes/cycle
32 Bytes/cycle
(banked)
64 Bytes/cycle
Store bandwidth
16 Bytes/cycle
16 Bytes/cycle
32 Bytes/cycle
256K, 8-way
256K, 8-way
256K, 8-way
10 cycles
11 cycles
11 cycles
32 Bytes/cycle
32 Bytes/cycle
64 Bytes/cycle
4K: 128, 4-way
2M/4M: 7/thread
4K: 128, 4-way
2M/4M: 8/thread
4K: 128, 4-way
2M/4M: 8/thread
4K: 64, 4-way
2M/4M: 32, 4-way
1G: fractured
4K: 64, 4-way
2M/4M: 32, 4-way
1G: 4, 4-way
4K: 64, 4-way
2M/4M: 32, 4-way
1G: 4, 4-way
4K: 512, 4-way
4K: 512, 4-way
4K+2M shared:
1024, 8-way
Fastest Load-to-use
L2 Unified Cache
Fastest load-to-use
Bandwidth to L1
L1 Instruction TLB
L1 Data TLB
L2 Unified TLB
All caches use 64-byte lines
31
Foil taken from IDF 2012
Computer Structure 2014 – Advanced Topics
Buffer Sizes
Extract more parallelism in every generation
Nehalem
Sandy Bridge
Haswell
128
168
192
Scheduling Window
36
54
In-flight Loads
48
64
72
In-flight Stores
32
36
42
Scheduler Entries
36
54
60
Integer Register File
N/A
160
168
FP Register File
N/A
144
168
28/thread
28/thread
56
Out-of-order Window
Allocation Queue
32
Foil taken from IDF 2012
Computer Structure 2014 – Advanced Topics
Haswell Core μArch
32KB L1 I-Cache
Instruction
Queue
Pre decode
Branch Prediction Unit
Load
Buffers
LSD
Decoders
Decoders
Decoders
Decoders
μop
Queue
μop Cache
Store Reorder
Buffers Buffers
Allocate/Rename/Retire
In order
OOO
Scheduler
Port 1
Port 5
Port 6
ALU
ALU
ALU
ALU
Shift
JMP
LEA
MUL
LEA
Shift
JMP
FMA
FMA
ALU
ALU
Shift
VEC
INT
Port 0
Shift
DIV
ALU
Shuffle
256KB ML$
33
Port 4
Store
Data
Load Data 2
Load Data 3
Port 2
Port 3
Load
Load
Store
Address
Store
Address
Port 7
Store
Address
Memory Control
Fill
Buffers
From the Optimization Manual
32KB L1 D$
Computer Structure 2014 – Advanced Topics
Hyper Threading Technology
34
Computer Structure 2014 – Advanced Topics
Thread-Level Parallelism
Multiprocessor systems have been used for many years
– There are known techniques to exploit multiprocessors
Software trends
– Applications consist of multiple threads or processes that can be
executed in parallel on multiple processors
Thread-level parallelism (TLP) – threads can be from
– the same application
– different applications running simultaneously
– operating system services
Increasing single thread performance becomes harder
– and is less and less power efficient
Chip Multi-Processing (CMP)
– Two (or more) processors are put on a single die
35
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
Multi-Threading Schemes
Multi-threading: a single processor executes multiple threads
Time-slice multithreading
– The processor switches between software threads after a fixed period
– Can effectively minimize the effects of long latencies to memory
Switch-on-event multithreading
– Switch threads on long latency events such as cache misses
– Works well for server applications that have many cache misses
A deficiency of both time-slice MT and switch-on-event MT
– They do not cover for branch mis-predictions and long dependencies
Simultaneous multi-threading (SMT)
– Multiple threads execute on a single processor simultaneously w/o switching
– Makes the most effective use of processor resources
36
Maximizes performance vs. transistor count and power
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
Hyper-Threading Technology
Two logical processors in each physical processor
– Sharing most execution/cache resources using SMT
– Look like two processors to SW (OS and apps)
Each logical processor executes a software thread
– Threads execute simultaneously on one physical processor
Each LP has maintains its own architectural state
– Complete set of architectural registers
general-purpose registers, control registers,
machine state registers, debug registers
– Instruction pointers
Each LP has its own interrupt controller
– Interrupts sent to a specific LP are handled only by it
37
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
Two Important Goals
When one thread is stalled the other thread can continue to
make progress
– Independent progress ensured by either
Partitioning buffering queues and limiting the number of entries each
thread can use
Duplicating buffering queues
A single active thread running on a processor with HT runs at
the same speed as without HT
– Partitioned resources are recombined when only one thread is active
38
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
Single-task And Multi-task Modes
MT-mode (Multi-task mode)
– Two active threads, with some resources partitioned as described earlier
ST-mode (Single-task mode)
– ST0 / ST1 – only thread 0 / 1 is active
– Resources that are partitioned in MT-mode are re-combined to give the
single active logical processor use of all of the resources
Moving the processor from between modes
Thread 0 executes HALT
Interrupt
ST0
Thread 1 executes HALT
39
Low
Power
Thread 1 executes HALT
ST1
MT
From the Optimization Manual
Thread 0 executes HALT
Computer Structure 2014 – Advanced Topics
Thread Optimization
The OS should implement two optimizations:
Use HALT if only one logical processor is active
– Allows the processor to transition to either the ST0 or ST1 mode
– Otherwise the OS would execute on the idle logical processor a sequence of
instructions that repeatedly checks for work to do
– This so-called “idle loop” can consume significant execution resources that
could otherwise be used by the other active logical processor
On a multi-processor system
– OS views logical processors similar to physical processors
But can still differentiate and prefer to schedule a thread on a new physical
processor rather than on a 2nd logical processors in the same phy processor
– Schedule threads to logical processors on different physical processors
before scheduling multiple threads to the same physical processor
– Allows SW threads to use different physical resources when possible
40
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
Physical Resource Sharing Schemes
Replicated Resources
– Each thread has its own resource
Partitioned Resources
– Resource is partitioned in MT mode, and combined in ST mode
Competitively-shared resource
– Both threads compete for the entire resource
HT unaware resources
– Both threads use the resource
Alternating Resources
– Alternate between active threads
– if one thread idle
41
the active thread uses the resource continuously
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
Front End Resource Sharing
Instruction Pointer is replicated
Branch prediction resources are either duplicated or shared
– E.g., the return stack buffer is duplicated
iTLB
– The large page iTLB is replicated
– The small page iTLB is partitioned
The I$ is Competitively-shared
The decoder logic is shared and its use is alternated
– If only one thread needs the decode logic,
it gets the full decode bandwidth
The μop-Cache is partitioned
The μop Queue is partitioned
56 μops in ST mode, 28 μops/thread in MT mode
42
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
Back End Resource Sharing
Out-Of-Order Execution
– Register renaming tables are replicated
– The reservation stations are competitively-shared
– Execution units are HT unaware
– The re-order buffers are partitioned
– Retirement logic alternates between threads
Memory
– The Load buffers and store buffers are partitioned
– The DTLB and STLB are competitively-shared
– The cache hierarchy and fill buffers are competitively-shared
43
From the Optimization Manual
Computer Structure 2014 – Advanced Topics
Power Management
44
Computer Structure 2014 – Advanced Topics
Processor Power Components
The power consumed by a processor consists of
– Dynamic power: power for toggling transistors and lines from 01 or 10
αCV2f : α – activity, C – capacitance, V – voltage, f – frequency
– Leakage power: leakage of transistors under voltage
function of: Z – total size of all transistors, V – voltage, t – temperature
Peak power must not exceed the thermal constrains
– Power generates heat
Heat must be dissipated to keep transistors within allowed temperature
– Peak power determines peak frequency (and thus peak performance)
– Also affects form factor, cooling solution cost, and acoustic noise
Average power
– Determines battery life (for mobile devices), electricity bill, air-condition bill
– Average power = Total Energy / Total time
Including low-activity and idle-time (~90% idle time for client)
45
Computer Structure 2014 – Advanced Topics
Performance per Watt
In small form-factor devices thermal budget limits performance
– Old target: get max performance
– New target: get max performance at a given power envelope
Performance per Watt
Increasing f also requires increasing V (~linearly)
– Dynamic Power = αCV2f = Kf3 X% performance costs ~3X% power
– A power efficient feature – better than 1:3 performance : power
Otherwise it is better to just increase frequency (and voltage)
Vmin is the minimal operation voltage
– Once at Vmin, reducing frequency no longer reduces voltage
– At this point a feature is power efficient only if it is 1:1 performance : power
Active energy efficiency tradeoff
– Energyactive = Poweractive × Timeactive
Poweractive / Perfactive
– Energy efficient feature: 1:1 performance : power
46
Computer Structure 2014 – Advanced Topics
Platform Power
Processor average power is <10% of the platform
CLK
5%
LAN Fan
DVD
2% 2%
ICH 2%
3%
Display
(panel + inverter)
33%
HDD
8%
GFX
8%
Misc.
8%
CPU
10%
MCH
9%
47
Power Supply
10%
Computer Structure 2014 – Advanced Topics
Managing Power
Typical CPU usage varies over time
– Bursts of high utilization & long idle periods (~90% of time in client)
Optimize power and energy consumption
– High power when high performance is needed
– Low power at low activity or idle
Enhanced Intel SpeedStep® Technology
– Multi voltage/frequency operating points
– OS changes frequency to meet performance needs and minimize power
– Referred to as processor Performance states = P-States
OS notifies CPU when no tasks are ready for execution
– CPU enters sleep state, called C-state
– Using MWAIT instruction, with C-state level as an argument
– Tradeoff between power and latency
Deeper sleep more power savings longer to wake
48
Computer Structure 2014 – Advanced Topics
P-states
Operation frequencies are called P-states = Performance
states
– P0 is the highest frequency
– P1,2,3… are lower frequencies
– Pn is the min Vcc point = Energy efficient point
DVFS = Dynamic Voltage and Frequency Scaling
– Power = CV2f ; f = KV Power ~ f3
– Program execution time ~ 1/f
– E = P×t E ~ f2
Pn is the most energy efficient point
Power
P0
P1
– Going up/down the cubic curve of power
High cost to achieve frequency
large power savings for
some small frequency reduction
P2
Pn
Freq
49
Computer Structure 2014 – Advanced Topics
C-States: C0
C0: CPU active state
Active Core Power
Local Clocks
and Logic
Clock
Distribution
Leakage
50
Computer Structure 2014 – Advanced Topics
C-States: C1
C0: CPU active state
C1: Halt state:
•
•
•
•
Active Core Power
Stop core pipeline
Stop most core clocks
No instructions are executed
Caches respond to external snoops
Clock
Distribution
Leakage
51
Computer Structure 2014 – Advanced Topics
C-States: C3
C0: CPU active state
C1: Halt state:
•
•
•
•
Active Core Power
Stop core pipeline
Stop most core clocks
No instructions are executed
Caches respond to external snoops
C3 state:
• Stop remaining core clocks
• Flush internal core caches
Leakage
52
Computer Structure 2014 – Advanced Topics
C-States: C6
C0: CPU active state
C1: Halt state:
•
•
•
•
Active Core Power
Stop core pipeline
Stop most core clocks
No instructions are executed
Caches respond to external snoops
C3 state:
• Stop remaining core clocks
• Flush internal core caches
C6 state:
• Processor saves architectural state
• Turn off power gate, eliminating leakage
Leakage
Core power goes to ~0
53
Computer Structure 2014 – Advanced Topics
Putting it all together
CPU running at max power and frequency
Periodically enters C1
C0
P0
20
18
16
Power [W]
14
12
10
8
6
4
C1
2
0
Time
54
Computer Structure 2014 – Advanced Topics
Putting it all together
Going into idle period
– Gradually enters deeper C states
– Controlled by OS
C0
P0
20
18
16
Power [W]
14
12
10
8
C2
6
C3
4
C1
2
C4
0
Time
55
Computer Structure 2014 – Advanced Topics
Putting it all together
Tracking CPU utilization history
– OS identifies low activity
– Switches CPU to lower P state
C0
P0
20
18
16
C0
P1
Power [W]
14
12
10
8
C2
6
C3
4
C1
2
C4
0
Time
56
Computer Structure 2014 – Advanced Topics
Putting it all together
CPU enters Idle state again
C0
P0
20
18
16
C0
P1
Power [W]
14
12
10
8
C2
6
C3
4
C1
2
C2
C4
C3
C4
0
Time
57
Computer Structure 2014 – Advanced Topics
Putting it all together
Further lowering the P state
DVD play runs at lowest P state
C0
P0
20
18
16
C0
P1
Power [W]
14
12
10
8
C2
6
C3
4
C1
2
C2
C4
C0
P2
C3
C4
0
Time
58
Computer Structure 2014 – Advanced Topics
Voltage and Frequency Domains
Two Independent Variable Power Planes
Shared frequency for all IA32 cores
and ring
Independent frequency for PG
Fixed Programmable power plane
for System Agent
VCC SA
Embedded power gates
– CPU cores, ring and LLC
Embedded power gates – each core
can be turned off individually
Cache power gating – turn off portions
or all cache at deeper sleep states
– Graphics processor
Can be varied or turned off when
not active
VCC Periphery
– Optimize SA power consumption
– System On Chip functionality and PCU logic
– Periphery: DDR, PCIe, Display
VCC Core
(Gated)
VCC Core
(Gated)
VCC Core
(ungated)
VCC Core
(Gated)
VCC Core
(Gated)
VCC Graphics
VCC Periphery
59
Computer Structure 2014 – Advanced Topics
Turbo Mode
P1 is guaranteed frequency
– CPU and GFX simultaneous heavy load at worst case conditions
– Actual power has high dynamic range
– OS treats P0 as any other P-state
Requesting is when it needs more performance
– P1 to P0 range is fully H/W controlled
Frequency transitions handled completely in HW
PCU keeps silicon within existing operating limits
– Systems designed to same specs,
with or without Turbo Mode
Pn is the energy efficient state
– Lower than Pn is controlled by Thermal-State
frequency
P0 is max possible frequency – the Turbo frequency
– P1-P0 has significant frequency range (GHz)
Single thread or lightly loaded applications
GFX <>CPU balancing
P0 1C
“Turbo”
H/W
Control
P1
OS
Visible
States
OS
Control
T-state &
Throttle
Pn
LFM
60
Computer Structure 2014 – Advanced Topics
Turbo Mode
Power Gating
Zero power for
inactive cores
61
Core 3
Core 2
Core 1
Core 0
Workload Lightly Threaded
Frequency (F)
Core 3
Core 2
Core 1
Core 0
Frequency (F)
No Turbo
Computer Structure 2014 – Advanced Topics
Turbo Mode
Power Gating
Zero power for
inactive cores
Turbo Mode
Use thermal budget of
inactive core to increase
frequency of active cores
62
Core 1
Core 0
Workload Lightly Threaded
Frequency (F)
Core 3
Core 2
Core 1
Core 0
Frequency (F)
No Turbo
Computer Structure 2014 – Advanced Topics
Turbo Mode
Power Gating
Zero power for
inactive cores
Turbo Mode
Use thermal budget of
inactive core to increase
frequency of active cores
63
Core 1
Core 0
Workload Lightly Threaded
Frequency (F)
Core 3
Core 2
Core 1
Core 0
Frequency (F)
No Turbo
Computer Structure 2014 – Advanced Topics
Turbo Mode
Turbo Mode
Increase frequency within
thermal headroom
64
Core
Core3 3
Core
Core2 2
Core 11
Core
Core
Core0 0
Active cores running
workloads < TDP
Frequency (F)
Core 3
Core 2
Core 1
Core 0
Frequency (F)
No Turbo
Computer Structure 2014 – Advanced Topics
Turbo Mode
Power Gating
Zero power for
inactive cores
Turbo Mode
Increase frequency within
thermal headroom
65
Core 2
Core 3
Core 1
Core 0
Workload Lightly Threaded
And active cores < TDP
Frequency (F)
Core 3
Core 2
Core 1
Core 0
Frequency (F)
No Turbo
Computer Structure 2014 – Advanced Topics
Thermal Capacitance
Classic Model
Steady-State Thermal Resistance
Steady-State Thermal Resistance
AND
Dynamic Thermal Capacitance
Temperature
Temperature
Design guide for steady state
New Model
Classic model
response
Time
More realistic
response to power
changes
Time
Temperature rises as energy is delivered to thermal solution
Thermal solution response is calculated at real-time
66
Foil taken from IDF 2011
Computer Structure 2014 – Advanced Topics
Intel® Turbo Boost Technology 2.0
Power
After idle periods, the system
accumulates “energy budget”
and can accommodate high
power/performance for a few
seconds
C0/P0
(Turbo)
Turbo Boost 2.0
In Steady State conditions
the power stabilizes on TDP
Use
accumulated
energy budget
to enhance user
experience
“TDP”
Sleep or
Low power
Time
Buildup thermal
budget during
idle periods
67
Foil taken from IDF 2011
Computer Structure 2014 – Advanced Topics
Core and Graphic Power Budgeting
• Cores and Graphics integrated on the same die with
separate voltage/frequency controls; tight HW control
• Full package power specifications available for sharing
• Power budget can shift between Cores and Graphics
Core
Power [W]
Heavy CPU
workload
Total package power Sandy Bridge
Next Gen Turbo
Sum of max power
for short periods
Realistic concurrent
max power
Specification
Core Power
Applications
Heavy Graphics
workload
Specification
Graphics Power
68
Foil taken from IDF 2011
Graphics Power
[W]
Computer Structure 2014 – Advanced Topics