L11_advance_2008
Download
Report
Transcript L11_advance_2008
Computer Architecture
Advanced Topics
1
Computer Architecture 2008 – Advanced Topics
Pentium® M Processor
2
Computer Architecture 2008 – Advanced Topics
Pentium® M Processor
Intel’s 1st processor designed for mobility
– Achieve best performance at given power and thermal constraints
– Achieve longest battery life
3
Banias
Dothan
transistors
77M
140M
process
130nm
90nm
Die size
84 mm2
85mm2
Peak power
24.5 watts
21 watts
Freq
1.7 GHz
2.1GHz
L1 cache
32KB I$ + 32KB D$
32KB I$ + 32KB D$
L2 cache
1MB
2MB
Computer Architecture 2008 – Advanced Topics
Performance per Watt
Mobile’s smaller form-factor decreases power budget
– Power generates heat, which must be dissipated to keep transistors
within allowed temperature
– Limits the processor’s peak power consumption
Change the target
– Old target: get max performance
– New target: get max performance at a given power envelope
Performance per Watt
Performance via frequency increase
– Power = CV2f, but increasing f also requires increasing V
– X% performance costs 3X% power
Assume performance linear with frequency
A power efficient feature – better than 1:3 performance : power
– Otherwise it is better to just increase frequency
– All Banias u-arch features (aimed at performance) are power efficient
4
Computer Architecture 2008 – Advanced Topics
Higher Performance vs.
Longer Battery Life
Processor average power is <10% of
the platform
– The processor reduces power in periods of
low processor activity
– The processor enters lower power states in
idle periods
Average power includes low-activity
periods and idle-time
– Typical: 1W – 3W
Max power limited by heat dissipation
– Typical: 20W – 100W
Intel®
LAN Fan
DVD
ICH
2% 2%
2%
3%
CLK
5%
Display
(panel + inverter)
33%
HDD
8%
GFX
8%
Misc.
8%
CPU
10%
Intel® MCH Power Supply
10%
9%
Decision
– Optimize for performance when Active
– Optimize for battery life when idle
5
Computer Architecture 2008 – Advanced Topics
Static Power
The power consumed by a processor consists of
– Active power: used to switch transistors
– Static power: leakage of transistors under voltage
Static power is a function of
– Number of transistors and their type
– Operating voltage
– Die temperature
Leakage is growing dramatically in new process technologies
Pentium® M reduces static power consumption
– The L2 cache is built with low-leakage transistors (2/3 of the die transistors)
Low-leakage transistors are slower, increasing cache access latency
The significant power saved justifies the small performance loss
– Enhanced SpeedStep® technology
6
Reduces voltage and temperature on low processor activity
Computer Architecture 2008 – Advanced Topics
Less is More
Less instructions per task
– Advanced branch prediction reduces #wrong instructions executed
– SSE instructions reduce the number of instructions architecturally
Less uops per instruction
– Uops fusion
– Dedicated stack engine
Less transistor switches per micro-op
– efficient bus
– various lower-level optimizations
Less energy per transistor switch
– Enhanced SpeedStep® technology
Power-awareness top to bottom
7
Computer Architecture 2008 – Advanced Topics
Improved Branch Predictor
Pentium® M employs best-in-class branch prediction
– Bimodal predictor, Global predictor, Loop detector
– Indirect branch predictor
Reduces number of wrong instructions executed
– Saves energy spent executing wrong instructions
Loop predictor
Count Limit Prediction
– Analyzes branches for loop behavior
Moving in one direction (taken or NT)
a fixed number of times
Ended with a single movement
in the opposite direction
+1
=
0
– Detect exact loop count
– Loop predicted accurately
8
Computer Architecture 2008 – Advanced Topics
Indirect Branch Predictor
Indirect jumps are widely used in object-oriented code (C++, Java)
Targets are data dependent
– Resolved at execution high misprediction penalty
Initially, allocate indirect branch only in target array (TA)
– If TA mispredicts allocate in iTA according to global history
Multiple targets allocated for a given branch
– Indirects with a single target predicted by TA, saving iTA space
Use iTA if TA indicates indirect branch + iTA hits
Branch IP
Target Array
hit
indirect branch
target
global
history
9
iTA
target
hit
HIT
Predicted
Target
Computer Architecture 2008 – Advanced Topics
Dedicated Stack Engine
PUSH, POP, CALL, RET update ESP (add or sub an offset)
– Use a dedicated add uop
Track the ESP offset at the front-end
– ID maintains offset in ESP_delta (+/- Osize)
– Eliminates need for uops updating ESP
– Patch displacements of stack operations
In some cases, ESP actual value is needed
– For example: add eax, esp, 3
– A sync uop is inserted before the instruction
if ESP_delta != 0
ESP = ESP + ESP_delta
– Reset ESP_delta
ESP_delta recovered on jump misprediction
10
Computer Architecture 2008 – Advanced Topics
ESP Tracking Example
Δ=0
PUSH eax
PUSH ebx
INC eax
ESP = ESP - 4
Δ=Δ-4
Δ=-4
STORE [ESP], EAX
STORE [ESP-4], EAX
ESP = ESP - 4
Δ=Δ-4
STORE [ESP], EBX
STORE [ESP-8], EBX
EAX = ADD EAX, 1
EAX = ADD EAX, 1
Δ=-8
ESP = ADD ESP, 1
ESP =Sync
SUBESP
ESP,! 8
ΔΔ==-08
Δ=-8
INC esp
ESP = ADD ESP, 1
Δ=0
11
Computer Architecture 2008 – Advanced Topics
Uop Fusion
The Instruction Decoder breaks an instruction into uops
– A conventional uop consists of a single operation operating on two sources
An instruction requires multiple uops when
– the instruction operates on more than two sources, or
– the nature of the operation requires a sequence of operations
Uop fusion: in some cases the decoder fuses 2 uops into one uop
– A short field added to the uop to support fusing of specific uop pairs
Uop fusion reduces the number of uops by 10%
– Increases performance by effectively widening rename, and retire bandwidth
– More instructions can be decode by all decoders
The same task is accomplished by processing fewer uops
– Decreases the energy required to complete a given task
12
Computer Architecture 2008 – Advanced Topics
A 2-uop Load-Op
add eax,[ebp+4*esi+8]
Load-op with 3
reg. operands
Decoder
Decoded into 2 uops
LD: read data from mem
OP: reg ← reg op data
LD
tmp=load[ebp+4*esi+8]
OP
eax = eax + tmp
Scheduler
LD
OP
The LD and OP are
inherently serial
OP dispatched only
when LD completes
13
MEU
ALU
LD
OP
Computer Architecture 2008 – Advanced Topics
A 1-uop Load-Op
add eax,[ebp+4*esi+8]
Decoded into 1 uop
Fused uops has a 3rd
source – new field in
uop holds index register
Increase decode BW
Decoder
LD + OP
Scheduler
LD + OP
Increase alloc BW and
ROB/RS effective size
Dispatched twice
OP dispatched after LD
fused uop retires after
both LD&OP complete
Increase retire BW
14
eax = eax +
load[ebp+4*esi+8]
Cache
ALU
LD
OP
Computer Architecture 2008 – Advanced Topics
Enhanced SpeedStep™ Technology
The “Basic” SpeedStep™ Technology had
– 2 operating points
– Non-transparent switch
The “Enhanced” version provides
– Multi voltage/frequency operating points. The Pentium M processor 1.6GHz
operation ranges:
From 600MHz @ 0.956V
To 1.6GHz @ 1.484V
Voltage, Frequency, Power
4.0
Freq (GHz)
Power (Watts)
3.6
6.1X
3.2
16
2.8
Benefits
14
Efficiency
ratio = 2.3
2.4
10
1.2
2.7X
)
1.6
8
6
0.8
4
0.4
2
0.0
0
0.8
15
12
2.0
(GHz
Frequency
– Higher power efficiency
2.7X lower frequency
2X performance loss
>2X energy gain
– Outstanding battery life
– Excellent thermal mgmt.
18
1.0
1.2
Voltage (Volt)
1.4
Typical Power
– Transparent switch
– Frequent switches
20
1.6
Computer Architecture 2008 – Advanced Topics
Trace Cache
(Pentium® 4 Processor)
16
Computer Architecture 2008 – Advanced Topics
Trace Cache
Decoding several IA-32 inst/clock at high frequency is difficult
– Instructions have a variable length and have many different options
– Takes several pipe-stages
Adds to the branch mis-prediction penalty
Trace-cache: cache uops of previously decoded instructions
– Decoding is only needed for instructions that miss the TC
The TC is the primary (L1) instruction cache
– Holds 12K uops
– 8-way set associative with LRU replacement
The TC has its own branch predictor (Trace BTB)
– Predicts branches that hit in the TC
– Directs where instruction fetching needs to go next in the TC
17
Computer Architecture 2008 – Advanced Topics
Traces
Instruction caches fetch bandwidth is limited to a basic blocks
– Cannot provide instructions across a taken branch in the same cycle
Jump into
the line
Jump out
of the line
jmp
The TC builds traces: program-ordered sequences of uops
– Allows the target of a branch to be included in the same TC line as the branch
itself
jmp
jmp
jmp
jmp
Traces have variable length
– Broken into trace lines, six uops per trace line
– There can be many trace lines in a single trace
18
Computer Architecture 2008 – Advanced Topics
Hyper Threading Technology
(Pentium® 4 Processor )
Based on
Hyper-Threading Technology Architecture and Micro-architecture
Intel Technology Journal
19
Computer Architecture 2008 – Advanced Topics
Thread-Level Parallelism
Multiprocessor systems have been used for many years
– There are known techniques to exploit multiprocessors
Software trends
– Applications consist of multiple threads or processes that can be
executed in parallel on multiple processors
Thread-level parallelism (TLP) – threads can be from
– the same application
– different applications running simultaneously
– operating system services
Increasing single thread performance becomes harder
– and is less and less power efficient
Chip Multi-Processing (CMP)
– Two (or more) processors are put on a single die
20
Computer Architecture 2008 – Advanced Topics
Multi-Threading
Multi-threading: a single processor executes multiple threads
Time-slice multithreading
– The processor switches between software threads after a fixed period
– Can effectively minimize the effects of long latencies to memory
Switch-on-event multithreading
– Switch threads on long latency events such as cache misses
– Works well for server applications that have many cache misses
A deficiency of both time-slice MT and switch-on-event MT
– They do not cover for branch mis-predictions and long dependencies
Simultaneous multi-threading (SMT)
– Multiple threads execute on a single processor simultaneously w/o switching
– Makes the most effective use of processor resources
21
Maximizes performance vs. transistor count and power
Computer Architecture 2008 – Advanced Topics
Hyper-threading (HT) Technology
HT is SMT
– Makes a single processor appear as 2 logical processors = threads
Each thread keeps a its own architectural state
– General-purpose registers
– Control and machine state registers
Each thread has its own interrupt controller
– Interrupts sent to a specific logical processor are handled only by it
OS views logical processors (threads) as physical processors
– Schedule threads to logical processors as in a multiprocessor system
From a micro-architecture perspective
– Thread share a single set of physical resources
22
caches, execution units, branch predictors, control logic, and buses
Computer Architecture 2008 – Advanced Topics
Two Important Goals
When one thread is stalled the other thread can continue to
make progress
– Independent progress ensured by either
Partitioning buffering queues and limiting the number of entries each
thread can use
Duplicating buffering queues
A single active thread running on a processor with HT runs at
the same speed as without HT
– Partitioned resources are recombined when only one thread is active
23
Computer Architecture 2008 – Advanced Topics
Front End
Each thread manages its own next-instruction-pointer
Threads arbitrate TC access every cycle (Ping-Pong)
– If both want to access the TC – access granted in alternating cycles
– If one thread is stalled, the other thread gets the full TC bandwidth
TC entries are tagged with thread-ID
– Dynamically allocated as needed
– Allows one logical processor to have more entries than the other
TC Hit
24
TC Miss
Computer Architecture 2008 – Advanced Topics
Front End (cont.)
Branch prediction structures are either duplicated or shared
– The return stack buffer is duplicated
– Global history is tracked for each thread
– The large global history array is a shared
Entries are tagged with a logical processor ID
Each thread has its own ITLB
Both threads share the same decoder logic
– if only one needs the decode logic, it gets the full decode bandwidth
– The state needed by the decodes is duplicated
Uop queue is hard partitioned
– Allows both logical processors to make independent forward progress
regardless of FE stalls (e.g., TC miss) or EXE stalls
25
Computer Architecture 2008 – Advanced Topics
Out-of-order Execution
ROB and MOB are hard partitioned
– Enforce fairness and prevent deadlocks
Allocator ping-pongs between the thread
– A thread is selected for allocation if
26
Its uop-queue is not empty
its buffers (ROB, RS) are not full
It is the thread’s turn, or the other thread cannot be selected
Computer Architecture 2008 – Advanced Topics
Out-of-order Execution (cont)
Registers renamed to a shared physical register pool
– Store results until retirement
After allocation and renaming uops are placed in one of 2 Qs
– Memory instruction queue and general instruction queue
The two queues are hard partitioned
– Uops are read from the Q’s and sent to the scheduler using ping-pong
The schedulers are oblivious to threads
– Schedule uops based on dependencies and exe. resources availability
Regardless of their thread
– Uops from the two threads can be dispatched in the same cycle
– To avoid deadlock and ensure fairness
Limit the number of active entries a thread can have in each
scheduler’s queue
Forwarding logic compares physical register numbers
– Forward results to other uops without thread knowledge
27
Computer Architecture 2008 – Advanced Topics
Out-of-order Execution (cont)
Memory is largely oblivious
– L1 Data Cache, L2 Cache, L3 Cache are thread oblivious
All use physical addresses
– DTLB is shared
Each DTLB entry includes a thread ID as part of the tag
Retirement ping-pongs between threads
– If one thread is not ready to retire uops all retirement bandwidth is
dedicated to the other thread
28
Computer Architecture 2008 – Advanced Topics
Single-task And Multi-task Modes
MT-mode (Multi-task mode)
– Two active threads, with some resources partitioned as described earlier
ST-mode (Single-task mode)
– There are two flavors of ST-mode
single-task thread 0 (ST0) – only thread 0 is active
single-task thread 1 (ST1) – only thread 1 is active
– Resources that were partitioned in MT-mode are re-combined to give the
single active logical processor use of all of the resources
Moving the processor from between modes
Thread 0 executes HALT
Interrupt
ST0
Thread 1 executes HALT
29
Low
Power
Thread 1 executes HALT
ST1
MT
Thread 0 executes HALT
Computer Architecture 2008 – Advanced Topics
Operating System And Applications
An HT processor appears to the OS and application SW as 2
processors
– The OS manages logical processors as it does physical processors
The OS should implement two optimizations:
Use HALT if only one logical processor is active
– Allows the processor to transition to either the ST0 or ST1 mode
– Otherwise the OS would execute on the idle logical processor a sequence of
instructions that repeatedly checks for work to do
– This so-called “idle loop” can consume significant execution resources that
could otherwise be used by the other active logical processor
On a multi-processor system,
– Schedule threads to logical processors on different physical processors
before scheduling multiple threads to the same physical processor
– Allows SW threads to use different physical resources when possible
30
Computer Architecture 2008 – Advanced Topics