ca-2011-lec14-by-Ohadx

Download Report

Transcript ca-2011-lec14-by-Ohadx

Computer Architecture
Advanced Topics
1
Computer Architecture 2010 – Advanced Topics
Pentium® M Processor
2
Computer Architecture 2010 – Advanced Topics
From Pentium® M Processor
 Intel’s 1st processor designed for mobility
– Achieve best performance at given power and thermal constraints
– Achieve longest battery life
Banias
Dothan
Sandy Bridge
transistors
77M
140M
995M / 506M
(55M / Core)
process
130nm
90nm
32nm
Die size
84 mm2
85mm2
216mm2 – 4C+GT2
131mm2 - 2C +GT1
Peak power
24.5 watts
21 watts
17 – 90 W
Freq
1.7 GHz
2.1GHz
2.8 – 3.8 – 4.4GHz
L1 cache
32KB I$ + 32KB D$
32KB I$ + 32KB D$
32KB I$ + 32KB D$
L2 cache
1MB
2MB
256K (per core) +
3-8MB L3
src: http://www.anandtech.com
3
Computer Architecture 2010 – Advanced Topics
Example Standby Bridge
 Use Moor’s Law and
process improvements to:
 Power/Performance
 Integration
 Reduce communication
 Reduce Latencies
 (cost in complexity)
 More Performance and
Efficiency via :
 Speed Step
 Memory Hierarchy
 Multi-Core
 Multi-Thread
 Out-of-Order Execution
 Predictors
 Multi-Operand (vector)
Instructions
 Custom Processing
src: http://www.anandtech.com
4
Computer Architecture 2010 – Advanced Topics
Performance per Watt
 Mobile’s smaller form-factor decreases power budget
– Power generates heat, which must be dissipated to keep transistors
within allowed temperature
– Limits the processor’s peak power consumption
 Change the target
– Old target: get max performance
– New target: get max performance at a given power envelope

Performance per Watt
 Performance via frequency increase
– Power = CV2f, but increasing f also requires increasing V
– X% performance costs 3X% power

Assume performance linear with frequency
 A power efficient feature – better than 1:3 performance : power
– Otherwise it is better to just increase frequency
– All Banias u-arch features (aimed at performance) are power efficient
5
Computer Architecture 2010 – Advanced Topics
Higher Performance vs.
Longer Battery Life
 Processor average power is <10% of
the platform
– The processor reduces power in periods of
low processor activity
– The processor enters lower power states in
idle periods
 Average power includes low-activity
periods and idle-time
– Typical: 1W – 3W
 Max power limited by heat dissipation
– Typical: 20W – 100W
Intel®
LAN Fan
DVD
ICH
2% 2%
2%
3%
CLK
5%
Display
(panel + inverter)
33%
HDD
8%
GFX
8%
Misc.
8%
CPU
10%
Intel® MCH Power Supply
10%
9%
 Decision
– Optimize for performance when Active
– Optimize for battery life when idle
src: http://www.anandtech.com
6
Computer Architecture 2010 – Advanced Topics
Higher Performance vs.
Longer Battery Life
 High Dynamic Range
–
–
–
–
Long periods of Idle w/ picks of activity
Minimize power when Idle
Adequate performance when active
Quick transitions
 Max power limited by heat dissipation
– Typical: 3W (cell) – 6W (tablet)
15W (small PC) 60W (main stream PC)
150W+ (desktop)
– How can the design fit all ?
 Decision
– Optimize for user experience when Active (adequate
performance)
– Optimize for battery life when idle
src: http://www.anandtech.com
7
Computer Architecture 2010 – Advanced Topics
Static Power
 The power consumed by a processor consists of
– Active power: used to switch transistors
– Static power: leakage of transistors under voltage
 Static power is a function of
– Number of transistors and their type
– Operating voltage
– Die temperature
 Leakage is growing dramatically in new process technologies
 Pentium® M reduces static power consumption
– The L2 cache is built with low-leakage transistors (2/3 of the die transistors)

Low-leakage transistors are slower, increasing cache access latency

The significant power saved justifies the small performance loss
– Enhanced SpeedStep® technology

8
Reduces voltage and temperature on low processor activity
Computer Architecture 2010 – Advanced Topics
Less is More
 Less instructions per task
– Advanced branch prediction reduces #wrong instructions executed
– SSE instructions reduce the number of instructions architecturally
 Less uops per instruction
– Uops fusion
– Dedicated stack engine
 Less transistor switches per micro-op
– efficient bus
– various lower-level optimizations
 Less energy per transistor switch
– Enhanced SpeedStep® technology
Power-awareness top to bottom
9
Computer Architecture 2010 – Advanced Topics
Improved Branch Predictor
 Pentium® M employs best-in-class branch prediction
– Bimodal predictor, Global predictor, Loop detector
– Indirect branch predictor
 Reduces number of wrong instructions executed
– Saves energy spent executing wrong instructions
 Loop predictor
Count Limit Prediction
– Analyzes branches for loop behavior


Moving in one direction (taken or NT)
a fixed number of times
Ended with a single movement
in the opposite direction
+1
=
0
– Detect exact loop count
– Loop predicted accurately
10
Computer Architecture 2010 – Advanced Topics
Indirect Branch Predictor
 Indirect jumps are widely used in object-oriented code (C++, Java)
 Targets are data dependent
– Resolved at execution  high misprediction penalty
 Initially, allocate indirect branch only in target array (TA)
– If TA mispredicts  allocate in iTA according to global history

Multiple targets allocated for a given branch
– Indirects with a single target predicted by TA, saving iTA space
 Use iTA if TA indicates indirect branch + iTA hits
Branch IP
Target Array
hit
indirect branch
target
global
history
11
iTA
target
hit
HIT
Predicted
Target
Computer Architecture 2010 – Advanced Topics
Dedicated Stack Engine
 PUSH, POP, CALL, RET update ESP (add or sub an offset)
– Use a dedicated add uop
 Track the ESP offset at the front-end
– ID maintains offset in ESP_delta (+/- Osize)
– Eliminates need for uops updating ESP
– Patch displacements of stack operations
 In some cases, ESP actual value is needed
– For example: add eax, esp, 3
– A sync uop is inserted before the instruction


if ESP_delta != 0
ESP = ESP + ESP_delta
– Reset ESP_delta
 ESP_delta recovered on jump misprediction
12
Computer Architecture 2010 – Advanced Topics
ESP Tracking Example
Δ=0
PUSH eax
PUSH ebx
INC eax
ESP = ESP - 4
Δ=Δ-4
Δ=-4
STORE [ESP], EAX
STORE [ESP-4], EAX
ESP = ESP - 4
Δ=Δ-4
STORE [ESP], EBX
STORE [ESP-8], EBX
EAX = ADD EAX, 1
EAX = ADD EAX, 1
Δ=-8
ESP = ADD ESP, 1
ESP =Sync
SUBESP
ESP,! 8
ΔΔ==-08
Δ=-8
INC esp
ESP = ADD ESP, 1
Δ=0
13
Computer Architecture 2010 – Advanced Topics
Uop Fusion
 The Instruction Decoder breaks an instruction into uops
– A conventional uop consists of a single operation operating on two sources
 An instruction requires multiple uops when
– the instruction operates on more than two sources, or
– the nature of the operation requires a sequence of operations
 Uop fusion: in some cases the decoder fuses 2 uops into one uop
– A short field added to the uop to support fusing of specific uop pairs
 Uop fusion reduces the number of uops by 10%
– Increases performance by effectively widening rename, and retire bandwidth
– More instructions can be decode by all decoders
 The same task is accomplished by processing fewer uops
– Decreases the energy required to complete a given task
14
Computer Architecture 2010 – Advanced Topics
A 2-uop Load-Op
add eax,[ebp+4*esi+8]
Load-op with 3
reg. operands
Decoder
Decoded into 2 uops
LD: read data from mem
OP: reg ← reg op data
LD
tmp=load[ebp+4*esi+8]
OP
eax = eax + tmp
Scheduler
LD
OP
The LD and OP are
inherently serial
OP dispatched only
when LD completes
15
MEU
ALU
LD
OP
Computer Architecture 2010 – Advanced Topics
A 1-uop Load-Op
add eax,[ebp+4*esi+8]
Decoded into 1 uop
Fused uops has a 3rd
source – new field in
uop holds index register
Increase decode BW
Decoder
LD + OP
Scheduler
LD + OP
Increase alloc BW and
ROB/RS effective size
Dispatched twice
OP dispatched after LD
fused uop retires after
both LD&OP complete
Increase retire BW
16
eax = eax +
load[ebp+4*esi+8]
Cache
ALU
LD
OP
Computer Architecture 2010 – Advanced Topics
Enhanced SpeedStep™ Technology
 The “Basic” SpeedStep™ Technology had
– 2 operating points
– Non-transparent switch
 The “Enhanced” version provides
– Multi voltage/frequency operating points. The Pentium M processor 1.6GHz
operation ranges:
From 600MHz @ 0.956V
 To 1.6GHz @ 1.484V

Voltage, Frequency, Power
4.0
Freq (GHz)
Power (Watts)
3.6
6.1X
3.2
16
2.8
 Benefits
14
Efficiency
ratio = 2.3
2.4
10
1.2
2.7X
)
1.6
8
6
0.8
4
0.4
2
0.0
0
0.8
17
12
2.0
(GHz
Frequency
– Higher power efficiency
2.7X lower frequency 
2X performance loss 
>2X energy gain
– Outstanding battery life
– Excellent thermal mgmt.
18
1.0
1.2
Voltage (Volt)
1.4
Typical Power
– Transparent switch
– Frequent switches
20
1.6
Computer Architecture 2010 – Advanced Topics
Trace Cache
(Pentium® 4 Processor)
18
Computer Architecture 2010 – Advanced Topics
Trace Cache
 Decoding several IA-32 inst/clock at high frequency is difficult
– Instructions have a variable length and have many different options
– Takes several pipe-stages

Adds to the branch mis-prediction penalty
 Trace-cache: cache uops of previously decoded instructions
– Decoding is only needed for instructions that miss the TC
 The TC is the primary (L1) instruction cache
– Holds 12K uops
– 8-way set associative with LRU replacement
 The TC has its own branch predictor (Trace BTB)
– Predicts branches that hit in the TC
– Directs where instruction fetching needs to go next in the TC
19
Computer Architecture 2010 – Advanced Topics
Traces
 Instruction caches fetch bandwidth is limited to a basic blocks
– Cannot provide instructions across a taken branch in the same cycle
Jump into
the line
Jump out
of the line
jmp
 The TC builds traces: program-ordered sequences of uops
– Allows the target of a branch to be included in the same TC line as the branch
itself
jmp
jmp
jmp
jmp
 Traces have variable length
– Broken into trace lines, six uops per trace line
– There can be many trace lines in a single trace
20
Computer Architecture 2010 – Advanced Topics
Hyper Threading Technology
(Pentium® 4 Processor )
Based on
Hyper-Threading Technology Architecture and Micro-architecture
Intel Technology Journal
21
Computer Architecture 2010 – Advanced Topics
Thread-Level Parallelism
 Multiprocessor systems have been used for many years
– There are known techniques to exploit multiprocessors
 Software trends
– Applications consist of multiple threads or processes that can be
executed in parallel on multiple processors
 Thread-level parallelism (TLP) – threads can be from
– the same application
– different applications running simultaneously
– operating system services
 Increasing single thread performance becomes harder
– and is less and less power efficient
 Chip Multi-Processing (CMP)
– Two (or more) processors are put on a single die
22
Computer Architecture 2010 – Advanced Topics
Multi-Threading
 Multi-threading: a single processor executes multiple threads
 Time-slice multithreading
– The processor switches between software threads after a fixed period
– Can effectively minimize the effects of long latencies to memory
 Switch-on-event multithreading
– Switch threads on long latency events such as cache misses
– Works well for server applications that have many cache misses
 A deficiency of both time-slice MT and switch-on-event MT
– They do not cover for branch mis-predictions and long dependencies
 Simultaneous multi-threading (SMT)
– Multiple threads execute on a single processor simultaneously w/o switching
– Makes the most effective use of processor resources

23
Maximizes performance vs. transistor count and power
Computer Architecture 2010 – Advanced Topics
Hyper-threading (HT) Technology
 HT is SMT
– Makes a single processor appear as 2 logical processors = threads
 Each thread keeps a its own architectural state
– General-purpose registers
– Control and machine state registers
 Each thread has its own interrupt controller
– Interrupts sent to a specific logical processor are handled only by it
 OS views logical processors (threads) as physical processors
– Schedule threads to logical processors as in a multiprocessor system
 From a micro-architecture perspective
– Thread share a single set of physical resources

24
caches, execution units, branch predictors, control logic, and buses
Computer Architecture 2010 – Advanced Topics
Two Important Goals
 When one thread is stalled the other thread can continue to
make progress
– Independent progress ensured by either


Partitioning buffering queues and limiting the number of entries each
thread can use
Duplicating buffering queues
 A single active thread running on a processor with HT runs at
the same speed as without HT
– Partitioned resources are recombined when only one thread is active
25
Computer Architecture 2010 – Advanced Topics
Front End
 Each thread manages its own next-instruction-pointer
 Threads arbitrate TC access every cycle (Ping-Pong)
– If both want to access the TC – access granted in alternating cycles
– If one thread is stalled, the other thread gets the full TC bandwidth
 TC entries are tagged with thread-ID
– Dynamically allocated as needed
– Allows one logical processor to have more entries than the other
TC Hit
26
TC Miss
Computer Architecture 2010 – Advanced Topics
Front End (cont.)
 Branch prediction structures are either duplicated or shared
– The return stack buffer is duplicated
– Global history is tracked for each thread
– The large global history array is a shared

Entries are tagged with a logical processor ID
 Each thread has its own ITLB
 Both threads share the same decoder logic
– if only one needs the decode logic, it gets the full decode bandwidth
– The state needed by the decodes is duplicated
 Uop queue is hard partitioned
– Allows both logical processors to make independent forward progress
regardless of FE stalls (e.g., TC miss) or EXE stalls
27
Computer Architecture 2010 – Advanced Topics
Out-of-order Execution
 ROB and MOB are hard partitioned
– Enforce fairness and prevent deadlocks
 Allocator ping-pongs between the thread
– A thread is selected for allocation if



28
Its uop-queue is not empty
its buffers (ROB, RS) are not full
It is the thread’s turn, or the other thread cannot be selected
Computer Architecture 2010 – Advanced Topics
Out-of-order Execution (cont)
 Registers renamed to a shared physical register pool
– Store results until retirement
 After allocation and renaming uops are placed in one of 2 Qs
– Memory instruction queue and general instruction queue

The two queues are hard partitioned
– Uops are read from the Q’s and sent to the scheduler using ping-pong
 The schedulers are oblivious to threads
– Schedule uops based on dependencies and exe. resources availability

Regardless of their thread
– Uops from the two threads can be dispatched in the same cycle
– To avoid deadlock and ensure fairness

Limit the number of active entries a thread can have in each
scheduler’s queue
 Forwarding logic compares physical register numbers
– Forward results to other uops without thread knowledge
29
Computer Architecture 2010 – Advanced Topics
Out-of-order Execution (cont)
 Memory is largely oblivious
– L1 Data Cache, L2 Cache, L3 Cache are thread oblivious

All use physical addresses
– DTLB is shared

Each DTLB entry includes a thread ID as part of the tag
 Retirement ping-pongs between threads
– If one thread is not ready to retire uops all retirement bandwidth is
dedicated to the other thread
30
Computer Architecture 2010 – Advanced Topics
Single-task And Multi-task Modes
 MT-mode (Multi-task mode)
– Two active threads, with some resources partitioned as described earlier
 ST-mode (Single-task mode)
– There are two flavors of ST-mode

single-task thread 0 (ST0) – only thread 0 is active

single-task thread 1 (ST1) – only thread 1 is active
– Resources that were partitioned in MT-mode are re-combined to give the
single active logical processor use of all of the resources
 Moving the processor from between modes
Thread 0 executes HALT
Interrupt
ST0
Thread 1 executes HALT
31
Low
Power
Thread 1 executes HALT
ST1
MT
Thread 0 executes HALT
Computer Architecture 2010 – Advanced Topics
Operating System And Applications
 An HT processor appears to the OS and application SW as 2
processors
– The OS manages logical processors as it does physical processors
The OS should implement two optimizations:
 Use HALT if only one logical processor is active
– Allows the processor to transition to either the ST0 or ST1 mode
– Otherwise the OS would execute on the idle logical processor a sequence of
instructions that repeatedly checks for work to do
– This so-called “idle loop” can consume significant execution resources that
could otherwise be used by the other active logical processor
 On a multi-processor system,
– Schedule threads to logical processors on different physical processors
before scheduling multiple threads to the same physical processor
– Allows SW threads to use different physical resources when possible
32
Computer Architecture 2010 – Advanced Topics