Lecture 9: Multi-Core, Asymmetry, and

Transcript Lecture 9: Multi-Core, Asymmetry, and

COMPUTER ARCHITECTURE
CS 6354
Multi-Cores
Samira Khan
University of Virginia
Feb 18, 2016
The content and concept of this course are adapted from CMU ECE 740
AGENDA
• Logistics
• Review from last lecture
• Issues in out of order execution
• Latency Tolerance in OoO
• Multi-Core
2
LOGISTICS
• Course feedback form: Feb 18
• Project proposal: Feb 18, will have 4 late days
• Use the latex template
• Follow the format of the sample proposals
• Each group will prepare one document, but upload the file
individually via Collab
3
REVIEW: MEMORY DEPENDENCE
HANDLING
• When do you schedule a load instruction in an OOO
engine?
– Problem: A younger load can have its address ready before an
older store’s address is known
– Known as the memory disambiguation problem or the
unknown address problem
• Approaches
– Conservative: Stall the load until all previous stores have
computed their addresses (or even retired from the machine)
– Aggressive: Assume load is independent of unknown-address
stores and schedule the load right away
– Intelligent: Predict (with a more sophisticated predictor) if the
load is dependent on the/any unknown address store
4
REVIEW: MEMORY DISAMBIGUATION
• Option 1: Assume load dependent on all previous stores
+ No need for recovery
-- Too conservative: delays independent loads unnecessarily
• Option 2: Assume load independent of all previous stores
+ Simple and can be common case: no delay for independent loads
-- Requires recovery and re-execution of load and dependents on misprediction
• Option 3: Predict the dependence of a load on an outstanding
store
+ More accurate. Load store dependencies persist over time
-- Still requires recovery/re-execution on misprediction
– Alpha 21264 : Initially assume load independent, delay loads found to be
dependent
– Moshovos et al., “Dynamic speculation and synchronization of data
dependences,” ISCA 1997.
– Chrysos and Emer, “Memory Dependence Prediction Using Store Sets,” ISCA
1998.
5
REVIEW: SCHEDULING OF LOAD
DEPENDENTS
• Assume load will hit
+ No delay for dependents (load hit is the common case)
-- Need to squash and re-schedule if load actually misses
• Assume load will miss (i.e. schedule when load data
ready)
+ No need to re-schedule (simpler logic)
-- Significant delay for load dependents if load hits
• Predict load hit/miss
+ No delay for dependents on accurate prediction
-- Need to predict and re-schedule on misprediction
•
Yoaz et al., “Speculation Techniques for Improving Load Related Instruction Scheduling,”
ISCA 1999.
6
WHAT TO DO WITH DEPENDENTS
ON A LOAD MISS? (I)
• A load miss can take hundreds of cycles
• If there are many dependent instructions on a
load miss, these can clog the scheduling window
• Independent instructions cannot be allocated
reservation stations and scheduling can stall
7
SMALL WINDOWS: FULL-WINDOW STALLS
• When a long-latency instruction is not complete,
it blocks retirement.
• Incoming instructions fill the instruction window.
• Once the window is full, processor cannot place
new instructions into the window.
– This is called a full-window stall.
• A full-window stall prevents the processor from
making progress in the execution of the program.
8
SMALL WINDOWS: FULL-WINDOW STALLS
8-entry instruction window:
Oldest
LOAD R1  mem[R5]
L2 Miss! Takes 100s of cycles.
BEQ R1, R0, target
ADD R2  R2, 8
LOAD R3  mem[R2]
MUL R4  R4, R3
ADD R4  R4, R5
Independent of the L2 miss,
executed out of program order,
but cannot be retired.
STOR mem[R2]  R4
ADD R2  R2, 64
LOAD R3  mem[R2]
Younger instructions cannot be executed
because there is no space in the instruction window.
The processor stalls until the L2 Miss is serviced.
• L2 cache misses are responsible for most full-window stalls.
9
Normalized Execution Time
100
95
90
85
80
75
70
65
60
55
50
45
40
35
30
25
20
15
10
5
0
IMPACT OF L2 CACHE MISSES
Non-stall (compute) time
Full-window stall time
L2 Misses
128-entry window
512KB L2 cache, 500-cycle DRAM latency, aggressive stream-based prefetcher
Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model
10
Normalized Execution Time
100
95
90
85
80
75
70
65
60
55
50
45
40
35
30
25
20
15
10
5
0
IMPACT OF L2 CACHE MISSES
Non-stall (compute) time
Full-window stall time
L2 Misses
128-entry window
2048-entry window
500-cycle DRAM latency, aggressive stream-based prefetcher
Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model
11
THE PROBLEM
• Out-of-order execution requires large instruction
windows to tolerate today’s main memory latencies.
• As main memory latency increases, instruction window
size should also increase to fully tolerate the memory
latency.
• Building a large instruction window is a challenging task
if we would like to achieve
– Low power/energy consumption (tag matching logic, ld/st
buffers)
– Short cycle time (access, wakeup/select latencies)
– Low design and verification complexity
12
EFFICIENT SCALING OF INSTRUCTION WINDOW SIZE
• One of the major research issues in out of order execution
• How to achieve the benefits of a large window with a
small one (or in a simpler way)?
– Runahead execution?
• Upon L2 miss, checkpoint architectural state, speculatively execute
only for prefetching, re-execute when data ready
– Continual flow pipelines?
• Upon L2 miss, deallocate everything belonging to an L2 miss
dependent, reallocate/re-rename and re-execute upon data ready
– Dual-core execution?
• One core runs ahead and does not stall on L2 misses, feeds another
core that commits instructions
13
RUNAHEAD EXECUTION (I)
• A technique to obtain the memory-level parallelism
benefits of a large instruction window
• When the oldest instruction is a long-latency cache miss:
– Checkpoint architectural state and enter runahead mode
• In runahead mode:
– Speculatively pre-execute instructions
– The purpose of pre-execution is to generate prefetches
– L2-miss dependent instructions are marked INV and dropped
• Runahead mode ends when the original miss returns
– Checkpoint is restored and normal execution resumes
•
Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction Windows for
Out-of-order Processors,” HPCA 2003.
14
Perfect Caches:
Load 1 Hit
Compute
Runahead Example
Load 2 Hit
Compute
Small Window:
Load 2 Miss
Load 1 Miss
Compute
Stall
Compute
Miss 1
Stall
Miss 2
Runahead:
Load 1 Miss
Compute
Load 2 Miss
Runahead
Miss 1
Load 1 Hit
Load 2 Hit
Compute
Saved Cycles
Miss 2
15
BENEFITS OF RUNAHEAD EXECUTION
Instead of stalling during an L2 cache miss:
• Pre-executed loads and stores independent of L2-miss
instructions generate very accurate data prefetches:
– For both regular and irregular access patterns
• Instructions on the predicted program path are
prefetched into the instruction/trace cache and L2.
• Hardware prefetcher and branch predictor tables are
trained using future access information.
16
RUNAHEAD EXECUTION (III)
• Advantages:
+ Very accurate prefetches for data/instructions (all cache levels)
+ Follows the program path
+ Simple to implement, most of the hardware is already built in
+ Uses the same thread context as main thread, no waste of context
+ No need to construct a pre-execution thread
• Disadvantages/Limitations:
-- Extra executed instructions
-- Limited by branch prediction accuracy
-- Cannot prefetch dependent cache misses.
-- Effectiveness limited by available “memory-level parallelism” (MLP)
-- Prefetch distance limited by memory latency
• Implemented in IBM POWER6, Sun “Rock”
17
WHAT TO DO WITH DEPENDENTS
ON A LOAD MISS? (II)
• Idea: Move miss-dependent instructions into a separate buffer
– Example: Pentium 4’s “scheduling loops”
– Lebeck et al., “A Large, Fast Instruction Window for Tolerating
Cache Misses,” ISCA 2002.
• But, dependents still hold on to the physical registers
• Cannot scale the size of the register file indefinitely since it is
on the critical path
• Possible solution: Deallocate physical registers of dependents
– Difficult to re-allocate. See Srinivasan et al, “Continual Flow
Pipelines,” ASPLOS 2004.
Can you think of any other solution?
18
GENERAL ORGANIZATION OF AN OOO PROCESSOR

Smith and Sohi, “The Microarchitecture of Superscalar Processors,” Proc. IEEE, Dec. 1995.
19
A MODERN OOO DESIGN: INTEL
PENTIUM 4
20
Boggs et al., “The Microarchitecture of the Pentium 4 Processor,” Intel Technology Journal, 2001.
INTEL PENTIUM 4 SIMPLIFIED
Mutlu+, “Runahead Execution,”
HPCA 2003.
21
ALPHA 21264
Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro, March-April 1999.
22
MIPS R10000
Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, April 1996
23
IBM POWER4
• Tendler et al.,
“POWER4 system
microarchitecture”,
IBM J R&D, 2002.
24
IBM POWER4
•
•
•
•
•
•
2 cores, out-of-order execution
100-entry instruction window in each core
8-wide instruction fetch, issue, execute
Large, local+global hybrid branch predictor
1.5MB, 8-way L2 cache
Aggressive stream based prefetching
25

IBM
POWER5
Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE
Micro 2004.
26
RECOMMENDED READINGS
• Out-of-order execution processor designs
• Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro,
March-April 1999.
• Boggs et al., “The Microarchitecture of the Pentium 4
Processor,” Intel Technology Journal, 2001.
• Yeager, “The MIPS R10000 Superscalar Microprocessor,”
IEEE Micro, April 1996
• Tendler et al., “POWER4 system microarchitecture,” IBM
Journal of Research and Development, January 2002.
27
AND MORE READINGS…
• Stark et al., “On Pipelining Dynamic Scheduling
Logic,” MICRO 2000.
• Brown et al., “Select-free Instruction Scheduling
Logic,” MICRO 2001.
• Palacharla et al., “Complexity-effective
Superscalar Processors,” ISCA 1997.
28
MULTIPLE CORES ON CHIP
• Simpler and lower power than a single large core
• Large scale parallelism on chip
AMD Barcelona
Intel Core i7
IBM Cell BE
IBM POWER7
8 cores
8+1 cores
8 cores
Nvidia Fermi
Intel SCC
Tilera TILE Gx
448 “cores”
48 cores, networked
100 cores, networked
4 cores
Sun Niagara II
8 cores
29
MOORE’S LAW
Moore, “Cramming more components onto integrated circuits,”
Electronics, 1965.
30
31
MULTI-CORE
• Idea: Put multiple processors on the same die.
• Technology scaling (Moore’s Law) enables more
transistors to be placed on the same die area
• What else could you do with the die area you
dedicate to multiple processors?
–
–
–
–
Have a bigger, more powerful core
Have larger caches in the memory hierarchy
Simultaneous multithreading
Integrate platform components on chip (e.g., network
interface, memory controllers)
32
WHY MULTI-CORE?
• Alternative: Bigger, more powerful single core
– Larger superscalar issue width, larger instruction window,
more execution units, large trace caches, large branch
predictors, etc
+ Improves single-thread performance transparently to
programmer, compiler
- Very difficult to design (Scalable algorithms for improving
single-thread performance elusive)
- Power hungry – many out-of-order execution structures
consume significant power/area when scaled. Why?
- Diminishing returns on performance
- Does not significantly help memory-bound application
performance (Scalable algorithms for this elusive)
33
LARGE SUPERSCALAR VS. MULTI-CORE
• Olukotun et al., “The Case for a Single-Chip
Multiprocessor,” ASPLOS 1996.
34
MULTI-CORE VS. LARGE SUPERSCALAR
• Multi-core advantages
+ Simpler cores  more power efficient, lower complexity,
easier to design and replicate, higher frequency (shorter
wires, smaller structures)
+ Higher system throughput on multiprogrammed workloads 
reduced context switches
+ Higher system throughput in parallel applications
• Multi-core disadvantages
- Requires parallel tasks/threads to improve performance
(parallel programming)
- Resource sharing can reduce single-thread performance
- Shared hardware resources need to be managed
- Number of pins limits data supply for increased demand
35
LARGE SUPERSCALAR VS. MULTI-CORE
• Olukotun et al., “The Case for a Single-Chip
Multiprocessor,” ASPLOS 1996.
• Technology push
– Instruction issue queue size limits the cycle time of the
superscalar, OoO processor  diminishing performance
• Quadratic increase in complexity with issue width
– Large, multi-ported register files to support large instruction
windows and issue widths  reduced frequency or longer RF
access, diminishing performance
• Application pull
– Integer applications: little parallelism?
– FP applications: abundant loop-level parallelism
– Others (transaction proc., multiprogramming): CMP better fit
36
COMPARISON POINTS…
37
WHY MULTI-CORE?
• Alternative: Bigger caches
+ Improves single-thread performance transparently to
programmer, compiler
+ Simple to design
- Diminishing single-thread performance returns from
cache size. Why?
- Multiple levels complicate memory hierarchy
38
CACHE VS. CORE
Number of Transistors
Cache
Microprocessor
Tim e
39
WHY MULTI-CORE?
• Alternative: (Simultaneous) Multithreading
• Idea: Dispatch instructions from multiple threads in the
same cycle (to keep multiple execution units utilized)
– Hirata et al., “An Elementary Processor Architecture with
Simultaneous Instruction Issuing from Multiple Threads,” ISCA
1992.
– Yamamoto et al., “Performance Estimation of Multistreamed,
Superscalar Processors,” HICSS 1994.
– Tullsen et al., “Simultaneous Multithreading: Maximizing OnChip Parallelism,” ISCA 1995.
40
BASIC SUPERSCALAR OOO PIPELINE
Fetch
Decode/
Map
Queue
Reg
Read
Execute
Dcache/
Store
Buffer
Reg
Write
Retire
PC
Register
Map
Regs
Dcache
Regs
Icache
Threadblind
41
SMT PIPELINE
• Physical register file needs to become larger.
Why?
Fetch
Decode/
Map
Queue
Reg
Read
Execute
Dcache/
Store
Buffer
Reg
Write
Retire
PC
Register
Map
Regs
Dcache
Regs
Icache
42
WHY MULTI-CORE?
• Alternative: (Simultaneous) Multithreading
+ Exploits thread-level parallelism (just like multi-core)
+ Good single-thread performance with SMT
+ No need to have an entire core for another thread
+ Parallel performance aided by tight sharing of caches
- Scalability is limited: need bigger register files, larger issue
width (and associated costs) to have many threads 
complex with many threads
- Parallel performance limited by shared fetch bandwidth
- Extensive resource sharing at the pipeline and memory system
reduces both single-thread and parallel application
performance
43
WHY MULTI-CORE?
• Alternative: Integrate platform components on
chip instead
+ Speeds up many system functions (e.g., network
interface cards, Ethernet controller, memory
controller, I/O controller)
- Not all applications benefit (e.g., CPU intensive code
sections)
44
WHY MULTI-CORE?
• Other alternatives?
– Dataflow?
– Vector processors (SIMD)?
– Integrating DRAM on chip?
– Reconfigurable logic? (general purpose?)
45
COMPUTER ARCHITECTURE
CS 6354
Multi-Cores
Samira Khan
University of Virginia
Feb 18, 2016
The content and concept of this course are adapted from CMU ECE 740

Lecture 9: Multi-Core, Asymmetry, and

Transcript Lecture 9: Multi-Core, Asymmetry, and

Directory