Lecture 9: Multi-Core, Asymmetry, and
Download
Report
Transcript Lecture 9: Multi-Core, Asymmetry, and
COMPUTER ARCHITECTURE
CS 6354
Multi-Cores
Samira Khan
University of Virginia
Feb 18, 2016
The content and concept of this course are adapted from CMU ECE 740
AGENDA
• Logistics
• Review from last lecture
• Issues in out of order execution
• Latency Tolerance in OoO
• Multi-Core
2
LOGISTICS
• Course feedback form: Feb 18
• Project proposal: Feb 18, will have 4 late days
• Use the latex template
• Follow the format of the sample proposals
• Each group will prepare one document, but upload the file
individually via Collab
3
REVIEW: MEMORY DEPENDENCE
HANDLING
• When do you schedule a load instruction in an OOO
engine?
– Problem: A younger load can have its address ready before an
older store’s address is known
– Known as the memory disambiguation problem or the
unknown address problem
• Approaches
– Conservative: Stall the load until all previous stores have
computed their addresses (or even retired from the machine)
– Aggressive: Assume load is independent of unknown-address
stores and schedule the load right away
– Intelligent: Predict (with a more sophisticated predictor) if the
load is dependent on the/any unknown address store
4
REVIEW: MEMORY DISAMBIGUATION
• Option 1: Assume load dependent on all previous stores
+ No need for recovery
-- Too conservative: delays independent loads unnecessarily
• Option 2: Assume load independent of all previous stores
+ Simple and can be common case: no delay for independent loads
-- Requires recovery and re-execution of load and dependents on misprediction
• Option 3: Predict the dependence of a load on an outstanding
store
+ More accurate. Load store dependencies persist over time
-- Still requires recovery/re-execution on misprediction
– Alpha 21264 : Initially assume load independent, delay loads found to be
dependent
– Moshovos et al., “Dynamic speculation and synchronization of data
dependences,” ISCA 1997.
– Chrysos and Emer, “Memory Dependence Prediction Using Store Sets,” ISCA
1998.
5
REVIEW: SCHEDULING OF LOAD
DEPENDENTS
• Assume load will hit
+ No delay for dependents (load hit is the common case)
-- Need to squash and re-schedule if load actually misses
• Assume load will miss (i.e. schedule when load data
ready)
+ No need to re-schedule (simpler logic)
-- Significant delay for load dependents if load hits
• Predict load hit/miss
+ No delay for dependents on accurate prediction
-- Need to predict and re-schedule on misprediction
•
Yoaz et al., “Speculation Techniques for Improving Load Related Instruction Scheduling,”
ISCA 1999.
6
WHAT TO DO WITH DEPENDENTS
ON A LOAD MISS? (I)
• A load miss can take hundreds of cycles
• If there are many dependent instructions on a
load miss, these can clog the scheduling window
• Independent instructions cannot be allocated
reservation stations and scheduling can stall
7
SMALL WINDOWS: FULL-WINDOW STALLS
• When a long-latency instruction is not complete,
it blocks retirement.
• Incoming instructions fill the instruction window.
• Once the window is full, processor cannot place
new instructions into the window.
– This is called a full-window stall.
• A full-window stall prevents the processor from
making progress in the execution of the program.
8
SMALL WINDOWS: FULL-WINDOW STALLS
8-entry instruction window:
Oldest
LOAD R1 mem[R5]
L2 Miss! Takes 100s of cycles.
BEQ R1, R0, target
ADD R2 R2, 8
LOAD R3 mem[R2]
MUL R4 R4, R3
ADD R4 R4, R5
Independent of the L2 miss,
executed out of program order,
but cannot be retired.
STOR mem[R2] R4
ADD R2 R2, 64
LOAD R3 mem[R2]
Younger instructions cannot be executed
because there is no space in the instruction window.
The processor stalls until the L2 Miss is serviced.
• L2 cache misses are responsible for most full-window stalls.
9
Normalized Execution Time
100
95
90
85
80
75
70
65
60
55
50
45
40
35
30
25
20
15
10
5
0
IMPACT OF L2 CACHE MISSES
Non-stall (compute) time
Full-window stall time
L2 Misses
128-entry window
512KB L2 cache, 500-cycle DRAM latency, aggressive stream-based prefetcher
Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model
10
Normalized Execution Time
100
95
90
85
80
75
70
65
60
55
50
45
40
35
30
25
20
15
10
5
0
IMPACT OF L2 CACHE MISSES
Non-stall (compute) time
Full-window stall time
L2 Misses
128-entry window
2048-entry window
500-cycle DRAM latency, aggressive stream-based prefetcher
Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model
11
THE PROBLEM
• Out-of-order execution requires large instruction
windows to tolerate today’s main memory latencies.
• As main memory latency increases, instruction window
size should also increase to fully tolerate the memory
latency.
• Building a large instruction window is a challenging task
if we would like to achieve
– Low power/energy consumption (tag matching logic, ld/st
buffers)
– Short cycle time (access, wakeup/select latencies)
– Low design and verification complexity
12
EFFICIENT SCALING OF INSTRUCTION WINDOW SIZE
• One of the major research issues in out of order execution
• How to achieve the benefits of a large window with a
small one (or in a simpler way)?
– Runahead execution?
• Upon L2 miss, checkpoint architectural state, speculatively execute
only for prefetching, re-execute when data ready
– Continual flow pipelines?
• Upon L2 miss, deallocate everything belonging to an L2 miss
dependent, reallocate/re-rename and re-execute upon data ready
– Dual-core execution?
• One core runs ahead and does not stall on L2 misses, feeds another
core that commits instructions
13
RUNAHEAD EXECUTION (I)
• A technique to obtain the memory-level parallelism
benefits of a large instruction window
• When the oldest instruction is a long-latency cache miss:
– Checkpoint architectural state and enter runahead mode
• In runahead mode:
– Speculatively pre-execute instructions
– The purpose of pre-execution is to generate prefetches
– L2-miss dependent instructions are marked INV and dropped
• Runahead mode ends when the original miss returns
– Checkpoint is restored and normal execution resumes
•
Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction Windows for
Out-of-order Processors,” HPCA 2003.
14
Perfect Caches:
Load 1 Hit
Compute
Runahead Example
Load 2 Hit
Compute
Small Window:
Load 2 Miss
Load 1 Miss
Compute
Stall
Compute
Miss 1
Stall
Miss 2
Runahead:
Load 1 Miss
Compute
Load 2 Miss
Runahead
Miss 1
Load 1 Hit
Load 2 Hit
Compute
Saved Cycles
Miss 2
15
BENEFITS OF RUNAHEAD EXECUTION
Instead of stalling during an L2 cache miss:
• Pre-executed loads and stores independent of L2-miss
instructions generate very accurate data prefetches:
– For both regular and irregular access patterns
• Instructions on the predicted program path are
prefetched into the instruction/trace cache and L2.
• Hardware prefetcher and branch predictor tables are
trained using future access information.
16
RUNAHEAD EXECUTION (III)
• Advantages:
+ Very accurate prefetches for data/instructions (all cache levels)
+ Follows the program path
+ Simple to implement, most of the hardware is already built in
+ Uses the same thread context as main thread, no waste of context
+ No need to construct a pre-execution thread
• Disadvantages/Limitations:
-- Extra executed instructions
-- Limited by branch prediction accuracy
-- Cannot prefetch dependent cache misses.
-- Effectiveness limited by available “memory-level parallelism” (MLP)
-- Prefetch distance limited by memory latency
• Implemented in IBM POWER6, Sun “Rock”
17
WHAT TO DO WITH DEPENDENTS
ON A LOAD MISS? (II)
• Idea: Move miss-dependent instructions into a separate buffer
– Example: Pentium 4’s “scheduling loops”
– Lebeck et al., “A Large, Fast Instruction Window for Tolerating
Cache Misses,” ISCA 2002.
• But, dependents still hold on to the physical registers
• Cannot scale the size of the register file indefinitely since it is
on the critical path
• Possible solution: Deallocate physical registers of dependents
– Difficult to re-allocate. See Srinivasan et al, “Continual Flow
Pipelines,” ASPLOS 2004.
Can you think of any other solution?
18
GENERAL ORGANIZATION OF AN OOO PROCESSOR
Smith and Sohi, “The Microarchitecture of Superscalar Processors,” Proc. IEEE, Dec. 1995.
19
A MODERN OOO DESIGN: INTEL
PENTIUM 4
20
Boggs et al., “The Microarchitecture of the Pentium 4 Processor,” Intel Technology Journal, 2001.
INTEL PENTIUM 4 SIMPLIFIED
Mutlu+, “Runahead Execution,”
HPCA 2003.
21
ALPHA 21264
Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro, March-April 1999.
22
MIPS R10000
Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, April 1996
23
IBM POWER4
• Tendler et al.,
“POWER4 system
microarchitecture”,
IBM J R&D, 2002.
24
IBM POWER4
•
•
•
•
•
•
2 cores, out-of-order execution
100-entry instruction window in each core
8-wide instruction fetch, issue, execute
Large, local+global hybrid branch predictor
1.5MB, 8-way L2 cache
Aggressive stream based prefetching
25
IBM
POWER5
Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE
Micro 2004.
26
RECOMMENDED READINGS
• Out-of-order execution processor designs
• Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro,
March-April 1999.
• Boggs et al., “The Microarchitecture of the Pentium 4
Processor,” Intel Technology Journal, 2001.
• Yeager, “The MIPS R10000 Superscalar Microprocessor,”
IEEE Micro, April 1996
• Tendler et al., “POWER4 system microarchitecture,” IBM
Journal of Research and Development, January 2002.
27
AND MORE READINGS…
• Stark et al., “On Pipelining Dynamic Scheduling
Logic,” MICRO 2000.
• Brown et al., “Select-free Instruction Scheduling
Logic,” MICRO 2001.
• Palacharla et al., “Complexity-effective
Superscalar Processors,” ISCA 1997.
28
MULTIPLE CORES ON CHIP
• Simpler and lower power than a single large core
• Large scale parallelism on chip
AMD Barcelona
Intel Core i7
IBM Cell BE
IBM POWER7
8 cores
8+1 cores
8 cores
Nvidia Fermi
Intel SCC
Tilera TILE Gx
448 “cores”
48 cores, networked
100 cores, networked
4 cores
Sun Niagara II
8 cores
29
MOORE’S LAW
Moore, “Cramming more components onto integrated circuits,”
Electronics, 1965.
30
31
MULTI-CORE
• Idea: Put multiple processors on the same die.
• Technology scaling (Moore’s Law) enables more
transistors to be placed on the same die area
• What else could you do with the die area you
dedicate to multiple processors?
–
–
–
–
Have a bigger, more powerful core
Have larger caches in the memory hierarchy
Simultaneous multithreading
Integrate platform components on chip (e.g., network
interface, memory controllers)
32
WHY MULTI-CORE?
• Alternative: Bigger, more powerful single core
– Larger superscalar issue width, larger instruction window,
more execution units, large trace caches, large branch
predictors, etc
+ Improves single-thread performance transparently to
programmer, compiler
- Very difficult to design (Scalable algorithms for improving
single-thread performance elusive)
- Power hungry – many out-of-order execution structures
consume significant power/area when scaled. Why?
- Diminishing returns on performance
- Does not significantly help memory-bound application
performance (Scalable algorithms for this elusive)
33
LARGE SUPERSCALAR VS. MULTI-CORE
• Olukotun et al., “The Case for a Single-Chip
Multiprocessor,” ASPLOS 1996.
34
MULTI-CORE VS. LARGE SUPERSCALAR
• Multi-core advantages
+ Simpler cores more power efficient, lower complexity,
easier to design and replicate, higher frequency (shorter
wires, smaller structures)
+ Higher system throughput on multiprogrammed workloads
reduced context switches
+ Higher system throughput in parallel applications
• Multi-core disadvantages
- Requires parallel tasks/threads to improve performance
(parallel programming)
- Resource sharing can reduce single-thread performance
- Shared hardware resources need to be managed
- Number of pins limits data supply for increased demand
35
LARGE SUPERSCALAR VS. MULTI-CORE
• Olukotun et al., “The Case for a Single-Chip
Multiprocessor,” ASPLOS 1996.
• Technology push
– Instruction issue queue size limits the cycle time of the
superscalar, OoO processor diminishing performance
• Quadratic increase in complexity with issue width
– Large, multi-ported register files to support large instruction
windows and issue widths reduced frequency or longer RF
access, diminishing performance
• Application pull
– Integer applications: little parallelism?
– FP applications: abundant loop-level parallelism
– Others (transaction proc., multiprogramming): CMP better fit
36
COMPARISON POINTS…
37
WHY MULTI-CORE?
• Alternative: Bigger caches
+ Improves single-thread performance transparently to
programmer, compiler
+ Simple to design
- Diminishing single-thread performance returns from
cache size. Why?
- Multiple levels complicate memory hierarchy
38
CACHE VS. CORE
Number of Transistors
Cache
Microprocessor
Tim e
39
WHY MULTI-CORE?
• Alternative: (Simultaneous) Multithreading
• Idea: Dispatch instructions from multiple threads in the
same cycle (to keep multiple execution units utilized)
– Hirata et al., “An Elementary Processor Architecture with
Simultaneous Instruction Issuing from Multiple Threads,” ISCA
1992.
– Yamamoto et al., “Performance Estimation of Multistreamed,
Superscalar Processors,” HICSS 1994.
– Tullsen et al., “Simultaneous Multithreading: Maximizing OnChip Parallelism,” ISCA 1995.
40
BASIC SUPERSCALAR OOO PIPELINE
Fetch
Decode/
Map
Queue
Reg
Read
Execute
Dcache/
Store
Buffer
Reg
Write
Retire
PC
Register
Map
Regs
Dcache
Regs
Icache
Threadblind
41
SMT PIPELINE
• Physical register file needs to become larger.
Why?
Fetch
Decode/
Map
Queue
Reg
Read
Execute
Dcache/
Store
Buffer
Reg
Write
Retire
PC
Register
Map
Regs
Dcache
Regs
Icache
42
WHY MULTI-CORE?
• Alternative: (Simultaneous) Multithreading
+ Exploits thread-level parallelism (just like multi-core)
+ Good single-thread performance with SMT
+ No need to have an entire core for another thread
+ Parallel performance aided by tight sharing of caches
- Scalability is limited: need bigger register files, larger issue
width (and associated costs) to have many threads
complex with many threads
- Parallel performance limited by shared fetch bandwidth
- Extensive resource sharing at the pipeline and memory system
reduces both single-thread and parallel application
performance
43
WHY MULTI-CORE?
• Alternative: Integrate platform components on
chip instead
+ Speeds up many system functions (e.g., network
interface cards, Ethernet controller, memory
controller, I/O controller)
- Not all applications benefit (e.g., CPU intensive code
sections)
44
WHY MULTI-CORE?
• Other alternatives?
– Dataflow?
– Vector processors (SIMD)?
– Integrating DRAM on chip?
– Reconfigurable logic? (general purpose?)
45
COMPUTER ARCHITECTURE
CS 6354
Multi-Cores
Samira Khan
University of Virginia
Feb 18, 2016
The content and concept of this course are adapted from CMU ECE 740