EECC722 - Shaaban

Download Report

Transcript EECC722 - Shaaban

Simultaneous Multithreading (SMT)
• An evolutionary processor architecture originally
introduced in 1995 by Dean Tullsen at the University of
Washington that aims at reducing resource waste in wide
issue processors (superscalars).
• SMT has the potential of greatly enhancing superscalar
processor computational capabilities by:
– Exploiting thread-level parallelism (TLP) in a single processor
core, simultaneously issuing, executing and retiring instructions
from different threads during the same cycle.
• A single physical SMT processor core acts as a number of logical
processors each executing a single thread
– Providing multiple hardware contexts, hardware thread scheduling
and context switching capability.
– Providing effective long latency hiding.
• e.g FP, branch misprediction, memory latency
EECC722 - Shaaban
#1 Lec # 2 Fall 2009 9-7-2009
SMT Issues
•
•
•
•
•
•
•
•
SMT CPU performance gain potential.
Modifications to Superscalar CPU architecture to support SMT.
SMT performance evaluation vs. Fine-grain multithreading, Superscalar,
Chip Multiprocessors.
Hardware techniques to improve SMT performance:
Ref. Papers
– Optimal level one cache configuration for SMT.
SMT-1, SMT-2
– SMT thread instruction fetch, issue policies.
– Instruction recycling (reuse) of decoded instructions.
Software techniques:
– Compiler optimizations for SMT. SMT-3
– Software-directed register deallocation.
– Operating system behavior and optimization. SMT-7
SMT support for fine-grain synchronization. SMT-4
SMT as a viable architecture for network processors.
Current SMT implementation: Intel’s Hyper-Threading (2-way SMT)
Microarchitecture and performance in compute-intensive workloads.
SMT-8
SMT-9
EECC722 - Shaaban
#2 Lec # 2 Fall 2009 9-7-2009
Evolution of Microprocessors
General Purpose Processors (GPPs)
Multi-cycle
Pipelined
(single issue)
Multiple Issue (CPI <1)
Superscalar/VLIW/SMT/CMP
1 GHz
to ???? GHz
Original
(2002)
Intel
Predictions
15 GHz
IPC
T = I x CPI x C
Source: John P. Chen, Intel Labs
Single-issue Processor = Scalar Processor
Instructions Per Cycle (IPC) = 1/CPI
EECC722 - Shaaban
#3 Lec # 2 Fall 2009 9-7-2009
Microprocessor Frequency Trend
100
Intel
Processor freq
scales by 2X per
generation
IBM Power PC
DEC
Gate delays/clock
21264S
1,000
Mhz
21164A
21264
Pentium(R)
21064A
21164
II
21066
MPC750
604
604+
10
Pentium Pro
601, 603 (R)
Pentium(R)
100
486
386
1.
2.
3.
Frequency used to double each generation
Number of gates/clock reduce by 25%
Leads to deeper pipelines with more stages
(e.g Intel Pentium 4E has 30+ pipeline stages)
T = I x CPI x C
2005
2003
2001
1999
1997
1995
1993
1991
1989
1
1987
10
Gate Delays/ Clock
10,000
Realty Check:
Clock frequency scaling
is slowing down!
(Did silicone finally hit
the wall?)
Why?
1- Power leakage
2- Clock distribution
delays
Result:
Deeper Pipelines
Longer stalls
Higher CPI
(lowers effective
performance
per cycle)
Possible Solutions?
- Exploit Thread-Level Parallelism (TLP)
at the chip level (SMT/CMP)
- Utilize/integrate more-specialized
computing elements other than GPPs
EECC722 - Shaaban
#4 Lec # 2 Fall 2009 9-7-2009
Parallelism in Microprocessor VLSI Generations
Bit-level parallelism
Instruction-level
100,000,000
Thread-level (?)
(ILP)
(TLP)
Multiple micro-operations
per cycle
Single-issue
(multi-cycle non-pipelined) Pipelined

Superscalar
/VLIW
CPI <1
CPI =1
10,000,000



Not Pipelined
CPI >> 1


1,000,000



R10000




 










Simultaneous
Multithreading SMT:
e.g. Intel’s Hyper-threading
Chip-Multiprocessors (CMPs)
e.g IBM Power 4, 5
Intel Pentium D, Core 2
AMD Athlon 64 X2
Dual Core Opteron
Sun UltraSparc T1 (Niagara)
Pentium
Transistors



i80386

Chip-Level
Parallel
Processing


i80286 
100,000

 R2000


 R3000
Even more important
due to slowing clock
rate increase
 i8086
10,000
 i8080
 i8008

Single Thread
 i4004
1,000
1970
Thread-Level
Parallelism (TLP)
1975
1980
1985
Improving microprocessor generation performance by
exploiting more levels of parallelism
1990
1995
2000
2005
EECC722 - Shaaban
#5 Lec # 2 Fall 2009 9-7-2009
Microprocessor Architecture Trends
General Purpose Processor (GPP)
{
Single
Threaded
CISC Machines
instructions take variable times to complete
RISC Machines (microcode)
simple instructions, optimized for speed
RISC Machines (pipelined)
same individual instruction latency
greater throughput through instruction "overlap"
Superscalar Processors
multiple instructions executing simultaneously
Multithreaded Processors
VLIW
additional HW resources (regs, PC, SP) "Superinstructions" grouped together
each context gets processor for x cycles decreased HW control complexity
(Single or Multi-Threaded)
SIMULTANEOUS MULTITHREADING (SMT)
multiple HW contexts (regs, PC, SP)
each cycle, any context may execute
CMPs
Single Chip Multiprocessors
duplicate entire processors
(tech soon due to Moore's Law)
(e.g IBM Power 4/5,
AMD X2, X3, X4, Intel Core 2)
e.g. Intel’s HyperThreading (P4)
SMT/CMPs
e.g. IBM Power5,6,7 , Intel Pentium D, Sun Niagara - (UltraSparc T1)
Upcoming (4th quarter 2008) Intel Nehalem (Core i7)
EECC722 - Shaaban
#6 Lec # 2 Fall 2009 9-7-2009
CPU Architecture Evolution:
Single Threaded/Issue Pipeline
• Traditional 5-stage integer pipeline.
• Increases Throughput: Ideal CPI = 1
Register File
Fetch
Decode
Execute
Memory
Writeback
PC
SP
Memory Hierarchy (Management)
EECC722 - Shaaban
#7 Lec # 2 Fall 2009 9-7-2009
CPU Architecture Evolution:
Single-Threaded/Superscalar Architectures
• Fetch, issue, execute, etc. more than one instruction per cycle (CPI < 1).
• Limited by instruction-level parallelism (ILP).
Decode i
Execute i
Memory i
Writeback i
Fetch i+1
Decode i+1
Execute i+1
Memory i+1
Writeback
i+1
Fetch i
Decode i
Execute i
Memory i
Writeback i
PC
SP
Memory Hierarchy (Management)
Register File
Fetch i
EECC722 - Shaaban
#8 Lec # 2 Fall 2009 9-7-2009
Superscalar Architecture Limitations:
Issue Slot Waste Classification
• Empty or wasted issue slots can be defined as either vertical waste or
horizontal waste:
– Vertical waste is introduced when the processor issues no
instructions in a cycle.
– Horizontal waste occurs when not all issue slots can be filled in a
cycle.
Example:
4-Issue
Superscalar
Ideal IPC =4
Ideal CPI = .25
Also applies to VLIW
Instructions Per Cycle = IPC = 1/CPI
Result of issue slot waste: Actual Performance << Peak Performance
EECC722 - Shaaban
#9 Lec # 2 Fall 2009 9-7-2009
Sources of Unused
Issue Cycles in an 8-issue Superscalar Processor.
(wasted)
Ideal Instructions Per Cycle, IPC = 8
(CPI = 1/8)
Here real IPC about 1.5
(18.75 % of ideal IPC)
Average IPC= 1.5
instructions/cycle
issue rate
Real IPC << Ideal IPC
1.5
<< 8
~ 81% of issue slots wasted
Processor busy represents the utilized issue slots; all
others represent wasted issue slots.
61% of the wasted cycles are vertical waste, the
remainder are horizontal waste.
Workload: SPEC92 benchmark suite.
SMT-1
Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al.,
Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403.
EECC722 - Shaaban
#10 Lec # 2 Fall 2009 9-7-2009
Superscalar Architecture Limitations :
All possible causes of wasted issue slots, and latency-hiding or latency reducing
techniques that can reduce the number of cycles wasted by each cause.
Main Issue: One Thread leads to limited ILP (cannot fill issue slots)
Solution:
How?
Exploit Thread Level Parallelism (TLP) within a single microprocessor chip:
Simultaneous Multithreaded (SMT) Processor:
-The processor issues and executes instructions from
a number of threads creating a number of logical
AND/OR
processors within a single physical processor
e.g. Intel’s HyperThreading (HT), each physical
processor executes instructions from two threads
Chip-Multiprocessors (CMPs):
- Integrate two or more complete processor cores on
the same chip (die)
- Each core runs a different thread (or program)
- Limited ILP is still a problem in each core
(Solution: combine this approach with SMT)
SMT-1
Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al.,
Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403.
EECC722 - Shaaban
#11 Lec # 2 Fall 2009 9-7-2009
Advanced CPU Architectures:
Fine-grained or Traditional Multithreaded
Processors
• Multiple hardware contexts (PC, SP, and registers).
• Only one context or thread issues instructions each cycle.
• Performance limited by Instruction-Level Parallelism
(ILP) within each individual thread:
– Can reduce some of the vertical issue slot waste.
– No reduction in horizontal issue slot waste.
•
Example Architecture: The Tera Computer System
EECC722 - Shaaban
#12 Lec # 2 Fall 2009 9-7-2009
Fine-grain or Traditional Multithreaded Processors
The Tera (Cray) Computer System
• The Tera computer system is a shared memory multiprocessor
that can accommodate up to 256 processors.
• Each Tera processor is fine-grain multithreaded:
From one thread
– Each processor can issue one 3-operation Long Instruction Word (LIW)
every 3 ns cycle (333MHz) from among as many as 128 distinct instruction
streams (hardware threads), thereby hiding up to 128 cycles (384 ns) of
memory latency.
– In addition, each stream can issue as many as eight memory references
without waiting for earlier ones to finish, further augmenting the memory
latency tolerance of the processor.
– A stream implements a load/store architecture with three addressing
modes and 31 general-purpose 64-bit registers.
– The instructions are 64 bits wide and can contain three operations: a
memory reference operation (M-unit operation or simply M-op for short),
an arithmetic or logical operation (A-op), and a branch or simple
arithmetic or logical operation (C-op).
EECC722 - Shaaban
Source: http://www.cscs.westminster.ac.uk/~seamang/PAR/tera_overview.html
#13 Lec # 2 Fall 2009 9-7-2009
Advanced CPU Architectures:
VLIW: Intel/HP IA-64
Explicitly Parallel Instruction Computing
(EPIC)
• Strengths:
– Allows for a high level of instruction parallelism (ILP).
– Takes a lot of the dependency analysis out of HW and places
focus on smart compilers.
• Weakness:
–
–
–
–
Limited by instruction-level parallelism (ILP) in a single thread.
Keeping Functional Units (FUs) busy (control hazards).
Static FUs Scheduling limits performance gains.
Resulting overall performance heavily depends on compiler
performance.
EECC722 - Shaaban
#14 Lec # 2 Fall 2009 9-7-2009
Advanced CPU Architectures:
Single Chip Multiprocessors
(CMPs)
AKA Multi-Core Processors
• Strengths:
– Create a single processor block and duplicate.
– Exploits Thread-Level Parallelism. at chip level
– Takes a lot of the dependency analysis out of HW and places
focus on smart compilers.
• Weakness:
– Performance within each processor still limited by individual
thread performance (ILP).
– High power requirements using current VLSI processes.
• Almost entire processor cores are replicated on chip.
• May run at lower clock rates to reduce heat/power consumption.
e.g IBM Power 4/5, Intel Pentium D, Core Duo, Core 2 (Conroe)
AMD Athlon 64 X2, X3, X4, Dual/quad Core Opteron
Sun UltraSparc T1 (Niagara) …
EECC722 - Shaaban
#15 Lec # 2 Fall 2009 9-7-2009
Advanced CPU Architectures:
Single Chip Multiprocessor (CMP)
Register File i
PC i
Control
Unit
i
Or 4-way
Superscalar (Two-way) Pipeline
i
SP i
PC i+1
Control
Unit
i+1
Superscalar (Two-way) Pipeline
i+1
Control
Unit
n
Superscalar (Two-way) Pipeline
n
SP i+1
Register File n
PC n
SP n
CMP with n cores
Memory Hierarchy (Management)
Regist er File i+1
EECC722 - Shaaban
#16 Lec # 2 Fall 2009 9-7-2009
Current Dual-Core Chip-Multiprocessor (CMP) Architectures
Single Die
Private Caches
Shared System Interface
Single Die
Shared L2 Cache
On-chip crossbar/switch
Cores communicate using shared cache
(Lowest communication latency)
Examples:
IBM POWER4/5
Intel Pentium Core Duo (Yonah), Conroe
Sun UltraSparc T1 (Niagara)
Also upcoming (2007) Quad Core AMD K10
(shared L3 cache)
Cores communicate using on-chip
Interconnects (shared system interface)
Two Dice – Shared Package
Private Caches
Private System Interface
FSB
Cores communicate over external
Front Side Bus (FSB)
(Highest communication latency)
Examples:
AMD Dual Core Opteron,
Athlon 64 X2
Intel Itanium2 (Montecito)
Source: Real World Technologies,
http://www.realworldtech.com/page.cfm?ArticleID=RWT101405234615
Example:
Intel Pentium D
EECC722 - Shaaban
#17 Lec # 2 Fall 2009 9-7-2009
SMT: Simultaneous Multithreading
• Multiple Hardware Contexts running at the same time (HW context:
registers, PC, and SP etc.).
• A single physical SMT processor core acts (and reports to the
operating system) as a number of logical processors each executing a
single thread
• Reduces both horizontal and vertical waste by having multiple
threads keeping functional units busy during every cycle.
• Builds on top of current time-proven advancements in CPU design:
superscalar, dynamic scheduling, hardware speculation, dynamic HW
branch prediction, multiple levels of cache, hardware pre-fetching etc.
• Enabling Technology: VLSI logic density in the order of hundreds of
millions of transistors/Chip.
– Potential performance gain is much greater than the increase in
chip area and power consumption needed to support SMT.
• Improved Performance/Chip Area/Watt (Computational Efficiency) vs.
single-threaded superscalar cores.
2-way SMT processor 10-15% increase in area
Vs. ~ 100% increase for dual-core CMP
EECC722 - Shaaban
#18 Lec # 2 Fall 2009 9-7-2009
SMT
• With multiple threads running penalties from long-latency
operations, cache misses, and branch mispredictions will be
Thus SMT is an effective long latency-hiding technique
hidden:
– Reduction of both horizontal and vertical waste and thus
improved Instructions Issued Per Cycle (IPC) rate.
• Functional units are shared among all contexts during every
cycle:
– More complicated register read and writeback stages.
• More threads issuing to functional units results in higher
resource utilization.
• CPU resources may have to resized to accommodate the
additional demands of the multiple threads running.
– (e.g cache, TLBs, branch prediction tables, rename registers)
context = hardware thread
EECC722 - Shaaban
#19 Lec # 2 Fall 2009 9-7-2009
SMT: Simultaneous Multithreading
Superscalar (Two-way) Pipeline
i
PC i
SP i
PC i+1
SP i+1
Superscalar (Two-way) Pipeline
i+1
Register File n
PC n
Memory Hierarchy (Management)
Regist er File i+1
Control Unit (Chip-Wide)
n Hardware Contexts
Register File i
Superscalar (Two-way) Pipeline
n
SP n
One n-way SMT Core
Modified out-of-order Superscalar Core
EECC722 - Shaaban
#20 Lec # 2 Fall 2009 9-7-2009
Time (processor cycles)
The Power Of SMT
1
1
1
1
1
1 1
1
1
1
2
2
2 2
3
3 3
3 3
4
4
2 2
4
5 5
4
5
1 1
1
1
1
2
2
2
3
1
2
4
1
2
5
1
1
1
1
2 2
3
1
4 4
4
Superscalar
Traditional (Fine-grain)
Multithreaded
1 2
5
1
Simultaneous
Multithreading
Rows of squares represent instruction issue slots
Box with number x: instruction issued from thread x
Empty box: slot is wasted
EECC722 - Shaaban
#21 Lec # 2 Fall 2009 9-7-2009
SMT Performance Example
Inst
A
B
C
D
E
F
G
H
I
J
K
•
•
•
•
•
Code
LUI
FMUL
ADD
MUL
LW
ADD
NOT
FADD
XOR
SUBI
SW
R5,100
F1,F2,F3
R4,R4,8
R3,R4,R5
R6,R4
R1,R2,R3
R7,R7
F4,F1,F2
R8,R1,R7
R2,R1,4
ADDR,R2
Description
R5 = 100
F1 = F2 x F3
R4 = R4 + 8
R3 = R4 x R5
R6 = (R4)
R1 = R2 + R3
R7 = !R7
F4=F1 + F2
R8 = R1 XOR R7
R2 = R1 – 4
(ADDR) = R2
Functional unit
Int ALU
FP ALU
Int ALU
Int mul/div
Memory port
Int ALU
Int ALU
FP ALU
Int ALU
Int ALU
Memory port
4 integer ALUs (1 cycle latency)
1 integer multiplier/divider (3 cycle latency)
3 memory ports (2 cycle latency, assume cache hit)
2 FP ALUs (5 cycle latency)
Assume all functional units are fully-pipelined
EECC722 - Shaaban
#22 Lec # 2 Fall 2009 9-7-2009
4-way
Cycle
1
2
3
4
5
6
7
8
9
SMT Performance Example
(continued)
Superscalar Issuing Slots
1
2
3
LUI (A)
FMUL (B) ADD (C)
MUL (D)
LW (E)
ADD (F)
NOT (G)
FADD (H) XOR (I)
SW (K)
•
•
SUBI (J)
4
SMT Issuing Slots
1
2
T1.LUI (A)
T1.FMUL
(B)
T1.MUL (D)
T1.LW (E)
T2.MUL (D)
T2.LW (E)
T1.ADD (F)
T1.FADD (H)
T1.SW (K)
T2.XOR (I)
T2.SW (K)
2-thread SMT
T1.NOT (G)
T1.XOR (I)
T2.NOT (G)
T2.SUBI (J)
2 additional cycles for SMT to complete program 2
Throughput:
3
T1.ADD (C)
4
T2.LUI (A)
T2.FMUL (B)
T2.ADD (C)
T1.SUBI (J)
T2.FADD (H)
T2.ADD (F)
i.e 2nd thread
– Superscalar: 11 inst/7 cycles = 1.57 IPC
– SMT: 22 inst/9 cycles = 2.44 IPC
– SMT is 2.44/1.57 = 1.55 times faster than superscalar for
Ideal speedup = 2
this example
EECC722 - Shaaban
#23 Lec # 2 Fall 2009 9-7-2009
Modifications to Superscalar CPUs to Support SMT
Necessary Modifications:
• Multiple program counters and some mechanism by which one fetch unit selects one
each cycle (thread instruction fetch/issue policy).
•
A separate return stack for each thread for predicting subroutine return destinations.
•
Per-thread instruction issue/retirement, instruction queue flush, and trap mechanisms.
•
A thread id with each branch target buffer entry to avoid predicting phantom branches.
Modifications to Improve SMT performance:
•
A larger register file, to support logical registers for all threads plus additional registers
for register renaming. (may require additional pipeline stages).
•
•
A higher available main memory fetch bandwidth may be required.
Larger data TLB with more entries to compensate for increased virtual to physical
address translations.
•
Improved cache to offset the cache performance degradation due to cache sharing
among the threads and the resulting reduced locality.
– e.g Private per-thread vs. shared L1 cache.
SMT-2
EECC722 - Shaaban
Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor,
Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202.
#24 Lec # 2 Fall 2009 9-7-2009
SMT Implementations
• Intel’s implementation of Hyper-Threading (HT) Technology (2thread SMT) in its P4 processor family.
• IBM POWER 5/6: Dual cores each 2-thread SMT.
• The Alpha EV8 (4-thread SMT) originally scheduled for production
in 2001 is currently on indefinite hold :(
• A number of special-purpose processors targeted towards network
processor (NP) applications.
• Sun UltraSparc T1 (Niagara): Eight processor cores each executing
from 4 hardware threads (32 threads total).
–
Actually, not SMT but fine-grain multithreaded (each core issues one instruction from
one thread per cycle).
• Intel’s upcoming (4th quarter 2008) Nehalem (Core i7): 2, 4 or 8 cores
per chip each 2-thread SMT (4-16 threads per chip).
• Current technology has the potential for 4-8 simultaneous threads per
core (based on transistor count and design complexity).
EECC722 - Shaaban
#25 Lec # 2 Fall 2009 9-7-2009
A Base SMT Hardware Architecture.
In-Order Front End
Out-of-order Core
Modified Superscalar Speculative Tomasulo
SMT-2
Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor,
Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202.
EECC722 - Shaaban
#26 Lec # 2 Fall 2009 9-7-2009
Example SMT Vs. Superscalar Pipeline
Based on the Alpha 21164
Two extra pipeline stages added for reg. Read/write to account for the size increase of the register file
•
The pipeline of (a) a conventional superscalar processor and (b) that pipeline
modified for an SMT processor, along with some implications of those pipelines.
SMT-2
Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor,
Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202.
EECC722 - Shaaban
#27 Lec # 2 Fall 2009 9-7-2009
Intel Hyper-Threaded (2-way SMT) P4
Processor Pipeline
Source: Intel Technology Journal , Volume 6, Number 1, February 2002.
SMT-8
EECC722 - Shaaban
#28 Lec # 2 Fall 2009 9-7-2009
Intel P4 Out-of-order Execution Engine
Detailed Pipeline
Hyper-Threaded (2-way SMT)
Source: Intel Technology Journal , Volume 6, Number 1, February 2002.
SMT-8
EECC722 - Shaaban
#29 Lec # 2 Fall 2009 9-7-2009
SMT Performance Comparison
•
Instruction throughput (IPC) from simulations by Eggers et al. at The
University of Washington, using both multiprogramming and parallel
workloads:
Multiprogramming workload
Superscalar
Threads
1
2
4
8
2.7
-
Traditional
Multithreading
2.6
3.3
3.6
2.8
Parallel Workload
Superscalar
Threads
1
2
4
8
MP2
MP4
(MP = Chip-multiprocessor)
3.3
-
2.4
4.3
-
1.5
2.6
4.2
-
SMT
3.1
3.5
5.7
6.2
IPC
i.e Fine-grained
Traditional
Multithreading
3.3
4.1
4.2
3.5
Multiprogramming workload = multiple single threaded programs (multi-tasking)
Parallel Workload = Single multi-threaded program
SMT
3.3
4.7
5.6
6.1
EECC722 - Shaaban
#30 Lec # 2 Fall 2009 9-7-2009
Possible Machine Models for an 8-way Multithreaded Processor
•
•
•
•
The following machine models for a multithreaded CPU that can issue 8 instruction per cycle
differ in how threads use issue slots and functional units:
Fine-Grain Multithreading:
– Only one thread issues instructions each cycle, but it can use the entire issue width of the
processor. This hides all sources of vertical waste, but does not hide horizontal waste.
SM:Full Simultaneous Issue:
i.e SM: Eight Issue
– This is a completely flexible simultaneous multithreaded superscalar: all eight threads
compete for each of the 8 issue slots each cycle. This is the least realistic model in terms of
hardware complexity, but provides insight into the potential for simultaneous
multithreading. The following models each represent restrictions to this scheme that
decrease hardware complexity.
SM:Single Issue,SM:Dual Issue, and SM:Four Issue:
–
–
•
These three models limit the number of instructions each thread can issue, or have active in the
scheduling window, each cycle.
For example, in a SM:Dual Issue processor, each thread can issue a maximum of 2 instructions
per cycle; therefore, a minimum of 4 threads would be required to fill the 8 issue slots in one cycle.
SM:Limited Connection.
–
–
–
Each hardware context is directly connected to exactly one of each type of functional unit.
For example, if the hardware supports eight threads and there are four integer units, each integer
unit could receive instructions from exactly two threads.
The partitioning of functional units among threads is thus less dynamic than in the other models,
but each functional unit is still shared (the critical factor in achieving high utilization).
SMT-1
Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al.,
Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403.
EECC722 - Shaaban
#31 Lec # 2 Fall 2009 9-7-2009
Comparison of Multithreaded CPU
Models Complexity
A comparison of key hardware complexity features of the various models (H=high complexity).
The comparison takes into account:
– the number of ports needed for each register file,
– the dependence checking for a single thread to issue multiple instructions,
– the amount of forwarding logic,
– and the difficulty of scheduling issued instructions onto functional units.
SMT-1
Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al.,
Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403.
EECC722 - Shaaban
#32 Lec # 2 Fall 2009 9-7-2009
Simultaneous Vs. Fine-Grain Multithreading Performance
IPC
IPC
Workload:
SPEC92
Instruction throughput as a function of the number of threads. (a)-(c) show the throughput by thread
priority for particular models, and (d) shows the total throughput for all threads for each of the six
machine models. The lowest segment of each bar is the contribution of the highest priority thread to the
total throughput.
SMT-1
Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al.,
Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403.
EECC722 - Shaaban
#33 Lec # 2 Fall 2009 9-7-2009
Simultaneous Multithreading (SM) Vs. Single-Chip
Multiprocessing (MP)
•
Results for the multiprocessor MP vs. simultaneous multithreading SM comparisons.The multiprocessor always has
one functional unit of each type per processor. In most cases the SM processor has the same total number of each FU
type as the MP.
SMT-1
Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al.,
Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403.
EECC722 - Shaaban
#34 Lec # 2 Fall 2009 9-7-2009
Impact of Level 1 Cache Sharing on SMT Performance
•
Notation:
Results for the simulated cache configurations, shown relative to the
64K instruction cache shared
throughput (instructions per cycle) of the 64s.64p
64K data cache private
Instruction
Data
• The caches are specified as:
(8K per thread)
[total I cache size in KB][private or shared].[D cache size][private or shared]
For instance, 64p.64s has eight private 8 KB I caches and a shared 64 KB data
Best overall performance
of configurations considered
achieved by
64s.64s
(64K instruction cache shared
64K data cache shared)
SMT-1
Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al.,
Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403.
EECC722 - Shaaban
#35 Lec # 2 Fall 2009 9-7-2009
The Impact of Increased Multithreading on Some Low Level
Metrics for Base SMT Architecture
More threads supported may lead to more demand on hardware resources
(e.g here D and I miss rated increased substantially, and thus need to be resized)
SMT-2
EECC722 - Shaaban
Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor,
Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202.
#36 Lec # 2 Fall 2009 9-7-2009
Possible SMT Thread Instruction Fetch Scheduling Policies
•
Round Robin:
– Instruction from Thread 1, then Thread 2, then Thread 3, etc.
(eg RR 1.8 : each cycle one thread fetches up to eight instructions
RR 2.4 each cycle two threads fetch up to four instructions each)
•
BR-Count:
– Give highest priority to those threads that are least likely to be on a wrong path
by by counting branch instructions that are in the decode stage, the rename
stage, and the instruction queues, favoring those with the fewest unresolved
branches.
•
MISS-Count:
– Give priority to those threads that have the fewest outstanding Data cache
misses.
•
ICount:
– Highest priority assigned to thread with the lowest number of instructions in
static portion of pipeline (decode, rename, and the instruction queues).
•
IQPOSN:
Instruction Queue (IQ) Position
– Give lowest priority to those threads with instructions closest to the head of
either the integer or floating point instruction queues (the oldest instruction is at
the head of the queue).
SMT-2
EECC722 - Shaaban
Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor,
Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202.
#37 Lec # 2 Fall 2009 9-7-2009
Instruction Throughput For Round Robin
Instruction Fetch Scheduling
RR.2.8
RR with best performance
Best overall instruction throughput achieved using round robin RR.2.8
(in each cycle two threads each fetch a block of 8 instructions)
SMT-2
Workload: SPEC92
EECC722 - Shaaban
Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor,
Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202.
#38 Lec # 2 Fall 2009 9-7-2009
Instruction throughput & Thread Fetch Policy
ICOUNT.2.8
All other fetch heuristics provide speedup over round robin
Instruction Count ICOUNT.2.8 provides most improvement
5.3 instructions/cycle vs 2.5 for unmodified superscalar.
Workload: SPEC92
Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor,
Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202.
ICOUNT: Highest priority assigned to thread with the lowest number of
instructions in static portion of pipeline (decode, rename, and the instruction
queues).
SMT-2
EECC722 - Shaaban
#39 Lec # 2 Fall 2009 9-7-2009
Low-Level Metrics For
Round Robin 2.8, Icount 2.8
ICOUNT improves on the performance of Round Robin by 23%
by reducing Instruction Queue (IQ) clog by selecting a better mix
of instructions to queue
SMT-2
EECC722 - Shaaban
#40 Lec # 2 Fall 2009 9-7-2009
Possible SMT Instruction Issue Policies
• OLDEST FIRST: Issue the oldest instructions (those
deepest into the instruction queue, the default).
• OPT LAST and SPEC LAST: Issue optimistic and
speculative instructions after all others have been issued.
• BRANCH FIRST: Issue branches as early as possible in
order to identify mispredicted branches quickly.
Instruction issue bandwidth is not a bottleneck in SMT as shown above
Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor,
Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202.
ICOUNT.2.8 Fetch policy used for all issue policies above
SMT-2
EECC722 - Shaaban
#41 Lec # 2 Fall 2009 9-7-2009
SMT: Simultaneous Multithreading
• Strengths:
– Overcomes the limitations imposed by low single thread
instruction-level parallelism.
– Resource-efficient support of chip-level TLP.
– Multiple threads running will hide individual control hazards
( i.e branch mispredictions) and other long latencies (i.e main
memory access latency on a cache miss).
• Weaknesses:
–
–
–
–
Additional stress placed on memory hierarchy.
Control unit complexity.
Sizing of resources (cache, branch prediction, TLBs etc.)
Accessing registers (32 integer + 32 FP for each HW context):
• Some designs devote two clock cycles for both register reads and
register writes.
Deeper pipeline
EECC722 - Shaaban
#42 Lec # 2 Fall 2009 9-7-2009