EECC722 - Shaaban

Download Report

Transcript EECC722 - Shaaban

Simultaneous Multithreading (SMT)
• An evolutionary processor architecture originally
introduced in 1996 by Dean Tullsen at the University
of Washington that aims at reducing resource waste
in wide issue processors.
• SMT has the potential of greatly enhancing
processor computational capabilities by:
– Exploiting thread-level parallelism (TLP), simultaneously
executing instructions from different threads during the
same cycle.
– Providing multiple hardware contexts, hardware thread
scheduling and context switching capability.
EECC722 - Shaaban
#1 Lec # 2 Fall 2001 9-10-2001
SMT Issues
• SMT CPU performance gain potential.
• Modifications to Superscalar CPU architecture necessary to
support SMT.
• SMT performance evaluation vs. Fine-grain multithreading
Superscalar, Chip Multiprocessors.
• Hardware techniques to improve SMT performance:
– Optimal level one cache configuration for SMT.
– SMT thread instruction fetch, issue policies.
– Instruction recycling (reuse) of decoded instructions.
• Software techniques:
– Compiler optimizations for SMT.
– Software-directed register deallocation.
– Operating system behavior and optimization.
• SMT support for fine-grain synchronization.
• SMT as a viable architecture for network processors.
EECC722 - Shaaban
#2 Lec # 2 Fall 2001 9-10-2001
Microprocessor Architecture Trends
CISC Machines
instructions take variable times to complete
RISC Machines (microcode)
simple instructions, optimized for speed
RISC Machines (pipelined)
same individual instruction latency
greater throughput through instruction "overlap"
Superscalar Processors
multiple instructions executing simultaneously
Multithreaded Processors
VLIW
additional HW resources (regs, PC, SP) "Superinstructions" grouped together
each context gets processor for x cycles decreased HW control complexity
Single Chip Multiprocessors
duplicate entire processors
(tech soon due to Moore's Law)
SIMULTANEOUS MULTITHREADING
multiple HW contexts (regs, PC, SP)
each cycle, any context may execute
EECC722 - Shaaban
#3 Lec # 2 Fall 2001 9-10-2001
Performance Increase of Workstation-Class
Microprocessors 1987-1997
Integer SPEC92 Performance
EECC722 - Shaaban
#4 Lec # 2 Fall 2001 9-10-2001
Microprocessor Logic Density
100000000
Alpha 21264: 15 million
Pentium Pro: 5.5 million
PowerPC 620: 6.9 million
Alpha 21164: 9.3 million
Sparc Ultra: 5.2 million
10000000
Moore’s Law
Pentium
i80486
Transistors
1000000
i80386
i80286
100000
Moore’s Law:
i8086
10000
2X transistors/Chip
Every 1.5 years
i8080
i4004
1000
1970
1975
1980
1985
1990
1995
2000
Year
EECC722 - Shaaban
#5 Lec # 2 Fall 2001 9-10-2001
Increase of Capacity of VLSI Dynamic RAM Chips
size
year
size(Megabit)
1980
0.0625
1983
0.25
1986
1
1989
4
1992
16
1996
64
1999
256
2000
1024
1000000000
100000000
Bits
10000000
1000000
100000
10000
1000
1970
1975
1980
1985
Year
1990
1995
2000
1.55X/yr,
or doubling every 1.6
years
EECC722 - Shaaban
#6 Lec # 2 Fall 2001 9-10-2001
CPU Architecture Evolution:
Single Threaded Pipeline
• Traditional 5-stage pipeline.
• Increases Throughput: Ideal CPI = 1
Register File
Fetch
Decode
Execute
Memory
Writeback
PC
SP
Memory Hierarchy (Management)
EECC722 - Shaaban
#7 Lec # 2 Fall 2001 9-10-2001
CPU Architecture Evolution:
Superscalar Architectures
• Fetch, decode, execute, etc. more than one instruction per
cycle (CPI < 1).
• Limited by instruction-level parallelism (ILP).
Decode i
Execute i
Memory i
Writeback i
Fetch i+1
Decode i+1
Execute i+1
Memory i+1
Writeback
i+1
Fetch i
Decode i
Execute i
Memory i
Writeback i
PC
SP
Memory Hierarchy (Management)
Register File
Fetch i
EECC722 - Shaaban
#8 Lec # 2 Fall 2001 9-10-2001
Superscalar Architectures:
Issue Slot Waste Classification
•
Empty or wasted issue slots can be defined as either vertical waste or
horizontal waste:
– Vertical waste is introduced when the processor issues no instructions in
a cycle.
– Horizontal waste occurs when not all issue slots can be filled in a cycle.
EECC722 - Shaaban
#9 Lec # 2 Fall 2001 9-10-2001
Sources of Unused Issue Cycles in an 8-issue Superscalar Processor.
Processor busy represents the utilized issue slots; all
others represent wasted issue slots.
61% of the wasted cycles are vertical waste, the
remainder are horizontal waste.
Workload: SPEC92 benchmark suite.
Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al.,
Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403.
EECC722 - Shaaban
#10 Lec # 2 Fall 2001 9-10-2001
Superscalar Architectures:
All possible causes of wasted issue slots, and latency-hiding or latency reducing
techniques that can reduce the number of cycles wasted by each cause.
Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al.,
Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403.
EECC722 - Shaaban
#11 Lec # 2 Fall 2001 9-10-2001
Advanced CPU Architectures:
Fine-grain or Traditional
Multithreaded Processors
• Multiple HW contexts (PC, SP, and registers).
• One context gets CPU for x cycles at a time.
• Limited by thread-level parallelism (TLP):
– Can reduce some of the vertical issue slot waste.
– No reduction in horizontal issue slot waste.
• Example Architectures: HEP, Tera.
EECC722 - Shaaban
#12 Lec # 2 Fall 2001 9-10-2001
Advanced CPU Architectures:
VLIW: Intel/HP
Explicitly Parallel Instruction Computing
(EPIC)
• Strengths:
– Allows for a high level of instruction parallelism (ILP).
– Takes a lot of the dependency analysis out of HW and places
focus on smart compilers.
• Weakness:
– Limited by instruction-level parallelism (ILP) in a single thread.
– Keeping Functional Units (FUs) busy (control hazards).
– Static FUs Scheduling limits performance gains.
EECC722 - Shaaban
#13 Lec # 2 Fall 2001 9-10-2001
Advanced CPU Architectures:
Single Chip Multiprocessor
• Strengths:
– Create a single processor block and duplicate.
– Takes a lot of the dependency analysis out of HW and
places focus on smart compilers.
• Weakness:
– Performance limited by individual thread performance
(ILP).
EECC722 - Shaaban
#14 Lec # 2 Fall 2001 9-10-2001
Advanced CPU Architectures:
Single Chip Multiprocessor
Register File i
PC i
Control
Unit
i
Superscalar (Two-way) Pipeline
i
Control
Unit
i+1
Superscalar (Two-way) Pipeline
i+1
Control
Unit
n
Superscalar (Two-way) Pipeline
n
SP i
PC i+1
SP i+1
Register File n
PC n
SP n
Memory Hierarchy (Management)
Regist er File i+1
EECC722 - Shaaban
#15 Lec # 2 Fall 2001 9-10-2001
SMT: Simultaneous Multithreading
• Multiple Hardware Contexts running at the same
time (HW context: registers, PC, and SP).
• Avoids both horizontal and vertical waste by having
multiple threads keeping functional units busy
during every cycle.
• Builds on top of current time-proven advancements
in CPU design: superscalar, dynamic scheduling,
hardware speculation, dynamic HW branch
prediction.
• Enabling Technology: VLSI logic density in the
order of hundreds of millions of transistors/Chip.
EECC722 - Shaaban
#16 Lec # 2 Fall 2001 9-10-2001
SMT
• With multiple threads running penalties from long-latency
operations, cache misses, and branch mispredictions will be
hidden:
– Reduction of both horizontal and vertical waste and thus
improved Instructions Issued Per Cycle (IPC) rate.
• Pipelines are separated until issue stage.
• Functional units are shared among all contexts during every
cycle:
– More complicated writeback stage.
• More threads issuing to functional units results in higher
resource utilization.
EECC722 - Shaaban
#17 Lec # 2 Fall 2001 9-10-2001
SMT: Simultaneous Multithreading
Register File i
Superscalar (Two-way) Pipeline
i
PC i
SP i
SP i+1
Superscalar (Two-way) Pipeline
i+1
Register File n
PC n
Memory Hierarchy (Management)
PC i+1
Control Unit (Chip-Wide)
Regist er File i+1
Superscalar (Two-way) Pipeline
n
SP n
EECC722 - Shaaban
#18 Lec # 2 Fall 2001 9-10-2001
Time (processor cycles)
The Power Of SMT
1
1
1
1
1
1 1
1
1
1
2
2
2 2
3
3 3
3 3
4
4
2 2
4
5 5
4
5
1 1
1
1
1
2
2
2
3
1
2
4
1
2
5
1
1
1
1
2 2
3
1
4 4
4
Superscalar
Traditional
Multithreaded
1 2
5
1
Simultaneous
Multithreading
Rows of squares represent instruction issue slots
Box with number x: instruction issued from thread x
Empty box: slot is wasted
EECC722 - Shaaban
#19 Lec # 2 Fall 2001 9-10-2001
SMT Performance Example
Inst
A
B
C
D
E
F
G
H
I
J
K
•
•
•
•
•
Code
LUI
FMUL
ADD
MUL
LW
ADD
NOT
FADD
XOR
SUBI
SW
R5,100
F1,F2,F3
R4,R4,8
R3,R4,R5
R6,R4
R1,R2,R3
R7,R7
F4,F1,F2
R8,R1,R7
R2,R1,4
ADDR,R2
Description
R5 = 100
F1 = F2 x F3
R4 = R4 + 8
R3 = R4 x R5
R6 = (R4)
R1 = R2 + R3
R7 = !R7
F4=F1 + F2
R8 = R1 XOR R7
R2 = R1 – 4
(ADDR) = R2
Functional unit
Int ALU
FP ALU
Int ALU
Int mul/div
Memory port
Int ALU
Int ALU
FP ALU
Int ALU
Int ALU
Memory port
4 integer ALUs (1 cycle latency)
1 integer multiplier/divider (3 cycle latency)
3 memory ports (2 cycle latency, assume cache hit)
2 FP ALUs (5 cycle latency)
Assume all functional units are fully-pipelined
EECC722 - Shaaban
#20 Lec # 2 Fall 2001 9-10-2001
SMT Performance Example
(continued)
Cycle
1
2
3
4
5
6
7
8
9
Superscalar Issuing Slots
1
2
3
LUI (A)
FMUL (B) ADD (C)
MUL (D)
LW (E)
ADD (F)
NOT (G)
FADD (H) XOR (I)
SW (K)
•
•
SUBI (J)
4
SMT Issuing Slots
1
2
T1.LUI (A)
T1.FMUL
(B)
T1.MUL (D)
T1.LW (E)
T2.MUL (D)
T2.LW (E)
T1.ADD (F)
T1.FADD (H)
T1.SW (K)
T2.XOR (I)
T2.SW (K)
T1.NOT (G)
T1.XOR (I)
T2.NOT (G)
T2.SUBI (J)
3
T1.ADD (C)
4
T2.LUI (A)
T2.FMUL (B)
T2.ADD (C)
T1.SUBI (J)
T2.FADD (H)
T2.ADD (F)
2 additional cycles to complete program 2
Throughput:
– Superscalar: 11 inst/7 cycles = 1.57 IPC
– SMT: 22 inst/9 cycles = 2.44 IPC
EECC722 - Shaaban
#21 Lec # 2 Fall 2001 9-10-2001
Changes to Superscalar CPUs
Necessary to support SMT
•
Multiple program counters and some mechanism by which one fetch unit
selects one each cycle (thread instruction fetch policy).
•
A separate return stack for each thread for predicting subroutine return
destinations.
•
Per-thread instruction retirement, instruction queue flush, and trap
mechanisms.
•
A thread id with each branch target buffer entry to avoid predicting
phantom branches.
•
A larger register file, to support logical registers for all threads plus
additional registers for register renaming. (may require additional pipeline
stages).
•
A higher available main memory fetch bandwidth may be required.
•
Improved cache to offset the cache performance degradation due to cache
sharing among the threads and the resulting reduced locality.
– e.g Private per-thread vs. shared L1 cache.
EECC722 - Shaaban
#22 Lec # 2 Fall 2001 9-10-2001
A Base SMT hardware Architecture.
Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor,
Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202.
EECC722 - Shaaban
#23 Lec # 2 Fall 2001 9-10-2001
Example SMT Vs. Superscalar Pipeline
•
The pipeline of (a) a conventional superscalar processor and (b) that pipeline
modified for an SMT processor, along with some implications of those pipelines.
Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor,
Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202.
EECC722 - Shaaban
#24 Lec # 2 Fall 2001 9-10-2001
SMT Performance Comparison
•
Instruction throughput from simulations by Eggers et al. at The University
of Washington, using both multiprogramming and parallel workloads:
Multiprogramming workload
Superscalar
Threads
1
2
4
8
2.7
-
Traditional
Multithreading
2.6
3.3
3.6
2.8
SMT
3.1
3.5
5.7
6.2
Parallel Workload
Superscalar
Threads
1
2
4
8
3.3
-
MP2
MP4
2.4
4.3
-
1.5
2.6
4.2
-
Traditional
Multithreading
3.3
4.1
4.2
3.5
SMT
3.3
4.7
5.6
6.1
EECC722 - Shaaban
#25 Lec # 2 Fall 2001 9-10-2001
Simultaneous Vs. Fine-Grain Multithreading Performance
Instruction throughput as a function of the number of threads. (a)-(c) show the throughput by thread
priority for particular models, and (d) shows the total throughput for all threads for each of the six
machine models. The lowest segment of each bar is the contribution of the highest priority thread to the
total throughput.
Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al.,
Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403.
EECC722 - Shaaban
#26 Lec # 2 Fall 2001 9-10-2001
Simultaneous Multithreading Vs. Single-Chip
Multiprocessing
•
Results for the multiprocessor MP vs. simultaneous multithreading SM comparisons.The multiprocessor always has
one functional unit of each type per processor. In most cases the SM processor has the same total number of each FU
type as the MP.
Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al.,
Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403.
EECC722 - Shaaban
#27 Lec # 2 Fall 2001 9-10-2001
Impact of Level 1 Cache Sharing on SMT Performance
•
Results for the simulated cache configurations, shown relative to the
throughput (instructions per cycle) of the 64s.64p
• The caches are specified as:
[total I cache size in KB][private or shared].[D cache size][private or shared]
For instance, 64p.64s has eight private 8 KB I caches and a shared 64 KB data
Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al.,
Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403.
EECC722 - Shaaban
#28 Lec # 2 Fall 2001 9-10-2001
SMT Thread Instruction Fetch Scheduling Policies
•
Round Robin:
– Instruction from Thread 1, then Thread 2, then Thread 3, etc.
(eg RR 1.8 : each cycle one thread fetches up to eight instructions
RR 2.4 each cycle two threads fetch up to four instructions each)
•
BR-Count:
– Give highest priority to those threads that are least likely to be on a wrong path
by by counting branch instructions that are in the decode stage, the rename
stage, and the instruction queues, favoring those with the fewest unresolved
branches.
•
MISS-Count:
– Give priority to those threads that have the fewest outstanding Data cache
misses.
•
I-Count:
– Highest priority assigned to thread with the lowest number of instructions in
static portion of pipeline (decode, rename, and the instruction queues).
•
IQPOSN:
– Give lowest priority to those threads with instructions closest to the head of
either the integer or floating point instruction queues (the oldest instruction is at
the head of the queue).
EECC722 - Shaaban
#29 Lec # 2 Fall 2001 9-10-2001
Instruction throughput & Thread Fetch Policy
EECC722 - Shaaban
#30 Lec # 2 Fall 2001 9-10-2001
Possible SMT Instruction Issue Policies
• OLDEST FIRST: Issue the oldest instructions (those
deepest into the instruction queue).
• OPT LAST and SPEC LAST: Issue optimistic and
speculative instructions after all others have been issued.
• BRANCH FIRST: Issue branches as early as possible in
order to identify mispredicted branches quickly.
Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor,
Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202.
EECC722 - Shaaban
#31 Lec # 2 Fall 2001 9-10-2001
Simulator (sim-SMT) @ RIT CE
•
•
•
•
Execution-driven, performance simulator.
Derived from Simple Scalar tool set.
Simulates cache, branch prediction, five pipeline stages
Flexible:
– Configuration File controls cache size, buffer sizes, number
of functional units.
• Cross compiler used to generate Simple Scalar assembly
language.
• Binary utilities, compiler, and assembler available.
• Standard C library (libc) has been ported.
EECC722 - Shaaban
#32 Lec # 2 Fall 2001 9-10-2001
Simulator Memory Address Space
EECC722 - Shaaban
#33 Lec # 2 Fall 2001 9-10-2001
Alternate Functional Unit
Configurations
• New functional unit configurations attempted (by
adding one of each type of FU):
– +1 integer multiplier/divider
• +2.8% IPC, issue rate
• -74% times with no FU available
• Simulator very flexible (only one line in configuration
file required change)
EECC722 - Shaaban
#34 Lec # 2 Fall 2001 9-10-2001
Sim-SMT Simulator Limitations
• Does not keep precise exceptions.
• System Call’s instructions not tracked.
• Limited memory space:
– Four test programs’ memory spaces running on
one simulator memory space
– Easy to run out of stack space
EECC722 - Shaaban
#35 Lec # 2 Fall 2001 9-10-2001
Simulation Runs & Results
• Test Programs used:
–
–
–
–
Newton interpolation.
Matrix Solver using LU decomposition.
Integer Test Program.
FP Test Program.
• Simulations of a single program
– 1,2, and 4 threads.
• System simulations involve a combination of all
programs simultaneously
– Several different combinations were run
• From simulation results:
– Performance increase:
• Biggest increase occurs when changing from one to two
threads.
– Higher issue rate, functional unit utilization.
EECC722 - Shaaban
#36 Lec # 2 Fall 2001 9-10-2001
Simulation Results:
Performance (IPC)
EECC722 - Shaaban
#37 Lec # 2 Fall 2001 9-10-2001
Simulation Results:
Simulation Time
EECC722 - Shaaban
#38 Lec # 2 Fall 2001 9-10-2001
Simulation Results:
Instruction Issue Rate
EECC722 - Shaaban
#39 Lec # 2 Fall 2001 9-10-2001
Simulation Results:
Performance Vs. Issue BW
EECC722 - Shaaban
#40 Lec # 2 Fall 2001 9-10-2001
Simulation Results:
Functional Unit Utilization
EECC722 - Shaaban
#41 Lec # 2 Fall 2001 9-10-2001
Simulation Results:
No Functional Unit Available
EECC722 - Shaaban
#42 Lec # 2 Fall 2001 9-10-2001
Simulation Results:
Horizontal Waste Rate
EECC722 - Shaaban
#43 Lec # 2 Fall 2001 9-10-2001
Simulation Results:
Vertical Waste Rate
EECC722 - Shaaban
#44 Lec # 2 Fall 2001 9-10-2001
SMT: Simultaneous Multithreading
• Strengths:
– Overcomes the limitations imposed by low single thread
instruction-level parallelism.
– Multiple threads running will hide individual control
hazards (branch mispredictions).
• Weaknesses:
– Additional stress placed on memory hierarchy Control
unit complexity.
– Sizing of resources (cache, branch prediction, etc.)
– Accessing registers (32 integer + 32 FP for each HW
context):
• Some designs devote two clock cycles for both register reads
and register writes.
EECC722 - Shaaban
#45 Lec # 2 Fall 2001 9-10-2001
SMT: Simultaneous Multithreading
Kernel Code
• Many, if not all, benchmarks are based upon a limited
interaction with kernel code.
• How can the kernel overhead be minimized (contextswitching, process management, etc.)?
– CHAOS (Context Hardware Accelerated Operating
System).
• Introduce a lightweight dedicated kernel context to
handle process management:
– When there are 4 contexts, there is a good chance that one
of them will continue to run, why take an (expensive)
chance in swapping it out when it will be brought right
back in by the swapper (process management).
EECC722 - Shaaban
#46 Lec # 2 Fall 2001 9-10-2001
SMT & Technology
• SMT architecture has not been implemented
in any existing commercial microprocessor yet
(First 4-thread SMT CPU: Alpha EV8 ~2001).
• Current technology has the potential for 4-8
simultaneous threads:
– Based on transistor count and design
complexity.
EECC722 - Shaaban
#47 Lec # 2 Fall 2001 9-10-2001
RIT-CE SMT Project Goals
• Investigate performance gains from exploiting Thread-Level
Parallelism (TLP) in addition to current Instruction-Level
Parallelism (ILP) in processor design.
• Design and simulate an architecture incorporating
Simultaneous Multithreading (SMT).
• Study operating system and compiler modifications needed
to support SMT processor architectures.
• Define a standard interface for efficient SMT-processor/OS
kernel interaction.
• Modify an existing OS kernel (Linux?) to take advantage of
hardware multithreading capabilities.
• Long term: VLSI implementation of an SMT prototype.
EECC722 - Shaaban
#48 Lec # 2 Fall 2001 9-10-2001
Current Project Status
• Architecture/OS interface definition.
• Study of design alternatives and impact on
performance.
• SMT Simulator Development:
– System call development, kernel support,
and compiler/assembler changes.
• Development of code (programs and OS kernel)
is key to getting results.
EECC722 - Shaaban
#49 Lec # 2 Fall 2001 9-10-2001
Short-Term Project Chart
Simulator will represent
hardware with kernel
context
Simulator
Compiler
Compiler is simply a
hacked version gcc
(using assembler from host
system)
Linker/Loader
System Call Proxy
(OS specific)
Kernel Code will
provide the thread
that will be held
in the HW kernel
context
Simulation Results
(running program)
Kernel Code
Memory Management
Process Management
SMT Kernel Simulation
EECC722 - Shaaban
#50 Lec # 2 Fall 2001 9-10-2001
Current/Future Project Goals
• SMT simulator completion refinement, and further testing.
• Development of an SMT-capable OS kernel.
• Extensive performance studies with various workloads
using the simulator/OS/compiler:
– Suitability for fine-grained parallel applications?
– Effect on multimedia applications?
• Architectural changes based on benchmarks.
• Cache impact on SMT performance investigation.
• Investigation of an in-order SMT processor (C or VHDL
model)
• MOSIS Tiny Chip (partial/full) implementation.
• Investigate the suitability of SMT processors as building
blocks for MPPs.
EECC722 - Shaaban
#51 Lec # 2 Fall 2001 9-10-2001