CS20092 Coarse Grain Multithreading

Download Report

Transcript CS20092 Coarse Grain Multithreading

Hardware Multithreading
COMP25212
1
Increasing CPU Performance
• By increasing clock frequency – pipelining
• By increasing Instructions per Clock – superscalar
• Minimizing memory access impact – caches
• Maximizing pipeline utilization – branch prediction
• Maximizing pipeline utilization – forwarding
• Maximizing instruction issue – dynamic scheduling
2
Increasing Parallelism
• Amount of parallelism that we can exploit is
limited by the programs
– Some areas exhibit great parallelism
– Some others are essentially sequential
• In the later case, where can we find
additional independent instructions?
– In a different program!
3
Software Multithreading - Revision
• Modern Operating Systems support several
processes/threads to be run concurrently
• Transparent to the user – all of them appear
to be running at the same time
• BUT, actually, they are scheduled (and
interleaved) by the OS
4
OS Thread Switching - Revision
Thread T0
Exec
Operating System
Save state into PCB0
Thread T1
Wait
Load state fromPCB1
Exec
Wait
Save state into PCB0
Exec
COMP25111 – Lect. 5
Load state fromPCB1
Wait
5
Process Control Block (PCB) - Revision
PCBs store information about the state of ‘alive’
processes handled by the OS
•
•
•
•
•
•
Process ID
Process State
PC
Stack Pointer
General Registers
Memory Management
Info
• Open File List, with
positions
• Network Connections
• CPU time used
• Parent Process ID
6
OS Process States - Revision
Wait (e.g. I/O)
Terminated
Blocked
waiting for
event
Running
on a CPU
Ready
waiting for
a CPU
Event
occurs
New
COMP25111 – Lect. 5
7
Hardware Multithreading
• Allow multiple threads to share a single
processor
• Requires replicating the independent state of
each thread
• Virtual memory can be used to share memory
among threads
8
CPU Support for Multithreading
VA MappingA
Address
Translation
VA MappingB
Inst Cache
Data Cache
PCA
Write Logic
Fetch
Mem Logic
Logic
Fetch
Exec Logic
Logic
RegB
Decode
Fetch Logic
Logic
RegA
Fetch Logic
PCB
9
Hardware Multithreading Issues
• How to HW MT is presented to the OS
– Normally present each hardware thread as a
virtual processor (Linux, UNIX, Windows)
– Requires multiprocessor support from the OS
• Needs to share or replicate resources
– Registers – normally replicated
– Caches – normally shared
• Each thread will use a fraction of the cache
• Cache trashing issues – harm performance
10
Example of Trashing - Revision
Direct Mapped cache
Memory Accesses
Line 13D
Thread A
Thread B
:
:
0x075A13D0
0x075A13D8
:
MISS
MISS
0X018313D8
MISS
MISS
0X018313DC
Tag
Invalid
:
MISS
0X018313D4
0x075A13D4
Action taken
MISS
Load 0x075A
0x075A
Load 0X0183
0X0183
Load 0x075A
0x075A
Load 0X0183
0X0183
Load 0x075A
0x075A
Load 0X0183
0X0183
Same index
11
Hardware Multithreading
• Different ways to exploit this new source of
parallelism
• When & how to switch threads?
– Coarse-grain Multithreading
– Fine-grain Multithreading
– Simultaneous Multithreading
12
Coarse-Grain Multithreading
13
Coarse-Grain Multithreading
• Issue instructions from a single thread
• Operate like a simple pipeline
• Switch Thread on “expensive” operation:
– E.g. I-cache miss
– E.g. D-cache miss
14
Switch Threads on Icache miss
Inst a
Inst b
Inst c
Inst X
Inst Y
Inst Z
1
2
3
4
5
6
7
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF MISS
ID
-
EX
-
MEM
-
WB
-
IF
ID
EX
MEM
IF
ID
EX
IF
ID
• Remove Inst c and switch to ‘grey’ thread
• ‘Grey’ thread will continue its execution until
there is another I-cache or D-cache miss
15
Switch Threads on Dcache miss
Inst a
Inst b
Inst c
Inst d
Inst X
Inst Y
1
2
3
4
5
6
7
IF
ID
EX
M-Miss
WB
MISS
MISS
MISS
IF
ID
EX
MEM
-
-
IF
ID
EX
-
IF
ID
-
WB
Abort
MEM
these
EX
-
MEM
-
IF
ID
EX
IF
ID
WB
-
• Remove Inst a and switch to ‘grey’ thread
– Remove issued instructions from ‘white’ thread
– Roll back ‘white’ PC to point to Inst a
16
Coarse Grain Multithreading
• Good to compensate for infrequent, but
expensive pipeline disruption
• Minimal pipeline changes
– Need to abort all the instructions in “shadow” of
Dcache miss  overhead
– Swap instruction streams
• Data control hazards are not solved
17
Fine-Grain Multithreading
18
Fine-Grain Multithreading
• Interleave the execution of several threads
• Usually using Round Robin among all the
ready hardware threads
• Requires instantaneous thread switching
– Complex hardware
19
Fine-Grain Multithreading
• Multithreading helps alleviate fine-grain
dependencies (e.g. forwarding?)
Inst a
Inst M
Inst b
Inst N
Inst c
Inst P
1
2
3
4
5
6
7
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
IF
ID
EX
IF
ID
20
I-cache misses in Fine Grain
Multithreading
• An I-cache miss is overcome transparently
Inst a
Inst M
Inst b
Inst N
Inst P
1
2
3
4
5
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF-MISS
-
-
-
-
IF
ID
EX
MEM
IF
ID
EX
IF
ID
Inst Q
6
7
Inst b is removed and ‘white’ is marked as not ‘ready’
‘White’ thread is not ready so ‘grey’ is executed
21
D-cache misses in Fine Grain
Multithreading
• Mark the thread as not ‘ready’ and issue only
from the other thread(s)
Inst a
Inst M
Inst b
Inst N
Inst P
1
2
3
4
5
6
7
IF
ID
EX
M-MISS
Miss
Miss
WB
IF
ID
EX
MEM
WB
IF
ID
-
-
-
IF
ID
EX
MEM
IF
ID
EX
IF
ID
Inst Q
‘White’ marked as not ‘ready’. Remove Inst b. Update PC.
‘White’ thread is not ready so ‘Grey’ is executed
22
Fine Grain Multithreading
in Out-of-order processors
• In an out of order processor we may continue
issuing instructions from both threads
– Unless O-o-O algorithm stalls one of the threads
Inst a
Inst M
Inst b
Inst N
Inst c
Inst P
1
2
3
4
5
6
7
IF
RO
EX
MMEM
MISS
Miss
WB
Miss
WB
IF
RO
EX
MEM
WB
IF
RO
ID
(RO)
EX
MEM
(RO)
WB
EX
IF
RO
ID
EX
MEM
IF
(RO)
ID
(RO)
EX
IF
RO
ID
23
Fine Grain Multithreading
• Utilization of pipeline resources increased, i.e.
better overall performance
• Impact of short stalls is alleviated by executing
instructions from other threads
• Single-thread execution is slowed
• Requires an instantaneous thread switching
mechanism
– Expensive in terms of hardware
24
Simultaneous Multi-Threading
25
Simultaneous Multi-Threading
• The main idea is to exploit instructions level
parallelism and thread level parallelism at the
same time
• In a superscalar processor issue instructions
from different threads in the same cycle
• Instructions from different threads can be
using the same stage of the pipeline
26
Simultaneous Multi-Threading
1
2
3
4
5
Inst a
IF
ID
EX
MEM
WB
Inst b
IF
ID
EX
MEM
WB
Inst M
IF
ID
EX
MEM
WB
Inst N
IF
ID
EX
MEM
WB
Inst c
IF
ID
EX
MEM
WB
Inst P
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
Inst e
IF
ID
EX
MEM
WB
Inst R
IF
ID
EX
MEM
WB
Inst Q
Inst d
Different
thread
6
7
8
9
10
Same thread
27
SMT issues
• Asymmetric pipeline stall (from superscalar)
– One part of pipeline stalls – we want the other
pipeline to continue
• Overtaking – want non-stalled threads to
make progress
• Existing implementations on O-o-O, register
renamed architectures (similar to Tomasulo)
– e.g. Intel Hyperthreading
28
SMT: Glimpse into the Future
• Scout threads
– A thread to prefetch memory – reduce cache miss
overhead
• Speculative threads
– Allow a thread to execute speculatively way past
branch/jump/call/miss/etc
– Needs revised O-o-O logic
– Needs and extra memory support
29
Simultaneous Multi Threading
• Extracts the most parallelism from instructions
and threads
• Implemented only in out-of-order processors
because they are the only able to exploit that
much parallelism
• Has a significant hardware overhead
30
Example
Consider we want to execute 2 programs with 100 instructions each. The first
program suffers an i-cache miss at instruction #30, and the second program another
at instruction #70. Assume that:
+ There is parallelism enough to execute all instructions independently
(no hazards)
+ Switching threads can be done instantaneously
+ A cache miss requires 20 cycles to get the instruction to the cache.
+ The two programs would not interfere with each other’s caches lines
Calculate the execution time observed by each of the programs (cycles elapsed
between the execution of the first and the last instruction of that application) and
the total time to execute the workload
a) Sequentially (no multithreading),
b) With coarse-grain multithreading,
c) With fine-grain multithreading,
d) With 2-way simultaneous multithreading,
31
Summary of Hardware
Multithreading
32
Benefits of Hardware Multithreading
• Multithreading techniques improve the
utilisation of processor resources and, hence,
the overall performance
• If the different threads are accessing the same
input data they may be using the same
regions of memory
– Cache efficiency improves in these cases
33
Disadvantages of Hardware
Multithreading
• The single-thread performance may be degraded
when comparing with a single-thread CPU
– Multiple threads interfering with each other
• Shared caches mean that, effectively, threads would
use a fraction of the whole cache
– Trashing may exacerbate this issue
• Thread scheduling at hardware level adds high
complexity to processor design
– Thread state, managing priorities, OS-level information, …
34
Multithreading Summary
• A cost-effective way of finding additional
parallelism for the CPU pipeline
• Available in x86, Itanium, Power and SPARC
• Present additional hardware thread as an
additional virtual CPU to Operating System
• Operating Systems Beware!!! (why?)
35
Comparison of Multithreading
Techniques – 4-way superscalar
Time ————>
Thread A
1 2
3
4 5 6
7 8
9 10 11 12
Thread B
1 2 3
4 5
6
7
8
13
14 15 16
9 10 11 12
13 14
15 16
Time ————>
Coarse-grain
1 2
3
4 5 6
7 8
9 10 11 12
1
4
6
7
2
5
3
Thread C
1 2 3
4
6
7
8
1
1
1
1
3
4
4
2
4
6
Fine-grain
2
2 3
2 3
5
5
3
5
4
6
1
2
5
7
9
5
9
Thread D
3 4
6
8
10 11 12
10
SMT
1 2 1 2
3 1 2 3
1 3 4 5
4 5 6 6
2 3 4 7
8 7 4 5
5 6 9 10
11 12 8 6
7 8 7
9 10 11 12
36