10-11-ca-multithreadingx

Transcript 10-11-ca-multithreadingx

Multithreaded Architectures
Uri Weiser and Yoav Etsion
Computer Architecture 2015 - Multithreading
1/40
Overview

Multithreaded Software

Multithreaded Architecture

Multithreaded Micro-Architecture

Conclusions
Computer Architecture 2015 - Multithreading
2/40
Multithreading
Basics

Process



Thread – each thread has its unique execution context




Each process has its unique address space
Can consist of several threads
Thread has its own PC (sequencer) + registers + stack
All threads (within a process) share same address space
Private heap is optional
Multithreaded app’s: process broken into threads

#1 example: transactions (databases, web servers)



Increased concurrency
Partial blocking (your transaction blocks, mine doesn’t have to)
Centralized (smarter) resource management (by process)
Computer Architecture 2015 - Multithreading
3/40
Superscalar Execution
Example – 5 stages, 4 wide machine
F
D1
D2
EX/RET
WB
F
D1
D2
EX/RET
WB
F
D1
D2
EX/RET
WB
F
D1
D2
EX/RET
WB
Computer Architecture 2015 - Multithreading
5/40
Superscalar: Execution Stage
Time
Generally increases throughput, but decreases utilization
Computer Architecture 2015 - Multithreading
6/40
CMP – Chip Multi-Processor
Multi-threads/Multi-tasks runs on Multi-Processor/core
Time
Better utilization / Lower thread-level throughput / Higher overall throughput /
Computer Architecture 2015 - Multithreading
7/40
Multiple Hardware Threads

A thread can be viewed as a stream of instructions


Equip multithreaded processors with multiple
hardware contexts (threads)



State is represented by PC, Stack pointer, GP Registers
Multiple register files, PCs, SPs
OS views CPU as multiple logical processors
Execution of instructions from different threads are
interleaved in the pipeline

Interleaving policy is critical…
Computer Architecture 2015 - Multithreading
8/40
Blocked Multithreading
Multi-threading/multi-tasking runs on single core
Time
How is this different from OS scheduling?
May increase utilization and throughput, but must switch when current
thread goes to low utilization/throughput section (e.g. L2 cache miss)
Computer Architecture 2015 - Multithreading
9/40
Fine Grained Multithreading
Multi-threading/multi-tasking runs on single core
Time
Increases utilization/throughput by reducing impact of dependences
Computer Architecture 2015 - Multithreading
10/40
Simultaneous Multithreading (SMT)
Multi-threading/multi-tasking runs on single core
Time
Increases utilization/throughput
Computer Architecture 2015 - Multithreading
11/40
Blocked Multithreading
(SoE-MT- Switch on Event MT, aka’ – “Poor Man MT”)

Critical decision: when to switch threads


When current thread’s utilization/throughput is about to drop
(e.g. L2 cache miss)
Requirements for throughput:

(Thread switch) + (pipe fill time) << blocking latency



Would like to get some work done before other thread comes back
Fast thread-switch: multiple register banks
Fast pipe-fill: short pipe

Advantage: small changes to existing hardware

Drawback: high single thread performance requires long thread
switch
Examples




Macro-dataflow machine
MIT Alewife
IBM Northstar
Computer Architecture 2015 - Multithreading
12/40
Interleaved (Fine grained)
Multithreading

Critical decision: none?

Requirements for throughput:

Enough threads to eliminate data access and dependencies


Increasing number of threads reduces single-thread performance
Implementation options: fixed vs. flexible
I1 F D E M W
F D E M W
I2 F D E M W
Superscalar pipeline
I1 of T1 F D E M W
I1 of T2 F D E M W
I2 of T1 F D E M W
Multithreaded pipeline
Computer Architecture 2015 - Multithreading
13/40
Interleaved (Fine grained)
Multithreading (cont.)

Advantages:



Drawback:






(w/ flexible interleave:) Reasonable single thread performance
High processor utilization (esp. in case of many thread)
Complicated hardware
Requires highly parallel software for efficient execution
Massive cache pressure
(w/ fixed interleave:) limits single thread performance
Examples:

HEP Denelcor: 8 threads (latencies were shorter then)

TERA: 128 threads

MicroUnity - 5 x 1GHz threads = 200 MHz like latency
GPUs employ a variant of fine-grain interleaving
Computer Architecture 2015 - Multithreading
14/40
Fine Grained Multithreading in GPGPUs
Time
• 32 threads per group (warp); GPGPU core alternates between 48 warps
• Total number of threads running on a core: 32 * 48 = 1536
• Huge register file (around 128KB)
Computer Architecture 2015 - Multithreading
15/40
Simultaneous
Multi-threading (SMT)

Critical decision: fetch-interleaving policy

Requirements for throughput:

Enough threads to utilize resources

Fewer than needed to stretch dependences

Limit: too many threads reduces cache effectiveness

Examples:


Compaq Alpha EV8 (cancelled)
Intel Hyper-Threading Technology
Computer Architecture 2015 - Multithreading
16/40
Caching and Data Sharing

Decoupled caches = Private caches


X

Guarantee each thread’s cache space
Requires protocols across caches
Cache allocation is rigid
Shared Caches



X
Faster to communicate among threads
No coherence overhead
Flexibility in allocating resources
Threads can step on each other toes…
Tag
Data
Way 0
Tag
Data
Way 1
Tag
Data
Way 2
Tag
Data
Way 7
Computer Architecture 2015 - Multithreading
17/40
SMT Case Study: EV8


8-issue OOO processor (wide machine)
SMT Support


Multiple sequencers (PC): 4
Alternate fetch from multiple threads



Separate register renaming for each thread
More (4x) physical registers
Thread tags on all shared resources
ROB, LSQ, BTB, etc.
 Allow per thread flush/trap/retirement



Process tags on all address space resources:
caches, TLB’s, etc.
Notice: none of these things are in the “core”

Instruction queues, scheduler, wakeup are SMT-blind
Computer Architecture 2015 - Multithreading
18/40
Basic EV8
Fetch
Decode/
Map
Queue
Reg
Read
Execute
Dcache/
Store
Buffer
Reg
Write
Retire
PC
Register
Map
Regs
Dcache
Regs
Icache
Thread-blind
Computer Architecture 2015 - Multithreading
19/40
SMT EV8
Fetch
Decode/
Map
Queue
Reg
Read
Execute
Dcache/
Store
Buffer
Reg
Write
Retire
PC
Register
Map
Regs
Dcache
Regs
Icache
Thread-blind
Thread Blind?
Not really
Computer Architecture 2015 - Multithreading
20/40
Performance Scalability
250%
Multiprogrammed Workload
200%
1T
2T
3T
4T
150%
100%
50%
Multithreaded Applications
300%
0%
250%
SpecInt
SpecFP
Mixed Int/FP
200%
1T
2T
4T
150%
250%
Decomposed SPEC95 Applications
100%
50%
200%
1T
2T
3T
4T
150%
100%
0%
Barnes
Chess
Sort
TP
50%
0%
Turb3d
Swm256
Tomcatv
Computer Architecture 2015 - Multithreading
21/40
Performance Gain – when?

When threads do not “step” on each other
“predicted state”  $, Branch Prediction,
data prefetch….

If ample resources are available

Slow resource, can hamper performance
Computer Architecture 2015 - Multithreading
22/40
Scheduling in SMT

What if one thread gets “stuck”?

Round-robin


eventually it will fill up the machine (resource contention)
ICOUNT: thread with fewest instructions in pipe has
priority


Thread doesn’t get to fetch until it gets “unstuck”
Other policies have been devised to get best
balance, fairness, and overall performance
Computer Architecture 2015 - Multithreading
23/40
Scheduling in SMT

Variation: what if one thread is spinning?
 Not really stuck, gets to keep fetching
 Have to stop/slow it artificially (QUIESCE)


Sleep until a memory location changes
On Pentium 4 – use the Pause instruction
Computer Architecture 2015 - Multithreading
24/40
Intermediate summary


Multithreaded Software
Multithreaded Architecture

Advantageous cost/throughput …

Blocked MT - long latency tolerance
 Good single thread performance, good throughput
 Needs fast thread switch and short pipe

Interleaved MT – ILP increase
 Bad single thread performance, good throughput
 Needs many threads; several loaded contexts…

Simultaneous MT – ILP increase
 OK MT performance
 Good single thread performance
 Good utilization …
 Need fewer threads
Computer Architecture 2015 - Multithreading
25/40
SMT vs. CMP


SMT
Core
2MB
L2
SMT
2MB
L2
CMP
Better core+SMT or CMP?
Better single thread performance
Better resource utilization
CMP






Reference
SMT


1MB
L2
How to use 2x transistors?


Core A
Much easier to implement
Avoid wire delays problems
Overall better throughput per area/power
More deterministic performance
BUT! program must be multithreaded
Core A
Core B
Rough numbers when doubling the number of transistors:
 SMT:
ST Performance / MT performance = 1.3X/1.5-1.7X
CMP:
 ST Performance / MT performance = 1X/2X


Computer Architecture 2015 - Multithreading
26/40
Example: Pentium® 4 Hyper-Threading

Executes two threads simultaneously

Two different applications
or


CPU maintains architecture state for two processors


Two logical processors per physical processor
Implementation on Intel® Xeon™ Processor


Two threads of same application
Two logical processors for < 5% additional die area
The processor pretends as if it has 2 cores in an MP
shared memory system
Computer Architecture 2015 - Multithreading
28/40
uArchitecture impact

Replicate some resources




Partition some resources (share by splitting in half
per thread)




All; per-CPU architectural state
Instruction pointers, renaming logic
Some smaller resources - e.g. return stack predictor, ITLB, etc.
Re-Order Buffer
Load/Store Buffers
Queues; etc.
Share most resources


Out-of-Order execution engine
Caches
Computer Architecture 2015 - Multithreading
29/40
Hyper-Threading Performance

Performance varies as expected with:



Number of parallel threads
Resource utilization
Less aggressive MT than EV8  Less impressive scalability

Machine optimized for single-thread.
1.18
1.22
1.30
1.23
1.18
1.00
IXP-MP
HyperThreading
Disabled
Microsoft*
Microsoft*
Active Directory SQL Server
Microsoft*
Exchange
Microsoft*
IIS
Parallel GCC
Compile
IXP-MP
Hyper-Threading Enabled
Intel Xeon Processor MP platforms are prototype systems in 2-way configurations
Applications not tuned or optimized for Hyper-Threading Technology
*All trademarks and brands are the property of their respective ownersComputer Architecture 2015 - Multithreading
30/40
Analysis of a Multi-Thread Architecture:
A simple switch-on-event (SoE) processor
Memory access = P
execution
Memory access
execution
Period
Computer Architecture 2015 - Multithreading
31/40
Multi-Thread Architecture (SoE)
Threads architectural states
C
To Memory
Memory latency is shielded by multiple threads execution
Computer Architecture 2015 - Multithreading
32/40
Multi-Thread Architecture (SoE)
Threads architectural states
C
To Memory
Instructions/Memory access=1/MI
CPII = Ideal CPI of core C
MI = Memory access/Instruction
P = Memory access penalty (cycles)
n = number of threads
Computer Architecture 2015 - Multithreading
33/40
Multi-Thread Architecture (SoE)
Threads architectural states
C
To Memory
Period = (1/MI)*CPIi + P
P
CPI1= {(1/MI)*CPIi +P} / {1/MI} =CPIi + MI * P
Cycles
/ instructions
CPII = Ideal CPI of core C
MI = Memory access/Instruction
P = Memory access penalty (cycles)
n = number of threads
Computer Architecture 2015 - Multithreading
34/40
Multi-Thread Architecture (SoE)
Threads architectural states
C
To Memory
Period = (1/MI])*CPIi + P
CPIMT= {(1/MI)*CPIi +P} / n*{1/MI} =CPIi + MI * P
n
Cycles
/ instructions
CPII = Ideal CPI of core C
MI = Memory access/Instruction
P = Memory access penalty (cycles)
n = number of threads
Computer Architecture 2015 - Multithreading
35/40
Multi-Thread Architecture (SoE)
Threads architectural states
C
To Memory
Period
IPCMT = n * IPC1 = n * 1/CPI1 = n / [ CPIi + MI * P ]
CPII = Ideal CPI of core C
MI = Memory access/Instruction
P = Memory access penalty (cycles)
n = number of threads
Computer Architecture 2015 - Multithreading
36/40
Multi-Thread Architecture (SoE)
Threads architectural states
C
To Memory
Performance = IPC
Memory latency is shielded by multiple threads execution
Max performance ?
Threads (n)
Computer Architecture 2015 - Multithreading
37/40
Multi-Thread Architecture (SoE)
Threads architectural states
C
To Memory
Performance = IPC
Memory latency is shielded by multiple threads execution
Max performance = 1/CPIi
N1?
Threads (n)
Computer Architecture 2015 - Multithreading
38/40
So how many threads do we need for a
maximum performance?
1 / CPIi = n / [ CPIi + MI * P ]
[ CPIi + MI * P ] / CPIi = n
N = 1 + [MI * P] / CPIi
CPII = CPI ideal of core C
MI = Memory access/Instruction
P = Memory access penalty (cycles)
n = number of threads
Computer Architecture 2015 - Multithreading
39/40
SoE: Now with a cache! (single-thread)
Threads architectural states
C
To Memory
Period = (1/MI)*CPIi + MR*P
P
CPI1= {(1/MI)*CPIi +MR*P} / {1/MI} =CPIi + MI * MR *P
Cycles
/
instructions
CPII = Ideal CPI of core C
MR=cache miss rate
MI = Memory access/Instruction
P = Memory access penalty (cycles)
n = number of threads
Computer Architecture 2015 - Multithreading
40/40
The unified machine – the valley
When we have a cache, the memory penalty depends on the miss-rate
Perf = n/ [ CPIi + MI * MR(n) * P ]
Three regions: Cache Region , the valley, MT Region
The valley
MT region
MC region
Performance

Threads
Computer Architecture 2015 - Multithreading
41/40
Multi-Thread Architecture (SoE)
conclusions
 Applicable
 Requires
 Can
saving thread state
reach “high” performance (~1/CPIi)
 Bandwidth

when many threads are available
to memory is high
high power
Computer Architecture 2015 - Multithreading
42/40
BACKUP SLIDES
Computer Architecture 2015 - Multithreading
43/40
The unified machine
Parameter: Cache Size
Increase Cache size  cache ability to hide memory
latency  favors MC
Performance for Different Cache Sizes
1100
1000
900
800
700
600
500
400
300
200
100
0
GOPS
no cache
16M
32M
64M
128M
Number Of Threads
* At this point: unlimited BW to memory
40000
37500
35000
32500
30000
27500
25000
22500
20000
17500
15000
12500
10000
7500
5000
2500
perfect $
0

Cache Size Impact

Increase in cache size cache suffices for more in-flight threads
 Extends the $ region ..AND also Valuable in the MT region

Caches reduce off-chip bandwidth delay the BW saturation point
Performance for Different Cache Sizes (Limited BW)
1100
1000
Increase in
cache size
900
no $
700
16M
600
32M
500
64M
400
300
128M
200
perfect $
100
20000
19000
18000
17000
16000
15000
14000
13000
12000
11000
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
0
GOPS
800
Number Of Threads
Computer Architecture 2015 - Multithreading
45/40
Memory Latency Impact
Increase in memory latency  Hinders the MT region
 Emphasise the importance of caches
Performance for Different memory Latencies
1100
1000
900
800
zero latency
700
50 cycles
600
100 cycles
500
400
300 cycles
300
1000 cycles
200
100
40000
Unlimited BW to memory
38000
36000
34000
32000
30000
28000
26000
24000
22000
20000
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
0
GOPS

Number Of Threads
Computer Architecture 2015 - Multithreading
46/40