Multithreading

Download Report

Transcript Multithreading

Multithreaded Architectures
Uri Weiser and Yoav Etsion
Computer Architecture 2014 - Multithreading
1/40
Overview

Multithreaded Software

Multithreaded Architecture

Multithreaded Micro-Architecture

Conclusions
Computer Architecture 2014 - Multithreading
2/40
Multithreading
Basics

Process



Thread – each thread has its unique execution context




Each process has its unique address space
Can consist of several threads
Thread has its own PC (Sequencer) + registers + stack
All threads (within a process) share same address space
Private heap is optional
Multithreaded app’s: process broken into threads

#1 example: transactions (databases, web servers)



Increased concurrency
Partial blocking (your transaction blocks, mine doesn’t have to)
Centralized (smarter) resource management (by process)
Computer Architecture 2014 - Multithreading
3/40
Superscalar Execution
Example – 5 stages, 4 wide machine
F
D1
D2
EX/RET
WB
F
D1
D2
EX/RET
WB
F
D1
D2
EX/RET
WB
F
D1
D2
EX/RET
WB
Computer Architecture 2014 - Multithreading
5/40
Superscalar Execution
Time
Generally increases throughput, but decreases utilization
Computer Architecture 2014 - Multithreading
6/40
CMP – Chip Multi-Processor
Multi-threads/Multi-tasks runs on Multi-Processor/core
Time
Low utilization / higher throughput
Computer Architecture 2014 - Multithreading
7/40
Multiple Hardware Threads

A thread can be viewed as a stream of instructions


Equip multithreaded processors with multiple
hardware contexts (threads)


State is represented by PC, Stack pointer, GP Registers
OS views CPU as multiple logical processors
Execution of instructions from different threads are
interleaved in the pipeline

Interleaving policy is critical…
Computer Architecture 2014 - Multithreading
8/40
Blocked Multithreading
Multi-threading/multi-tasking runs on single core
Time
How is this different from OS scheduling?
May increase utilization and throughput, but must switch when current
thread goes to low utilization/throughput section (e.g. L2 cache miss)
Computer Architecture 2014 - Multithreading
9/40
Fine Grained Multithreading
Multi-threading/multi-tasking runs on single core
Time
Increases utilization/throughput by reducing impact of dependences
Computer Architecture 2014 - Multithreading
10/40
Simultaneous Multithreading (SMT)
Multi-threading/multi-tasking runs on single core
Time
Increases utilization/throughput
Computer Architecture 2014 - Multithreading
11/40
Blocked Multithreading
(SoE-MT- Switch on Event MT, aka’ – “Poor Man MT”)

Critical decision: when to switch threads


When current thread’s utilization/throughput is about to drop
(e.g. L2 cache miss)
Requirements for throughput:

(Thread switch) + (pipe fill time) << blocking latency



Would like to get some work done before other thread comes back
Fast thread-switch: multiple register banks
Fast pipe-fill: short pipe

Advantage: small changes to existing hardware

Drawback: high single thread performance requires long thread
switch
Examples




Macro-dataflow machine
MIT Alewife
IBM Northstar
Computer Architecture 2014 - Multithreading
12/40
Interleaved (Fine grained)
Multithreading

Critical decision: none?

Requirements for throughput:

Enough threads to eliminate data access and dependencies

Increasing number of threads reduces single-thread performance
I1 F D E M W
F D E M W
I2 F D E M W
Superscalar pipeline
I1 of T1 F D E M W
I1 of T2 F D E M W
I2 of T1 F D E M W
Multithreaded pipeline
Computer Architecture 2014 - Multithreading
13/40
Interleaved (Fine grained)
Multithreading (cont.)

Advantages:



Drawback:




Complicated hardware
Multiple contexts (states)
(w/ inflexible interleave:) limits single thread performance
Examples:

HEP Denelcor: 8 threads (latencies were shorter then)
TERA: 128 threads

MicroUnity - 5 x 1GHz threads = 200 MHz like latency


(w/ flexible interleave:) Reasonable single thread performance
High processor utilization (esp. in case of many thread)
Became attractive for GPUs and network processors
Computer Architecture 2014 - Multithreading
14/40
Simultaneous
Multi-threading (SMT)

Critical decision: fetch-interleaving policy

Requirements for throughput:

Enough threads to utilize resources


Fewer than needed to stretch dependences
Examples:


Compaq Alpha EV8 (cancelled)
Intel Pentium® 4 Hyper-Threading Technology
Computer Architecture 2014 - Multithreading
15/40
SMT Case Study: EV8


8-issue OOO processor (wide machine)
SMT Support


Multiple sequencers (PC): 4
Alternate fetch from multiple threads



Separate register renaming for each thread
More (4x) physical registers
Thread tags on all shared resources
ROB, LSQ, BTB, etc.
 Allow per thread flush/trap/retirement



Process tags on all address space resources:
caches, TLB’s, etc.
Notice: none of these things are in the “core”

Instruction queues, scheduler, wakeup are SMT-blind
Computer Architecture 2014 - Multithreading
16/40
Basic EV8
Fetch
Decode/
Map
Queue
Reg
Read
Execute
Dcache/
Store
Buffer
Reg
Write
Retire
PC
Register
Map
Regs
Dcache
Regs
Icache
Thread-blind
Computer Architecture 2014 - Multithreading
17/40
SMT EV8
Fetch
Decode/
Map
Queue
Reg
Read
Execute
Dcache/
Store
Buffer
Reg
Write
Retire
PC
Register
Map
Regs
Dcache
Regs
Icache
Thread-blind
Thread Blind?
Not really
Computer Architecture 2014 - Multithreading
18/40
Performance Scalability
250%
Multiprogrammed Workload
200%
1T
2T
3T
4T
150%
100%
50%
Multithreaded Applications
300%
0%
250%
SpecInt
SpecFP
Mixed Int/FP
200%
1T
2T
4T
150%
250%
Decomposed SPEC95 Applications
100%
50%
200%
1T
2T
3T
4T
150%
100%
0%
Barnes
Chess
Sort
TP
50%
0%
Turb3d
Swm256
Tomcatv
Computer Architecture 2014 - Multithreading
19/40
Multithread
I’m afraid of this
1.6
SMT Speedup
1.4
1.2
1
0.8
0.6
0.4
0
5
10
15
Mcycles
Is CMP a solution?
20
25
EV8 simulation R. Espasa
Computer Architecture 2014 - Multithreading
20/40
Performance Gain – when?

When threads do not “step” on each other
“predicted state”  $, Branch Prediction,
data prefetch….

If ample resources are available

Slow resource, can hamper performance
Computer Architecture 2014 - Multithreading
21/40
Scheduling in SMT

What if one thread gets “stuck”?

Round-robin


eventually it will fill up the machine (resource contention)
ICOUNT: thread with fewest instructions in pipe has
priority


Thread doesn’t get to fetch until it gets “unstuck”
Other policies have been devised to get best
balance, fairness, and overall performance
Computer Architecture 2014 - Multithreading
22/40
Scheduling in SMT

Variation: what if one thread is spinning?
 Not really stuck, gets to keep fetching
 Have to stop/slow it artificially (QUIESCE)


Sleep until a memory location changes
On Pentium 4 – use the Pause instruction
Computer Architecture 2014 - Multithreading
23/40
MT Application Data Sharing

Shared memory apps

Decoupled caches = Private caches


Easier to implement – use existing MP like protocol
Shared Caches



Faster to communicate among threads
No coherence overhead
Flexibility in allocating resources
Tag
Data
Way 0
Tag
Data
Way 1
Tag
Data
Way 2
Tag
Data
Way 7
Computer Architecture 2014 - Multithreading
24/40
SMT vs. CMP


SMT
Core
2MB
L2
SMT
2MB
L2
CMP
Better core+SMT or CMP?
Better single thread performance
Better resource utilization
CMP






Reference
SMT


1MB
L2
How to use 2X transistors?


Core A
Much easier to implement
Avoid wire delays problems
Overall better throughput per area/power
More deterministic performance
BUT! program must be multithreaded
Core A
Core B
Rough numbers
 SMT:
ST Performance/MT performance for 2X transistors = 1.3X/1.5-1.7X
CMP:
 ST Performance /MT performance for 2X transistors = 1X/2X


Computer Architecture 2014 - Multithreading
25/40
Intermediate summary


Multithreaded Software
Multithreaded Architecture

Advantageous cost/throughput …

Blocked MT - long latency tolerance
 Good single thread performance, good throughput
 Needs fast thread switch and short pipe

Interleaved MT – ILP increase
 Bad single thread performance, good throughput
 Needs many threads; several loaded contexts…

Simultaneous MT – ILP increase
 OK MT performance
 Good single thread performance
 Good utilization …
 Need fewer threads
Computer Architecture 2014 - Multithreading
26/40
Example: Pentium® 4 Hyper-Threading

Executes two threads simultaneously

Two different applications
or


CPU maintains architecture state for two processors


Two logical processors per physical processor
Implementation on Intel® Xeon™ Processor


Two threads of same application
Two logical processors for < 5% additional die area
The processor pretends as if it has 2 cores in an MP
shared memory system
Computer Architecture 2014 - Multithreading
28/40
uArchitecture impact

Replicate some resources




Partition some resources (share by splitting in half
per thread)




All; per-CPU architectural state
Instruction pointers, renaming logic
Some smaller resources - e.g. return stack predictor, ITLB, etc.
Re-Order Buffer
Load/Store Buffers
Queues; etc.
Share most resources


Out-of-Order execution engine
Caches
Computer Architecture 2014 - Multithreading
29/40
Hyper-Threading Performance

Performance varies as expected with:



Number of parallel threads
Resource utilization
Less aggressive MT than EV8  Less impressive scalability

Machine optimized for single-thread.
1.18
1.22
1.30
1.23
1.18
1.00
IXP-MP
HyperThreading
Disabled
Microsoft*
Microsoft*
Active Directory SQL Server
Microsoft*
Exchange
Microsoft*
IIS
Parallel GCC
Compile
IXP-MP
Hyper-Threading Enabled
Intel Xeon Processor MP platforms are prototype systems in 2-way configurations
Applications not tuned or optimized for Hyper-Threading Technology
*All trademarks and brands are the property of their respective ownersComputer Architecture 2014 - Multithreading
30/40
Multi-Thread Architecture (SoE)
Memory access = P
execution
Memory access
execution
Period
Computer Architecture 2014 - Multithreading
31/40
Multi-Thread Architecture (SoE)
Threads architectural states
C
To Memory
Memory latency is shielded by multiple threads execution
Computer Architecture 2014 - Multithreading
32/40
Multi-Thread Architecture (SoE)
Threads architectural states
C
To Memory
Instructions/Memory access=1/MI
CPII = CPI ideal of core C
MI = Memory access/Instruction
P = Memory access penalty (cycles)
n = number of threads
Computer Architecture 2014 - Multithreading
33/40
Multi-Thread Architecture (SoE)
Threads architectural states
C
To Memory
Period = P +[1/MI] CPIi
P
CPI1= {P + [1/MI]CPIi}/ [1/MI] =CPIi + MI * P
Cycles
/ instructions
CPII = CPI ideal of core C
MI = Memory access/Instruction
P = Memory access penalty (cycles)
n = number of threads
Computer Architecture 2014 - Multithreading
34/40
Multi-Thread Architecture (SoE)
Threads architectural states
C
To Memory
Period
IPCT= n / CPI1 = n / [ CPIi + MI * P ]
CPII = CPI ideal of core C
MI = Memory access/Instruction
P = Memory access penalty (cycles)
n = number of threads
Computer Architecture 2014 - Multithreading
35/40
Multi-Thread Architecture (SoE)
Threads architectural states
C
To Memory
Performance = IPC
Memory latency is shielded by multiple threads execution
Max performance ?
Threads (n)
Computer Architecture 2014 - Multithreading
36/40
Multi-Thread Architecture (SoE)
Threads architectural states
C
To Memory
Performance = IPC
Memory latency is shielded by multiple threads execution
Max performance = 1/CPIi
N1?
Threads (n)
Computer Architecture 2014 - Multithreading
37/40
The unified machine – the valley
Perf = n/ [ CPIi + MI * MR(n) * P ]
Three regions: Cache Region , the valley, MT Region
The valley
MT region
MC region
Performance

Threads
Computer Architecture 2014 - Multithreading
38/40
Multi-Thread Architecture (SoE)
conclusions
 Applicable
 Require
 Can
threads state saving
reach “high” performance (~1/CPIi)
 Bandwidth

when many threads are available
to memory is high
high power
Computer Architecture 2014 - Multithreading
39/40
BACKUP SLIDES
Computer Architecture 2014 - Multithreading
40/40
The unified machine
Parameter: Cache Size
Increase Cache size  cache ability to hide memory
latency  favors MC
Performance for Different Cache Sizes
1100
1000
900
800
700
600
500
400
300
200
100
0
GOPS
no cache
16M
32M
64M
128M
Number Of Threads
* At this point: unlimited BW to memory
40000
37500
35000
32500
30000
27500
25000
22500
20000
17500
15000
12500
10000
7500
5000
2500
perfect $
0

Cache Size Impact

Increase in cache size cache suffices for more in-flight threads
 Extends the $ region ..AND also Valuable in the MT region

Caches reduce off-chip bandwidth delay the BW saturation point
Performance for Different Cache Sizes (Limited BW)
1100
1000
Increase in
cache size
900
no $
700
16M
600
32M
500
64M
400
300
128M
200
perfect $
100
20000
19000
18000
17000
16000
15000
14000
13000
12000
11000
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
0
GOPS
800
Number Of Threads
Computer Architecture 2014 - Multithreading
42/40
Memory Latency Impact
Increase in memory latency  Hinders the MT region
 Emphasise the importance of caches
Performance for Different memory Latencies
1100
1000
900
800
zero latency
700
50 cycles
600
100 cycles
500
400
300 cycles
300
1000 cycles
200
100
40000
Unlimited BW to memory
38000
36000
34000
32000
30000
28000
26000
24000
22000
20000
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
0
GOPS

Number Of Threads
Computer Architecture 2014 - Multithreading
43/40