The Graphite Multicore Simulator

Download Report

Transcript The Graphite Multicore Simulator

TIME TRAVELING HARDWARE AND
SOFTWARE SYSTEMS
Xiangyao Yu, Srini Devadas
CSAIL, MIT
FOR FIFTY YEARS,
WE HAVE RIDDEN MOORE’S LAW
Moore’s Law and the scaling of clock frequency
= printing press for the currency of performance
10,000,000
TECHNOLOGY SCALING
Transistors x 1000
■ Clock frequency (MHz)
▲ Power (W)
● Cores

100,000
1,000
10
0
1970
1980
1990
2000
2010
Each generation of Moore’s law doubles the number of
transistors but clock frequency has stopped increasing.
10,000,000
TECHNOLOGY SCALING
Transistors x 1000
■ Clock frequency (MHz)
▲ Power (W)
● Cores

100,000
1,000
10
0
1970
1980
1990
2000
2010
To increase performance, need to exploit parallelism.
DIFFERENT KINDS OF
PARALLELISM - 1
Instruction Level
a=b+c
d=e+f
g=d+b
Transaction Level
Read A
Read B
Compute C
Read A
Read D
Compute E
Read C
Read E
Compute F
5
DIFFERENT KINDS OF
PARALLELISM - 2
Thread Level
Task Level
Search(“image”)
A
x
B
=
C
Different thread computes
each entry of product matrix
C
Cloud
6
DIFFERENT KINDS OF
PARALLELISM - 3
Thread Level
User Level
Search(“image”)
A
x
B
=
C
Different thread computes
each entry of product matrix
C
Lookup(“data”)
Cloud
Query(“record”)
7
DEPENDENCY DESTROYS
PARALLELISM
for i = 1 to n
a[b[i]] = (a[b[i - 1]] + b[i]) / c[i]
Need to compute ith entry after i - 1th
has been computed 
8
DIFFERENT KINDS OF
DEPENDENCY
Read A
No
Read A dependency!
RAW:
Write A
Read needs
Read A new value
WAW:
Write A
Semantics
decide
Write A
order
Read A
Write A
WAR:
We have
flexibility
here!
9
DEPENDENCE IS ACROSS TIME,
BUT WHAT IS TIME?
• Time can be physical time
• Time could correspond to logical
timestamps assigned to instructions
• Time could be a combination of the above
 Time is a definition of ordering
10
WAR DEPENDENCE
Initially A = 10
Thread 0
Thread 1
Logical
Order
Read A
Write A
Physical Time
Order
A=13
Local copy of A = 10
Read happens later than Write in
physical time but is before Write in
logical time.
11
WHAT IS CORRECTNESS?
• We define correctness of a parallel
program based on its outputs in relation to
the program run sequentially
12
SEQUENTIAL CONSISTENCY
A B
C D
Global Memory Order
A B C D
A C B D
C D A B
C B A D
Can we exploit
this freedom in
correct
execution to
avoid
dependency?
13
AVOIDING DEPENDENCY
ACROSS THE STACK
Circuit
Multicore
Processor
Multicore
Database
Distributed
Database
Distributed
Shared Memory
Efficient atomic instructions
Tardis coherence protocol
TicToc concurrency control
with Andy Pavlo and Daniel Sanchez
Distributed TicToc
Transaction processing with
fault tolerance
14
SHARED MEMORY SYSTEMS
Cache
Coherence
Concurrency
Control
Multi-core
Processor
OLTP
Databas
15
DIRECTORY-BASED COHERENCE
P
P
P
P
P
P
P
P
P
• Data replicated and cached locally for access
• Uncached data copied to local cache, writes
invalidate data copies
16
CACHE COHERENCE SCALABILITY
Storage Overhead
250%
200%
Read A
150%
Read A
Write A
Today
100%
50%
Invalidation
0%
16
128
Core Count
1024
O(N) Sharer List
17
LEASE-BASED COHERENCE
Ld(A)
Ld(A)
Ld(A)
Program
Timestamp
(pts)
Core 0
Core 1Write
Timestamp
t=0
(wts)
St(A)
1
Read
Timestamp
2(rts) 3
4
5
6
7
Time
Logical
Timestamp
• A read gets a lease on a cacheline
• Lease renewal after lease expires
• A store can only commit after leases expire
• Tardis: logical leases
18
LOGICAL TIMESTAMP
Invalidation
Physical Time Order
Tardis
Logical Time Order
(No Invalidation)
(concept borrowed from database)
Old Version
New Version
logical time
19
TIMESTAMP MANAGEMENT
Core
Program Timestamp (pts)
pts=5
Timestamp of last memory operation
Cache
A
S 0 10
Write Timestamp (wts)
Data created at wts
state
A
B
wts
rts
Read Timestamp (rts)
Data valid from wts to rts
S 0 10
S 0 5
Shared LLC
Lease
wts
rts
Logical Time
20
TWO-CORE EXAMPLE
Core 0
pts=0
Core 1
Cache
pts=0
Cache
1
2
Core 0
store A
load B
Core 1
3
4
5
A
B
S 0
S 0
0
0
store B
load A
load B
physical
time
21
STORE A @ CORE 0
Core 0
pts=1
pts=0
Core 1
Cache
A
M 1
pts=0
Cache
1
2
1
Core 0
store A
load B
Core 1
3
4
ST(A) Req
5
A
B
0 0
S owner:0
M
S 0 0
store B
load A
load B
Write at pts = 1
22
LOAD B @ CORE 0
Core 0
pts=1
Core 1
Cache
A
B
M 1
pts=0
Cache
1
2
1
Core 0
store A
load B
Core 1
3
4
LD(B) Req
5
A
B
M owner:0
0
S 0 11
store B
load A
load B
Reserve rts to pts + lease = 11
23
STORE B @ CORE 1
Core 0
pts=1
Core 1
Cache
A
B
M 1 1
S 0 11
A
B
pts=0
pts=12
Cache
1
2
Core 0
store A
load B
Core 1
3
B
M owner:0
0 11
S owner:1
M
M 12 12
ST(B) Req
4
5
store B
load A
load B
Exclusive ownership returned
No invalidation
24
Two VERSIONS COEXIST
Core 0
pts=1
Core 1
Cache
A
B
M 1 1
S 0 11
pts=12
Cache
1
2
Core 1
3
B
M 12 12
4
5
A
B
Core 0
store A
load B
M
M
store B
load A
load B
owner:0
owner:1
Core 1 traveled ahead in time
Versions ordered in logical time
25
LOAD A @ CORE 1
Core 0
pts=1
Core 1
Cache
A
B
S 1 22
1
M
S 0 11
pts=12
Cache
A
B
WB(A) Req
1 22
A SM owner:0
B M owner:1
1
2
Core 0
store A
load B
Core 1
3
M 12 12
LD(A) Req
4
5
store B
load A
load B
Write back request to Core 0
Downgrade from M to S
Reserve rts to pts + lease = 22
26
LOAD B @ CORE 0
Core 0
pts=1
Core 1
Cache
A
B
S 1 22
S 0 11
pts=12
1
2
Cache
A
B
S 1 22
M 12 12
M
M
owner:0
owner:1
Core 1
physical time
5
A
B
Core 0
store A
load B
load B
5 > physical 4 ,
time
3
4
store B
load A
logical timestamp
5 < logical 4
time
global memory order ≠ physical time order
27
SUMMARY OF EXAMPLE
Directory
Core 1
Core 0
store A RAW
WAR
load B
store B
RAW
load A
load B
physical time order
Tardis
Core 0
store A
load B
physical
time
Core 1
RAW
store B
load A
load B
WAR
physical + logical time order
28
PHYSIOLOGICAL TIME
Tardis
Core 0
store A (1)
load B (1)
Core 1
2
1
store B (12)
load A (12)
load B (1)
Global Memory Order
3
Physical Time
Logical Time
Physiological Time
X <PL Y := X <L Y or (X =L Y and X <P Y)
Thm: Tardis obeys Sequential Consistency
29
TARDIS PROS AND CONS
Scalability
Lease Renew
Speculative Read
No Invalidation,
Multicast or
Broadcast
Timestamp Size
Timestamp Compression
Time Stands Still
Livelock Avoidance
30
COMMIT
COMMIT
BEGIN
BEGIN
COMMIT
BEGIN
BEGIN
COMMIT
CONCURRENCY CONTROL
COMMIT
BEGIN
COMMIT
BEGIN
COMMIT
BEGIN
COMMIT
BEGIN
Serializability
Results should
correspond to
some serial order
of atomic
execution
35
BEGIN
COMMIT
COMMIT
BEGIN
COMMIT
BEGIN
BEGIN
COMMIT
CONCURRENCY CONTROL
COMMIT
BEGIN
COMMIT
BEGIN
COMMIT
COMMIT
BEGIN
BEGIN
Can’t Have This
Results should
correspond to
some serial order
of atomic
execution
36
BOTTLENECK 1: TIMESTAMP
ALLOCATION
T/O
– Timestamp allocation is
a scalability bottleneck
• Synchronized Clock
– Clock skew causes
unnecessary aborts
25
Throughput
(Million txn/s)
• Centralized Allocator
2PL
20
15
10
5
0
0
20
40
60
80
Thread Count
37
BOTTLENECK 2: STATIC
ASSIGNMENT
COMMIT
WRITE(A)
READ(A)
WRITE(A)
ABORT
T2@ts=1
COMMIT
T1@ts=2
READ(A)
COMMIT
BEGIN
BEGIN
• Suboptimal assignment
leads to unnecessary
aborts.
T2@ts=2
BEGIN
• Timestamps assigned
before a transaction
starts
T1@ts=1
BEGIN
Time
38
KEY IDEA: DATA DRIVEN
TIMESTAMP MANAGEMENT
Traditional T/O
1. Acquire timestamp (TS)
2. Determine tuple visibility using
TS
TicToc
1. Access tuples and remember
their timestamp info.
2. Compute commit timestamp
(CommitTS)
Timestamp Allocation
No Timestamp Allocation
Static Timestamp Assignment
Dynamic Timestamp Assignment
39
READ
PHASE
Read & Write Tuples
Execute Transaction
wts :
rts :
VALIDATION
PHASE
Compute CommitTS
Decide Commit/Abort
last data write @ wts
last data read @ rts
Tuple Format
WRITE
PHASE
Data
COMMIT
BEGIN
TICTOC TRANSACTION EXECUTION
Update Database
data valid between wts and rts
wts
rts
(Write Timestamp)
(Read Timestamp)
40
TICTOC EXAMPLE
0
1
Tuple A
v1
Tuple B
v1
2
3
4
Database States
T1
Logical Time
Transaction 1
1 load A
3 store B
5 commit?
Transaction 2
2 load A
4 load B
6 commit?
Transaction Local States
T2
41
LOAD A FROM T1
0
1
Tuple A
v1
Tuple B
v1
T1
2
3
4
Logical Time
Transaction 1
1 load A
3 store B
5 commit?
Transaction 2
2 load A
4 load B
6 commit?
Load a snapshot of tuple A
- Data, wts and rts
T2
42
LOAD A FROM T2
0
1
Tuple A
v1
Tuple B
v1
T1
2
3
4
Logical Time
Transaction 1
1 load A
3 store B
5 commit?
Transaction 2
2 load A
4 load B
6 commit?
Load a snapshot of tuple A
- Data, wts and rts
T2
43
STORE B FROM T1
0
1
Tuple A
v1
Tuple B
v1
T1
2
3
4
Logical Time
Transaction 1
1 load A
3 store B
5 commit?
Transaction 2
2 load A
4 load B
6 commit?
Store B to local write set
T2
44
LOAD B FROM T2
0
1
Tuple A
v1
Tuple B
v1
T1
2
3
4
Logical Time
Transaction 1
1 load A
3 store B
5 commit?
Transaction 2
2 load A
4 load B
6 commit?
Load a snapshot of tuple B
- Data, wts and rts
T2
45
COMMIT PHASE OF T1
0
1
Tuple A
v1
Tuple B
v1
T1
T2
2
3
4
Logical Time
Transaction 1
1 load A
3 store B
5 commit?
Transaction 2
2 load A
4 load B
6 commit?
Compute CommitTS
Write Set:
tuple.rts+1 ≤ CommitTs
Read Set:
tuple.wts ≤ CommitTs ≤ tuple.rts
46
COMMIT PHASE OF T1
0
1
Tuple A
Tuple B
2
3
v1
v1
T1 commits
@ TS=2
T1
4
Logical Time
Transaction 1
1 load A
3 store B
5 commit?
Transaction 2
2 load A
4 load B
6 commit?
rts extension on tuple A
T2
47
COMMIT PHASE OF T1
0
1
Tuple A
Tuple B
2
3
v1
v1
v2
T1 commits
@ TS=2
T1
4
Logical Time
Transaction 1
1 load A
3 store B
5 commit?
Transaction 2
2 load A
4 load B
6 commit?
Copy tuple B from write
set to database
T2
48
COMMIT PHASE OF T2
0
1
Tuple A
Tuple B
T1
T2
2
3
v1
v1
v2
T1 commits
@ TS=2
4
Logical Time
Transaction 1
1 load A
3 store B
5 commit?
Transaction 2
2 load A
4 load B
6 commit?
Compute CommitTS for T2
Find consistent read time
for T2 (no writes in T2)
49
FINAL STATE
0
1
Tuple A
Tuple B
3
v1
v1
v2
T1 commits
@ TS=2
T1
T2
2
T2 commits
@ TS=0
4
Logical Time
Transaction 1
1 load A
3 store B
5 commit?
Transaction 2
2 load A
4 load B
6 commit?
Txn 1 < physical Txn 2
time
Txn 2 < logical Txn 1
time
50
PHYSIOLOGICAL TIME
ACROSS THE STACK
Circuit
Efficient atomic instructions
Multicore
Processor
Tardis coherence protocol
Multicore
Database
TicToc concurrency control
Distributed
Database
Distributed TicToc
Distributed
Shared Memory
Transaction processing with
fault tolerance
54
ATOMIC INSTRUCTION (LR/SC)
• ABA Problem
• Detect ABA using timestamp (wts)
Core 0
Core 0
LR(x); # x = A
SC(x); # x = A
ST(x);
ST(x);
#x=B
#x=A
ABA?
55
TARDIS CACHE COHERENCE
• Simple: No Invalidation
• Scalable:
– O(log N) storage
– No Broadcast, No Multicast
– No Clock Synchronization
• Support Relaxed Consistency Models
56
T1000: PROPOSED 1000-CORE
SHARED MEMORY PROCESSOR
57
TICTOC CONCURRENCY CONTROL
• Data Driven Timestamp Management
• No Central Timestamp Allocation
• Dynamic Timestamp Assignment
58
DISTRIBUTED TICTOC
• Data Driven Timestamp Management
• Efficient Two-Phase Commit Protocol
• Support Local Caching of Remote Data
59
FAULT TOLERANT
DISTRIBUTED SHARED MEMORY
• Transactional Programming Model
• Distributed Command Logging
• Dynamic Dependency Tracking Among
Transactions (WAR dependency can be ignored)
60
TIME TRAVELING TO ELIMINATE
WAR
Xiangyao Yu, Srini Devadas
CSAIL, MIT