Compiler-Defined Architectures

Download Report

Transcript Compiler-Defined Architectures

Spatial Computation
Computing without General-Purpose Processors
Mihai Budiu
[email protected]
Carnegie Mellon University
July 8, 2004
Spatial Computation
Spatial Computation
A computation model based on:
Mihai Budiu
• application-specific
[email protected] hardware
Mellon University
• no Carnegie
interpretation
• minimal resource sharing
2
The Engine Behind This Talk
main( )
{
signal(SIGINT, welcome);
while (slides( ) && time( )) {
talk( );
}
}
3
Research Scope
Object:
future architectures
Tool:
compilers
Evaluation:
simulators
4
Research Methodology
Y (e.g., cost)
“reasonable limits”
state-of-the-art
incremental
evolution
Constraint Space
X (e.g., power)
new solutions
5
Outline
• Introduction: problems of current architectures
100
•
•
•
•
2000
1998
1996
1994
1992
1990
1988
1986
1984
1
1982
10
1980
Performance
1000
Compiling Application-Specific Hardware
Pipelining
ASH Evaluation
Conclusions
6
Resources
[Intel]
• We do not worry about not having hardware resources
• We worry about being able to use hardware resources
7
Design Complexity
1010
Transistors
109
108
107
106
105
104
8
Communication vs. Computation
gate
wire
5ps
20ps
Power consumption on wires is also dominant
9
Power Consumption
Toasted CPU:
about 2 sec after removing cooler.
(Tom’s Hardware Guide)
10
Energy Efficiency
ALUs
Pentium 4
11
Clock Speed
3GHz
6GHz
10GHz
Cannot rely on global signals
(clock is a global signal)
12
Instruction-Set Architecture
Software
ISA
VERY rigid
to changes
(e.g. x86 vs
Itanium)
Hardware
13
Our Proposal
• ASH addresses these problems
• ASH is not a panacea
• ASH “complementary” to CPU
Low ILP computation
+ OS + VM
CPU
ASH
High-ILP
computation
$
Memory
14
Outline
• Problems of current architectures
• CASH: Compiling ASH
– program representation
– compiling C programs
• Pipelining
• ASH Evaluation
• Conclusions
15
Application-Specific Hardware
C program
Compiler
Dataflow IR
SW
ISA
HW
HW backend
Reconfigurable/custom hw
16
Application-Specific Hardware
Soft
C program
Compiler
Dataflow IR
SW backend
CPU [predication]
17
Key: Intermediate Representation
Our IR
Traditionally
• SSA + predication + speculation
• Uniform for scalars and memory
may-dep.
CFG
...
• Explicitly encodes may-depend
• Executable
• Precise semantics
def-use
• Dataflow IR
• Close to asynchronous target
18
Computation = Dataflow
Programs
Circuits
a
x = a & 7;
...
7
&
2
y = x >> 2;
x
>>
• Operations ) functional units
• Variables
) wires
• No interpretation
19
Basic Computation
+
latch
data
ack
valid
20
Asynchronous Computation
+
data
1
latch
5
+
+
+
ack
valid
2
+
3
+
6
4
+
7
+
8
21
Distributed Control Logic
global
FSM
ack
rdy
+
short, local wires
asynchronous control
22
Outline
• Problems of current architectures
• CASH: Compiling ASH
– program representation
– compiling C programs
• Pipelining
• ASH Evaluation
• Conclusions
23
MUX: Forward Branches
b
if (x > 0)
y = -x;
else
y = b*x;
x
*
0
-
f
>
!
y
SSA
= no arbitration
Conditionals ) Speculation
critical path
24
Control Flow ) Data Flow
data
f
Merge (label)
data
data
predicate
Gateway
p
Split (branch)
!
25
0
Loops
i
*
0
int sum=0, i;
for (i=0; i < 100; i++)
sum += i*i;
return
return sum;
sum;
+1
< 100
sum
+
!
ret
26
Predication and Side-Effects
addr
token
sequencing
of side-effects
no speculation
pred
Load
data
to
memory
token
27
Memory Access
LD
ST
pipelined
arbitrated
network
Monolithic
Memory
LD
local communication
global structures
Future work: fragment this!
complexity
related work
28
CASH Optimizations
• SSA-based optimizations
– unreachable/dead code, gcse, strength reduction,
loop-invariant code motion, software pipelining,
reassociation, algebraic simplifications, induction
variable optimizations, loop unrolling, inlining
• Memory optimizations
– dependence & alias analysis, register promotion,
redundant load/store elimination, memory access
pipelining, loop decoupling
• Boolean optimizations
– Espresso CAD tool, bitwidth analysis
29
Outline
• Problems of current architectures
• Compiling ASH
• Pipelining
• Evaluation: CASH vs. clocked designs
• Conclusions
30
i
Pipelining
*
+
<=
pipelined
multiplier
(8 stages)
int sum=0, i;
for (i=0; i < 100; i++)
sum += i*i;
return sum;
100
1
sum
+
step 1
31
i
Pipelining
*
100
1
+
<=
sum
+
step 2
32
i
Pipelining
*
100
1
+
<=
sum
+
step 3
33
i
Pipelining
*
100
1
+
<=
sum
+
step 4
34
i
Pipelining
i=1
100
1
+
<=
i=0
sum
+
step 5
35
i
Pipelining
*
i=1
100
1
+
<=
i=0
sum
+
step 6
36
i
Pipelining
*
+
<=
i’s loop
predicate
100
1
Long
latency
pipe
sum
sum’s loop
+
step 7
37
i
Pipelining
*
i’s loop
critical path
100
1
+
<=
Predicate ack
edge is on the
critical path.
sum
sum’s loop
+
38
Pipeline balancing
*
i
100
1
+
<=
i’s loop
decoupling
FIFO
sum
sum’s loop
+
step 7
39
i
Pipeline balancing
*
i’s loop
100
1
+
<=
critical path
decoupling
FIFO
sum
sum’s loop
+
40
Outline
•
•
•
•
Problems of current architectures
Compiling ASH
Pipelining
Evaluation: CASH vs. clocked designs
• Conclusions
41
Evaluating ASH
C
Mediabench kernels
(1 hot function/benchmark)
CASH
core
Verilog
back-end
Synopsys,
Cadence P/R
180nm std. cell
library, 2V
ModelSim
Mem
(Verilog simulation)
ASIC
~1999
technology
performance
numbers
42
normalized area
it_
e
gw
pe
it_
d
gw
pe
e
g2
_
m
pe
d
_e
_d
g2
_
m
pe
eg
jp
eg
jp
m
_e
gs
Mem access
Datapath
m
_d
21
_e
8
gs
g7
21
_d
_e
2
g7
pc
m
_d
7
ad
pc
m
ad
Square mm
ASH Area
P4: 217
6
5
4
3
minimal RISC core
1
0
43
ASH vs 600MHz CPU [.18 mm]
4
3.76
3.51
3.5
Times slower
3
2.5
2.19
1.91
2
1.73
1.61
1.5
1.62
1.65
1.48
1.17
1.08
1
0.45
0.5
0.45
g
av
m
pe
g2
_d
m
pe
g2
_e
pe
gw
it_
d
pe
gw
it_
e
g_
e
jp
e
g_
d
jp
e
_e
gs
m
_d
gs
m
_e
g7
21
_d
g7
21
m
_e
ad
pc
ad
pc
m
_d
0
44
Bottleneck: Memory Protocol
ST
LSQ
LD
Memory
•Token release to dependents: requires round-trip to memory.
•Limit study: round trip zero time ) up to 6x speed-up.
•Exploring protocol for in-order data delivery & fast token release.
45
pe
g2
_e
pe
g2
_d
mP
4000
pe
gw
it_
d
pe
gw
it_
e
m
m
eg
_e
jp
eg
_d
DSP
110
jp
_e
m
gs
_d
_e
_d
m
21
gs
g7
21
30
g7
_e
m
ad
pc
_d
m
ad
pc
Power [mW]
Power
Xeon
[+cache]
67000
25
20
15
10
5
0
46
Energy Efficiency
1000x
Dedicated hardware
ASH media kernels
Asynchronous mP
FPGAs
General-purpose DSP
Microprocessors
0.01
0.1
1
10
100
1000
Energy Efficiency [Operations/nJ]
47
Outline
Problems of current architectures
+ Compiling ASH
+ Pipelining
+ ASH Evaluation
= Future/related work & conclusions
48
Related Work
Asynchronous
circuits
Nanotechnology
Embedded
systems
Reconfigurable
computing
Dataflow
machines
High-level
synthesis
Computer
architecture
Compilation
49
Future Work
• Optimizations for
area/speed/power
• Memory partitioning
• Concurrency
• Compiler-guided layout
• Explore extensible ISAs
• Hybridization with superscalar mechanisms
• Reconfigurable hardware support for ASH
• Formal verification
50
Grand Vision:
Certified Circuit Generation
• Translation validation: input ´ output
• Preserve input properties
– e.g., C programs cannot deadlock
– e.g., type-safe programs cannot crash
• Debug, test, verify only at source-level
HLL
IR
IRopt
Verilog
gates
layout
formally validated
How far can you go?
51
Conclusions
Spatial computation strengths
Feature
No interpretation
Advantages
Energy efficiency, speed
Spatial layout
Short wires, no contention
Asynchronous
Low power, scalable
Distributed
No global signals
Automatic compilation Design productivity, no ISA
52
• Reconfigurable hardware
• Critical paths
• Control logic
• ASH vs ...
• ASH weaknesses
• Exceptions
• Normalized area
• Why C?
• Splitting memory
• More performance
• Recursive calls
Backup
Slides
53
Reconfigurable Hardware
Interconnection
network
Universal gates
and/or
storage elements
Programmable switches
54
Main RH Ingredient: RAM Cell
a0
a1
0
0
0
1
data
a0
a1
a1 & a2
Universal gate = RAM
data in
0
control
Switch controlled by a 1-bit RAM cell
back
55
Critical Paths
b
if (x > 0)
y = -x;
else
y = b*x;
x
*
0
-
>
!
y
56
Lenient Operations
b
if (x > 0)
y = -x;
else
y = b*x;
x
*
0
-
>
!
y
Solves the problem of unbalanced paths
back
back to talk
57
Asynchronous Control
ackout
rdyin
D
C
ackin
rdyout
datain
back
back to talk
Reg
=
dataout
HLL to HW
High-level
Synthesis
Reconfigurable
Computing
Asynchronous
circuits
Behavioral
HDL
C
[subsets]
Concurrent
Language
Synchronous
Hardware
Hardware
configuration
Asynchronous
Hardware
(spatial computation)
Prior work
This research
59
CASH vs High-Level Synthesis
• CASH: the only existing tool to translate
complete ANSI C to hardware
• CASH generates asynchronous circuits
• CASH does not treat C as an HDL
– no annotations required
– no reactivity model
– does not handle non-C, e.g., concurrency
back
60
ASH Weaknesses
•
•
•
•
•
•
Low efficiency for low-ILP code
Does not adapt at runtime
Monolithic memory
Resource waste
Not flexible
No support for exceptions
61
ASH Weaknesses (2)
• Both branch and join not free
• Static dataflow
(no re-issue of same instr)
• Memory is “far”
• Fully static
– No branch prediction
– No dynamic unrolling
– No register renaming
• Calls/returns not lenient
back
62
Branch Prediction
i
ASH crit path
for (i=0; i < N; i++) {
...
CPU crit path
1
+
<
exception
if (exception) break;
}
Predicted not taken
Effectively a noop for CPU!
back Predicted taken.
!
&
result available before inputs63
Exceptions
• Strictly speaking, C has no exceptions
• In practice hard to accommodate
exceptions in hardware implementations
• An advantage of software flexibility:
PC is single point of execution control
Low ILP computation
+ OS + VM + exceptions
CPU
$$$
ASH
High-ILP
computation
Memory
back
64
Why C
• Huge installed base
• Embedded specifications written in C
• Small and simple language
– Can leverage existing tools
– Simpler compiler
• Techniques generally applicable
• Not a toy language
back
65
4000
g
e
it_
d
it_
_e
_d
4500
av
gw
pe
gw
pe
g2
pe
m
g2
_e
eg
_d
_e
m
eg
pe
m
jp
jp
gs
_d
_e
m
21
gs
g7
_d
_e
m
_d
m
21
pc
pc
g7
ad
ad
Megaoperations per second
Performance
5000
MOPSall
MOPSspec
MOPS
3500
3000
2500
2000
1500
1000
500
0
66
rasta
pegwit_e
pegwit_d
mpeg2_e
mpeg2_d
mesa
jpeg_e
jpeg_d
gsm_e
gsm_d
g721_e
g721_d
epic_e
epic_d
adpcm_e
adpcm_d
Parallelism Profile
25
CPU
ASH
20
15
10
5
4
0
67
Normalized Area
120
2.5
Lines/sq mm
sq mm/kbyte
100
2
80
1.5
60
1
40
0.5
20
back to talk
av
g
e
gw
it_
d
pe
gw
it_
e
pe
pe
g
2_
d
m
pe
g
2_
e
m
eg
_
jp
d
eg
_
jp
_e
gs
m
_d
gs
m
_e
21
g7
g7
21
_e
pc
m
ad
pc
m
ad
back
_d
0
_d
0
68
Memory Partitioning
• MIT RAW project: Babb FCCM ‘99,
Barua HiPC ‘00,Lee ASPLOS ‘00
• Stanford SpC:
Semeria DAC ‘01, TVLSI ‘02
• Berkeley CCured: Necula POPL ‘02
• Illinois FlexRAM: Fraguella PPoPP ‘03
• Hand-annotations #pragma
back
back to talk
69
Memory Complexity
RAM
addr
LSQ
data
back
back to talk
70
Recursion
save live values
recursive call
restore live values
back
stack
71
Me?
72