Modeling of Imagine architecture

Download Report

Transcript Modeling of Imagine architecture

technische universiteit eindhoven
Modeling of Architectures
Platform-based Design
5KK70
Henk Corporaal
Bart Mesman
Hamed Fatemi
2010
Department of Electrical Engineering
Electronic Systems
‘Nothing is built on stone; all is built on sand,
but we must build as if the sand were stone.’
Jorge Luis Borges (Argentine writer 1899-1986)
2
Outline
• We will look at models for Area, Delay and Energy
• Processor structure
• Register files - Register cell
• Model (area, power, delay)
• details for several register file configurations
• Apply this to the Imagine architecture
• Stream register file
• Network
Platform-based Design 5KK70
Electronic Systems
3
Processor
• Single processor
• Instruction Memory (IM)
• Controller
• Processing Element (PE)
IM
• Register File (RF)
• ALU
• Data Memory (DM)
Controller
PE
• SIMD
• Multiple PEs
• VLIW
RF
ALU
DM
• Multiple ALUs
•
Network
Multi-Processor
• Several processors
• Connected by a bus or
network
Platform-based Design 5KK70
Electronic Systems
4
Register File (RF) Area model
1-bit
• Assume:
• p = number of ports
• For large RF row decoder
small compared to cell area
• 1-Bit area = w*h (tracks)
Acell (p)  (w  p)(h  p)
If p is large
A (p)  p 2
Schematic of 1 register cell
Platform-based Design 5KK70
Electronic Systems
5
Register file (RF) Delay model
Delay (d):
• Wire Propagation delay
• Fan-in/out delay
• Wire propagation dominates
the delay with a large number
of ports
• R = number of registers
Register file
- assuming square layout
- R registers of b bits
1
2
1
2
1
2
Note: for N FUs (ALUs), p ~ 3N, R ~ N →
d ~ N3/2
d  ( w  p )(bR)  ( w  p )(bR)  pR
Platform-based Design 5KK70
Electronic Systems
6
Register file (RF) Power model
• Power (P):
• Proportional to the capacitance that
must be switched for each access
• In each access every bit-line and one
word-line  bit-line capacitance
• Each port drives (bR)1/2 bit lines
• Each bit line has length (h+p) (bR)1/2
Register file
P1 port  bR (h  p )Cw
Pp _ ports  Rp 2
If p is large: power is dominated by wire capacitance
Note: for N FUs (ALUs), p ~ 3N, R ~ N →
P ~ N3
Platform-based Design 5KK70
Electronic Systems
7
Register File organization
• Processor with one level register
Central (shared register file)
ALU 1
ALU N
DRF (distributed register file):
ALU 1
ALU N
Platform-based Design 5KK70
Electronic Systems
8
Comparing Area model of Central and Distributed RF
Central (shared) RF:
•2 read ports, one write port per ALU
•R= rN: number of registers of b bits
•r: number of register per ALU
A  N3
•N: number of ALUs
A  rNb[(3N  h)(3N  w)]
DRF:
•Only 2 ports: one read, one write
•This would give A(1 RF) ~ N
•Area of switch has same area cost
complexity
Square layout & organization
A  N2
of the DRF, including 2N*N crossbar
Platform-based Design 5KK70
Electronic Systems
9
Delay and Power models of central versus distributed RF
Assume N ALUs
• Central RF:
• #registers R=rN
• #ports p =3N
• Large N
dN
3
2
P  N3
• DRF:
• Constant #registers per ALU
• #ports p=2 (also constant!)
• DRF has a fixed delay and
power (per RF)
• Wire propagation determines
delay and power (for large N)
• For large N
dN
PN
2
Platform-based Design 5KK70
Electronic Systems
10
Register File
Register (memory) storage and
communication between ALUs are
critical parts for area, energy and
performance in media processor.
Hierarchical register storage
Platform-based Design 5KK70
Electronic Systems
11
2-levels register files (Hierarchical)
RF2 (level 2)
Central:
RF1 (level 1)
ALU 1
ALU N
DRF:
RF2 (level 2)
RF1 (level 1)
ALU 1
ALU N
• RF1 serves the ALUs, while RF2 is used to cover the memory latency
• Overall tendency for Area is the same as having one level RF
Platform-based Design 5KK70
Electronic Systems
12
Register Files
• Processor with stream register files:
• Replace each port into the memory staging RF with a stream buffer
• All stream buffers share a single port into the memory staging RF,
allowing that single physical port to act as many logical ports.
Central:
ALU 1
ALU N
Platform-based Design 5KK70
Electronic Systems
13
Register Files
• The payoff the transformation into a stream architecture is that we
can achieve an area proportional to N^2, since R2 (memory storage)
only needs 1 port. We also have to add in the area of the stream
buffers, which grows as N^2 with a very small constant.
DRF:
ALU 1
ALU N
Platform-based Design 5KK70
Electronic Systems
Results
area per ALU
(Normalized to 1 ALU)
14
Platform-based Design 5KK70
Electronic Systems
Results
Local delay
15
Platform-based Design 5KK70
Electronic Systems
Results
Power overhead
16
Platform-based Design 5KK70
Electronic Systems
17
Imagine Architecture
Cell placement of Imagine
Die Photo of Imagine
Platform-based Design 5KK70
Electronic Systems
Imagine Floorplan
Stream
Controller
• 22 million transistors
Micro-Controller
ALU Cluster 0
• 500 MHz
ALU Cluster 1
ALU Cluster 2
SRF
Memory System
• Area, Energy, Delay
models
• Clusters, Microcontroller, SRF, Network
Interface
Network
Interface
ALU Cluster 3
7.8mm
18
ALU Cluster 4
ALU Cluster 5
ALU Cluster 6
ALU Cluster 7
7.6mm
Platform-based Design 5KK70
Electronic Systems
19
Stream register File
Platform-based Design 5KK70
Electronic Systems
Network:
20
•
Area of network grows with (like
DRF switch) :
A C
A  Nc
2
2
N c  number of clustes
C  number of clustes
A total  CASRF  A micro  CAcluster  A comm
E total  CASRF  E micro  EA cluster  E comm
More details in khailany paper [2003]
Platform-based Design 5KK70
Electronic Systems
21
Exploration
Intra-cluster scaling
Platform-based Design 5KK70
Electronic Systems
22
Exploration
Inter-cluster scaling
Platform-based Design 5KK70
Electronic Systems
23
end
• More details:
• Scott Rixner, William J. Dally, Brucek Khailany, Peter Mattson,
Ujval J.Kapasi, and John D. Owens. Register Organization for
Media Processing. In Proceedings of the 6th International
Symposium on High-Performance Computer Architecture
(HPCA), pages 375–386, Toulouse, France, January 2000. IEEE
Computer Society.
• Brucek Khailany, William Dally, Scott Rixner, Ujval Kapasi, John
Owens, and Brian Towles. Exploring the vlsi scalability of
stream processors. In Proceedings of the Ninth Symposium on
High Performance Computer Architecture (HPCA), pages 153–
164, Anaheim, California, USA, February 2003. IEEE Computer
Society.
Platform-based Design 5KK70
Electronic Systems