Modeling of Imagine architecture

Download Report

Transcript Modeling of Imagine architecture

technische universiteit eindhoven
Modeling of Architectures
Embedded Computer Architecture
5KK73
Henk Corporaal
Bart Mesman
Hamed Fatemi
2011
Department of Electrical Engineering
Electronic Systems
‘Nothing is built on stone; all is built on sand,
but we must build as if the sand were stone.’
Jorge Luis Borges (Argentine writer 1899-1986)
2
Outline
• We will look at models for Area, Delay and Energy
• Processor structure
• Register files - Register cell
• Model (area, power, delay)
• details for several register file configurations
• Apply this to the Imagine architecture
• Stream register file (SRF)
• Network
5kk73
Electronic Systems
3
Processor
• Single processor
• Instruction Memory (IM)
• Controller
• Processing Element (PE)
IM
• Register File (RF)
• ALU
• Data Memory (DM)
Controller
PE
• SIMD
• Multiple PEs
• VLIW
RF
ALU
DM
• Multiple ALUs
•
Network
Multi-Processor
• Several processors
• Connected by a bus or
network
5kk73
Electronic Systems
4
Register File (RF) Area model
1-bit of size w*h
• Assume:
• p = number of ports
• For large RF row decoder
small compared to cell area
• 1-Bit area = w*h (tracks)
Acell (p)  (w  p)(h  p)
If p is large
A (p)  p 2
Schematic of 1 register cell
• 1 wordline and bitline per port needed
5kk73
Electronic Systems
5
Register file (RF) Delay model
Delay (d):
• Wire Propagation delay
• Fan-in/out delay
• Delay ~ wire length ~
connected cells
• R = number of registers, each b
bits wide => Nbits = bR
• Assuming square bit-layout
1
2
1
2
d  ( w  p )(bR)  (h  p )(bR)  pR
Note: for N FUs (ALUs), p ~ 3N, R ~ N →
1
2
(for large p wiring dominates)
d ~ N3/2
5kk73
Electronic Systems
6
Register file (RF) Power model
• Power (P):
• Proportional to the capacitance that
must be switched for each access
• In each access every bit-line and one
word-line  bit-line capacitance
• Each port drives (bR)1/2 bit lines
• Each bit line has length (h+p) (bR)1/2
Register file
P1 port  bR (h  p )Cw
Pp _ ports  Rp 2
If p is large: power is dominated by wire capacitance
Note: for N FUs (ALUs), p ~ 3N, R ~ N →
P ~ N3
5kk73
Electronic Systems
7
Register File organization
• Processor with one level register
Central (shared register file)
ALU 1
ALU N
DRF (distributed register file):
ALU 1
ALU N
5kk73
Electronic Systems
8
Comparing Area model of Central and Distributed RF
Central (shared) RF:
•2 read ports, one write port per ALU
•R= rN: number of registers of b bits
•r: number of register per ALU
A  N3
•N: number of ALUs
A  rNb[(3N  h)(3N  w)]
DRF:
•Only 2 ports: one read, one write
•This would give A(1 RF) ~ N
•Area of switch has same area cost
complexity
Square layout & organization
A  N2
of the DRF, including 2N*N crossbar
5kk73
Electronic Systems
9
Delay and Power models of central versus distributed RF
Assume N ALUs
• Central RF:
• #registers R=rN
• #ports p =3N
• Large N
dN
3
2
P  N3
• DRF:
• Constant #registers per ALU
• #ports p=2 (also constant!)
• DRF has a fixed delay and
power (per RF)
• Wire propagation determines
delay and power (for large N)
• For large N
dN
PN
2
5kk73
Electronic Systems
10
Register File
Register (memory) storage and
communication between ALUs are
critical parts for area, energy and
performance in media processor.
Hierarchical register storage
5kk73
Electronic Systems
11
2-levels register files (Hierarchical)
RF2 (level 2)
Central:
RF1 (level 1)
ALU 1
ALU N
DRF:
RF2 (level 2)
RF1 (level 1)
ALU 1
ALU N
• RF1 serves the ALUs, while RF2 is used to cover the memory latency
• Overall tendency for Area is the same as having one level RF
5kk73
Electronic Systems
12
Register Files
• Processor with stream register files:
• Replace each port into the memory staging RF with a stream buffer
• All stream buffers share a single port into the memory staging RF,
allowing that single physical port to act as many logical ports.
Central:
ALU 1
ALU N
5kk73
Electronic Systems
13
Register Files
• The payoff the transformation into a stream architecture is that we
can achieve an area proportional to N^2, since R2 (memory storage)
only needs 1 port. We also have to add in the area of the stream
buffers, which grows as N^2 with a very small constant.
DRF:
ALU 1
ALU N
5kk73
Electronic Systems
area per ALU
(Normalized to 1 ALU)
14
Results
5kk73
Electronic Systems
Results
Local delay
15
5kk73
Electronic Systems
Results
Power overhead
16
5kk73
Electronic Systems
17
Imagine Architecture
Cell placement of Imagine
Die Photo of Imagine
5kk73
Electronic Systems
Imagine Floorplan
Stream
Controller
• 22 million transistors
Micro-Controller
ALU Cluster 0
• 500 MHz
ALU Cluster 1
ALU Cluster 2
SRF
Memory System
• Area, Energy, Delay
models
• Clusters, Microcontroller, SRF, Network
Interface
Network
Interface
ALU Cluster 3
7.8mm
18
ALU Cluster 4
ALU Cluster 5
ALU Cluster 6
ALU Cluster 7
7.6mm
5kk73
Electronic Systems
19
Stream register File
5kk73
Electronic Systems
Network:
20
•
Area of network grows with (like
DRF switch) :
A C
2
A  Nc
2
N c  number of clustes
C  number of clustes
A total  CASRF  A micro  CAcluster  A comm
E total  CASRF  E micro  EA cluster  E comm
More details in khailany paper [2003]
5kk73
Electronic Systems
21
Exploration
Intra-cluster scaling
5kk73
Electronic Systems
22
Exploration
Inter-cluster scaling
5kk73
Electronic Systems
23
end
• More details:
• Scott Rixner, William J. Dally, Brucek Khailany, Peter Mattson,
Ujval J.Kapasi, and John D. Owens. Register Organization for
Media Processing. In Proceedings of the 6th International
Symposium on High-Performance Computer Architecture
(HPCA), pages 375–386, Toulouse, France, January 2000. IEEE
Computer Society.
• Brucek Khailany, William Dally, Scott Rixner, Ujval Kapasi, John
Owens, and Brian Towles. Exploring the vlsi scalability of
stream processors. In Proceedings of the Ninth Symposium on
High Performance Computer Architecture (HPCA), pages 153–
164, Anaheim, California, USA, February 2003. IEEE Computer
Society.
5kk73
Electronic Systems