Virtual and Physical Cellular Architectures for Kilo

Download Report

Transcript Virtual and Physical Cellular Architectures for Kilo

Virtual and Physical Cellular
Architectures for Kilo-processor
Chip Computers
Tamás Roska
Hungarian Academy of Sciences and
Pázmány P. Catholic University,
Budapest, Hungary
IEEE CNNA 2010
A new era in computing starts
• NOT just parallel
• NOT using the present algorithms
• NOT massaging the nanoscale devices to repeat
the functional primitives of CMOS building
blocks
• Running time is NOT the only measure
• There are NO efficient tools
Key: understanding spatial-temporal algorithmic
dynamics
Cellular Wave Computers: one way of thinking
Table of Contents
• The technology trend
• The six prototype platforms
• Virtual and Physical Cellular Machines and
the Design Scenario
• Processor and memory array representations
• Design tools and principles
• Qualitative theory of operators with local
connectedness
• New principles of Computational Complexity
The Technology Trend
• ~ 30 nm : over 10 billion transistors
• Severe limits for clock speed and power dissipation
• 25 k mixed mode processors and sensors on a
cellular visual microprocessor at 180 nm
• 45 nm: 70 k logic- and 2k arithmetic- processors
• about 1 million 8 bit microprocessors could be placed
(~5 Billion transistors) ...Why not?
• Wire delay is bigger than gate delay, hence
communication speed is limited (synchrony radius)
Hence
• The precedence of Locality ~ i.e. Cellular, mainly
locally connected architecture is a must for high
computing power
International Technology Roadmap for
Semiconductors, ITRS/2007 /2009
…but it also doesn’t scale terrible
well.
The six prototype platforms
- related new products
1.CELL Broadband Engine Architecture
(CBEA) and high-end supercomputer
•Heterogeneous, multi-core CELL
Multiprocessor chip
–241M transistor, 235mm2
–200 GFlops (SP) @3.2GHz
–200 GB/s bus (internal) @ 3.2GHz
–dual XDR controller (25.6GB/s)
–two configurable interfaces
(76.8GB/s)
•Power Processor Element (PPE)
–General purpose processor
•Synergistic Processor Element
(SPE)
– SIMD-only engine
–fast access to 256KB local
memories
–Globally coherent DMA to transfer
data
•Racks, cabinets, PetaFlops
systems
2. FPGAs – three cell arrays: logic- and
arithmetic- processors, and memories
XILINX VIRTEX 6 FPGA
• Over 70k Advanced Silicon Modular Logic
Blocks in a cellular array
• Over 2k DSP „slices” in a cellular array
• 36 kB memory blocks in a cellular array
• Application Specific array components
• Logic and arithmetic processor arrays and
memory arrays are tiled on a 2D plate
• Local directional preference (stream)
Xilinx Virtex-5 FPGA architecture
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
Memory
(BRAM)
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
Memory
(BRAM)
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
Memory
(BRAM)
CLB
DSP Slice
(18x25 bit
Multiplier)
DSP Slice
(18x25 bit
Multiplier)
DSP Slice
(18x25 bit
Multiplier)
DSP Slice
(18x25 bit
Multiplier)
DSP Slice
(18x25 bit
Multiplier)
DSP Slice
(18x25 bit
Multiplier)
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
Memory
(BRAM)
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
Memory
(BRAM)
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
Memory
(BRAM)
DSP Slice
(18x25 bit
Multiplier)
DSP Slice
(18x25 bit
Multiplier)
DSP Slice
(18x25 bit
Multiplier)
DSP Slice
(18x25 bit
Multiplier)
DSP Slice
(18x25 bit
Multiplier)
DSP Slice
(18x25 bit
Multiplier)
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
3. Stacked arrays – GPU: Graphic
Programming Units
Fermi ( NVIDIA -2010):
- in 32 core units / 16 kB local memory
- 512 cores (16x32)
- IEEE 754-2008 32-bit and 64-bit
- Full 32-bit integer path with 64-bit extensions
- Memory access instructions - transition to 64bit address
Fermi’s 16 SM are positioned around a common L2 cache. Each SM is a vertical
rectangular strip that contain an orange portion (scheduler and dispatch), a green
portion(execution units), and light blue portions (register file and L1 cache).
• In each core an integer/logic (64 bit)
unit and an arithmetic unit (32/64
floating point)
• Local and Global memory in a cache
architecture
• Possibility to use several 512 core
chips
• Fully cellular architecture
• Physical Cellular Machine
SENSORY INTEGRATED CHIPS
4. WITH MIXED MODE CELL
PROCESSORS – Eye-RIS
5. WITH DIGITAL CELL PROCESSORS
– XENON
Eye-RIS v.1.2, AnaFocus Ltd.
Many kilo processor chips
using cellular architecture
EyeRIS camera computer 25 k processors with optical
sensors, 1000 frame per sec
AnaFocus Ltd. Seville
XENON V3 (MDA SBIR Phase II Project:
ROIC No2, 64x64 prototype)
Photo of the ASIC
processor layer:
XENON V3
architecture
P - general
scheduler
Pixel (sensor)
array of a cell
Program
memory
Processors: 64 (8x8)
Sensors: 4096 (64x64)
MUX
Communi
-cation
Cell (processor) array
Topographic sensor-processor
arrangement (integrated
InGaAs/InP sensors are not
shown in the photo)
ADC
Processor
Memory
(to neighbors)
EUTECUS, Inc.
- proprietary
Processing speed
(depends on algorithmic
complexity):
> 1000 fps
Programming: general
purpose
6. Cellular big cores:
Intel Cloud 48 - 2010
48 IA processors in cellular communication scheme
Integrated Systems
• Bi-i systems with different sensor inputs
and outputs and mobile versions
• Desktop high-end systems
• VISCUBE 3D integrated circuits
Desktop Megaprocessor Computer
Viscube 3D
architecture
Eutecus-SZTAKI-Anafocus
Sensor layer (320x240)
Mixed
signal
layer
Processor
array
(160x120)
1 bit/pixel
(mask)
ROI parallel
ONR Grant
Pixel
Parallel BB
AD
8 bit/pixel
(gray value)
Digital
proc.
layer
Memory array
(32x24)
Proc.
parallel
Digital
memory
layer
Processor array
(32x24)
ROI
Virtual and Physical Cellular Machines
and the Design Scenario
The many-core
Cellular Wave Computer Components
The architectural element is defined as
• A processor array placed on a 2D or 3D grid
• The processors communicate with a delay
proportional with the distance .
• Typically two speed classes; a local fo within a
synchrony radius r, and a global Fo via a Manhattan
type bus system. In case of continuous time dynamics
f ~1/T, T being the dominant time constant
• The local and global clock speeds are adjusted to
keep the power dissipation within prescribed limit Pd
The Virtual Cellular architecture of a single array is
shown, for a 2D case, in Figure 1 (hiding physical
details)
1
n
j
1
r
i
m
Global
progr.
unit
Figure 1: the Cellular many-core 2D Wave Computer architecture
New principles
• the role of the geometric address of a single
processor in an array is introducing new
algorithmic principles:
– the precedence of geometric locality (due to
the physical and logical locality), i.e. the
cellular character – granular, topographic, etc.
– cellular wave dynamics as an instruction
– streaming activity pattern in logic
Do NOT parallelize existing single processor
algorithms – invent new cellular algorithms
The physical constraints on
communication speed and power dissipation
The basic constraints are: wire delay and power dissipation. A
side constraint is the number of communication pins (contacts).
The execution of a logic, arithmetic or symbolic elementary
array instruction is defined via r input (u(t)), m output (y(t))
and n state (x(t)) variables (t is the time instant).
This unit is characterized by its
• s, surface/area,
• e, energy,
• f, operating frequency,
• w = e f local power dissipation, and
• the signals are traveling on a wire with length L , width q,
and with speed vq introducing a delay of D = l vq
•Ω cores can be placed on a single Chip , typically
in a rectangle grid, with R input and Q output
physical connectors typically at the corners of the
Chip, altogether there are K input/output connectors.
•The maximal value of dissipation of the Chip is W.
• The physics is represented by the maximal values
of Ω, K, and W (as well as the operating frequency).
The operating frequency might be global for the
whole Chip Fo, or could be local within the Chip, fo,
fi (some parts might be switched off, fi = 0)
Virtual and Physical Cellular
Machines
• The Virtual Cellular Machine is the modern
version of the Virtual Memory
– (i) to hide the physical details of the huge
number of processing elements (cores, cells,
threads, etc.),
– (ii) to provide the framework for designing the
algorithms with elementary array instructions,
and
– (iii) to serve as the starting phase for the
Virtual to Physical Cellular Machine mapping.
Virtual Cellular Machines
• Its computational building blocks are mainly
arrays, composed of operator and memory
elements, acting in space and time
• Heterogeneous and homogeneous
computational blocks
• Building blocks have no physical constraints
- size, bandwidth, speed, power dissipation,
except the two classes of memory access
local and global)
• A typical logic elementary array instruction (LA) is
a binary logic function on n variables, or a memory
look-up table array
• A typical arithmetic/analog elementary array (AA)
instruction is a multiply and accumulate (add) term
(MAC core) array,
• A symbolic elementary array instruction (SA) might
be a string manipulation core array (e.g. a P system)
• a complex cell based array instruction (XA),
hosting cells with all the three above types of data and
instructions
An 8, 16, or 32 bit microprocessor could be considered
as well as an elementary array instruction with
iterative or multi-thread implementation.
A Virtual Cellular Machine
is composed of five types of building blocks:
• (i) cellular processor arrays/layers, CP, with simple (L,
or A, or S type) or complex (X) cells and their local
memories, these are the protagonist building blocks,
• (ii) classical digital stored program computers, P
(microprocessors),
• (iii) multimodal topographic (T) or non-topographic
inputs and outputs, I/O (e.g. scalar, vector, or matrix
signals),
• (iv) global memories of different data types M,
organized in qualitatively different sizes and access
times (e.g. cache memories), and
• (v) interconnection pathways B (busses).
Simple Example: Global system control & memory
I∕O
B0
CP1/1 ... CP1/g
Pn
CP2/1
CP2/h
:
:
P1
b0
b1
b2
T
Input
2D
P0
T
Input
2D
M1
M2
M
f0
F0
B0
F0
B0
f0
f0
F0
F0
f0
F0
• tasks, the algorithms to be implemented, are
defined on the Data/Memory representations
of the Virtual Cellular Machine
• Various data representations for
– Topographic (e.g. picture, image, PDE,
molecule dynamics) and
– Non-topographic (e.g. Markov processes,
algebra, number theory)
problems
Physical Cellular Machines
– the model for the physical implementation
of system architectures
We have three elementary programmable Cell processor
(cell core) types used in array implementations
A) An algorithm with input, state and output vectors having real/
arithmetic, binary/digital logic, and symbolic variables(typically
implemened via digital circuits).
B) A real valued state and output dynamic system with
analog/continuous or arithmetic variables (typically
implemented via mixed mode/analog-and-logic circuits and
digital control processors)
C) A physical dynamic entity with well defined geometric
layout and I/O ports (function in layout) – (typical
implementations are CMOS and/or beyond CMOS nanoscale
designs, or optical architectures with programmable control)
Physical Cellular Machines
The above cell processors have their own physical
parameters, hence, the arrays of them, in similar
building arrangements like in the Virtual Cellular
Machines, will be defined by given physical
parameters as the
• Size
• Bandwidth/latency
• Operation speed
• Power dissipation
It might have the same or a different architecture
- compared to the Virtual Cellular Machine
• The geometry of the architectures are reflecting the
physical layout of the chips or systems. A building
block could be implemented as a separate chip or a
part of a chip. This architectural geometry defines also
the communication (interacting) speed ranges. Hence
physical closeness means higher speed ranges.
• The spatial location or topographic address of each
elementary cell processor or cell core, as well as that
of each building block within the architecture, and the
communication bandwidth, plays a crucial role.
In addition to the number of processors, these are
the most dramatic differences compared to
classical computer science.
The Design Scenario
algorithms
Physical Cellular
Machine
COMPILER
Processor / memory
topography
Physical
Implementation
Virtual Cellular
Machine
Data /object
topography
Task/Problem/
Workload
Processor and memory/Data
array representations
Array Signals, variables,
memory
Processors
logic / symbolic array
logic/ symbolic processor
logic / symbolic value
arithmetic/analog processor
arithmetic/analog array
arithmetic/analog processor array
arithmetic/analog value
logic/symbolic processor array
FPGA organization
1c
~50c.
1D
200c.
2D
200c.
GPU organization
1c.
1c.
10c.
1c.
100c.
CELL organization
Many sensors/data/memory units
on one Cell Processor
Separate sensory/memory plane
Flow architecture
1
2
K
Design Tools and Principles
•
•
•
•
•
•
•
•
Homogeneous cellular arrays
Decomposition in 2D
Decomposition in 1D
Composing complex cells for a
homogeneous FPGA architecture
Self-organizing distributed control
The CNN Universal Machine as a compiler
Nontopographic data representation
Spatial distribution of arithmetic operators
with large number of terms
Equivalent transformations for finite size
homogeneous cellular arrays (Á.Zarándy)
Four different methods when the size of
the Physical Cellular machine is
smaller than that of the Virtual Cellular
Machine
Simple example of a Physical Cellular Machine
implemented on a single chip (FPGA)
CP2
{50x20}
CP2
{20x30}
CP2
{50x40}
CP2
{30x80}
CP2
{20x50}
CP2
{50x20}
Chip size:100x80 processors
: conditional data path
Spatial decomposition on 1D
IBM CELL BE based design
Partitioning
• Horizontal stripes
• SPE-SPE
Communication
– First and last line
computation
• Row-wise data alignment
• Communication between
2 SPEs can be carried
out by a single DMA
command
CNN cell array
SPE 1
SPE 2
SPE 3
SPE 4
A compiler for FPGA implementation (Z.Nagy
and P.Szolgay)
• Compiler: the FALCON architecture via the
CNN Universal Machine.
• Special cases for
– Navier Stokes 2D, 2½D, incompressible and
compressible
– Multilayer retina model
• Introducing special FIFOs to overcome the
bandwidth constraint and the global control:
Cellular Distributed Self-organizing Control
structure ==>> streaming activity pattern
Xilinx Virtex-5 FPGA architecture
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
Memory
(BRAM)
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
Memory
(BRAM)
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
Memory
(BRAM)
CLB
DSP Slice
(18x25 bit
Multiplier)
DSP Slice
(18x25 bit
Multiplier)
DSP Slice
(18x25 bit
Multiplier)
DSP Slice
(18x25 bit
Multiplier)
DSP Slice
(18x25 bit
Multiplier)
DSP Slice
(18x25 bit
Multiplier)
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
Memory
(BRAM)
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
Memory
(BRAM)
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
Memory
(BRAM)
DSP Slice
(18x25 bit
Multiplier)
DSP Slice
(18x25 bit
Multiplier)
DSP Slice
(18x25 bit
Multiplier)
DSP Slice
(18x25 bit
Multiplier)
DSP Slice
(18x25 bit
Multiplier)
DSP Slice
(18x25 bit
Multiplier)
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
Virtex-5 Logic Cell Element
• Configurable Logic
Block (CLB)
Cout
D1-D6
– 2 slices
LUT6
RAM64
SRL32
Carry&
Control
D
FF/Latch
DQ
C
CLB
C1-C6
Slice(1)
LUT6
RAM64
SRL32
Carry&
Control
FF/Latch
CQ
B
B1-B6
Slice(0)
LUT6
RAM64
SRL32
Carry&
Control
FF/Latch
BQ
A
A1-A6
LUT6
RAM64
SRL32
Carry&
Control
Cin
FF/Latch
AQ
Arithmetic and Memory Cell
Elements
•Dual-ported memory
•DSP Slice
DIA
DOA
AddrA
WEA
A
B
R
R
Multiplier
25x18 bit
ADD
A
C
C
ENA
CLKA
P
36Kb
Memory
Array
DIB
C
AddrB
WEB
ENB
CLKB
DOB
Programmable Interconnect
CLB
Slice(1)
Switch
Matrix
Slice(0)
Switch
Matrix
Switch
Matrix
Switch
Matrix
Switch
Matrix
Switch
Matrix
RAMB36
DSP48
Switch
Matrix
Switch
Matrix
Switch
Matrix
Switch
Matrix
Switch
Matrix
DSP48
5
4
3
2
1
0
-1
-2
-3
-4
-5
-5 -4 -3 -2 -1 0 1 2 3 4 5
Distance (hops)
Distance (hops)
Wire delay distribution (ns)
0.85-0.95
0.75-0.85
0.65-0.75
0.55-0.65
0.45-0.55
0.35-0.45
0.25-0.35
The CNN State Equation
1
Cx ij (t )  
x ij (t )   A(i, j, k, l) y kl (t )   B(i, j, k, l)u kl (t )  zij
Rx
klSr ij
klSr ij
• Assumed that the input is constant or
changing slowly
• Forward Euler discretization
• Feedback equation
x ij n  1   A(i, j, k, l) y kl[n]  g ij
klSr ( ij)
• Feed-forward equation
g ij 
 B(i, j, k, l)u
klSr ( ij)
kl
 hz ij
The Falcon processor core
StateIn ConstIn TemplSelIn
Memory Unit
Mixer
Template
Memory
Arithmetic Unit
StateOut ConstOut TemplSelOut
• Memory unit
– Contains a belt of the
cell array
• Mixer
– Contains cell values
for the next updates
• Template memory
• Arithmetic unit
T1 S2
Mult
T2
Mult
S9
T9
Mult
Adder tree
gij
xij
Shift reg
S1
Shift reg
The complex arithmetic unit for
implementing the CELL dynamics
+
+
Shift &
Round
NewState
• 9 multipliers (3x3
case)
• Adder tree
– DSP slice built-in
adders
• Fully pipelined
• 1 cell update / clock
cycle
The Falcon array
Input lines
Control lines
Core
processor
Core
processor
Core
processor
Core
processor
Core
processor
Core
processor
Core
processor
Core
processor
Core
processor
Output lines
• Each core processor
works on a narrow
slice of the image
• Each line computes
one CNN iteration
• Results are shifted
down to the next row
processors
• Linear speedup
Performance compared to a
conventional microprocessor
Speedup
10000
1000
100
10
2
6
10
14
18
22
26
30
34
38
42
46
50
54
58
62
Precision (bit)
XC5VSX240T
XC6VSX475T
*Estimated data
Inviscid, Adiabatic, Compressible
flows on the CNN Universal Machine
• Euler equations:

    v  0
t

   v


    vv  I p   0
t


E
   E  p  v  0
t
• Notations:
–
–
–
–
–
–
–
–
t: time
: Nabla operator
ρ: density
v(u, v): velocity vector field
p: pressure
I: identity matrix
E: total energy
γ: ratio of specific heats
• Total energy is defined as:
p
1
E
 v  v
 1 2
 1

p =    1  E -  v  v 
 2

1st and 2nd order solution
NEW algorithms for non-topographic
problems
Particle filters (A.Horváth and M. Rásonyi)
• Topographic representation of the
nontopographic data
• Random selection within topographic
locality
• 1000 trajectories x 100 time instants x a few
particles

• Convergence and stability both on Virtual
and on an 8 bit Physical Cellular Machine
Arithmetic operations on big
number of terms
• Intel Threading building Blocks (~ log n)
• Spatial-temporal optimization and their
shapes on the cellular core (thread) array
• Shape filling optimization (~log n/N)
Shape filling optimization
N
Log N
4 parallel arrays
N
fill factor: static/dynamic
N
Log N
Data arriving in parallel
waves
Algorithm Representations
•
•
•
•
•
UMF DIAGRAMS
Dynamic Operational Graphs
Dynamic Acyclic Graphs
Dynamic process graphs
Etc.
UMF diagrams
A single cellular array/ layer defined by
a standard CNN dynamics :
U
Xo


z

τ

TEM k
Y
U: input array, Xo Initial state array, z: threshold or mask
array, Y: output array,
τ: time constant or clock time, TEMk : local instruction
Parallel
Algorithmic structures
in terms of arrays/layers
A typical parallel structure with two
parallel flows is shown below,
by combining them in the final layer
Cascade
U
U1
X01
U2
X1
X2
z
z
z
X02
z
Y
z
Y
Optimization of algorithms
represented by UMF diagrams
Directed acyclic graphs representing
UMF diagrams
Genetic Programming with Indexed
Memory, GP-IM (G. Pazienza)
Dynamic operational graphs
• Nodes are memories
• Branches are either
– the operators or
– communication paths
Extremal graph problems
Qualitative theory of Operators
with local connectedness
• Fundamental analytic results for 1D Binary CNN
(L.O.Chua et. al.) – the richness of patterns
• A fundamental Theorem for oscillating CNN
(F. Corinto, T.Roska, and M.Gilli):
Any oscillating fully connected dynamical system array
can be represented by a locally connected one
• Synchronization
A globally connected array of stochastic oscillatior
cells can also be implemented by locally connected
ones (G. Máté, E.Á.Horváth, E.Káptalan, A. Tunyagi,
Z.Néda, and T. Roska)
- level of synchrony
- size of local neighborhood
• Logic operators
Any 2D Boolean operator can be represented
by a sequence of locally connected standard
CNN
• Grammar cells
Locally connected simple grammar cell arrays
are equivalent to Turing Machines (E. Csuhaj
Varju et.al.)
• True Randomness in space
True random spatial bit patterns can be
generated with standard CNN Universal
Machine and spatially correlated local noise
(M. Ercsey-Ravasz, Z. Néda, and T. Roska)
New principles of
Computational Complexity
• The algorithmic and physical complexity
measures of these many core architectures will
be eventually different compared to the single
processor systems – abandoning the
asymptotic framework (what is big? 1 million?).
• With Ω cores and M total memory the finite
virtual algorithmic complexity Ca = Ca(Ω , M )
• The finite physical computational complexity of
an algorithm is measured by a proper mix of
speed, power, area, and accuracy.
A new era in
• Computer Science
• Computer engineering
begins….