Soft Computing - Ubiquitous Computing Lab

Download Report

Transcript Soft Computing - Ubiquitous Computing Lab

Soft Computing
Lecture 23
Hardware neural networks
Kinds of hardware support of NN
• Using parallel general purpose hardware
• Using of signal processors
• Development of special hardware, based
on VLSI, for implementation of NN
14.12.2005
2
Technologies for development of neural
networks in VLSI
Kind of elements
Analog optical
Advantages
Disadvantages
Opportunity of mass
connections
Full technology of optical
computing is absent
Analog electrical
Simple concepts, fast
processing of data
Hard technological
requirements, sensitivity to
defects and external
influences, small exactness
of computing, difficulties of
implementation of mass
connections
Digital electrical
Advanced full technology,
high exactness of
computing, robustness
to technological options
Complexity of schemes
decisions, multitact
execution of basic
operations, difficulties of
implementation of mass
connections
Analog acceleration of basic
operation with digital
interfaces with external
devices, opportunity of
optical commutation
Further technological
developments are
needed
Hybrid
14.12.2005
3
Some first neural networks in VLSI
Name
(company-producer)
Kind of
elements
Number of
neurons and
synapses
Number of
multiplications with
summation per
second
Silicon Retina (Synaptics)
Analog
48 x 48
?
ETANN (Intel)
Analog
64 / 104
2 109
N64000 (Inova)
Digital
64 / 105
9 108
MA-16 (Siemens)
Digital
16 / 256
4 108
RN-200 (Ricoh)
Hybrid
16 / 256
3 109
NeuroClassifier (Mesa Research
Institute)
Hybrid
7 / 426
2 1010
14.12.2005
4
14.12.2005
5
Some neuro computers
Type
CNAPS/PC
(Adaptive Solutions)
PCaccelerat
or card
2 CNAPS-1016
processors
(128
neuronsв)
2.5 10 9
CNAPS
(Adaptive Solutions)
Neuro
computer
8 CNAPS-1016
processors
(512
neurons)
10 10
Neuro
computer
8 MA-16
processors
(512
neurons)
SYNAPSE-1
(Siemens)
14.12.2005
Number of
processor
elements
Number of
multiplication
s with
summation
per second
Name, Producer
3 10 9
6
14.12.2005
7
Examples of Neurocomputers
• Three neurocomputer systems are listed:
• The Adaptive Solutions CNAPS uses the Inova N64000
chip on on VME boards in a custom cabinet run from a
UNIX host. Boards come with 1 to 4 chips and two
boards can process the same network to give a total of
512 PE's. The software includes a C-language library,
assembler, compiler, and a package of NN algorithms.
• Similarly, the HNC SNAP Neurocomputer comes with
typically include 2 VME boards, each with four NAP 100
chips, providing 32 PE's total. The boards are controlled
from a PC by the HNC Balboa accelerator card.
• The Siemens SYNAPSE-1 uses a systolic array of 8 MA16 chips in a custom cabinet with a Unix host.
14.12.2005
8
Slice architecture
•
Following the bit slice concept of conventional digital processors, the neural
network slice chips provide building blocks to construct networks of arbitrary
size and precision. Such chips typcially cost only about $50/chip, perform at
moderate speeds, and are without on-chip learning.
• The Micro Devices MD1220 was probably the first commercial neural
network chip. Each chip has eight neurons with hard-limit thresholds and
eight 16-bit synapses with 1-bit inputs. With bit-serial multipliers in the
synapse, the chip provides about 9MCPS. Bigger networks and networks
with higher bit inputs can be constructed with multiple chips. A 16-bit
accumulator limits the total number of inputs because of overflows.
• A similar chip is the Neuralogix NLX-420 Neural Processor Slice[9], which
has has 16 processing elements (PE). A common 16-bit input is multiplied
by a weight in each PE in parallel. New weights are read from off-chip. The
16-bit weights and inputs can be user selected as 16 1-bit, 4 4-bit, 2 8-bit or
1 16-bit value(s). The 16 neuron sums are multiplexed through a userdefined piece-wise continuous threshold function to produce a 16-bit output.
Internal feedback allows for multi-layer networks. Multiple chips can build
large networks .
• The Philips Lneuro 1.0 chip, which is designed to be easily interfaced to
Transputers, also has 16-bit processing in which the neuron values can be
interpreted as 8 2-bit, 4 4-bit, etc., sub-values. Unlike the NLX-420, there is
a sizable (1kByte) on chip cache to hold weights. The transfer function is
done off-chip, which allows for multiple chips to provide synapse-input
products to the neurons to build very large networks.
14.12.2005
9
Multi-processor CHIPs
• A far more elaborate approach is to put many small processors on a
chip. Two architectures dominate such designs: single instruction with
multiple data (SIMD) and systolic arrays. For SIMD design, each
processor executes the same instruction in parallel but on different data.
In systolic arrays, a processor does one step of a calculation (always the
same step) before passing it's result on to the next processor in a
pipelined manner.
• SIMD chips include the Inova N64000 and the HNC 100 NAP. The
Adaptive Solutions CNAPS systems uses the Inova N64000 to build a
SIMD array. The chip contains 64 PE's, with each PE possessing a 9x16
bit integer multiplier, 32-bit accumulator, and 4KBytes of on-chip memory
for weight storage. All chips execute the same instruction and common
control and data buses allow for multiple chips to be combined. The
Hecht-Nielson Computers 100 NAP (Neurocomputer Array Processor)
contains only 4 PE's but each PE performs true 32-bit floating point
arithmetic. Weights are stored in off-chip memory and multiple chips can
be cascaded.
• A systolic array system can be built with the Siemens MA-16. The MA-16
provides for fast matrix-matrix operations (mult, sub, or add) of 4x4
matrices with 16-bit elements. The multipler outputs and accumulators
have 48-bit precision. Weights are stored off-chip and neuron transfer
functions are off-chip via lookup tables. Multiple chips can be cascaded.
14.12.2005
10
RBF functions:
• RBF networks provide fast learning and straight-forward
interpretation. The comparison of input vectors to stored training
vectors can be done quickly if non-Euclidian distances, such as the
Manhatten block norm (sum of element differences), are calculated
with no multiplication operations.
• Two commercial RBF products are now available: the IBM ZISC036
(Zero Instruction Set Computer) chip and the Nestor Ni1000 chip.
• The ZISC036 contains 36 prototype-vector neurons, where the
vectors have 64 8-bit elements, and can be assigned to categories
from 1 to 16383. Multiple chips can be easily cascaded to provide
additional prototypes. The distance norm is selectable between
Manhatten block and the largest element difference. The chip
implements a Region of Influence learning algorithm using signum
basis functions with radii of 0 to 16383. Recall is according to the
ROI identification or via nearest neighbor readout. Recall processing
takes for a 250k/sec pattern presentation rate.
• The Nestor Ni1000, developed jointly by Intel and Nestor, contains
1024 prototypes of 256 5-bit elements. The chip has two on-chip
learning algorithms, RCE and PNN, and other algorithms can be
microcoded. The processing rate is about 40k patterns/sec with a
40MHz clock.
14.12.2005
11
Other digital design
• Some digital neural network chips don't quite fit into the
above three sub-categories.
• Examples include the Micro Circuit Engineering MT19003
NISP Neural Instruction Set Processor and the Hitachi Wafer
Scale Integration chips.
• The NISP is basically a very simple RISC processor with
seven instructions, optimized for implementation of multilayer networks, and loaded with small programs to direct the
processing. Feed-forward processing reaches 40MCPS.
• At the other end of the complexity scale are the Hitachi
Wafer Scale Integration chips. Both Hopfield and backpropagation wafers have been built. A neurocomputer with 8
of the back-prop wafers, each with 144 neurons, achieved
2.3GCUPS.
14.12.2005
12
NeuroMatrix® NM6403 RISC/DSP Microprocessor
(Research Center Module, Russia)
• NeuroMatrix® NM6403 is a high performance dual-core
microprocessor with combination of VLIW/SIMD
architectures. The architecture includes two main units:
– 32-bit RISC Core
– 64-bit VECTOR co-processor to support vector operations with
elements of variable bit length (Patent US 6,539,368 B1).
There are two identical programmable interfaces to work
with any memory types as well as two communication
ports hardware compatible with TI DSP TMS320C4x
which permit to build multi- processor systems.
14.12.2005
13
RISC-Core
• 5-stage pipelined 32-bit RISC;
• processor instructions are 32 and 64 bit wide (usually
two operations are executed by each instruction);
• two address generation units, address space - 16 GB;
• two 64 bit programmable interfaces with SRAM/DRAM
shared memory;
• data format: 32-bit digit integers;
• registers:
– 8 of 32 bit general purpose registers;
– 8 of 32 bit address registers;
– special control and state registers;
• two high speed I/O communication ports of a byte width
hardware compatible with those of TMS320C4x.
14.12.2005
14
VECTOR co-processor
• 1-64 bit word length of vector operands
and the products;
• data format: integer data packed into 64bit blocks in the form of variable length
words from 1 to 64 bits each;
• hardware support of vector-matrix or
matrix-matrix multiplication;
• On-chip saturation functions;
• On-chip three 32*64 bit RAM blocks.
14.12.2005
15
Applications
• accelerators for PCs and workstations for:
– neural net emulation;
– signal processing;
– image processing;
– acceleration of vector and matrix calculations;
• telecommunications;
• embedded systems;
• basic block for building large super parallel
computing systems.
14.12.2005
16
Performance
• scalar operations:
– 40 MIPS;
– 120 MOPS for 32 bit data;
• vector operations:
– from 40 to 11.500+ MMAC (million multiplication and
accumulation per second);
• I/O and interfaces:
– two programmable external memory 64 bit interfaces
have up to 800 MB/sec. throughput;
– I/O communication ports up to 20 MB/sec. throughput
each.
14.12.2005
17
NeuroMatrix® MC431 SingleDSP PCI Evaluation Board
• MC431 is a Single-DSP PCI board designed for software
evaluation and system prototyping on NM6403 DSP.
MC431 is a low cost solution that can be used for
learning NeuroMatrix® architecture and Software
Development Kit. It has one NM6403 DSP, 4MB SRAM
and two communication ports.
• The NM6403 DSP has an access to two 2MB SRAM
banks (one bank per each bus). One bank is accessible
for reading/writing both from the processor and from PCI
bus. The MC431 has two external communication ports
for connecting input/output devices.
14.12.2005
18
NeuroMatrix® MC431 Single-DSP PCI
Evaluation Board (2)
low cost high-performance vector-matrix engine - NM6403
DSP
• 4 MB SRAM
• Two I/O communication ports hardware compatible with
TI 'C40 DSP
• PCI host interface
14.12.2005
19
14.12.2005
20
Neural networks taget traffic
management
Using a PCI-based video input board, image processor, and video
display board, each with an on-board NM6403 neural-network
processor, RC Module's imaging system can monitor and analyze traffic
flow on six-lane highways in real time
14.12.2005
21