Transcript d-mesfet

Page Number: 1/101
MICROPROCESSORS
DARPA EYES 100-MIPS GaAs CHIP FOR STAR WARS
PALO ALTO
For its Star Wars program, the Department of Defense
intends to push well beyond the current limits of technology. And along with lasers and particle beams, one piece of
hardware it has in mind is a microprocessor chip having as
much computing power as 100 of Digital Equipment
Corp.’s VAX-11/780 superminicomputers.
One candidate for the role of basic computing engine for
the program, officially called the Strategic Defense
Initiative [ElectronicsWeek, May 13, 1985, p. 28], is a gallium arsenide version of the Mips reduced-instruction-set
computer (RISC) developed at Stanford University. Three
teams are now working on the processor. And this month,
the Defense Advanced Projects Research Agency closed the
request-for-proposal (RFP) process for a 1.25-µm silicon
version of the chip.
Last October, Darpa awarded three contracts for a 32-bit
GaAs microprocessor and a floating-point coprocessor. One
went to McDonnell Douglas Corp., another to a team
formed by Texas Instruments Inc. and Control Data Corp.,
and the third to a team from RCA Corp. and Tektronix Inc.
The three are now working on processes to get useful
yields. After a year, the program will be reduced to one or
two teams. Darpa’s target is to have a 10,000-gate GaAs
chip by the beginning of 1988.
If it is as fast as Darpa expects, the chip will be the basic
engine for the Advanced Onboard Signal Processor, one of
the baseline machines for the SDI. “We went after RISC
because we needed something small enough to put on
GaAs,” says Sheldon Karp, principal scientist for strategic
technology at Darpa. The agency had been working with
the Motorola Inc. 68000 microprocessor, but Motorola
wouldn’t even consider trying to put the complex 68000
onto GaAs, Karp says.
A natural. The Mips chip, which was originally funded by
Darpa, was a natural for GaAs. “We have only 10,000 gates
to work with,” Karp notes. “And the Mips people had taken
every possible step to reduce hardware requirements. There
are no hardware interlocks, and only 32 instructions.”
Reprinted with permission
Even 10,000 gates is big for GaAs; the first phase of the
work is intended to make sure that the RISC architecture
can be squeezed into that size at respectable yields, Karp
says.
Mips was designed by a group under John Hennessey at
Stanford. Hennessey, who has worked as a consultant with
Darpa on the SDI project, recently took the chip into the
private sector by forming Mips Computer Systems of
Mountain View, Calif. [ElectronicsWeek, April 29, 1985,
p. 36]. Computer-aided-design software came from the
Mayo Clinic in Rochester, Minn.
The GaAs chip
will be clocked at 200 MHz,
the silicon at 40 MHz
The silicon Mips chip will come from a two-year effort
using the 1.25-µm design rules developed for the Very High
Speed Integrated Circuit program. (The Darpa chip was not
made part of VHSIC in order to open the RFP to
contractors outside that program.)
Both the silicon and GaAs microprocessors will be full 32bit engines sharing 90% of a common instruction core.
Pascal and Air Force 1750A compilers will be targeted for
the core instruction set, so that all software will be interchangeable.
The GaAs requirement specifies a clock frequency of
200 MHz and a computation rate of 100 million instructions
per second. The silicon chip will be clocked at 40 MHz.
Eventually, the silicon chip must be made radiation-hard;
the GaAs chip will be intrinsically rad-hard.
Darpa will not release figures on the size of its RISC effort.
The silicon version is being funded through the Air Force’s
Air Development Center in Rome, N.Y.
–Clifford Barney
ElectronicsWeek/May 20, 1985
Figure 1.1.a. A brochure about the RCA’s 32-bit and 8-bit versions of the GaAs
RISC/MIPS processor, realized as a part of the “MIPS for Star Wars” project.
Page Number: 2/101
Phases of a Well-Structured VLSI Design
1.
Generation of candidate architectures
with approximately the same VLSI area.
2.
Comparison of candidate architectures,
from the point of view of the compiled HLL code speed.
3.
Selection of one candidate architecture,
and finalization of its schematics.
4.
Design of the VLSI chip:
a. Schematic capture
b. Logic and timing testing
c. Placement and routing
5.
Generation of the mask.
6.
Chip fabrication, etc...
Page Number: 3/101
Typical Development Phases for
One 32-bit Microprocessor on a VLSI Chip
(or about the development of
DARPA's 32-bit RISC MIPS processors in GaAs and silicon)
1. Announcement of project requirements
(on 1.1.1984.)
a. Type of the architecture (SU-MIPS)
b. Maximal on-chip transistor count
(30K)
c. Detailed specification of the
assembly language (Core-MIPS)
d. A set of benchmark programs typical
of the end-user application (13)
Three competitors selected by 12.13.1984.
a. McDonell Douglas
b. CDC + TI
c. RCA (Purdue + TriQuint)
Page Number: 4/101
2. In-house research by the three competitors
(till 12.31.1985.)
a. Generation of several candidate architectures under 30K
transistors.
b. Design of an ENDOT (isp') simulator of all candidate
architectures (why isp'?).
c. All candidate architectures are ranked according to the above
mentioned benchmark programs.
d. Reasons for high/low ranking of specific candidate
architectures are analysed, and the best candidate
architectures are modified to become better.
The final architecture is determined and
"frozen" after several iterations.
Detailed RTL design is completed,
and it is proven that the total transistor count is below 30K.
Page Number: 5/101
3. Decision-making at the sponsor side
(by 1.1.1986.)
a. Final architectures of all competitors
are ranked (using the isp' simulators
and the initially provided
benchmarks).
b. A subset of competitors is selected
for further financing; others are offered
to stay in the competition with the own
financing.
c. All those that stay in competition are
shown all reports generated (by others)
till that point.
Page Number: 6/101
4. In-house development by the three competitors
(till 12.31.1986.)
a. Improvements are added, after the solutions of
the competition are reviewed, and their impact
is verified with isp’ simulation
b. The architecture is frozen, forever.
c. The RTL design is redone and frozen.
d. The appropriate semi-custom standard-cell family is
selected,and the gate level design is completed. The
standard-cel family choices, in the project which is
the subject of this presentation
 The 1 micron E/D-MESFET GaAs
e. The completed gate level (GTL) design
contains only the elements of the cells from
the selected family (which includes the input,
output, and input/output pads).
 The 1.25 micron SOS-CMOS Si
Page Number: 7/101
f. The gate level design is entered into a computer, using one of the following methods:





Graphic entry
HDL based entry
Logic equation entry
State machine entry
Direct entry of the net-list,
using a text editor
Except in the last case, the net list (needed for further work) is obtained using the appropriate translator.
g. The net-list is tested (logic and timing), using an appropriate testing program (LOGSIM). If
errors, the work iterates back, as needed.
h. The net-list is treated by an appropriate placement and routing program (MP2D). No timing
errors (guaranteed) after the chip is fabricated! Logic errors possible after the chip
is
fabricated.
The major two output files:
 Artwork file for visual analysis
(for printer or ploter)
 Fab file (for shipment to a chip
foundary, by regular mail or email)
At the chip foundary, the tab file is analysed, and each standard cell is substituted with its full-custom
equivalent (details are typically confidental).
Page Number: 8/101
5. Further narrowing down of the sponsored competition, and
widening up of the support technology (by 1.1.1987.)
a. Only a subset of the sponsored competition is given
further support for fabrication of a prototype at a
lower-than-nominal speed.
b. More funding made available for R&D in both,
semiconductor and packaging technologies.
c. More funding made available for the Core-MIPS
translators (for the MC680x0 and the 1750A
assembly languages) and compilers (for ADA and C).
Page Number: 9/101
6. Prototype fabrication (by 12.31.1987.)
7. Zero series at a still-lower-than-nominal speed (by 12.31.1988.)
8. Commercial series at the nominal speed (by 12.31.1989.)
9. The US epilogue!
10. The rest-of-the-world epilogue!
Page Number: 10/101
The ENDOT Package by TDT
1. First, the appropriate files are formed.
In the most general case:
a. One or more .isp (isp') file (different names; same extensions)
b. One .t (topology) file (trivial if one .isp file; complex if many .isp files)
c. One .m (meta-micro) file (one jumbo case statement)
d. One .i file (information related to linking and loading)
e. One or more .b (benchmark) files (any extension allowed)
Only this, and nothing more! [Poe66]
2. Second, the formed files are treated with appropriate tools:
a. Hardware tools
b. Software tools
c. Postprocessing and utility tools
Finally, the simulator is completed.
3. Third, the simulator is run, and the statistics about the analyzed architecture(s)
are collected.
4. Fourth, if needed, a silicon compiler is run, etc...
Page Number: 11/101
ENDOT
(1) Hardware Tools
(1.1) ISP' Language
(1.2) ISP' Compiler - ic
(1.3) Topology Language
(1.4) Ecologist - ec
(1.5) Simulation Command Language
(1.6) Simulator - n2
(2) Software Tools
(2.1) Meta-assembler - micro
(2.2) Meta-loader - the linker/loader
(2.2.1) Interpreter - inter
(2.2.2) Allocator - cater
(2.3.) Minor programs
(2.3.1) mdump
(2.3.2) merge
(2.3.3) mas = micro + cater
(2.3.4) mkmem
(3) Postprocesing & Utility Tools
(3.1) Statements counter - coverage
(3.2) General purpose post-processor - gpp
(3.3) N.2 help utility -nhelp
(3.4) Build utility - build
(3.5) VHDL translator - icv
Page Number: 12/101
THE N.2 DESIGN PROCESS
Step 1:
Step 2:
Step 3:
Step 4:
Step 5:
Step 6:
Idea!!!
Hardware (and Software) design
Simulation
Analysis
IF design <> ok THEN GOTO Step 2
End
With N.2 your design iterations become painless!!!
Page Number: 13/101
HARDWARE TOOLS
ISP' language
Purpose:
DESCRIPTION OF THE HARDWARE SYSTEMS
ISP' program:
(1) Declaration section
(2) Behavior section
Page Number: 14/101
Declaration section:
- CONTAINS STRUCTURE DECLARATIONS.
- STRUCTURES: ALL ISP' NAMED OBJECTS.
- STRUCTURE TYPES:
(1) MACRO
(2) PORT
(3) STATE
(4) MEMORY
(5) FORMAT
(6) QUEUE
MACRO subsection:
names which are used to give convenient easily
remembered names to objects.
PORT subsection:
names which are used for communication with
outside world.
STATE subsection:
internal names of the ISP' model that can store
information.
MEMORY subsection: same as a state, except that memory can be
initialized.
FORMAT subsection: convenient names for inconvenient names;
typically subranges of states.
QUEUE subsection: names which are used for synchronization with
outside world.
Page Number: 15/101
Behavior section:
- CONTAINS ONE OR MORE PROCESSES.
- PROCESS:
(1) PROCESS DECLARATION
(2) PROCESS BODY
- PROCESS BODY:
SET OF ISP' STATEMENTS.
- ISP' STATEMENTS:
PROCESS EXECUTES ALL
ITS INDEPENDENT STATEMENTS
CONCURENTLY.
- next AND delay STATEMENTS:
CAN BE USED TO
FORCE SEQUENTIAL EXECUTION
WITHIN A PROCESS
- main:
OPERATES IN A COUNTINUOUS LOOP.
- when:
WAITS FOR AN EVENT.
- procedure: SAME AS A SUBROUTINE IN A HLL;
main process INVOKES a procedure.
Page Number: 16/101
- function:
SAME AS A FUNCTION IN A HLL.
Example: “wave.isp”
port
CK 'output;
main CYCLE :=
(
CK = 0;
delay(50);
CK = 1;
delay(50);
)
Figure 3.1. File wave.isp with the description of a clock generator in the
ISP’ language.
Page Number: 17/101
File “cntr.isp”
port
CK 'input,
Q<4> 'output;
state
COUNT<4>;
when EDGE(CK:lead) :=
(
Q = COUNT + 1;
COUNT = COUNT + 1;
)
Figure 3.2. File cntr.isp with the description of clocked counter in the ISP’
language.
Page Number: 18/101
ic - The ISP' Compiler
Purpose:
COMPILES ".isp" SOURCE FILES
INTO ".sim" OBJECTS FILES
- input: ".isp" file
- output: ".sim" file
wave.isp ---> ic ---> wave.sim
cntr.isp ---> ic ---> cntr.sim
Page Number: 19/101
Topology Language
Purpose:
DESCRIBES LINKS
BETWEEN THE ".sim" FILES
Topology program:
(1) SIGNAL SECTION
(2) PROCESSOR SECTION
(3) MACRO SECTION
(4) COMPOSITE SECTION
(5) INCLUDE SECTION
- SIGNAL SECTION: IF EXISTS, CONTAINS A SET
OF SIGNAL DECLARATIONS
- SIGNAL DECLARATIONS:
signal_name [<width>][,signal declarations]
Page Number: 20/101
- PROCESSOR SECTION: CONTAINS A
PROCESSOR DECLARATION.
- PROCESSOR DECLARATION:
processor_name = "filename.sim"
[time delay = integer;]
[connections signal_connections;]
[initial memory_name = l.out;]
- MACRO SECTION: USER'S CONVENIENT NAMES
FOR TOPOLOGY OBJECTS.
- COMPOSITE SECTION: THIS SECTION
MAY CONTAIN SET OF THE
TOPOLOGY LANGUAGE DECLARATIONS
IN THE FOLLOWING FORMAT:
begin
declaration {declaration}
end
- INCLUDE SECTION: SIMPLE INCLUDING OF
THE FILE WHICH CONTAINS
TOPOLOGY LANGUAGE DECLARATIONS.
Page Number: 21/101
File “clcnt.t”
signal
CLOCK,
BUS<4>;
processor CLK = "wave.sim";
time delay = 10;
connections
CK = CLOCK;
processor CNT = "cntr.sim";
connections
CK = CLOCK,
Q = BUS;
Figure 3.3. File clcnt.t with the topology language description of the
connection between the clock generator and the clock counter, described in
Page Number: 22/101
the wave.isp and cntr.isp files, respectively.
ec - The Ecologist
Purpose:
COMPILES ".t" SOURCE FILES
INTO ".e00" FILES
- explicit input: ".t" file
- implicit input: ".sim" file(s)
- optional implicit input: "l.out" file
(derived by the software tools)
-output: ".e00" file (object file)
clcnt.t ----------->
wave.sim ------->
clcnt.e00
cntr.sim -------->
[l.out ------------>]
ec
----->
Page Number: 23/101
n2 - The Simulator
Purpose:
SIMULATION OF THE DESCRIBED
HARDWARE
SYSTEM.
- input: ".sim" & ".e00" files
- optional input: "l.out" file
(derived by the software
tools)
- output: if exists, ".txt" file
wave.sim ------->
cntr.sim -------->
clcnt.txt]
clcnt.e00 ------->
[l.out ------------>]
n2
[ ----->
Page Number: 24/101
Simulation Command Language
Purpose:
CONTROLLING THE FLOW OF SIMULATION
Some basic simulator commands:
- run:
STARTS OR RESUMES THE SIMULATION.
- quit:
EXIT THE SIMULATOR.
- time:
QUERIES THE SIMULATION "CLOCK" TO OBTAIN THE
ELAPSED UNITS
OF SIMULATION TIME.
- examine structures:
QUERIES THE CONTE OF THE STRUCTURES.
- help keyword:
PROVIDES AN ON-LINE REFERENCE.
- deposite value structure:
SETS THE CONTENTS OF THE STRUCTURE WITH
THE VALUE FIELD.
- monitor structures & alert structures:
PROVIDES A VARIETY OF CAPABILITIES FOR
GETTING INFORMATION DURING SIMULATION.. Page Number: 25/101
Installation of ENDOT package on systems
running SCO UNIX
1. Login as root
2. cd /usr
3. tar xv n2.tar.Z
(extract)
4. uncompress -v n2.tar.Z
5. tar xvf n2.tar
(extract)
6. rm n2.tar
7. cd n2
8. tar xvf nmpc.uof
9. cp nmpc.uof /usr/USERNAME
Sequence of operations for simulation of the
clocked counter
1. vi wave.isp
2. vi cntr.isp
3. ic wave.isp
4. ic cntr.isp
5. vi clcnt.t
6. ec -h clcnt.t
7. n2 -s clcnt.txt clcnt.e00
Page Number: 26/101
SOFTWARE TOOLS
metaMicro
Purpose:
ASSEMBLING AN ASSEMBLER PROGRAM.
- input:
METAMICRO ASSEMBLER SOURCE FILE AND ASSEMBLERPROGRAM
- output:
".n" FILE
arch.m ---------->
--->
program.m ----->
- arch.m:
|
|
|
---> micro ---> arch.n
CONTAINS DEFINITION OF THE ASSEMBLER INSTRUCTIONS AND
Begin-end Section:
begin
include program.m$
end
- program.m: CONTAINS ASSEMBLER PROGRAM
- arch.n:
OBJECT FILE.
Page Number: 27/101
inter - the Interpreter
Purpose:
DESCRIPTION OF THE
INSTRUCTION WORD;
ADDRESS
RESOLUTION AND RELOCATION.
- input:
- output:
LINKER/LOADER SOURCE FILE
".a" FILE
arch.i -----> inter ------> arch.a
- arch.i:
CONTAINS DEFINITIONS OF THE
INSTRUCTION WORD AND
INFORMATION FOR THE
ADDRESS RESOLUTION AND RELOCATION.
- arch.a:
OBJECT FILE.
Page Number: 28/101
cater - The Allocator
Purpose:
LINKING THE ".n" AND ".a" FILES;
RESOLVING ADDRESS & ALLOCATION.
- input:
".n" & ".a" files
- output: "l.out" file
- l.out:
MEMORY IMAGE FILE
arch.n --->
arch.a --->
|
| ---> cater ---> l.out
|
Page Number: 29/101
Postprocessing & Utility Tools
coverage - ANALYZES PROCESSOR STATEMENTS
BY USAGE, HIGHLIGHTING THE
UNEXECUTED STATEMENTS.
gpp -
ANALYZES PROCESSOR
STRUCTURES BY VALUE,
PROVIDING STATISTICAL,
GRAPHICAL, OR COMPARATIVE
PRESENTATION OF RESULTS.
nhelp -
ON-LINE HELP.
build -
MANAGING OF THE SOURCE FILES.
icv -
TRANSLATING ISP' MODELS INTO VHDL
Page Number: 30/101
The Fura RISC CPU

Word length: 32 bits

Registers: sixteen 32-bit

Execution model: register-to-register
dp = register_read -> ALU_operation -> register_write

Memory access: load & store

Pipelining:
delayed branching!!!
delayed loading!

Instruction classes:
(1) ALU class
(2) branch class
(3) data memory class
(4) system class
Page Number: 31/101
Instruction cycles:
(1) INSTRUCTION FETCH (IF)
(2) INSTRUCTION DECODING
AND EXECUTION (IDX)
(3) DATA LOAD (LD)
i-1:
i:
i+1
A
IDX
IF
IF
D
LD
IDX
IF
LD
IDX
LD
Possible isp' coding window positioning (i+1 is the current
instruction)
main := (
main:= (
IF(i+1);
IDX(i);
LD(i-1);
IF(i+1);
delay(1);
LD(i);
IDX(i+1);
)
)
main := (
main := (
IF(i+1);
delay(1);
IDX(i+1);
delay(1);
LD(i+1);
)
Page Number: 32/101
)

Instruction format:
31
24 23
OP
31
DST
24 23
OP
31
12 11
0
SRC#2
X
16 15
SRC#1
20 19
DST
16 15
SRC#1
20 19
DST
24 23
OP
20 19
5 4
X
SIMM
16 15
SRC#1
0
0
LIMM
Page Number: 33/101
ALU Class:
Add
(a) ADD Rd, Rs1, Rs2
(b) ADD Rd, Rs1, imm16
(c) ADD Rd, PC, imm16
Substract
(a) SUB Rd, Rs1, Rs2
(b) SUB Rd, Rs1, imm16
(c) SUB Rd, PC, imm16
Move
(a) MOV Rd, Rs1
(b) MOV Rd, imm16
(c) MOV Rd, PC
Negate
(a) NEG Rd, Rs1
Logical Not
(a) LNOT Rd, Rs1
Logical And
(a) LAND Rd, Rs1, Rs2
(b) LADD Rd, Rs1, imm16
Logical Or
(a) LOR Rd, Rs1, Rs2
Arithmetic Shift Left
(a) SLA Rd, Rs1, imm5
Arithmetic Shift Right
(a) SRA Rd, Rs1, imm5
Set if Equal
(a) SEQ Rd, Rs1, Rs2
Set if Greater Than
(a) SGT Rd, Rs1, Rs2
(b) LOR Rd, Rs1, imm1
Page Number: 34/101
Branch Class:
Branch on True
(a) BT Rd, Rs1
Branch Always
(a) BA Rd
Data Memory Class:
- load & store instructions
 load:
(1) three cycles: IF, IDX & LD
(2) IDX:
register_read - ALU_operation - output_latch_write (address)
(3) LD
Load
(a) SEQ Rd, Rs1, Rs2
 store:
(1) two cycles: IF & IDX
(2) IDX:
register_read - ALU_operation - output_latch_write (data & data address)
Store
(a) ST Rd, Rs2
Page Number: 35/101
System instructions:
Noophalt
(a) NOOPHALT
idle state of the machine; this instruction may be used for
filling slot(s) behind branches and/or loads,
or for real-time isp' programming,
or to support modular isp' programming.
Page Number: 36/101
Branching in pipelined machines:
Interlock mechanism:
hw (cisc-mostly) versus sw (risc-mostly)
i
i+1
i+75

Scoreboard branch: hw interlock
(clock slow-down)
 ALU (arithmetic-logic-unit) suspend
 RWB (register-write-unit) suspend
Page Number: 37/101
Delayed branch: sw interlock
source code:
i-1
i
i+1
i+2
ADD R7, imm32
JUMP R1, R2>R3
MOVE R3, R4
SUB R5, R6
after code generation:
i-1
ADD R7, imm32
i
JUMP R1+1, R2>R3
i+1
NOOP
i+2
MOVE R3, R4
i+3
SUB R5, R6
after code optimization:
i-1
i
JUMP R1+1, R2>R3
i+1
ADD R7, imm32
i+2
MOVE R3, R4
i+3
SUB R5, R6
Page Number: 38/101
condition: THE MOVED INSTRUCTION
(a) MUST BE EXECUTED (no matter if the
branch is taken or not), AND
(b) HAS CONDITION AND/OR
THE JUMP TARGET ADDRESS.
parameters:
(a) PIPELINE FILL-IN DEPTH
(which is not the pipeline depth minus one!)
(b) BRANCHING-RELATED STATISTICS
(branches executed versus branches taken)
(c) BRANCH FILL-IN FUNCTION
(local versus global code optimization)
(d) CLOCK SLOW DOWN FUNCTION
(in-the-critical-path versus off-the-critical-path)
(e) TECHNOLOGY-RELATED STATISTICS
(on-chip versus off-chip delays)
(f) CACHE IMPACT (hit versus miss penalty)
NUMERICAL EXAMPLE:
What is the equation for the condition that
hw and sw interlock have the same
benchmark execution time (not clock-count)
Page Number: 39/101
Loading in pipelined machines:
Interlock mechanism: hw versus sw
i
i+1
IF
IDX
LD
IF
IDX

Scoreboard LOAD:
 Syspend
 Bypass
Page Number: 40/101
Delayed LOAD: sw interlock
source code:
i-1
i
i+1
MOVE R3,R4
LOAD R7, memory
ADD R2, R1, R7
after code generation:
i-1
i
i+1
i+2
MOVE R3,R4
LOAD R7, memory
NOOP
ADD R2, R1, R7
after code optimization:
i-1
i
i+1
i+2
condition:
LOAD R7, memory
MOVE R3,R4
ADD R2, R1, R7
mutual independence
parameters: technology related,
design + organization + architecture related,
system software related,
and application related.
Page Number: 41/101
CURRENT WINDOW

IF
IDX LD
IF IDX LD
IF IDX LD
 

i-1:
i:
i+1:
leaves PASTPC,
PASTOP (part of PASTIR)
leaves PC,
OP (part of IR)
after IF,
puts PC+1 into PC;
after IDX (when branch),
puts REG[dst] into PC;
 

MAIN
DELAY(1) END
IR=MEMRY[PASTPC]
PASTPC=PC
PC=PC+1
PASTOP=OP
PC=REG[DST]
Page Number: 42/101
Page Number: 43/101
The ".isp" file:
- Macro section
macro
WORD = 32&,
BYTE
= 8&,
NIBBLE = 4&
;
- State section
state
reg[0:15]<WORD>,
pc<WORD>,
pastpc<WORD>,
ir<WORD>,
pastop<WORD>,
pastdst<NIBBLE>,
pastval<WORD>,
hist[0:23]<WORD>
;
!
!
- Memory section
memory
memry[0:0xfff]<WORD>
;
- Format section
format
op
dst
src1
src2
imm16
imm5
=
=
=
=
=
=
ir<31:24>,
ir<23:20>,
ir<19:16>,
ir<15:12>,
ir<15:0>,
ir<4:0>
Page Number: 44/101
- Main Program
main := (
pastop = op;
pastpc = pc;
pc = pc + 1;
ir = memry[pastpc];
hist[pastop] = hist[opastop] + 1;
delay(1);
if pastop eql 21
reg[pastdst] = pastval;
case op
0:reg[dst] = reg[src1] + reg[src2]
instructions 1 to 20
21: (
22:
23:
pastdst = dst;
pastval = memry[reg[src2]])
memry[reg[src2]] = reg[dst]
esac;
)
Page Number: 45/101
The complete "case":
! Instruction decode and execution is done here. The "case" statement performs
! the decode - note that the opcode bits are tested as one would expect.
! For each legal opcode, a unique action is specified.
! Only one action is performed, the the bottom of the "main" process is reached,
! and we return to the top of the process.
case op
0: reg[dst] = reg[src1] + reg[src2]
1: reg[dst] = reg[src1] + imm16 sxt 32
2: reg[dst] = pc + imm16 sxt 32
3: reg[dst] = reg[src1] - reg[src2]
4: reg[dst] = reg[src1] - imm16 sxt 32
5: reg[dst] = pc - imm16 sxt 32
6: reg[dst] = reg[src1]
7: reg[dst] = imm16 sxt 32
8: reg[dst] = pc
9: reg[dst] = - reg[src1]
10: reg[dst] = reg[src1] and reg[src2]
11: reg[dst] = reg[src1] and imm16 sxt 32
12: reg[dst] = reg[src1] or reg[src2]
13: reg[dst] = reg[src1] or imm16 sxt 32
14: reg[dst] = not reg[src1]
15: reg[dst] = reg[src1] *:arith (imm5 ext 32)
16: reg[dst] = reg[src1] /:arith (imm5 ext 32)
17: if reg[src1] eql reg[src2]
reg[dst] = - 1
else reg[dst] = 0
18: if reg[src1] gtr reg[src2]
reg[dst] = - 1
else reg[dst] = 0
19: if reg[src1] eql -1
pc = reg[dst]
20: pc = reg[dst]
21: (pastdst = dst;
pastval = memry[reg[src2]]
)
22: memry[reg[src2]] = reg[dst]
! add (reg-reg)
! add (reg-imm)
! add (pc-imm)
! sub (reg-reg)
! sub (reg-imm)
! sub (pc-imm)
! mov (reg-reg)
! mov (reg-imm)
! mov (pc-imm)
! negate
! and (reg-reg)
! and (reg-imm)
! or (reg-reg)
! or (reg-imm)
! not
! shift left
! shift right
! set if equal
!!
!!
!!
! set if greater
! branch on true
! branch always
! load
! store
Page Number: 46/101
The ".m" file:
- Instr Section
instr
I<32>$
- Format Section
format
op = I<32:24>,
dst = I<23:20>,
src1 = I<19:16>,
src2 = I<15:12>,
imm16 = I<15:12>,
imm5 = I<4:0>$
- Macro section
macro
r0 = 0&,
r1 = 1&,
...
r15 = 15&,
addr(d,s1,s2) = op=0; dst=d;
src1=s1; src2=s2$&,
instructions 1 to 22
noophalt = op=23$&$
- Begin-end section
begin
include ee666.test$
end
Page Number: 47/101
The ".i" file:
- Instr Section
instr
I<32>$
- Format Section
format
op = I<32:24>,
dst = I<23:20>,
src1 = I<19:16>,
src2 = I<15:12>,
imm16 = I<15:0>,
imm5 = I<4:0>$
- Space section
space
<0:4095>$
- Transfer section
transfer
{new}
- Mode section
mode
case op eql 7
imm16~address$
break$
esac,
default:
imm16~imm16$
Page Number: 48/101
The ".t" file
processor cpu = "ee666.sim";
time delay = 100ns;
initial memry = l.out;
Page Number: 49/101
The ".b" file:
Sample assembler language program that uses the instructions
for the RISC-like processor of the ee666 (Advanced Computer Systems),
Purdue University, Spring Semester 1987.
Filename: eee666.test
11:
12:
13:
movi(r0,100)
subri(r1,10,100)
movr(r2,r1)
seq(r3,r1,r2)
movi(r4,11)
movi(r5,12)
moci(r6,13)
bt(r4,r3)
ba(r5)
movi(r1,10)
addri(r1,r1,1)
addri(r1,r1,1)
sgt(r7,r2,r1)
bt(r6,r7)
addr(r8,r0,r2)
subri(r9,r1,10)
st(r9,r8)
ba(r5)
addri(r2,r2,2)
subri(r8,r8,2)
ld(r8,r8)
movr(r10,r8)
addrr(r10,r10,r8)
sla(r10,r10,2)
halt
Page Number: 50/101
Sample Fura RISC VMS Session:
1.
2.
3.
4.
set def [.N2]
copy VL$A:[N2.E666]*.* *.*
@VL$A:[N2]login
n2 -script.txt ee666.e00
If you want to test your own CPU:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
@VL$A:[N2]login
edit cpuname.isp
ic cpuname.isp
edit cpuname.m
edit program.m
micro cpuname.m
edit cpuname.i
inter cpuname.i
cater cpuname.a cpuname.n
edit cpuname.t
ec -b cpuname.t
n2 -s script.txt cpuname.e00
Page Number: 51/101
Papers from the Open Literature:
1)
Rose, C.W., Ordy, G. M., Drongowski, P. J.,
"N.mpc: A Study in University-Industry Technology Transfer"
IEEE Design & Test of Computers, February 1984, pp 44-56.
2)
Rose, C. W., "System Design Tools - A Paradigm Shift,"
Endot Corporation Internal Report, 1986.
3)
Gay, F., "Funcitonal Simulation Fuels System Design,"
VLSI Design Technology
4)
Kong, S., Wood, D., Gibson, G., Katz, R., Patterson, D.,
"Design Methodology of a VLSI Multiprocessor Workstation,"
VLSI Systems, February 1987.
5)
Bozanic, D., Fura, D., Milutinovic, V., "Simulation of a Simple
RISC Processor," Application Note, No. D#001/VM,
TD Technologies, Cleveland Heights, Ohio, U.S.A., 1993.
6)
Petkovic, Z., Milutinovic, V., "Simulation of the Intel i860 RISC
Processor," Application Note, No. D#003/VM, TD Technologies,
Cleveland Heights, Ohio, U.S.A., 1994.
7)
Milicev, D., Petkovic, Z., Milutinovic, V., "Simulation Study of
Uniprocessor Cache Memories," Application Note,
No. D#004/VM, TD Technologies,
Cleveland Heights, Ohio, U.S.A., 1994.
8)
Tomasevic, M., Milutinovic, V., "Using N.2 in a Simulation
Study of Snoopy Cache Coherence Protocols for Shared Memory
Multiprocessor System," Application Note, No. D#002/VM,
TD Technologies, Cleveland Heights, Ohio, U.S.A., 1993.
Page Number: 52/101
WORKLOAD CHARACTERIZATION
Important Reference:
Ferrari, D., Computer Systems Performance Evaluation, Prentice-Hall, Englewood
Cliffs, New Jersey, U.S.A., 1978.
Introduction:
Workload of a computer system has been defined as the set of all inputs
environment
(programs, data, commands, etc... ) that the system receives from its
In measurement experiments, the system is driven by a model of the workload which is just a sample of the real production workload.
The major question is how representative this sample is. Other important characteristics of a workload are:
a) simplicity of construction,
b) usage cost,
c) reproducibility,
d) compactness, and
e) system independence.
Types of Workload Models:
1. Natural workload model: A sample job stream taken from a production workload, and used to drive the system at the very time it was produced.
2. Artificial workload model: All other cases.
2a. Non executable:
Defined via statistical distributions of relevant parameters.
Usage: In analytical studies.
Typical forms: Probabilities of various instructions
(instruction mixes), memory accesses,
procedure nesting depths, etc...
Relevant issues: Mean values, variances, correlations,
autocorrelations, etc...
Standard instruction mixes: Flynn (MLL), Knuth (HLL), etc...
Page Number: 53/101
2b. Executable:
Defined via one or more programs.
Usage: In empirical studies.
Typical forms: Synthetic jobs (parametric programs) and
benchmarks (semantic programs).
Relevant issues: application orientation, etc...
Standard ones: See the PC magazines, etc...
Synthetic job approaches:
Buchhulz (fixed flowchart with variable parameters)
Kernigham + Hamilton (similar but more sophisticated)
Archibald + Baer (the most widely cited
computer architecture paper in 80's )
Benchmark types:
Extracted
Created
Standard (application dependent)
Page Number: 54/101
The DARPA/Stanford benchmarks:
The DARPA/Stanford Benchmark Package
consists of thirteen PASCAL programs:
1)
2)
3)
4)
5)
6)
7)
8)
9)
0)
1)
2)
3)
ackp.p
bubblesortp.p
fftp.p
fibp.p
intmmp.p
permp.p
puzzlep.p
eightqueenp.p
quickp.p
realmmp.p
sievep.p
towresp.p
treep.p
These programs are located on ed machine,
and the full path name of their directory is:
/a/mips/bench
Page Number: 55/101
An Introduction to
VLSI Processor Architecture
for GaAS
This research has been sponsored by RCA
and conducted in collaboration with
the RCA Advanced Technology Laboratories,
Moorestown, New Jersey.
Page Number: 56/101
Advantages
• For the same power consumption, at least half order of magnitude faster than Silicon.
• Efficient integration of electronics and optics.
• Tolerant of temperature variations. Operating range: [200C, 200C].
• Radiation hard. Several orders of magnitude more than Silicon: [>100 million RADs].
Page Number: 57/101
Disadvantages:
• High density of wafer dislocations
 Low Yield  Small chip size  Low transistor count.
• Noise margin not as good as in Silicon.
 Area has to be traded in for higher reliability.
• At least two orders of magnitude more expensive than Silicon.
• Currently having problems with high-speed test equipment.
Page Number: 58/101
Basic differences of Relevance for Microprocessor Architecture
• Small area and low transistor count
(* in general, implications of this fact are dependent
on the speed of the technology *)
• High ratio of off-chip and on-chip delays
(* consequently, off-chip and on-chip delays access is
much longer then on-chip memory access *)
• Limited fan-in and fan-out (?)
(* temporary differences *)
• High demand on efficient fault-tolerance (?)
(* to improve the yield for bigger chips *)
Page Number: 59/101
A Brief Look Into the GaAs IC Design
•Bipolar (TI + CDC)
•JFET (McDAC)
•GaAs MESFET Logic Families (TriQuint + RCA)
D-MESFET
(* Depletion Mode *)
E-MESFET
(* Enhancement Mode *)
Page Number: 60/101
Speed
(ns)
Arithmetic
32-bit adder
(BFL D-MESFET)
1616-bit multiplier
(DCFL E/D MESFET)
Control
1K gate array
(STL HBT)
2K gate array
(DCFL E/D MESFET)
Memory
4Kbit SRAM
(DCFL E/D MODFET)
16K SRAM
(DCFL E/D MESFET)
Dissipation Complexity
(W)
(K transistors)
2,9 total
1,2
2,5
10,5 total
1,0
10,0
0,4/gate
1,0
6,0
0,08/gate
0,4
8,2
2,0 total
1,6
26,9
4,1 total
2,5
102,3
Figure 7.1. Typical (conservative) data for speed, dissipation, and complexity of digital GaAs
chips.
Page Number: 61/101
GaAs
Silicon
Silicon
Silicon
Silicon
(1 m E/D-MESFET)
(2 m NMOS)
(2 m CMOS)
(1.25 m NMOS)
(2 m ECL)
40K
200K
200K
400K
40K (T or R)
Gate delay
(minimal fan-out)
50-150 ps
1-3 ns
800-1000 ps 500-700 ps
150-200 ps
On-chip memory access
(3232 bit capacity)
0.5-2.0 ns
20-40 ns
10-20 ns
5-10 ns
2-3 ns
Off-chip, on package
memory access (25632 bits)
4-8 ns
40-80 ns
30-40 ns
20-30 ns
6-10 ns
Off-package memory access
(1k32 bits)
10-50 ns
100-200 ns
60-100 ns
40-80 ns
20-80 ns
Complexity
On-chip transistor count
Speed
Figure 7.2. Comparison (conservative) of GaAs and silicon, in terms of complexity and speed of the chips (assuming equal
dissipation). Symbols T and R refer to the transistors and the resistors, respectively. Data on silicon ECL technology
complexity includes the transistor count increased for the resistor count.
Page Number: 62/101
GaAs E/D-DCFL
Silicon SOS-CMOS
Minimal geometry
1m
1.25 m
Levels of metal
2
2
Gate delay
250 ps
1.25 ns
Maximum fan-in
5 NOR, 2 AND
4 NOR, 4 NAND
Maximum fan-out
4
20
Noise immunity level
220 mV
1.5 V
Average gate transistor count
4.5
7
On-chip transistor count
25 000
100 000-150 000
Figure 7.3. Comparison of GaAs and silicon, in the case of actual 32-bit microprocessor implementations (courtesy of RCA).
The impossibility of implementing “phantom” logic (wired-OR) is a consequence of the low noise immunity of GaAs circuits
(200 mV).
Page Number: 63/101
Figure 7.4. Processor organization based on the BS (bit-slice) components. The meaning of symbols is as follows: IN—
input, BUFF—buffer, MUX—multiplexer, DEC—decoder, L—latch, OUT—output. The remaining symbols are standard.
Page Number: 64/101
Figure 7.5. Processor organization based on the FS (function slice) components: IM—instruction memory, I_D_U—
instruction decode unit, DM_I/O_U—data memory input/output unit, DM—data memory.
Page Number: 65/101
Implication of the High Off/On Ratio
On the Choice of Processor Design Philosophy
Only a single-chip reduced architecture makes sense!
In Silicon environment,we can argue “RISC” or “CISC”.
In GaAs environment,there is only one choice: “RISC”.
However, the RISC concept has to be significantly modified
for efficient GaAs utilization.
Page Number: 66/101
The Information Bandwidth Problem of GaAs
Assume a 10:1 advantage in on-chip switching speed, but
only a 3:1 advantage in off-chip/off-package memory access.
Will the microprocessor be 10 times faster?
Or only 3 times faster?
Why the Information Bandwidth Problem?
The Reduced Philosophy:
 Large register file
 Most or all on-chip memory is used for the register file
 On chip instruction cache is out of question
Instruction fetch must be from an off-chip environment
Page Number: 67/101
Applications for GaAs Microprocessor
• General purpose processing in defense and aerospace,
and execution of compiled HLL code.
• General purpose processing and substitution
of current CISC microprocessors.*
• Dedicate special-purpose applications
in digital control and signal processing.*
• Multiprocessing of the SIMD/MIMD type,
for numeric and symbolic applications.
Page Number: 68/101
Which Design Issues Are Affected?
On-chip issues:
•Register file
•ALU
•Pipeline organization
•Instruction set
Off-chip issues:
•Cache
•Virtual memory management
•Coprocessing
•Multiprocessing
System software issues:
Compilation
Compilation
Compilation
Code optimization
Code optimization
Code optimization
Page Number: 69/101
Adder Design
igure 7.6. Comparison of GaAs and silicon. Symbols CL and RC refer to the basic adder types (carry look ahead and ripple carry).
Symbol B refers to the word size.
a)
Complexity comparison. Symbol C[tc] refers to complexity, expressed in transistor count.
b)
Speed comparison. Symbol D[ns] refers to propagation delay through the adder, expressed in nanoseconds. In the case
of silicon technology, the CL adder is faster when the word size exceeds four bits (or a somewhat lower number, depending on the
diagram in question). In the case of GaAs technology, the RC adder is faster for the word sizes up to n bits (actual value of n
depends on the actual GaAs technology used).
Page Number: 70/101
Figure 7.7. Comparison of GaAs and silicon technologies: an example of the bit-serial adder. All symbols
have their standard meanings.
Page Number: 71/101
Register File Design
a)
b)
Figure 7.8. Comparison of GaAs and silicon technologies: design of the register cell: (a) an example of the register cell frequently used
in the silicon technology; (b) an example of the register cell frequently used in the GaAs microprocessors. Symbol BL refers to the
unique bit line in the four-transistor cell. Symbols A BUS and B BUS refer to the double bit lines in the seven-transistor cell. Symbol F
refers to the refresh input. All other symbols have their standard meanings.
Page Number: 72/101
Pipeline design
Figure 7.9. Comparison of GaAs and silicon technologies: pipeline design—a possible design error: (a) twostage pipeline typical of some silicon microprocessors; (b) the same two-stage pipeline when the off-chip
delays are three times longer than on-chip delays (the off-chip delays are the same as in the silicon version).
Symbols IF and DP refer to the instruction fetch and the ALU cycle (datapath). Symbol T refers to time.
Page Number: 73/101
a1)
a3)
a2)
b) IP
b)
Figure 7.10. Comparison of GaAs and silicon technologies: pipeline design—possible solutions; (a1) timing diagrams of a pipeline
based on the IM (interleaved memory) or the MP (memory pipelining); (a2) a system based on the IM approach; (a3) a system based
on the MP approach; (b) timing diagram of the pipeline based on the IP (instruction packing) approach. Symbols P, M, and MM refer
to the processor, the memory, and the memory module. The other symbols were defined earlier
Page Number: 74/101
32-bit
GaAs MICROPROCESSORS
Goals and project requirements:
•200 MHz clock rate
•32-bit parallel data path
•16 general purpose registers
•Reduced Instruction Set Computer (RISC) architecture
•24-bit word addressing
•Virtual memory addressing
•Up to four coprocessors connected to the CPU
(Coprocessors can be of any type and all different)
References:
1. Milutinović,V.,(editor),”Special Issue on GaAs
Microprocessor Technology,” IEEE Computer, October 1986.
2. Helbig, W., Milutinović,V., “The RCA DCFL E/DMESFET GaAs Experimental RISC Machine,” IEEE
Transactions on Computers, December 1988.
Page Number: 75/101
3.The outputs of two circuits can not be tied together:
a. one can not utilize phantom logic on the chip, to implement functions like WIRED-OR
(all outputs active).Circuits have a low “operating noise margin”.
B . One can not use three-state logic on the chip, to implement functions
like MULTIPLE-SOURCE-BUS (only the output active). Circuits have no “off-state”.
C . Actually, if one insist on having a MULTIPLE-SOURCE- BUS on the chip,
one can have it at the cost of only one active load and the need to precharge
(both mean “constraints” and “slowdown on the architecture level).
D . Fortunately, logic function AND-OR is exactly what is needed to create
a multiplexer - a perfect replacement for a bus.
E
Page Number: 76/101





M
U
X
Page Number: 77/101
a)
b)
Figure 7.11. The technological problems that arise from the usage of GaAs technology: (a) an example of the fan-out tree, which
provides a fan-out of four, using logic elements with the fan-out of two; (b) an example of the logic element that performs a two-to-one
one-bit multiplexing. Symbols a and b refer to data inputs. Symbol c refers to the control input. Symbol o refers to data output.
Page Number: 78/101
Z0 
87
5, 98 H
ln
 r  1, 41 0, 8W  T
D0  1, 016 0, 475 r  0, 67 ns ft
Z0 
60
r
ln
4B
0, 67(0, 8W  T )
D0  1, 016  r ns ft
Figure 7.12. Some possible techniques for realization of PCBs (printed circuit boards): (a) The MS technique (microstrip); (b) The SL
technique (stripline).
Symbols and refer to the signal delay and the characteristic impedance, respectively. The meaning of other symbols is defined in
former figures, or they have standard meanings
Page Number: 79/101
The CPU Architecture
1. Deep Memory Pipelining:
Optimal memory pipelining depends on the ratio of off-chip and on-chip delays, plus
many other factors. Therefore, precise input from DP and CD people was crucial.
Unfortunately, these data were not quite known at the design time, and some solutions
(e.g. PC-stack) had to work for various levels of the pipeline depth.
2. Latency Stages:
One group of latency stages (WAIT) was associated to instruction fetch; the other
group was associated to operand load.
3. Four Basic Opcode Classes:
•ALU
•LOAD/STORE
•BRANCH
•COPROCESSOR
4. Register zero is hardwired to zero.
Page Number: 80/101
Silicon
IR
M
GRF
CPU
GaAs
CPU
M3
M6
M9
Page Number: 81/101
ALU CLASS
Page Number: 82/101
CATALYTIC MIGRATION
from the
RISC ENVIRONMENT
POINT-OF-VIEW
This research was sponsored by NCR
Page Number: 83/101
DEFINITION: DIRECT MIGRATION
Migration of an entire hardware resource into the system software.
EXAMPLES:
Pipeline interlock.
Branch delay control.
ESSENCE:
Examples that result in code* speed-up are very difficult to invent.
Page Number: 84/101
DELAYED CONTROL TRANSFER
I1 fetch
I1 execution
branch address calculation
branch target calculation
I2 fetch
I2 execution
I3 fetch
time 
Delayed Branch Scheme
Page Number: 85/101
DEFINITION: Catalytic Migration
Migration base on the utilization of a catalyst.
MIGRANT vs CATALIST
Figure 7.13. The catalytic migration concept. Symbols M, C, and P refer to the migrant, the catalyst, and the processor, respectively.
The acceleration, achieved by the extraction of a migrant of a relatively large VLSI area, is achieved after adding a catalyst of a
significantly smaller VLSI area.
ESSENCE:
Examples that result in code speed-up are much easier to invent.
Page Number: 86/101
METHODOLOGY:
Area estimation: Migrant
Area estimation: Catalyst
Real estate to invest: Difference
Investment
strategy:
R
Compile time algorithms
Analytical analysis
Simulation analysis
Implementational analysis
NOTE:
Before the reinvestment,
the migration may result in slow-down.
Page Number: 87/101
(N-2)*W vs DMA
a)
b)
Figure 7.16. An example of the DW (double windows) type of catalytic migration, (a) before the migration; (b) after the migration.
Symbol M refers to the main store. The symbol L-bit DMA refers to the direct memory access which transfers L bits in one
clock cycle. Symbol NW refers to the register file with N partially overlapping windows (as in the UCB-RISC processor), while
the symbol DW refers to the register file of the same type, only this time with two partially overlapping windows. The addition
of the L-bit DMA mechanism, in parallel to the execution using one window, enables the simultaneous transfer between the
main store and the window which is currently not in use. This enables one to keep the contents of the nonexistent N – 2
windows in the main store, which not only keeps the resulting code from slowing down, but actually speeds it up, because the
transistors released through the omission of N – 2 windows can be reinvested more appropriately.
Migrant: (N2)*W
Catalyst: L-bit DMA
Page Number: 88/101
i:
i + 1:
load r1, MA{MEM – 6}
load r2, MA{MEM – 3}
a)
b)
Figure 7.14. An example of catalytic migration: Type HW (hand walking): (a) before the migration; (b) after the migration. Symbols P
and GRF refer to the processor and the general-purpose register file, respectively. Symbols RA and MA refer to the register address and
the memory address in the load instruction. Symbol MEM – n refers to the main store which is n clocks away from the processor.
Addition of another bus for the register address eliminates a relatively large number of nop instructions (which have to separate the
Page Number: 89/101
interfering load instructions).
Figure 7.15. An example of catalytic migration: type II (ignore instruction): (a) before the migration; (b) after the migration. Symbol t
refers to time, and symbol UI refers to the useful instruction. This figure shows the case in which the code optimizer has successfully
eliminated only two nop instructions, and has inserted the ignore instruction, immediately after the last useful instruction. The addition
of the ignore instruction and the accompanying decoder logic eliminates a relatively large number of nop instructions, and speeds up
the code, through a better utilization of the instruction cache.
Page Number: 90/101
CODE INTERLEAVING
a)
b)
Figure 7.17. An example of the CI (code interleaving) catalytic migration: (a) before the migration; (b) after the migration. Symbols A
and B refer to the parts of the code in two different routines that share no data dependencies. Symbols GRF and SGRF refer to the
general purpose register file (GRF), and the subset of the GRF (SGRF). The sequential code of routine A is used to fill in the slots in
routine B, and vice versa. This is enabled by adding new registers (SGRF) and some additional control logic which is quite. The speedup is achieved through the elimination of nop instructions, and the increased efficiency of the instruction cache (a consequence of the
reduced code size).
Page Number: 91/101
CLASSIFICATION:
CM
ICM
C-+
ACM
C++
-+
++
EXAMPLES:
(N2)*W vs DMA
RDEST BUS vs CFF
IGNORE
CODE INTERLEAVING
Page Number: 92/101
for i := 1 to N do:
1.
2.
3.
4.
5.
MAE
CAE
DFR
RSD
CTA
6.
7.
8.
9.
10.
AAP
AAC
SAP
SAC
SLL
end do
Figure 7.18. A methodological review of catalytic migration (intended for a detailed study of a new catalytic migration example).
Symbols S and R refer to the speed-up and the initial register count. Symbol N refers to the number of generated ideas. The meaning of
other symbols is as follows: MAE—migrant area estimate, CAE—catalyst area estimate, DFR—difference for reinvestment, RSD—
reinvestment strategy developed, CTA—compile-time algorithm, AAC—analytical analysis of the complexity, AAP—analytical
analysis of the performance, SAC—simulation analysis of the complexity, SAP—simulation analysis of the performance, SLL—
summary of lessons learned.
Page Number: 93/101
RISCs FOR NN: Core + Accelerators
Figure 8.1. RISC architecture with on-chip accelerators. Accelerators are labeled ACC#1, ACC#2, …, and they are placed in parallel
with the ALU. The rest of the diagram is the common RISC core. All symbols have standard meanings.
Page Number: 94/101
Figure 8.2. Basic problems encountered during the realization of a neural computer: (a) an electronic neuron; (b) an interconnection
network for a neural network. Symbol D stands for the dendrites (inputs), symbol S stands for the synapses (resistors), symbol N stands
for the neuron body (amplifier), and symbol A stands for the axon (output). The symbols , , , and stand for the input connections, and
the symbols , , , and stand for the output connections.
Page Number: 95/101
Figure 8.3. A system architecture with N-RISC processors as nodes. Symbol PE (processing element) represents one N-RISC, and
refers to “hardware neuron.” Symbol PU (processing unit) represents the software routine for one neuron, and refers to “software
neuron.” Symbol H refers to the host processor, symbol L refers to the 16-bit link, and symbol R refers to the routing algorithm based
on the MP (message passing) method.
Page Number: 96/101
Figure 8.4. The architecture of an N-RISC processor. This figure shows two neighboring N-RISC processors, on the same ring.
Symbols A, D, and M refer to the addresses, data, and memory, respectively. Symbols PLA (comm) and PLA (proc) refer to the PLA
logic for the communication and processor subsystems, respectively. Symbol NLR refers to the register which defines the address of the
neuron (name/layer register). Symbol refers to the only register in the N-RISC processor. Other symbols are standard.
Page Number: 97/101
Figure 8.5. Example of an accelerator for neural RISC: (a) a three-layer neural network; (b) its implementation based on the reference
[Distante91]. The squares in Figure 8.5.a stand for input data sources, and the circles stand for the network nodes. Symbols W in
Figure 8.5.b stand for weights, and symbols F stand for the firing triggers. Symbols PE refer to the processing elements. Symbols W
have two indices associated with them, to define the connections of the element (for example, and so on). The exact values of the
indices are left to the reader to determine, as an exercise. Likewise, the PE symbols have one index associated with them, to determine
the node they belong to. The exact values of these indices were also left out, so the reader should determine them, too.
Page Number: 98/101
Figure 8.6. VLSI layout for the complete architecture of Figure 8.5. Symbol T refers to the delay unit, while symbols IN and OUT
refer to the inputs and the outputs, respectively
Page Number: 99/101
Figure 8.7. Timing for the complete architecture of Figure 8.5. Symbol t refers to time, symbol F refers to the moments of triggering,
and symbol P refers to the ordinal number of the processing element.
Page Number: 100/101
http://galeb.etf.bg.ac.yu/~vm/
e-mail: [email protected]
Page Number: 101/101