motorola phone 220 v

Download Report

Transcript motorola phone 220 v

An Introduction to
VLSI Processor Architecture
for GaAS
V. Milutinović
PURDUE UNIVERSITY
This research has been sponsored by RCA
and conducted in collaboration with
the RCA Advanced Technology Laboratories,
Moorestown, New Jersey.
Page Number: 1/47
Advantages
• For the same power consumption, at least half order of magnitude faster than Silicon.
• Efficient integration of electronics and optics.
• Tolerant of temperature variations. Operating range: [200C, 200C].
• Radiation hard. Several orders of magnitude more than Silicon: [>100 million RADs].
Page Number: 2/47
Disadvantages:
• High density of wafer dislocations
 Low Yield  Small chip size  Low transistor count.
• Noise margin not as good as in Silicon.
 Area has to be traded in for higher reliability.
• At least two orders of magnitude more expensive than Silicon.
• Currently having problems with high-speed test equipment.
Page Number: 3/47
Basic differences of Relevance for Microprocessor Architecture
• Small area and low transistor count
(* in general, implications of this fact are dependent
on the speed of the technology *)
• High ratio of off-chip and on-chip delays
(* consequently, off-chip and on-chip delays access is
much longer then on-chip memory access *)
• Limited fan-in and fan-out (?)
(* temporary differences *)
• High demand on efficient fault-tolerance (?)
(* to improve the yield for bigger chips *)
Page Number: 4/47
A Brief Look Into the GaAs IC Design
•Bipolar (TI + CDC)
•JFET (McDAC)
•GaAs MESFET Logic Families (TriQuint + RCA)
D-MESFET
(* Depletion Mode *)
E-MESFET
(* Enhancement Mode *)
Page Number: 5/47
Speed
(ns)
Arithmetic
32-bit adder
(BFL D-MESFET)
1616-bit multiplier
(DCFL E/D MESFET)
Control
1K gate array
(STL HBT)
2K gate array
(DCFL E/D MESFET)
Memory
4Kbit SRAM
(DCFL E/D MODFET)
16K SRAM
(DCFL E/D MESFET)
Dissipation Complexity
(W)
(K transistors)
2,9 total
1,2
2,5
10,5 total
1,0
10,0
0,4/gate
1,0
6,0
0,08/gate
0,4
8,2
2,0 total
1,6
26,9
4,1 total
2,5
102,3
Figure 7.1. Typical (conservative) data for speed, dissipation, and complexity of digital GaAs
chips.
Page Number: 6/47
GaAs
Silicon
Silicon
Silicon
Silicon
(1 m E/D-MESFET)
(2 m NMOS)
(2 m CMOS)
(1.25 m NMOS)
(2 m ECL)
40K
200K
200K
400K
40K (T or R)
Gate delay
(minimal fan-out)
50-150 ps
1-3 ns
800-1000 ps 500-700 ps
150-200 ps
On-chip memory access
(3232 bit capacity)
0.5-2.0 ns
20-40 ns
10-20 ns
5-10 ns
2-3 ns
Off-chip, on package
memory access (25632 bits)
4-8 ns
40-80 ns
30-40 ns
20-30 ns
6-10 ns
Off-package memory access
(1k32 bits)
10-50 ns
100-200 ns
60-100 ns
40-80 ns
20-80 ns
Complexity
On-chip transistor count
Speed
Figure 7.2. Comparison (conservative) of GaAs and silicon, in terms of complexity and speed of the chips (assuming equal
dissipation). Symbols T and R refer to the transistors and the resistors, respectively. Data on silicon ECL technology
complexity includes the transistor count increased for the resistor count.
Page Number: 7/47
GaAs E/D-DCFL
Silicon SOS-CMOS
Minimal geometry
1m
1.25 m
Levels of metal
2
2
Gate delay
250 ps
1.25 ns
Maximum fan-in
5 NOR, 2 AND
4 NOR, 4 NAND
Maximum fan-out
4
20
Noise immunity level
220 mV
1.5 V
Average gate transistor count
4.5
7
On-chip transistor count
25 000
100 000-150 000
Figure 7.3. Comparison of GaAs and silicon, in the case of actual 32-bit microprocessor implementations (courtesy of RCA).
The impossibility of implementing “phantom” logic (wired-OR) is a consequence of the low noise immunity of GaAs circuits
(200 mV).
Page Number: 8/47
Figure 7.4. Processor organization based on the BS (bit-slice) components. The meaning of symbols is as follows: IN—
input, BUFF—buffer, MUX—multiplexer, DEC—decoder, L—latch, OUT—output. The remaining symbols are standard.
Page Number: 9/47
Figure 7.5. Processor organization based on the FS (function slice) components: IM—instruction memory, I_D_U—
instruction decode unit, DM_I/O_U—data memory input/output unit, DM—data memory.
Page Number: 10/47
Implication of the High Off/On Ratio
On the Choice of Processor Design Philosophy
Only a single-chip reduced architecture makes sense!
In Silicon environment,we can argue “RISC” or “CISC”.
In GaAs environment,there is only one choice: “RISC”.
However, the RISC concept has to be significantly modified
for efficient GaAs utilization.
Page Number: 11/47
The Information Bandwidth Problem of GaAs
Assume a 10:1 advantage in on-chip switching speed, but
only a 3:1 advantage in off-chip/off-package memory access.
Will the microprocessor be 10 times faster?
Or only 3 times faster?
Why the Information Bandwidth Problem?
The Reduced Philosophy:
 Large register file
 Most or all on-chip memory is used for the register file
 On chip instruction cache is out of question
Instruction fetch must be from an off-chip environment
Page Number: 12/47
Applications for GaAs Microprocessor
• General purpose processing in defense and aerospace,
and execution of compiled HLL code.
• General purpose processing and substitution
of current CISC microprocessors.*
• Dedicate special-purpose applications
in digital control and signal processing.*
• Multiprocessing of the SIMD/MIMD type,
for numeric and symbolic applications.
Page Number: 13/47
Which Design Issues Are Affected?
On-chip issues:
•Register file
•ALU
•Pipeline organization
•Instruction set
Off-chip issues:
•Cache
•Virtual memory management
•Coprocessing
•Multiprocessing
System software issues:
Compilation
Compilation
Compilation
Code optimization
Code optimization
Code optimization
Page Number: 14/47
Adder Design
igure 7.6. Comparison of GaAs and silicon. Symbols CL and RC refer to the basic adder types (carry look ahead and ripple carry).
Symbol B refers to the word size.
a)
Complexity comparison. Symbol C[tc] refers to complexity, expressed in transistor count.
b)
Speed comparison. Symbol D[ns] refers to propagation delay through the adder, expressed in nanoseconds. In the case
of silicon technology, the CL adder is faster when the word size exceeds four bits (or a somewhat lower number, depending on the
diagram in question). In the case of GaAs technology, the RC adder is faster for the word sizes up to n bits (actual value of n
depends on the actual GaAs technology used).
Page Number: 15/47
Figure 7.7. Comparison of GaAs and silicon technologies: an example of the bit-serial adder. All symbols
have their standard meanings.
Page Number: 16/47
Register File Design
a)
b)
Figure 7.8. Comparison of GaAs and silicon technologies: design of the register cell: (a) an example of the register cell frequently used
in the silicon technology; (b) an example of the register cell frequently used in the GaAs microprocessors. Symbol BL refers to the
unique bit line in the four-transistor cell. Symbols A BUS and B BUS refer to the double bit lines in the seven-transistor cell. Symbol F
refers to the refresh input. All other symbols have their standard meanings.
Page Number: 17/47
Pipeline design
Figure 7.9. Comparison of GaAs and silicon technologies: pipeline design—a possible design error: (a) twostage pipeline typical of some silicon microprocessors; (b) the same two-stage pipeline when the off-chip
delays are three times longer than on-chip delays (the off-chip delays are the same as in the silicon version).
Symbols IF and DP refer to the instruction fetch and the ALU cycle (datapath). Symbol T refers to time.
Page Number: 18/47
a1)
a3)
a2)
b) IP
b)
Figure 7.10. Comparison of GaAs and silicon technologies: pipeline design—possible solutions; (a1) timing diagrams of a pipeline
based on the IM (interleaved memory) or the MP (memory pipelining); (a2) a system based on the IM approach; (a3) a system based
on the MP approach; (b) timing diagram of the pipeline based on the IP (instruction packing) approach. Symbols P, M, and MM refer
to the processor, the memory, and the memory module. The other symbols were defined earlier
Page Number: 19/47
32-bit
GaAs MICROPROCESSORS
Goals and project requirements:
•200 MHz clock rate
•32-bit parallel data path
•16 general purpose registers
•Reduced Instruction Set Computer (RISC) architecture
•24-bit word addressing
•Virtual memory addressing
•Up to four coprocessors connected to the CPU
(Coprocessors can be of any type and all different)
References:
1. Milutinović,V.,(editor),”Special Issue on GaAs
Microprocessor Technology,” IEEE Computer, October 1986.
2. Helbig, W., Milutinović,V., “The RCA DCFL E/DMESFET GaAs Experimental RISC Machine,” IEEE
Transactions on Computers, December 1988.
Page Number: 20/47
MICROPROCESSORS
Darpa eyes 100-mips GaAs chip for star wars
Palo Alto
For its Star Wars program, the Department of Defense intends to push well beyond the current
limits of technology. And along with lasers and particle beams, one piece of hardware it has in
mind is a microprocessor chip having as much computing power as 100 of Digital Equipment
Corp.’s VAX-11/780 superminicomputers.
One candidate for the role of basic computing engine for the program, officially called the
Strategic Defense Initiative [ElectronicsWeek, May 13, 1985, p. 28], is a gallium arsenide
version of the Mips reduced-instruction-set computer (RISC) developed at Stanford University.
Three teams are now working on the processor. And this month, the Defense Advanced
Projects Research Agency closed the request-for-proposal (RFP) process for a 1.25-µm silicon
version of the chip.
Last October, Darpa awarded three contracts for a 32-bit GaAs microprocessor and a floatingpoint coprocessor. One went to McDonnell Douglas Corp., another to a team formed by Texas
Instruments Inc. and Control Data Corp., and the third to a team from RCA Corp. and
Tektronix Inc. The three are now working on processes to get useful yields. After a year, the
program will be reduced to one or two teams. Darpa’s target is to have a 10,000-gate GaAs
chip by the beginning of 1988.
If it is as fast as Darpa expects, the chip will be the basic engine for the Advanced Onboard
Signal Processor, one of the baseline machines for the SDI. “We went after RISC because we
needed something small enough to put on GaAs,” says Sheldon Karp, principal scientist for
strategic technology at Darpa. The agency had been working with the Motorola Inc. 68000
microprocessor, but Motorola wouldn’t even consider trying to put the complex 68000 onto
GaAs, Karp says.
A natural. The Mips chip, which was originally funded by Darpa, was a natural for GaAs. “We
have only 10,000 gates to work with,” Karp notes. “And the Mips people had taken every
possible step to reduce hardware requirements. There are no hardware interlocks, and only 32
instructions.”
Even 10,000 gates is big for GaAs; the first phase of the work is intended to make sure that the
RISC architecture can be squeezed into that size at respectable yields, Karp says.
Mips was designed by a group under John Hennessey at Stanford. Hennessey, who has worked
as a consultant with Darpa on the SDI project, recently took the chip into the private sector by
forming Mips Computer Systems of Mountain View, Calif. [ElectronicsWeek, April 29, 1985,
p. 36]. Computer-aided-design software came from the Mayo Clinic in Rochester, Minn.
The GaAs chip
will be clocked at 200 MHz,
the silicon at 40 MHz
The silicon Mips chip will come from a two-year effort using the 1.25-µm design rules
developed for the Very High Speed Integrated Circuit program. (The Darpa chip was not made
part of VHSIC in order to open the RFP to contractors outside that program.)
Both the silicon and GaAs microprocessors will be full 32-bit engines sharing 90% of a common
instruction core. Pascal and Air Force 1750A compilers will be targeted for the core instruction
set, so that all software will be interchangeable.
The GaAs requirement specifies a clock frequency of 200 MHz and a computation rate of 100
million instructions per second. The silicon chip will be clocked at 40 MHz.
Eventually, the silicon chip must be made radiation-hard; the GaAs chip will be intrinsically radhard.
Darpa will not release figures on the size of its RISC effort. The silicon version is being funded
through the Air Force’s Air Development Center in Rome, N.Y.
–Clifford Barney
Page Number: 21/47
SSystem software
1. Core-MIPS translators
2. Compilers
MC680x0+1750A
C + Pascal + Ada
Technology Limitations
. Constraints on logic functions
a. No NAND gates.
b. NOR gates can have up to 5 inputs, plus a 2-input AND on each of the 5 inputs.
2. Power Levels: High, Reference, and Low:
a. Circuits are always drawing current, and the number of such circuits on a die is severely limited, due to
power consumption.
b. If smaller-size device are used, the circuit`s internal impedance becomes higher, it needs lwss power to
operate, the number of circuits on a die increases, but the fun–out gets severely limited.
Three different circuits` types exit in the standard cell library. These represent the trade-offs between
power, area, and fanout.
Page Number: 22/47
3.The outputs of two circuits can not be tied together:
a. one can not utilize phantom logic on the chip, to implement functions like WIRED-OR
(all outputs active).Circuits have a low “operating noise margin”.
B . One can not use three-state logic on the chip, to implement functions
like MULTIPLE-SOURCE-BUS (only the output active). Circuits have no “off-state”.
C . Actually, if one insist on having a MULTIPLE-SOURCE- BUS on the chip,
one can have it at the cost of only one active load and the need to precharge
(both mean “constraints” and “slowdown on the architecture level).
D . Fortunately, logic function AND-OR is exactly what is needed to create
a multiplexer - a perfect replacement for a bus.
E
Page Number: 23/47





M
U
X
Page Number: 24/47
E .
F .
Consequently,
in
hand-crafted
areas
(register
file
and
barrel
shifter),
busses were used (no
need for multiple active loads, and time was not critical).
In standard-cell areas (all the rest) multiplexers were used.
Using multiplexers frequently resulted in extra functionality on the architecture level,simply
because it was cheaper to keep them, than to exclude them.
Page Number: 25/47
a)
b)
Figure 7.11. The technological problems that arise from the usage of GaAs technology: (a) an example of the fan-out tree, which
provides a fan-out of four, using logic elements with the fan-out of two; (b) an example of the logic element that performs a two-to-one
one-bit multiplexing. Symbols a and b refer to data inputs. Symbol c refers to the control input. Symbol o refers to data output.
Page Number: 26/47
Z0 
87
5, 98 H
ln
 r  1, 41 0, 8W  T
D0  1, 016 0, 475 r  0, 67 ns ft
Z0 
60
r
ln
4B
0, 67(0, 8W  T )
D0  1, 016  r ns ft
Figure 7.12. Some possible techniques for realization of PCBs (printed circuit boards): (a) The MS technique (microstrip); (b) The SL
technique (stripline).
Symbols and refer to the signal delay and the characteristic impedance, respectively. The meaning of other symbols is defined in
former figures, or they have standard meanings
Page Number: 27/47
The CPU Architecture
1. Deep Memory Pipelining:
Optimal memory pipelining depends on the ratio of off-chip and on-chip delays, plus
many other factors. Therefore, precise input from DP and CD people was crucial.
Unfortunately, these data were not quite known at the design time, and some solutions
(e.g. PC-stack) had to work for various levels of the pipeline depth.
2. Latency Stages:
One group of latency stages (WAIT) was associated to instruction fetch; the other
group was associated to operand load.
3. Four Basic Opcode Classes:
•ALU
•LOAD/STORE
•BRANCH
•COPROCESSOR
4. Register zero is hardwired to zero.
Page Number: 28/47
Silicon
IR
M
GRF
CPU
GaAs
CPU
M3
M6
M9
Page Number: 29/47
ALU CLASS
Page Number: 30/47
CATALYTIC MIGRATION
from the
RISC ENVIRONMENT
POINT-OF-VIEW
Veljko Milutinović
UNIVERSITY of BELGRADE
This research was sponsored by NCR
Page Number: 31/47
DEFINITION: DIRECT MIGRATION
Migration of an entire hardware resource into the system software.
EXAMPLES:
Pipeline interlock.
Branch delay control.
ESSENCE:
Examples that result in code* speed-up are very difficult to invent.
Page Number: 32/47
DELAYED CONTROL TRANSFER
I1 fetch
I1 execution
branch address calculation
branch target calculation
I2 fetch
I2 execution
I3 fetch
time 
Delayed Branch Scheme
Page Number: 33/47
DEFINITION: Catalytic Migration
Migration base on the utilization of a catalyst.
MIGRANT vs CATALIST
Figure 7.13. The catalytic migration concept. Symbols M, C, and P refer to the migrant, the catalyst, and the processor, respectively.
The acceleration, achieved by the extraction of a migrant of a relatively large VLSI area, is achieved after adding a catalyst of a
significantly smaller VLSI area.
ESSENCE:
Examples that result in code speed-up are much easier to invent.
Page Number: 34/47
METHODOLOGY:
Area estimation: Migrant
Area estimation: Catalyst
Real estate to invest: Difference
Investment
strategy:
R
Compile time algorithms
Analytical analysis
Simulation analysis
Implementational analysis
NOTE:
Before the reinvestment,
the migration may result in slow-down.
Page Number: 35/47
(N-2)*W vs DMA
a)
b)
Figure 7.16. An example of the DW (double windows) type of catalytic migration, (a) before the migration; (b) after the migration.
Symbol M refers to the main store. The symbol L-bit DMA refers to the direct memory access which transfers L bits in one
clock cycle. Symbol NW refers to the register file with N partially overlapping windows (as in the UCB-RISC processor), while
the symbol DW refers to the register file of the same type, only this time with two partially overlapping windows. The addition
of the L-bit DMA mechanism, in parallel to the execution using one window, enables the simultaneous transfer between the
main store and the window which is currently not in use. This enables one to keep the contents of the nonexistent N – 2
windows in the main store, which not only keeps the resulting code from slowing down, but actually speeds it up, because the
transistors released through the omission of N – 2 windows can be reinvested more appropriately.
Migrant: (N2)*W
Catalyst: L-bit DMA
Page Number: 36/47
i:
i + 1:
load r1, MA{MEM – 6}
load r2, MA{MEM – 3}
a)
b)
Figure 7.14. An example of catalytic migration: Type HW (hand walking): (a) before the migration; (b) after the migration. Symbols P
and GRF refer to the processor and the general-purpose register file, respectively. Symbols RA and MA refer to the register address and
the memory address in the load instruction. Symbol MEM – n refers to the main store which is n clocks away from the processor.
Addition of another bus for the register address eliminates a relatively large number of nop instructions (which have to separate the
Page Number: 37/47
interfering load instructions).
Figure 7.15. An example of catalytic migration: type II (ignore instruction): (a) before the migration; (b) after the migration. Symbol t
refers to time, and symbol UI refers to the useful instruction. This figure shows the case in which the code optimizer has successfully
eliminated only two nop instructions, and has inserted the ignore instruction, immediately after the last useful instruction. The addition
of the ignore instruction and the accompanying decoder logic eliminates a relatively large number of nop instructions, and speeds up
the code, through a better utilization of the instruction cache.
Page Number: 38/47
CODE INTERLEAVING
a)
b)
Figure 7.17. An example of the CI (code interleaving) catalytic migration: (a) before the migration; (b) after the migration. Symbols A
and B refer to the parts of the code in two different routines that share no data dependencies. Symbols GRF and SGRF refer to the
general purpose register file (GRF), and the subset of the GRF (SGRF). The sequential code of routine A is used to fill in the slots in
routine B, and vice versa. This is enabled by adding new registers (SGRF) and some additional control logic which is quite. The speedup is achieved through the elimination of nop instructions, and the increased efficiency of the instruction cache (a consequence of the
reduced code size).
Page Number: 39/47
APPLICATION:
1. Technologies with small on-chip transistor count.
The larger the ratio of off-chip to on-chip delays,
the better it works.
2. Technologies with dissipation-related limitations.
The larger the dissipation costs, the better it works.
EXAMPLES:
Page Number: 40/47
CLASSIFICATION:
CM
ICM
C-+
ACM
C++
-+
++
EXAMPLES:
(N2)*W vs DMA
RDEST BUS vs CFF
IGNORE
CODE INTERLEAVING
Page Number: 41/47
for i := 1 to N do:
1.
2.
3.
4.
5.
MAE
CAE
DFR
RSD
CTA
6.
7.
8.
9.
10.
AAP
AAC
SAP
SAC
SLL
end do
Figure 7.18. A methodological review of catalytic migration (intended for a detailed study of a new catalytic migration example).
Symbols S and R refer to the speed-up and the initial register count. Symbol N refers to the number of generated ideas. The meaning of
other symbols is as follows: MAE—migrant area estimate, CAE—catalyst area estimate, DFR—difference for reinvestment, RSD—
reinvestment strategy developed, CTA—compile-time algorithm, AAC—analytical analysis of the complexity, AAP—analytical
analysis of the performance, SAC—simulation analysis of the complexity, SAP—simulation analysis of the performance, SLL—
summary of lessons learned.
Page Number: 42/47
RISCs FOR NN: Core + Accelerators
Figure 8.1. RISC architecture with on-chip accelerators. Accelerators are labeled ACC#1, ACC#2, …, and they are placed in parallel
with the ALU. The rest of the diagram is the common RISC core. All symbols have standard meanings.
Page Number: 43/47
Figure 8.2. Basic problems encountered during the realization of a neural computer: (a) an electronic neuron; (b) an interconnection
network for a neural network. Symbol D stands for the dendrites (inputs), symbol S stands for the synapses (resistors), symbol N stands
for the neuron body (amplifier), and symbol A stands for the axon (output). The symbols , , , and stand for the input connections, and
the symbols , , , and stand for the output connections.
Page Number: 44/47
Figure 8.3. A system architecture with N-RISC processors as nodes. Symbol PE (processing element) represents one N-RISC, and
refers to “hardware neuron.” Symbol PU (processing unit) represents the software routine for one neuron, and refers to “software
neuron.” Symbol H refers to the host processor, symbol L refers to the 16-bit link, and symbol R refers to the routing algorithm based
on the MP (message passing) method.
Page Number: 45/47
Figure 8.4. The architecture of an N-RISC processor. This figure shows two neighboring N-RISC processors, on the same ring.
Symbols A, D, and M refer to the addresses, data, and memory, respectively. Symbols PLA (comm) and PLA (proc) refer to the PLA
logic for the communication and processor subsystems, respectively. Symbol NLR refers to the register which defines the address of the
neuron (name/layer register). Symbol refers to the only register in the N-RISC processor. Other symbols are standard.
Page Number: 46/47
Figure 8.5. Example of an accelerator for neural RISC: (a) a three-layer neural network; (b) its implementation based on the reference
[Distante91]. The squares in Figure 8.5.a stand for input data sources, and the circles stand for the network nodes. Symbols W in
Figure 8.5.b stand for weights, and symbols F stand for the firing triggers. Symbols PE refer to the processing elements. Symbols W
have two indices associated with them, to define the connections of the element (for example, and so on). The exact values of the
indices are left to the reader to determine, as an exercise. Likewise, the PE symbols have one index associated with them, to determine
the node they belong to. The exact values of these indices were also left out, so the reader should determine them, too.
Page Number: 47/47
Figure 8.6. VLSI layout for the complete architecture of Figure 8.5. Symbol T refers to the delay unit, while symbols IN and OUT
refer to the inputs and the outputs, respectively
Page Number: 48/47
Figure 8.7. Timing for the complete architecture of Figure 8.5. Symbol t refers to time, symbol F refers to the moments of triggering,
and symbol P refers to the ordinal number of the processing element.
Page Number: 49/47