Transcript 1i - MariaM

Mariam Hoseini
Advisor: Dr. Chao You
Supervisor: Dr. Mark Pavicic
Committee members:
Dr. Rajendra Katti, Dr. Subbaraya Yuvarajan, Dr. Deying Li
North Dakota State University
April 2009
• Conformal Computing
• Asynchronous circuit design
 Handshake protocols
 Data encodings
 Signaling protocol
 Asynchronous design methodologies
 Asynchronous primitives
• Constructing an array of cells
• PCC cell design and simulations
• Conclusion
North Dakota State University
2
• Computers are typically rigid boards or boxes with a fixed
computational capability.
• The available computers may have the undesired size or shape,
or have less computing capability than is needed.
• The program investigates a more flexible form of computer which
easily conforms to the physical and computational needs of an
application.
• Potential applications:
– Sorting, cryptography, cellular neural nets, etc
– The computational material can be integrated with arrays of sensors
and/or actuators
North Dakota State University
3
• Potential problems:
– Easily changing the physical shape of the computer
– Adjusting the computational capability
– Propagation delays, synchronization, power distribution, and heat
dissipation.
• One approach is:
– To form extensible arrays of simple reconfigurable computing
elements (cells) into thin wallpaper-like sheets.
– Long signal wires are eliminated.
– Communications are local and synchronized with cell to cell pulses.
• This research presents a cell design, called a pulsed conformal
computer cell (PCC cell).
North Dakota State University
4
•
PCC cell has significant similarities to cellular automata (CA):
– Simple fine-grained elements,
– Integration of processing and storage,
– Local communication
•
CA can model the elements of digital computers, using patterns of cells to
perform the functions of wires, logic, & registers
– The same model is used in the PCC cell design
•
The function and connections of PCC cell are reconfigurable, similar to FPGAs.
– FPGAs are not as fine-grained
– FPGAs are not as regular
– The PCC cell array uses only short-range wires that connect adjacent cells
North Dakota State University
5
• Two major styles of circuit design: Synchronous & Asynchronous
• Advantages of asynchronous design, in terms of:
–
–
–
–
–
Clock skew
Speed
Meta-stability
Modularity
Power
• Disadvantages of asynchronous design:
– More difficult to design for a hazard free behavior and a correct
ordering of operations.
– Additional hardware to initiate, advance, and indicate the completion of
operations.
• Asynchronous systems are specified by handshake protocol, data
encoding, underlying delay model.
North Dakota State University
6
• Handshaking is the alternate for clocking in asynchronous systems.
• Data transfer between two processes is synchronized with signals
that are generated by the same processes.
• Asynchronous operation can also be done without handshaking.
– Handshaking is used to separate successive uses of a component.
– It may not be necessary to separate the use of a component or the separation
can be done by delaying the operations.
• Handshaking can be done at higher levels in an asynchronous
system.
North Dakota State University
7
• Bundled data:
– Normal Boolean levels encodes data values
– Separate request and acknowledge wires are used
• Dual rail:
– Two wires are used to carry a single bit
– Request wire is encoded in dual rail data wires
Dual rail encoding
Meaning
00
No data
01
0
10
1
11
Forbidden
– Dual rail data encoding is used in PCC cell design
North Dakota State University
8
• Pulse Signaling:
–
–
–
–
–
Each request and acknowledge is a pulse
Simple and small cycle like transition signaling
Dealing with levels like level signaling
Better noise immunity than single-track signaling
Potential problem: robustness of sending pulses over long wires.
One cycle
start
event
Request
event
done
Acknowledge
– Pulse signaling is used in PCC cell design & there is no problem of
long wires.
North Dakota State University
9
• Bounded delay


Simplest model
Delays of circuit elements and wires are assumed to be known or bounded.
• Delay insensitive (DI)


Both gates and wires have unbounded and unknown delays.
Completion detection mechanism is needed at receiver
• Quasi delay insensitive (QDI)



DI + Isochronic forks = QDI
Isochronic forks are capable of indication
All input transitions should be indicated by an output signal transition
A
North Dakota State University
d2
B
d3
C
d1
10
• In an asynchronous systems, interfaces and inside modules can
be designed with different timing models
• In the PCC cell design, for timing management:
– Internal of a cell is governed by a bounded delay model
– Communications between the cells is done by a QDI model
North Dakota State University
11
• In synchronous systems, Boolean circuits can be constructed from
a primitive like a NAND-gate
• Logic gates provide only logic functionality, not timing functionality,
so not sufficient to make asynchronous circuits
• Asynchronous systems can be made from a set of primitives
• The set of primitives must provide both universal logic and timing
functionalities
• Different sets of primitives have been introduced, such as Keller’s,
Patra’s, Lee’s, and etc
North Dakota State University
12
The set of primitives used in a PCC cell:
• Wire
I
O
– Transfers the output of a component to input of another one.
O2
I1
• Fork
– The output of one component is the input to
several components
• Merge
– Sends one of its input to the output
• Join
O1
I
I1
I2
O
– Data from several independent components are needed
to be synchronized.
O1 12
I1
I2
O
North Dakota State University
13
• An array of cells each having a simple one-bit processing unit
• Von Neumann neighborhood for local connections
• A routing problem occurs:
• A possible solution:
North Dakota State University
14
• Another approach is to combine every two to make a double cell
– The same routing capability with fewer neighboring connections
• A further step is to group 4 cells together to make a quad cell
– The same routing capability with simple connections to 4 nearest
neighbors
North Dakota State University
15
• Logic Unit Design
• Synchronization
• Pulse Regenerator
• Top Level Design
• Configuration Circuitry
• PCC Cell Simulations
– One-bit full adder
– Ring oscillator
– Shift register
• Implementing Pipelines
North Dakota State University
16
• There is a logic unit (LU) and an output register in each quarter
• Each LU has two inputs and one output
North Dakota State University
17
• Dual rail inputs
• Dual rail outputs
• Switches should
be set before
inputs arrival
• 8 switches to
define a function
• 16 functions
• Avoids floating
nodes by pull
down resistors
North Dakota State University
18
• AND function
• D, E , F, G are
“0001”
North Dakota State University
A
B
Z
0
0
0
0
1
0
1
0
0
1
1
1
19
• Wire  one output pulse
triggers the LU inputs of
the neighbor cell in the
same direction.
• Merge is realized by
2:1 Muxs, pulses do right
turns (90 degree)
• Fork Each turn triggers
a neighbor quarter and
also a neighbor cell,
– a single computation forks
into multiple parallel
computations
North Dakota State University
20
Join
North Dakota State University
•
A completion detection
circuitry
•
All the participating
quarters should have their
LU outputs ready
•
Complements a fork by
combining multiple parallel
computations into a single
computation.
•
QDI Communications
21
North Dakota State University
•
Fork1
– Only when a
pulse turns
– LU should use
only the turned
pulse
•
Fork2 & Fork4
– No timing
assumptions
•
Fork3 & Fork5
– Bounded delay
model
22
• When a pulse travels through many cells, the width of the pulse may
increase or decrease
• Too short pulse may not be detectable at all, too long pulse may catch up
other pulses
• A PRG produces an output pulse with a certain constant width,
D1
independent of the width of the input pulse.
D2
A
B
C
• D1 is the delay by which the input pulse is stretched
• D2 determines the width of the output pulse
D
E
North Dakota State University
23
North Dakota State University
24
In a PCC cell : (W/L)p / (W/L)n ≈ 1.6
In an inverter:
Equivalent resistance of a MOS : (R≈ L/W)
•
To match PMOS and NMOS resistances  (W/L)p / (W/L)n = 3 ~ 3.5
tpHL = .69* Rn* CL & tpLH = .69* Rp* CL
if
Rn = Rp

tpHL = tpLH
• A bigger PMOS improves the tpLH by increasing the charging current.
• A bigger PMOS degrades the tpHL by causing a larger parasitic capacitance.
• tp = (tpHL + tpLH)/2 is not minimal.
• The ratio for an optimal speed performance equals to √(Rp/Rn)
• The device can be speed up device by reducing the size of PMOS
North Dakota State University
25
•
Configuration bits (16 bits for LU switches, 8 bits for Merge MUXs & 4 bits for
Join, i.e. total of 28 bits) should be loaded
•
Only some parts of the array may need to be configured
• One solution is to make a long chain of shift
registers of all the cells & configure all of them
• A better solution is to form the chain of shift
registers only by the cells that are needed to be configured.
•
In each cell, a controller:
– decides whether the cell is wanted to be configured or not
– directs the bit flow to one of the cell neighbors
– stops the shift registers whenever all the intended cells are configured
North Dakota State University
26
Decoder
clk-N
clk-W
clk-E
clk-S
OR
clk-N
clk-S
clk-W
clk-E
Decoder
data-N
data-W
data-E
data-S
OR
10
Decoder
11
data-N
data-S
data-W
data-E
01
00
Controller
Shows that the shift register is filled
Shows that the cell is the last one in the chain of shift register
Determines that the cell should/should not be configured
Defines the neighbor to which the bits should be forwarded
North Dakota State University
27
North Dakota State University
28
•
•
PCC cell was implemented in TSMC 250 nm CMOS using S-Edit.
The simulation was done by Pspice
• The supply voltage is 5V
• Input pulse widths are 400ps
• Propagation delay through a cell
is 480ps ~ 500ps.
• Better speed:
Slope ≤ gate propagation delay
• Slope of the external inputs
are 12ps.
• No overshoots and undershoots
North Dakota State University
29
Voltage source =5V
Average current = 6 mA for 1.4 ns & 17 mA for 8.6 ns
For 20 pulses:
Energy = (5 * 6* 1.4) + (5 * 17 * 8.6) = 773 pJ
North Dakota State University
30
For 1 pulse (1-bit of operation):
Voltage source= 5 V
Average current = 5 mA
Voltage source= 3.3 V Average current = 3 mA
North Dakota State University
Energy = 5 * 5 *1.5 ns =37.5 pJ
Energy = 3 * 3.3 *1.8 ns=17.8 pJ
31
• Sum = A B C
 1 1 1= 1
• Carry= AB + BC + AC = AB + (A+B)C
 1.1 + (1+1).1=1
• Sum & carry products are ready after 0.5ns & 1.8ns
North Dakota State University
32
• Loops are important for many circuits such as sequential circuits,
iterative computations and For, If, and While constructs
• The ring oscillator represents two capabilities of PCC cell:
– A loop can be controlled externally (started & stopped)
– Utilizing Join of pulses, communications can be QDI
Start Pulse ‘0’
0
1
0 1 0
0
1
Output is always a ‘1’
North Dakota State University
33
• Ring oscillator implemented in an array of PCC cells
One
One
XOR
WR
Nand
•
•
Pass
WR
One
Pass
‘0’ pulses are shown in blue, ‘1’ pulses are shown in red
The input Mux is configured to receive a ‘0’ pulse only from external of the 1st
cell and a ‘1’ pulse only from a turn.
North Dakota State University
34
Simulation
Results:
North Dakota State University
35
An input bit stream of “1010” is used.
Cell
1
Cell
2
Cell Cell
3
4
D1
x
x
x
D2
D1
x
x
D3
D2
D1
x
D4
D3
D2
D1
North Dakota State University
36
•
•
If handshaking is done for every component, the components can form a pipeline.
Each component should supply an Ack to indicate that it is available for re-use.
Ack is received
Ack is received
LU
LU
LU
LU
LU
LU
Ack
LU
Ack
LU
LU
Delay(1) = 3X + (n-2)5X + 3X= (5n - 4)X
North Dakota State University
37
•
Some cells don’t handshake & they are cascaded. The cascaded cells form a
unit of a pipeline. So, handshaking is done only at higher level.
Ack is received
LU
LU
LU
LU
Ack
LU
A unit of the pipeline
Ack
LU
LU
A unit of the pipeline
Delay(2) = 3X + (n-2)2X + 3x= (2n +2)X
Delay(2)/Delay(1) = (2n + 2)X=(5n-4)X = 2/5
North Dakota State University
38
PCC Cell
Technology
TSMC 250nm
Voltage Source
5V (3.3V)
Transistor Count
760
Propagation delay
500 ps (600 ps)
Minimum input pulse width
400 ps
Energy consumption for 1-bit
operation
37.5 pJ (17.8 pJ)
Routing Capability
Data can be routed in
4 directions
QDI Communications
Yes, by perfuming
Join
Performance:
Speed  very good
Energy  good
Area  average
Implementing comb/seq circuits Yes
Controlling a loop externally
Yes
Implementing pipelines
Yes
North Dakota State University
39
• Contribution:
– Utilizing asynchrony, reconfigurability, and the properties of CA to
make an extensible array with more regular and finer grained cells
than that of FPGAs.
• Future works:
– Improving the performance of the cell in terms of area and thermal
management
North Dakota State University
40
• Express my deepest gratitude to my supervisors, Dr. Mark
Pavicic and Dr. Chao You.
• Gratitude are also due to graduate committee, Dr. Rajendra
Katti, Dr. Subbaraya Yuvarajan, Dr. Deying Li.
• Express my love and gratitude to my beloved spouse,
Hamed.
North Dakota State University
41
North Dakota State University
42