Transcript ppt

Implementing An Associative
Processor on FPGAs
1
Cell Interconnection Network
A Conceptual View of the KSU ASC Model
Memory
PE
Memory
PE
Memory
PE
Memory
PE
Instruction
Stream
Control
Unit
2
An Example for the Data Memory Organization:
Auto Information Stored in the PE Cells
1
Burgundy
Focus
OH
170
…….
PE0
2
Blue
Taurus
OH
160
…….
PE1
3
Blue
Focus
OH
190
…….
PE2
4
Red
Focus
PA
180
ID
Color
Model
State
…….
PE3
Rebate
3
The Prototype of the Byte-serial ASC
Processor
IS Control Unit
Associative Processing Array
CPU
(for Sequential and
parallel instructions)
PE Array
32-bit
Instruction
Memory
Responder
Resolution
Circuitry
MAX/MIN
Circuitry
Data
Memory
16 8-bit Common
Registers
4
Prototype of the 4-PE Associative Processing Array
PE Cell 0
Data Memory0
At_Least_One_Responder
PE Cell 1
PE0
Responder
Resolution
Circuitry
PE Cell 2
Data Memory2
Data Memory1
PE1
MAX/MIN
Circuitry
PE Cell 3
PE2
Data Memory3
PE3
5
A Processing Element Overview
CarryOut
MUX
1bit
AL
U
MUX
Common
Registers
Logical Registers
General-Purpose
Registers
Comparator
16 8-bit
General
Purpose
Registers
16-deep
1-bit
Mask
Stack
1-bit
Responder
Register
16 1-bit
Logical
Registers
Find/Step
/ResolveFirst
6
Instruction Set and Assembling
Language (1)
• Data Transfer Instructions
- LD address, dstreg
- LDRR srcreg, dstreg
- ST srcreg, address
- LDI immediate, dstreg
- LDRRSPD srcreg
• Arithmetic and Logical Instructions
(mnemonic srcreg1, srcreg2, dstreg)
–
–
–
–
ADD
AND
SLL
SLT
SUB
OR XOR NOT
SRL
SLE SGT SGE SEQ
SNE
7
Instruction Set and Assembling
Language (2)
•
Mask Stack and Responder Instructions
–
–
–
–
–
–
–
–
–
SETMSK
TOPMSK
TOPMSKRSPD
POPMSK
POPMSKRSPD
POPTHEM
POPTHEMRSPD
RPCMSK
RPCMSKRSPD
PUSHMSK
PUSHMSKRSPD
PUSHTHEM
PUSHMSKTHEM
STKTOMEM
MEMTOSTK
–
–
–
FIND
STEP
RESFST
8
Instruction Set and Assembling
Language (3)
• Maximum and Minimum Searching
Instructions
–
–
–
–
–
SETMXMI
LDMXMI
STMXMI
MAX
MIN
• Branch/Jump Instructions
– BNR
– BRS
– J
9
Associative Operations
Related PE Components:
• The Responder Register:
to indicate whether a PE is a responder to a
particular associative search or not
• The Step/Find/ResolveFirst Unit:
to support processing multiple responders in various
ways
• The Mask Stack:
to represent at most 16 levels of association. The top
of the Mask Stack always represents the current
status of the PE – whether it is masked (‘1’) or
unmasked (‘0’)
10
Example of Associative Search:
Find all Focus cars located in Ohio
• Perform the comparison: model == “Focus”,
and store the result either ‘1’ or ‘0’ into $LR1
• Perform the comparison: location == “Ohio”,
and store the result into $LR2
• AND $LR1 with $LR2, and store the result into
the Responder Register
(Note: all the instructions above performed by all PEs in
parallel are called unmasked instructions)
11
Unmasked and Masked Instructions
 Unmasked Instruction:
Executed by all the PEs regardless of
the state of the Mask Stack
 Masked Instruction:
Executed only by those PEs with a ‘1’
on the top of their Mask Stack
12
Example of Associative Search Using
Masked Instructions:

•
•
•


(Find all Focus cars located in Ohio )
Initialize the top of the Mask Stack to ‘1’
Perform the comparison: model == “Focus”, and store
the result ‘1’ or ‘0’ into $LR1
Perform the comparison: location == “Ohio”, and store
the result into $LR2
AND $LR1 with $LR2, and store the result into the
Responder Register
AND the Responder Register with the top of the Mask
Stack, and push the ANDing result into the Mask Stack
and also store it into the Responder Register
Increase the rebate of all Focus cars in Ohio by 10
(masked instruction)
13
The MAX/MIN Circuitry, the Responder Resolution
Circuitry, and PE3
Responder Resolution
PE 0
Step
/Find
/RslvFst
Mask
Stack
From PE0
R0
V0
to PE0
From PE1
R1
V1
to PE1
From PE2
R2
V2
to PE2
From PE3
R3
V3
to PE3
V4
To CU
clr
Responder
General Purpose
Registers
MAX/MIN
From PE0 : GPR
RPD
D0
R0
MM0
From PE1: GPR
RPD
D1
R1
MM1
From PE2: GPR
RPD
D2
R2
MM2
From PE3: GPR
RPD
D3
R3
MM3
to PE0
to PE1
to PE2
to PE3
14
Using the Falkoff Algorithm for MAX/MIN Search
Maximum-Value Searching (the following steps are
performed in parallel for all the data)
• Search bit slices of the data from the most
significant bit to the least significant bit:
As each bit slice is processed, each bit is ANDed
with a corresponding MM bit (a 1-bit register used
to indicate whether or not a data item is the
maximum after processing a bit)
• Check the results of the AND to ensure that at
least one new maximum value remains:
15
Using the Falkoff Algorithm for MAX/MIN Search
(continued)
If this condition is true, then the MM bits are
updated by the results of AND; if all the results
are 0, then the MM bits are not updated at this
time
• Continue to process the remaining bit slices as
above until all bits are processed
• After the least significant bit slice is processed:
If only one MM bit is ‘1’, it marks the largest
number; if more than one MM bit is ‘1’, those
data are tied for the maximum value
16
Using the Falkoff Algorithm for MAX/MIN Search
(continued)
Minimum-Value Searching:
• Similar to maximum value searching, but
complement the bit slices each time before
ANDing it with MM bits
17
Search For the Maximum Rebate in the Data Memories
Bit Slices (7..0) of Rebates
Values in MM bits During Processing
Process bit from MSB to LSB
(rebate)
(170)
(160)
(190)
(180)
76543210
10101010
10100000
10111110
10110100
After processing each bit
Initialize 7 6 5 4 3 2 1 0
(MM0)
(MM1)
(MM2)
(MM3)
1
1
1
1
11100000
11100000
1 1 1 1 1 1 1 1 (max)
11110000
18
MAX/MIN Circuit using the Falkoff Algorithm
OP
Data 0
8-bit shift register0
“not”
to RPD0
MM0
RPD0
Data 1
8-bit shift register1
“not”
to RPD1
MM1
RPD1
Data 2
8-bit shift register2
“not”
MM2
RDP 2
Data 3
to RPD2
8-bit shift register3
“not”
MM3
RPD 3
Mask_W
to RPD3
19
The MAX/MIN Circuitry, the Responder Resolution
Circuitry, and PE3
Responder Resolution
PE 0
Step
/Find
/RslvFst
Mask
Stack
From PE0
R0
V0
to PE0
From PE1
R1
V1
to PE1
From PE2
R2
V2
to PE2
From PE3
R3
V3
to PE3
V4
To CU
clr
Responder
General Purpose
Registers
MAX/MIN
From PE0 : GPR
RPD
D0
R0
MM0
From PE1: GPR
RPD
D1
R1
MM1
From PE2: GPR
RPD
D2
R2
MM2
From PE3: GPR
RPD
D3
R3
MM3
to PE0
to PE1
to PE2
to PE3
20
Functionality of Responder Resolution Circuit
• Responder resolution:
Send an At-Least-One-Responder signal to the IS
control unit
• Support responder selection:
Send a corresponding Responder_Before_Me
signal to each PE’s Find_ Step _ResolveFirst unit
21
The Responder Resolution Circuitry for 4 PEs
V0
PE0
R0
PE1
V1
PE2
Responder Resolution Circuitry
‘0’
R1
V2
V4
R2
PE3
V3
R3
R0 to R3 : from responder registers
V0 to V3 : called Responder_Before_ME
V4
: called At_Least_One_Responder
22
Responder Processing
• Process responders in parallel:
– use masked instructions
• Process responders sequentially:
– Need some responder selection instructions
– Need a responder selection mechanism
23
Responder Selection Instructions
• Step
repetitively used to pick one responding PE
each time for further processing – “ for” loop
e.g., to step through all the Focus cars in Ohio to list
the features available on each car
• Find
select a responding PE, while still keeping all
responders identifiable – “ while” loop
e.g., retrieve the tax rate from one of the cars located
in OH, then increment the tax rate by a certain
amount, afterwards apply this new tax rate to all the
cars located in OH
24
Responder Selection Instructions
(continued)
• ResolveFirst
select a responder and only keep this
responder identifiable
e.g., resolve one PE from several PEs which have
the values tied for the maximum value
25
The Responder Resolution Circuitry, MAX/MIN
Circuitry, and PE3
Responder Resolution
PE 3
Step
/Find
/RslvFst
Mask
Stack
From PE0
R0
V0
to PE0
From PE1
R1
V1
to PE1
From PE2
R2
V2
to PE2
From PE3
R3
V3
to PE3
V4
To CU
clr
Responder
General Purpose
Registers
MAX/MIN
From PE0 : GPR
RPD
D0
R0
MM0
From PE1: GPR
RPD
D1
R1
MM1
From PE2: GPR
RPD
D2
R2
MM2
From PE3: GPR
RPD
D3
R3
MM3
to PE0
to PE1
to PE2
to PE3
26
Design Language: VHDL
• A standard hardware description language
used to model and design digital hardware
- Support concurrent events
- can be translated into hardware by some design tools
• good for managing large design structures
• Supported by many CAD tool and
programmable logic vendors
27
Altera MAX+PLUS II Development
System
Design Verification
Design Entry
Graphic Editor
Text Editor
Waveform Editor
Symbol Editor
Floorplan Editor
Other Design Entry
Tools
Simulator
Waveform Editor
Timing Analysis
Other Verification Tools
MAX+PLUS II Compiler
Design
Compilation
Programmer
Data I/O
Other Programmers
Device Programming
28
Altera FLEX 10K FPLD
IOEs
IOEs
IOEs
IOEs
IOEs
IOEs
L
A
B
E
A
B
LA
B
FastTrack
Interconnect
IOEs
• FLEX10K70 Device:
- 3,744 LEs
- 9 EABs
- 70,000 gates totally
IOEs
E
A
B
L
A
B
L
A
B
IOEs
IOEs
IOEs
IOEs
IOEs
IOEs
( IOEs – I/O elements)
Partial FLEX10K20 FPLD Architecture
29
Simulation on FLEX 10K 70 Chip
• The ISCU runs at about 10MHz using 50% logical
gates
• One EAB is used as a local memory for one PE; 4
PEs and the support circuit runs at about 14MHz
using 82% logical cells.
From the simulation result, we can see that the FLEX
10K 70 chip isn’t large enough for the 4-PE processor.
So our current work is targeting on Altera APEX 20K
devices with 1million gates in one chip.
30
Future Work
• Explore more arithmetic features and
associative operations
• Develop the complete ASC assembly
language and the ASC back-end compiler
• Implement the PE cell interconnection network
• Implement the whole ASC processor on
bigger and faster FPGA chips
• Develop the multiple instruction stream MASC
model
31