Design Killa

Download Report

Transcript Design Killa

designKilla:
The 32-bit pipelined processor
Brought to you by:
Victoria Farthing
Dat Huynh
Jerry Felker
Tony Chen
Supervisor: Young Cho
32-Bit RISC Pipelined Processor
• Reduced Instruction Set allows for faster
execution of simple, frequently used
instructions which can be combined to
achieve the same result as a single,
slower CISC instruction
• Pipelining allows a faster clock cycle and
less wasted resources
Datapath Pipeline Stages
• 5 Stages
–
–
–
–
–
Instruction Fetch
Instruction Decode
Execution
Memory Write
Write Back
Unique Data Path Features
• Next instruction address calculation
– For basic incrementation, the address is
calculated by a counter
Address Jump Calculations
– For address jumps, there is a 19-bit load port on the
counter
• The loaded address comes from an adder with multiplexed
inputs
• Load bit is controlled by a comparator (beq) or-ed with the
absolute jump control bit
Double Clocked Memory Interface
• Problem: One
Memory for both
Instruction and Data
• Solution: Double
Clock!
• Access the memory
twice during one
clock cycle
Double Clocked Memory Interface
Write
Enable
Fast Clock
Clock
Fetch
Instruction
Fetch Data
Fetch
Instruction
Write
Data
• Fetches Instruction in First Cycle
• Fetches or Writes Data In Second Cycle
• Data is output by end of Clock Cycle
Unique Data Path Features
• Structural Multiplier
– 16 X 16 bit
– Multi-level creation:
• Four 8 X 8 bit multipliers
– Each containing four 4 X 4 bit multipliers
• Each comprised of a cascaded network of full and
half adders, built on logic gates
16-Bit Multiplier Unit
• Based On Hand Multiplication
• Made Up of Network
of AND Gates
and Adders
Why 32  16 bit?
32bit x 32bit = 64 bits!
Multiple complex changes to existing
architecture would be required
• Only one register can be written per clock cycle
– Could hold value for next cycle or stall the
pipeline
• Would require pseudoinstruction as well
as new hardware and multiple control
signals
Use pseudo-code instruction
mult32
srli 25, 25, 16
and 24, 22, 30
srli 24, 24, 16
add 24, 24, 25
and 25, 23, 31
add 24, 24, 25
and 22, 24, 30
srli 22, 22, 16
and 21, 23, 30
srli 21, 21, 16
add 6, 21, 22
slli 6, 6, 16
and 24, 24, 31
or 6, 6, 24
mult 20, 2, 4
mult 21, 4, 1
mult 22, 2, 3
mult 23, 1, 3
and 24, 20, 30
srli 24, 24, 16
and 25, 21, 31
add 25, 24, 25
and 24, 22, 31
add 25, 25, 24
and 5, 25, 31
srli 5, 5, 16
and 20, 20, 31
or 5, 5, 20
Improve the Multiplier
• Can decrease the latency of a combinational
multiplier with carry-look ahead adding methods.
– Small amount of extra hardware needed, worth it if
multiplier has largest latency.
Other Multiplier Topologies
• Shifting multiplication
– Shift multiplicand
several times based
on multiplier bits
– Add intermediate
shifted values
Other Multiplier Topologies
• Pipelined multiplication
– Store intermediate sums
– Allows for faster clock
cycle if traditional
combinational
multiplication presents the
critical path
Other Multiplier Topologies
• Pipelined multiplication
– Sequential multiplication
• Useful to minimize hardware
waste if multiplication is an
infrequent operation
• Continues to allow for faster
clock cycle if traditional
combinational multiplication
presents the critical path
Instruction Set Architecture
R-Type
Mem
add
sub
inc
dec
sla
sra
and
or
comp
sll
srl
slt
operation
6 bits
000000
000000
000000
000000
000000
000000
000000
000000
000000
000000
000000
000000
rs1
5 bits
rs1
rs1
rs1
rs1
rs1
rs1
rs1
rs1
rs1
rs1
rs1
rs1
rs2
5 bits
rs2
rs2
rs2
rs2
rs2
rs2
rs2
rs2
rs2
rs2
rs2
rs2
rd
5 bits
rd
rd
rd
rd
rd
rd
rd
rd
rd
rd
rd
rd
shift amt
5 bits
000000
000000
000000
000000
000000
000000
000000
000000
000000
000000
000000
000000
function
6 bits
000000
000001
000100
000101
001000
001010
000010
000011
000110
001100
001110
001001
translation:
assembly
$rd=$rs1+$rs2
$rd=$rs1-$rs2
$rd=$rs1+1
$rd=$rs1-1
$rd=$rs1<<$rs2
$rd=$rs1>>$rs2
$rd=$rs1&$rs2 (bitwise)
$rd=$rs1|$rs2 (bitwise)
$rd= ~$rs1
$rd=$rs1<<$rs2
$rd=$rs1>>$rs2
if($rs1<$rs2) $rd=1, else $rd=0
add rd, rs1, rs2
sub rd, rs1, rs2
inc rd, rs1, *
dec rd, rs1, *
sla rd, rs1, rs2
sra rd, rs1, rs2
and rd, rs1, rs2
or rd, rs1, rs2
comp rd, rs1, *
sll rd, rs1, rs2
srl rd, rs1, rs2
slt rd, rs1, rs2
I-Type
Mem
lw
sw
lwi
addi
beq
slti
slai
srai
slli
srli
op
6 bits
000001
000010
000011
000101
000110
001001
001000
001010
001100
001110
rs
5 bits
rs
rs
rs
rs
rs
rs
rs
rs
rs
rs
rd
5 bits
rd
rd
rd
rd
rd
rd
rd
rd
rd
rd
ADDRESS OR IMMEDIATE
16 bits
address
address
immediate value
immediate value
address
immediate value
immediate value
immediate value
immediate value
immediate value
translation:
assembly
$rd=mem[immdiate+$rs]
lw rd, rs, 100
mem[immdiate+$rs]=$rd
sw rd, rs, 100
$rd=immediate
lwi rd, rs, 100
$rd=$rs+immediate
addi rd, rs, 100
if($rs==$rd) PC+=address?*4? beq rd, rs, 100
if($rs<immed) $rd=1, else $rd=0 slti rd, rs, 100
$rd=$rs<<immediate
slai rd, rs, 100
$rd=$rs>>immediate
slai rd, rs, 100
$rd=$rs<<immediate
slli rd, rs, 100
$rd=$rs>>immediate
srli rd, rs, 100
J-Type
Mem
jmp
op
6 bits
000111
target address for jump, all 1's for halt
26 bits
target adress
translation:
assembly
PC= target address?*4?
jmp 100, *, *
The Assembler
• Converts assembly code to binary representation
Mem
operation
rs1
rs2
rd
shift amt
function translation:
assembly
add
000000
rs1
rs2
rd
000000
000000
add rd, rs1, rs2
$rd=$rs1+$rs2
Add $3,$1,$2 => 0000000001000100001100000000000
16-bit wide memory modules
Split into high and low bits for output
000000000100010 => High
0001100000000000 => Low
Assembler Features
• Allows for labels to be used in loops
• Automatically calculates offsets based on label
position
LABEL:
add $1,$2,$3
jmp LABEL
• Resolves hazards created by pipelining
1. Automatically determines the appropriate number
of NO-OPS to insert based on relative position of
consecutive instructions
Design allows for pseudo-instructions to be used
Pseudo Instruction
HLT
Actual Instructions
H1:
JMP H1
NOP
NOP
Topic 2 Design – Compiler
•
•
•
•
•
•
Bison - Parser
A compiler compiler
A grammar generator
------------------------Flex – Lexer
A Fast lexical
analyzer
• Tool used in pattern
matching on text
Compiling The C Language
• Interface Lexer and
Parser
• Lex will feed tokens to
Bison (YACC)
• A grammar tree is
generated
Source code to run-time
A simple program
• A simple C program
•
•
•
•
•
• void main ( void )
• {
•
•
•
•
•
int b ;
int d;
int x;
int y = 3;
int g;
•
•
x = b + d;
g = y + x;
• }
• Assembly Code Equivalent
lwi 4, 0, 3
add 6, 1, 2
sw 3, 6, 0
add 6, 4, 3
sw 5, 6, 0
•Machine Code Instructions
•Memory High
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
0000110000000100
0000000000100010
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000100011000011
0000000000000000
0000000000000000
0000000000000000
0000000010000011
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000100011000101
•Memory Low
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
0000000000000011
0011000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0011000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
0000000000000000
Could Use a Little Work
• Currently the Processor could use a little
work to improve performance.
– Decreased memory latency would be largest
and most direct improvement to processor.
– Must optimize ALU as well as multiplier unit.
– All in all, will work but not ready for
commercial usage.
References
Computer Organization and Design: The Hardware Software Interface (2nd Ed)
Patterson, David A. and Hennessy, John L.
Morgan Kaufman Publishers, 1997
Introduction to Compilers
http://cs.wwc.edu/~aabyan/221_2/PLBOOK/Translation.html
Aaby, Anthony A., 1998
The Compiler Design Handbook
Srikant, Y. N. and Shankar, Priti
CRC Press, 2002
THE END
Questions?