Instructions

Download Report

Transcript Instructions

Performance
•
•
•
•
How to measure, report, and summarize performance?
What factors determine the performance of a computer?
Critical to purchase and design decisions
– best performance?
– least cost?
– best performance/cost?
Questions:
Why is some hardware better than others for different programs?
What factors of system performance are hardware related?
(e.g., Do we need a new machine, or a new operating system?)
How does the machine's instruction set affect performance?
1998 Morgan Kaufmann Publishers
1
Computer Performance
•
Response Time (execution time)
— The time between the start and completion of a task
•
Throughput
— The total amount of work done in a given time
•
Q: If we replace the processor with a faster one, what do we increase?
A: Response time and throughput
•
Q: If we add an additional processor to a system, what do we increase?
A: Throughput
1998 Morgan Kaufmann Publishers
2
Book's Definition of Performance
•
For some program running on machine X,
PerformanceX = 1 / Execution timeX
•
"X is n times faster than Y"
n = PerformanceX / PerformanceY
•
Problem: Machine A runs a program in 10 seconds and machine B in
15 seconds. How much faster is A than B?
Answer: n = PerformanceA / PerformanceB
= Execution timeB/Execution timeA = 15/10 = 1.5
A is 1.5 times faster than B.
1998 Morgan Kaufmann Publishers
3
Execution Time
•
Elapsed Time, wall-clock time or response time
– counts everything (disk and memory accesses, I/O , etc.)
– a useful number, but often not good for comparison purposes
•
CPU time
– doesn't count I/O or time spent running other programs
– can be broken up into system time, and user time
•
Our focus: user CPU time
– time spent executing the lines of code that are "in" our program
1998 Morgan Kaufmann Publishers
4
Clock Cycles
•
Instead of reporting execution time in seconds, we often use cycles
seconds
cycles
seconds


program program
cycle
•
Execution time = # of clock cycles • cycle time
•
Clock “ticks” indicate when to start activities (one abstraction):
time
•
•
cycle time (period) = time between ticks = seconds per cycle
clock rate (frequency) = cycles per second (1 Hz = 1 cycle/sec)
A 200 MHz clock has a
1
200 106
 5 ns
cycle time
1998 Morgan Kaufmann Publishers
5
How to Improve Performance
seconds
cycles
seconds


program program
cycle
So, to improve performance (everything else being equal) you can either
– reduce the # of required clock cycles for a program
– decrease the clock period or, said another way,
increase the clock frequency.
1998 Morgan Kaufmann Publishers
6
Different numbers of cycles for different instructions
time
•
Multiplication takes more time than addition
•
Floating point operations take longer than integer ones
•
Accessing memory takes more time than accessing registers
•
Important point: changing the cycle time often changes the number of
cycles required for various instructions (more later)
•
Another point: the same instruction might require a different number of
cycles on a different machine
1998 Morgan Kaufmann Publishers
7
Example
•
A program runs in 10 seconds on computer A, which has a 400 MHz clock.
We are trying to help a computer designer build a new machine B, that will
run this program in 6 seconds. The designer can use new technology to
substantially increase the clock rate, but this increase will affect the rest of
the CPU design, causing machine B to require 1.2 times as many clock cycles
as machine A. What clock rate should we tell the designer to target?”
•
Clock cyclesA = 10 s * 400 MHz = 4*109 cycles
Clock cyclesB = 1.2 * 4*109 cycles = 4.8 *109 cycles
Execution time = # of clock cycles * cycle time
Clock rateB = Clock cyclesB / Execution timeB
= 4.8 *109 cycles / 6 s = 800 MHz
1998 Morgan Kaufmann Publishers
8
Now that we understand cycles
•
A given program will require
– some number of instructions (machine instructions)
– some number of cycles
– some number of seconds
•
We have a vocabulary that relates these quantities:
– cycle time (seconds per cycle)
– clock rate (cycles per second)
– CPI (cycles per instruction)
AVERAGE VALUE!
a floating point intensive application might have a higher CPI
– MIPS (millions of instructions per second)
this would be higher for a program using simple instructions
1998 Morgan Kaufmann Publishers
9
Performance
•
Performance is determined by execution time
•
Related variables
– # of cycles to execute program
– # of instructions in program
– # of cycles per second
– average # of cycles per instruction
– average # of instructions per second
•
Common pitfall: thinking one of the variables is indicative of
performance when it really isn’t.
1998 Morgan Kaufmann Publishers
10
CPI Example
•
Suppose we have two implementations of the same instruction set
architecture (ISA). For some program,
Machine A has a clock cycle time of 10 ns and a CPI of 2.0
Machine B has a clock cycle time of 20 ns and a CPI of 1.2
Which machine is faster for this program, and by how much?
•
Time per instruction: for A 2.0 * 10 ns = 20 ns
B 1.2 * 20 ns = 24 ns
A is 24/20 = 1.2 times faster
•
If two machines have the same ISA, which of our quantities (e.g., clock rate,
CPI, execution time, # of instructions, MIPS) will always be identical?
Answer: # of instructions
1998 Morgan Kaufmann Publishers
11
# of Instructions Example
•
A compiler designer has two alternatives for a certain code
sequence.There are three different classes of instructions: A, B, and
C, and they require one, two, and three cycles, respectively.
The first sequence has 5 instructions: 2 of A, 1 of B, and 2 of C.
The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C.
Which sequence will be faster? What are the CPI values?
•
•
Sequence 1: 2*1+1*2+2*3 = 10 cycles; CPI1 = 10 / 5 = 2
Sequence 2: 4*1+1*2+1*3 = 9 cycles; CPI2 = 9 / 6 = 1.5
•
Sequence 2 is faster.
1998 Morgan Kaufmann Publishers
12
MIPS
• Million Instructions Per Second
• MIPS = instruction count/(execution time*106)
• MIPS is easy to understand but
– does not take into account the capabilities of the
instructions; the instruction counts of different
instruction sets differ
– varies between programs even on the same
computer
– can vary inversely with performance!
1998 Morgan Kaufmann Publishers
13
MIPS example
•
Two compilers are being tested for a 100 MHz machine with three
different classes of instructions: A, B, and C, which require one,
two, and three cycles, respectively.
Compiler 1: Compiled code uses 5 million Class A, 1 million Class B,
and 1 million Class C instructions.
Compiler 2: Compiled code uses 10 million Class A, 1 million Class
B, and 1 million Class C instructions.
•
•
Which sequence will be faster according to MIPS?
Which sequence will be faster according to execution time?
1998 Morgan Kaufmann Publishers
14
MIPS example
•
Cycles and instructions
1: 10 million cycles, 7 million instructions
2: 15 million cycles, 12 million instructions
•
Execution time = Clock cycles/Clock rate
•
•
Execution time1 = 10*106 / 100*106 = 0.1 s
Execution time2 = 15*106 / 100*106 = 0.15 s
•
MIPS = Instruction count/(Execution time *106)
•
•
MIPS1 = 7*106 / 0.1*106 = 70
MIPS2 = 12*106 / 0.15*106 = 80
1998 Morgan Kaufmann Publishers
15
Benchmarks
•
•
•
Performance best determined by running a real application
– Use programs typical of expected workload
– Or, typical of expected class of applications
e.g., compilers/editors, scientific applications, graphics, etc.
Small benchmarks
– nice for architects and designers
– easy to standardize
– can be abused
SPEC (System Performance Evaluation Cooperative)
– companies have agreed on a set of real programs and inputs
– can still be abused
– valuable indicator of performance (and compiler technology)
1998 Morgan Kaufmann Publishers
16
SPEC ‘95
Benchmark
go
m88ksim
gcc
compress
li
ijpeg
perl
vortex
tomcatv
swim
su2cor
hydro2d
mgrid
applu
trub3d
apsi
fpppp
wave5
Description
Artificial intelligence; plays the game of Go
Motorola 88k chip simulator; runs test program
The Gnu C compiler generating SPARC code
Compresses and decompresses file in memory
Lisp interpreter
Graphic compression and decompression
Manipulates strings and prime numbers in the special-purpose programming language Perl
A database program
A mesh generation program
Shallow water model with 513 x 513 grid
quantum physics; Monte Carlo simulation
Astrophysics; Hydrodynamic Naiver Stokes equations
Multigrid solver in 3-D potential field
Parabolic/elliptic partial differential equations
Simulates isotropic, homogeneous turbulence in a cube
Solves problems regarding temperature, wind velocity, and distribution of pollutant
Quantum chemistry
Plasma physics; electromagnetic particle simulation
1998 Morgan Kaufmann Publishers
17
SPEC ‘89
Compiler effects on performance depend on applications.
800
700
600
SPEC performance ratio
•
500
400
300
200
100
0
gcc
espresso
spice
doduc
nasa7
li
eqntott
matrix300
fpppp
tomcatv
Benchmark
Compiler
Enhanced compiler
1998 Morgan Kaufmann Publishers
18
SPEC ‘95
10
10
9
9
8
8
7
7
6
6
SPECfp
SPECint
Organisational enhancements enhance performance.
Doubling the clock rate does not double the performance.
5
5
4
4
3
3
2
2
1
1
0
0
50
100
150
Clock rate (MHz)
200
250
Pentium
Pentium Pro
50
100
150
Clock rate (MHz)
200
250
Pentium
Pentium Pro
1998 Morgan Kaufmann Publishers
19
Amdahl's Law
Version 1
Execution Time After Improvement =
Execution Time Unaffected +
Execution Time Affected / Amount of Improvement
Version 2
Speedup
= Performance after improvement / Performance before improvement
= Execution time before improvement/ Execution time after improvement
Execution time:
before n + a
after n + a/p
Principle: Make the common case fast
na
su 
a
n
p
1998 Morgan Kaufmann Publishers
20
Amdahl's Law
Example:
Suppose a program runs in 100 seconds on a machine, with multiply
responsible for 80 seconds of this time. How much do we have to improve
the speed of multiplication if we want the program to run 4 times faster?"
100 s/4 = 80 s/n + 20 s
5 s = 80s/n
n= 80 s/ 5 s = 16
1998 Morgan Kaufmann Publishers
21
Amdahl's Law
Example:
A benchmark program spends half of the time executing floating
point instructions.
We improve the performance of the floating point unit by a factor of
four.
What is the speedup?
Time before 10s
Time after = 5s + 5s/4 = 6.25 s
Speedup = 10/6.25 = 1.6
1998 Morgan Kaufmann Publishers
22
Machine Instructions:
•
•
•
•
•
•
Language of the Machine
Lowest level of programming, control directly the hardware
Assembly instructions are symbolic versions of machine instructions
More primitive than higher level languages
Very restrictive
Programs are stored in the memory, one instruction is fetched and
executed at a time
•
We’ll be working with the MIPS instruction set architecture
1998 Morgan Kaufmann Publishers
23
MIPS instruction set:
•
•
•
•
Load from memory
Store in memory
Logic operations
– and, or, negation, shift, ...
Arithmetic operations
– addition, subtraction, ...
Branch
1998 Morgan Kaufmann Publishers
24
Instruction types:
•
1 operand
Jump #address
Jump $register number
•
2 operands
Multiply $reg1, $reg2
•
3 operands
Add $reg1, $reg2, $reg3
1998 Morgan Kaufmann Publishers
25
MIPS arithmetic
•
•
Instructions have 3 operands
Operand order is fixed (destination first)
Example:
C code:
A=B+C
MIPS code:
add $s0, $s1, $s2
$s0, etc. are registers
(associated with variables by compiler)
1998 Morgan Kaufmann Publishers
26
MIPS arithmetic
•
Design Principle 1: simplicity favours regularity.
•
Of course this complicates some things...
C code:
A = B + C + D;
E = F - A;
MIPS code:
add $t0, $s1, $s2
add $s0, $t0, $s3
sub $s4, $s5, $s0
•
Operands must be registers, 32 registers provided
•
Design Principle 2: smaller is faster.
1998 Morgan Kaufmann Publishers
27
Registers vs. Memory
•
•
•
Arithmetic instructions operands are registers
Compiler associates variables with registers
What about programs with lots of variables
Control
Input
Memory
Datapath
Processor
Output
I/O
1998 Morgan Kaufmann Publishers
28
Memory Organization
•
•
•
Viewed as a large, single-dimension array, with an address.
A memory address is an index into the array
"Byte addressing" means that the index points to a byte of memory.
0
1
2
3
4
5
6
...
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
1998 Morgan Kaufmann Publishers
29
Memory Organization
•
•
•
•
•
Bytes are nice, but most data items use larger "words"
For MIPS, a word is 32 bits or 4 bytes.
0 32 bits of data
4 32 bits of data
Registers hold 32 bits of data
32
bits
of
data
8
12 32 bits of data
...
232 bytes with byte addresses from 0 to 232-1
230 words with byte addresses 0, 4, 8, ... 232-4
Words are aligned
i.e., the 2 least significant bits of a word address are equal to 0.
1998 Morgan Kaufmann Publishers
30
Load and store instructions
•
•
•
•
Example:
C code:
A[8] = h + A[8];
MIPS code:
lw $t0, 32($s3)
add $t0, $s2, $t0
sw $t0, 32($s3)
word offset 8 equals byte offset 32
Store word has destination last
Remember arithmetic operands are registers, not memory!
1998 Morgan Kaufmann Publishers
31
So far we’ve learned:
•
MIPS
— loading and storing words but addressing bytes
— arithmetic on registers only
•
Instruction
Meaning
add $s1, $s2, $s3
sub $s1, $s2, $s3
lw $s1, 100($s2)
sw $s1, 100($s2)
$s1 = $s2 + $s3
$s1 = $s2 – $s3
$s1 = Memory[$s2+100]
Memory[$s2+100] = $s1
1998 Morgan Kaufmann Publishers
32
Machine Language
•
Instructions, like registers and words of data, are also 32 bits long
•
Example: add $t0, $s1, $s2
R-type instruction Format:
000000 10001
op
op
rs
rt
rd
shamt
funct
rs
10010
rt
01000
rd
00000
100000
shamt
funct
opcode, basic operation
1st source reg.
2nd source reg.
destination reg
shift amount
function, selects the specific variant of
the operation
1998 Morgan Kaufmann Publishers
33
Machine Language
•
Introduce a new type of instruction format
– I-type for data transfer instructions
Example: lw $t0, 32($s2)
35
18
9
op
rs
rt
32
16 bit number
rt destination register
new instruction format but fields 1…3 are the same
•
Design principle 3: Good design demands good compromises
1998 Morgan Kaufmann Publishers
34
Stored Program Concept
•
•
Instructions are groups of bits
Programs are stored in memory
— to be read or written just like data
Processor
•
Memory
memory for data, programs,
compilers, editors, etc.
Fetch & Execute Cycle
– Instructions are fetched and put into a special register
– Bits in the register "control" the subsequent actions
– Fetch the “next” instruction and continue
1998 Morgan Kaufmann Publishers
35
Control
•
Decision making instructions
– alter the control flow,
– i.e., change the "next" instruction to be executed
•
MIPS conditional branch instructions:
bne $t0, $t1, Label
beq $t0, $t1, Label
•
Example (if):
# branch if not equal
# branch if equal
if (i==j) h = i + j;
bne $s0, $s1, Label
add $s3, $s0, $s1
Label: ....
1998 Morgan Kaufmann Publishers
36
Control
•
MIPS unconditional branch instructions:
j label
•
Example (if - then - else):
if (i!=j)
h=i+j;
else
h=i-j;
beq $s4, $s5, Label1
add $s3, $s4, $s5
j Label2
Label1: sub $s3, $s4, $s5
Label2:
...
1998 Morgan Kaufmann Publishers
37
Control
•
Example (loop):
Loop:
•
---i=i+j;
if(i!=h) go to Loop
---
Loop: --add $s1, $s1, $s2 #i=i+j
bne $s1, $s3, Loop
---
1998 Morgan Kaufmann Publishers
38
So far:
•
•
Instruction
Meaning
add $s1,$s2,$s3
sub $s1,$s2,$s3
lw $s1,100($s2)
sw $s1,100($s2)
bne $s4,$s5,L
beq $s4,$s5,L
j Label
$s1 = $s2 + $s3
$s1 = $s2 – $s3
$s1 = Memory[$s2+100]
Memory[$s2+100] = $s1
Next instr. is at Label if $s4  $s5
Next instr. is at Label if $s4 = $s5
Next instr. is at Label
Formats:
R
op
rs
rt
rd
I
op
rs
rt
16 bit address
J
op
shamt
funct
26 bit address
1998 Morgan Kaufmann Publishers
39
Control Flow
•
•
We have: beq, bne, what about Branch-if-less-than?
New instruction: set on less than
if $s1 < $s2 then
$t0 = 1
slt $t0, $s1, $s2
else
$t0 = 0
•
slt and bne can be used to implement branch on less than
slt $t0, $s0, $s1
bne $t0, $zero, Less
•
Note that the assembler needs a register to do this,
there are register conventions for the MIPS assembly language
•
we can now build general control structures
1998 Morgan Kaufmann Publishers
40
MIPS Register Convention
Name Register number
$zero
0
$v0-$v1
2-3
$a0-$a3
4-7
$t0-$t7
8-15
$s0-$s7
16-23
$t8-$t9
24-25
$gp
28
$sp
29
$fp
30
$ra
31
•
•
Usage
the constant value 0
values for results and expression evaluation
arguments
temporaries
saved
more temporaries
global pointer
stack pointer
frame pointer
return address
$at, 1 reserved for assembler
$k0, $k1, 26-27 reserved for operating system
1998 Morgan Kaufmann Publishers
41
Procedure calls
•
•
Procedures and subroutines allow reuse and structuring of code
Steps
– Place parameters in a place where the procedure can access
them
– Transfer control to the procedure
– Acquire the storage needed for the procedure
– Perform the desired task
– Place the results in a place where the calling program can
access them
– Return control to the point of origin
1998 Morgan Kaufmann Publishers
42
Register assignments for procedure calls
•
•
•
$a0...$a3
$v0...$v1
$ra
•
•
use of argument and return value register: compiler
handling of control passing mechanism: machine
•
jump and link instruction:
jal ProcAddress
– saves return address (PC+4) in $ra (Program Counter holds the
address of the current instruction)
– loads ProcAddress in PC
return jump: jr $ra
– loads return address in PC
•
four argument registers for passing parameters
two return value registers
return address register
1998 Morgan Kaufmann Publishers
43
Stack
•
•
•
•
•
•
Used if four argument registers and two return value registers are
not enough or if nested subroutines (a subroutine calls another one)
are used
Can also contain temporary data
The stack is a last-in-first-out structure in the memory
Stack pointer ($sp) points at the top of the stack
Push and pop instructions
MIPS stack grows from higher addresses to lower addresses
1998 Morgan Kaufmann Publishers
44
Stack and Stack Pointer
elements
in the stack
bottom
elements
in the stack
SP
top
in
out
stack
grows
1998 Morgan Kaufmann Publishers
45
Constants
•
•
Small constants are used quite frequently
e.g.,
A = A + 5;
B = B - 1;
Solution 1: put constants in memory and load them
To add a constant to a register:
lw $t0, AddrConstant($zero)
add $sp,$sp,$t0
•
Solution 2: to avoid extra instructions keep the constant inside the
instruction itself
addi $29, $29, 4
# i means immediate
slti $8, $18, 10
andi $29, $29, 6
•
Design principle 4: Make the common case fast.
1998 Morgan Kaufmann Publishers
46
How about larger constants?
•
•
We'd like to be able to load a 32 bit constant into a register
Must use two instructions, new "load upper immediate" instruction
lui $t0, 1010101010101010
1010101010101010
•
filled with zeros
0000000000000000
Then must get the lower order bits right, i.e.,
ori $t0, $t0, 1010101010101010
1010101010101010
0000000000000000
0000000000000000
1010101010101010
1010101010101010
1010101010101010
ori
1998 Morgan Kaufmann Publishers
47
Overview of MIPS
•
•
•
•
•
simple instructions all 32 bits wide
very structured, no unnecessary baggage
only three instruction formats
R
op
rs
rt
rd
I
op
rs
rt
16 bit address
J
op
shamt
funct
26 bit address
rely on compiler to achieve performance
— what are the compiler's goals?
help compiler where we can
1998 Morgan Kaufmann Publishers
48
Addresses in Branches and Jumps
•
•
•
Instructions:
bne $t4,$t5,Label
beq $t4,$t5,Label
j Label
Next instruction is at Label if $t4  $t5
Next instruction is at Label if $t4 = $t5
Next instruction is at Label
Formats:
I
op
J
op
rs
rt
16 bit address
26 bit address
Addresses are not 32 bits
— How do we handle this with load and store instructions?
1998 Morgan Kaufmann Publishers
49
Addresses in Branches
•
•
Instructions:
bne $t4,$t5,Label
beq $t4,$t5,Label
Formats:
I
•
•
Next instruction is at Label if $t4$t5
Next instruction is at Label if $t4=$t5
op
rs
rt
16 bit address
Could specify a register (like lw and sw) and add it to address
– use Instruction Address Register (PC = program counter)
– most branches are local (principle of locality)
Jump instructions just use high order bits of PC
– address boundaries of 256 MB
1998 Morgan Kaufmann Publishers
50
MIPS addressing mode summary
•
•
•
•
•
•
Register addressing
– operand in a register
Base or displacement addressing
– operand in the memory
– address is the sumof a register and a constant in the instruction
Immediate addressing
– operand is a constant within the instruction
PC-relative addressing
– address is the sum of the PC and a constant in the instruction
– used e.g. in branch instructions
Pseudodirect addressing
– jump address is the 26 bits of the instruction concatenated with
the upper bits of the PC
Additional addressing modes in other computers
1998 Morgan Kaufmann Publishers
51
MIPS addressing mode summary
1. Immediate addressing
op
rs
rt
Immediate
2. Register addressing
op
rs
rt
rd
...
funct
Registers
Register
3. Base addressing
op
rs
rt
Memory
Address
+
Register
Byte
Halfword
Word
4. PC-relative addressing
op
rs
rt
Memory
Address
PC
+
Word
5. Pseudodirect addressing
op
Address
PC
Memory
Word
1998 Morgan Kaufmann Publishers
52
To summarize:
MIPS operands
Name
32 registers
Example
Comments
$s0-$s7, $t0-$t9, $zero, Fast locations for data. In MIPS, data must be in registers to perform
$a0-$a3, $v0-$v1, $gp,
arithmetic. MIPS register $zero always equals 0. Register $at is
$fp, $sp, $ra, $at
reserved for the assembler to handle large constants.
Memory[0],
2
30
Accessed only by data transfer instructions. MIPS uses byte addresses, so
memory Memory[4], ...,
words
and spilled registers, such as those saved on procedure calls.
add
MIPS assembly language
Example
Meaning
add $s1, $s2, $s3
$s1 = $s2 + $s3
Three operands; data in registers
subtract
sub $s1, $s2, $s3
$s1 = $s2 - $s3
Three operands; data in registers
$s1 = $s2 + 100
$s1 = Memory[$s2 + 100]
Memory[$s2 + 100] = $s1
$s1 = Memory[$s2 + 100]
Memory[$s2 + 100] = $s1
Used to add constants
Category
Arithmetic
sequential words differ by 4. Memory holds data structures, such as arrays,
Memory[4294967292]
Instruction
addi $s1, $s2, 100
lw $s1, 100($s2)
sw $s1, 100($s2)
store word
lb $s1, 100($s2)
load byte
sb $s1, 100($s2)
store byte
load upper immediate lui $s1, 100
add immediate
load word
Data transfer
Conditional
branch
Unconditional jump
$s1 = 100 * 2
16
Comments
Word from memory to register
Word from register to memory
Byte from memory to register
Byte from register to memory
Loads constant in upper 16 bits
branch on equal
beq
$s1, $s2, 25
if ($s1 == $s2) go to
PC + 4 + 100
Equal test; PC-relative branch
branch on not equal
bne
$s1, $s2, 25
if ($s1 != $s2) go to
PC + 4 + 100
Not equal test; PC-relative
set on less than
slt
$s1, $s2, $s3
if ($s2 < $s3) $s1 = 1;
else $s1 = 0
Compare less than; for beq, bne
set less than
immediate
slti
jump
j
jr
jal
jump register
jump and link
$s1, $s2, 100 if ($s2 < 100) $s1 = 1;
Compare less than constant
else $s1 = 0
2500
$ra
2500
Jump to target address
go to 10000
For switch, procedure return
go to $ra
$ra = PC + 4; go to 10000 For procedure call
1998 Morgan Kaufmann Publishers
53
Assembly Language vs. Machine Language
•
•
•
•
Assembly provides convenient symbolic representation
– much easier than writing down numbers
– e.g., destination first
Machine language is the underlying reality
– e.g., destination is no longer first
Assembly can provide 'pseudoinstructions'
– e.g., “move $t0, $t1” exists only in Assembly
– would be implemented using “add $t0,$t1,$zero”
When considering performance you should count real instructions
1998 Morgan Kaufmann Publishers
54
Alternative Architectures
•
Design alternative:
– provide more powerful operations than found in MIPS
– goal is to reduce number of instructions executed
– danger is a slower cycle time and/or a higher CPI
•
Sometimes referred to as “RISC vs. CISC”
– Reduced Instruction Set Computers
– Complex Instruction Set Computers
– virtually all new instruction sets since 1982 have been RISC
1998 Morgan Kaufmann Publishers
55
Reduced Instruction Set Computers
•
Common characteristics of all RISCs
– Single cycle issue
– Small number of fixed length instruction formats
– Load/store architecture
– Large number of registers
•
Additional characteristics of most RISCs
– Small number of instructions
– Small number of addressing modes
– Fast control unit
1998 Morgan Kaufmann Publishers
56
An alternative architecture: 80x86
•
•
•
•
•
•
1978: The Intel 8086 is announced (16 bit architecture)
1980: The 8087 floating point coprocessor is added
1982: The 80286 increases address space to 24 bits, +instructions
1985: The 80386 extends to 32 bits, new addressing modes
1989-1995: The 80486, Pentium, Pentium Pro add a few instructions
(mostly designed for higher performance)
1997: MMX is added
•
Intel had a 16-bit microprocessor two years before its competitors’
more elegant architectures which led to the selection of the 8086 as
the CPU for the IBM PC
•
“This history illustrates the impact of the “golden handcuffs” of compatibility”
“an architecture that is difficult to explain and impossible to love”
1998 Morgan Kaufmann Publishers
57
A dominant architecture: 80x86
•
•
•
See your textbook for a more detailed description
Complexity:
– Instructions from 1 to 17 bytes long
– one operand must act as both a source and destination
– one operand can come from memory
– complex addressing modes
e.g., “base or scaled index with 8 or 32 bit displacement”
Saving grace:
– the most frequently used architectural components are not too
difficult to implement
– compilers avoid the portions of the architecture that are slow
1998 Morgan Kaufmann Publishers
58
Summary
•
•
•
Instruction complexity is only one variable
– lower instruction count vs. higher CPI / lower clock rate
Design Principles:
– simplicity favours regularity
– smaller is faster
– good design demands good compromises
– make the common case fast
Instruction set architecture
– a very important abstraction indeed!
1998 Morgan Kaufmann Publishers
59
Arithmetic
•
•
Where we've been:
– Performance (seconds, cycles, instructions)
– Abstractions:
Instruction Set Architecture
Assembly Language and Machine Language
What's up ahead:
– Implementing the Architecture
1998 Morgan Kaufmann Publishers
60
Arithmetic
•
We start with the Arithmetic Logic Unit
operation
a
32
ALU
result
32
b
32
1998 Morgan Kaufmann Publishers
61
Numbers
•
•
•
•
•
•
Bits are just bits (no inherent meaning)
— conventions define relationship between bits and numbers
Binary numbers (base 2)
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001...
decimal: 0...2n-1
Of course it gets more complicated:
numbers are finite (overflow)
fractions and real numbers
negative numbers
How do we represent negative numbers?
i.e., which bit patterns will represent which numbers?
Octal and hexadecimal numbers
Floating-point numbers
1998 Morgan Kaufmann Publishers
62
Possible Representations of Signed Numbers
•
Sign Magnitude:
000 = +0
001 = +1
010 = +2
011 = +3
100 = -0
101 = -1
110 = -2
111 = -3
•
•
One's Complement
Two's Complement
000 = +0
001 = +1
010 = +2
011 = +3
100 = -3
101 = -2
110 = -1
111 = -0
000 = +0
001 = +1
010 = +2
011 = +3
100 = -4
101 = -3
110 = -2
111 = -1
Issues: balance, number of zeros, ease of operations.
Two’s complement is best.
1998 Morgan Kaufmann Publishers
63
MIPS
•
32 bit signed numbers:
0000
0000
0000
...
0111
0111
1000
1000
1000
...
1111
1111
1111
0000 0000 0000 0000 0000 0000 0000two = 0ten
0000 0000 0000 0000 0000 0000 0001two = + 1ten
0000 0000 0000 0000 0000 0000 0010two = + 2ten
1111
1111
0000
0000
0000
1111
1111
0000
0000
0000
1111
1111
0000
0000
0000
1111
1111
0000
0000
0000
1111
1111
0000
0000
0000
1111
1111
0000
0000
0000
1110two
1111two
0000two
0001two
0010two
=
=
=
=
=
+
+
–
–
–
2,147,483,646ten
2,147,483,647ten
2,147,483,648ten
2,147,483,647ten
2,147,483,646ten
maxint
minint
1111 1111 1111 1111 1111 1111 1101two = – 3ten
1111 1111 1111 1111 1111 1111 1110two = – 2ten
1111 1111 1111 1111 1111 1111 1111two = – 1ten
1998 Morgan Kaufmann Publishers
64
Two's Complement Operations
•
Negating a two's complement number: invert all bits and add 1
– Remember: “Negate” and “invert” are different operations.
You negate a number but invert a bit.
•
Converting n bit numbers into numbers with more than n bits:
– MIPS 16 bit immediate gets converted to 32 bits for arithmetic
– copy the most significant bit (the sign bit) into the other bits
0010
-> 0000 0010
1010
-> 1111 1010
– "sign extension"
– MIPS load byte instructions
lbu: no sign extension
lb:
sign extension
1998 Morgan Kaufmann Publishers
65
Addition & Subtraction
•
Just like in grade school (carry/borrow 1s)
0111
0111
0110
+ 0110
- 0110
- 0101
1101
0001
0001
•
Two's complement operations easy
– subtraction using addition of negative numbers
0111
+ 1010
10001
•
Overflow (result too large for finite computer word):
– e.g., adding two n-bit numbers does not yield an n-bit number
0111
+ 0001
1000
1998 Morgan Kaufmann Publishers
66
Detecting Overflow
•
•
•
•
No overflow when adding a positive and a negative number
No overflow when signs are the same for subtraction
Overflow occurs when the value affects the sign:
– overflow when adding two positives yields a negative
– or, adding two negatives gives a positive
– or, subtract a negative from a positive and get a negative
– or, subtract a positive from a negative and get a positive
Consider the operations A + B, and A – B
– Can overflow occur if B is 0 ? No.
– Can overflow occur if A is 0 ? Yes.
1998 Morgan Kaufmann Publishers
67
Effects of Overflow
•
•
•
An exception (interrupt) occurs
– Control jumps to predefined address for exception
– Interrupted address is saved for possible resumption
Details based on software system / language
– example: flight control vs. homework assignment
Don't always want to detect overflow
— new MIPS instructions: addu, addiu, subu
note: addiu still sign-extends!
note: sltu, sltiu for unsigned comparisons
1998 Morgan Kaufmann Publishers
68
Logical Operations
•
•
•
•
and, andi:
or, ori:
sll:
slr:
•
0101 1010
shifting left two steps gives 0110 1000
0110 1010
shifting right three bits gives 0000 1011
•
bit-by-bit AND
bit-by-bit OR
shift left logical
shift right logical
1998 Morgan Kaufmann Publishers
69
Logical unit
•
Let's build a logical unit to support the and and or instructions
– we'll just build a 1 bit unit, and use 32 of them
– op=0: and; op=1: or
operation
a
b
•
result
op
0
0
0
0
1
1
1
1
a
0
0
1
1
0
0
1
1
b
0
1
0
1
0
1
0
1
res
0
0
0
1
0
1
1
1
Possible Implementation (sum-of-products):
res = a • b + a • op + b • op
1998 Morgan Kaufmann Publishers
70
Review: The Multiplexor
•
Selects one of the inputs to be the output, based on a control input
S
•
A
0
B
1
IEC symbol
of a 4-input
MUX:
C
MUX
EN
0
1
G 0_3
0
1
2
3
Lets build our logical unit using a MUX:
Operation
a
b
0
Result
1
1998 Morgan Kaufmann Publishers
71
Different Implementations
•
Not easy to decide the “best” way to build something
•
– Don't want too many inputs to a single gate
– Don’t want to have to go through too many gates
– For our purposes, ease of comprehension is important
– We use multiplexors
Let's look at a 1-bit ALU for addition:
CarryIn
a
Sum
b
cout = a b + a cin + b cin
sum = a xor b xor cin
CarryOut
•
How could we build a 1-bit ALU for AND, OR and ADD?
•
How could we build a 32-bit ALU?
1998 Morgan Kaufmann Publishers
72
Building a 32 bit ALU for AND, OR and ADD
CarryIn
a0
Operation
CarryIn
b0
Operation
CarryIn
ALU0
Result0
CarryOut
a
0
a1
1
Result
b1
CarryIn
ALU1
Result1
CarryOut
2
b
a2
b2
CarryIn
ALU2
Result2
CarryOut
CarryOut
We need a 4-input MUX.
a31
b31
CarryIn
ALU31
Result31
1998 Morgan Kaufmann Publishers
73
What about subtraction (a – b) ?
•
Two's complement approch: just negate b and add.
•
A clever solution:
Binvert
Operation
CarryIn
a
0
1
b
0
Result
2
1
CarryOut
•
In a multiple bit ALU the least significant CarryIn has to be equal to 1
for subtraction.
1998 Morgan Kaufmann Publishers
74
Tailoring the ALU to the MIPS
•
Need to support the set-on-less-than instruction (slt)
– remember: slt is an arithmetic instruction
– produces a 1 if rs < rt and 0 otherwise
– use subtraction: (a-b) < 0 implies a < b
•
Need to support test for equality (beq $t5, $t6, $t7)
– use subtraction: (a-b) = 0 implies a = b
1998 Morgan Kaufmann Publishers
75
Supporting slt
Binvert
Operation
CarryIn
a
•
0
Other ALUs:
1
Result
b
0
2
1
Less
3
a.
CarryOut
Binvert
Operation
CarryIn
a
•
0
Most significant ALU:
1
Result
b
0
2
1
Less
3
Set
Overflow
detection
b.
Overflow
32 bit ALU supporting slt
Binvert
CarryIn
a0
b0
CarryIn
ALU0
Less
CarryOut
a1
b1
0
CarryIn
ALU1
Less
CarryOut
a2
b2
0
CarryIn
ALU2
Less
CarryOut
Operation
Result0
Result1
Result2
CarryIn
a31
b31
0
CarryIn
ALU31
Less
Result31
Set
Overflow
a<b  a-b<0, thus
Set is the sign bit
of the result.
1998 Morgan Kaufmann Publishers
77
Final ALU including test for equality
Bnegate
•
Operation
Notice control lines:
000
001
010
110
111
=
=
=
=
=
and
or
add
subtract
slt
a0
b0
CarryIn
ALU0
Less
CarryOut
Result0
a1
b1
0
CarryIn
ALU1
Less
CarryOut
Result1
a2
b2
0
CarryIn
ALU2
Less
CarryOut
Result2
•Note: zero is a 1 when the result is zero!
a31
b31
0
CarryIn
ALU31
Less
Zero
Result31
Set
Overflow
1998 Morgan Kaufmann Publishers
78
Conclusion
•
We can build an ALU to support the MIPS instruction set
– key idea: use multiplexor to select the output we want
– we can efficiently perform subtraction using two’s complement
– we can replicate a 1-bit ALU to produce a 32-bit ALU
•
Important points about hardware
– all of the gates are always working
– the speed of a gate is affected by the number of inputs to the gate
– the speed of a circuit is affected by the number of gates in series
(on the “critical path” or the “deepest level of logic”)
•
Our primary focus: comprehension, however,
– Clever changes to organization can improve performance
(similar to using better algorithms in software)
– we’ll look at examples for addition, multiplication and division
1998 Morgan Kaufmann Publishers
79
Problem: ripple carry adder is slow
•
•
A 32-bit ALU is much slower than a 1-bit ALU.
There are more than one way to do addition.
– the two extremes: ripple carry and sum-of-products
Can you see the ripple?
How could you get rid of it?
c1
c2
c3
c4
c2 = c2(a0,b0,c0,a1,b1)
c3 = c3(a0,b0,c0,a1,b1,a2,b2)
c4 = c4(a0,b0,c0,a1,b1,a2,b2,a3,b3)
=
=
=
=
b0c0
b1c1
b2c2
b3c3
+
+
+
+
a0c0
a1c1
a2c2
a3c3
+
+
+
+
a0b0
a1b1
a2b2
a3b3
Not feasible! Too many inputs to the gates.
1998 Morgan Kaufmann Publishers
80
Carry-lookahead adder
•
•
An approach in-between the two extremes
Motivation:
– If we didn't know the value of carry-in, what could we do?
– When would we always generate a carry?
gi = ai bi
– When would we propagate the carry?
pi = ai + bi
– Look at the truth table!
•
Did we get rid of the ripple?
c1 = g0 + p0c0
c2 = g1 + p1c1
c3 = g2 + p2c2
c4 = g3 + p3c3
c2 = g1+p1g0+p1p0c0
c3 = g2+p2g1+p2p1g0+p2p1p0c0
c4 = ...
Feasible! A smaller number of inputs to the gates.
1998 Morgan Kaufmann Publishers
81
1-bit adder
a b cin cout sum
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0
0
0
1
0
1
1
1
0
1
1
0
1
0
0
1
1998 Morgan Kaufmann Publishers
82
Use principle to build bigger adders
CarryIn
a0
b0
a1
b1
a2
b2
a3
b3
CarryIn
Result0--3
ALU0
P0
G0
pi
gi
Carry-lookahead unit
C1
a4
b4
a5
b5
a6
b6
a7
b7
a8
b8
a9
b9
a10
b10
a11
b11
a12
b12
a13
b13
a14
b14
a15
b15
ci + 1
CarryIn
Result4--7
ALU1
P1
G1
•
•
•
pi + 1
gi + 1
C2
ci + 2
CarryIn
Result8--11
ALU2
P2
G2
pi + 2
gi + 2
C3
Can’t build a 16 bit CLA adder (too big)
Could use ripple carry of 4-bit CLA adders
Better: use the CLA principle again!
Principle shown in the figure. See textbook
for details.
ci + 3
CarryIn
Result12--15
ALU3
P3
G3
pi + 3
gi + 3
C4
ci + 4
CarryOut
1998 Morgan Kaufmann Publishers
83
Multiplication
•
•
•
More complicated than addition
– can be accomplished via shifting and addition
More time and more area
Let's look at 2 versions based on grammar school algorithm
0010
x_1011
0010
0010
0000
0010___
0010110
•
(multiplicand)
(multiplier)
Negative numbers: easy way: convert to positive and multiply
– there are better techniques
1998 Morgan Kaufmann Publishers
84
Multiplication, First Version
Start
Multiplier0 = 1
1. Test
Multiplier0
Multiplier0 = 0
Multiplicand
1a. Add multiplicand to product and
place the result in Product register
Shift left
64 bits
Multiplier
Shift right
64-bit ALU
32 bits
Product
Write
Control test
2. Shift the Multiplicand register left 1 bit
3. Shift the Multiplier register right 1 bit
64 bits
32nd repetition?
No: < 32 repetitions
Yes: 32 repetitions
Done
1998 Morgan Kaufmann Publishers
85
Multiplication, Final Version
Start
Product0 = 1
1. Test
Product0
Product0 = 0
Multiplicand
32 bits
1a. Add multiplicand to the left half of
the product and place the result in
the left half of the Product register
32-bit ALU
2. Shift the Product register right 1 bit
Product
64 bits
Shift right
Write
Control
test
32nd repetition?
No: < 32 repetitions
Yes: 32 repetitions
Done
1998 Morgan Kaufmann Publishers
86
Booth’s Algorithm
•
•
•
•
The grammar school method was implemented using addition and
shifting
Booth’s algorithm also uses subtraction
Based on two bits of the multiplier either add, subtract or do nothing;
always shift
Handles two’s complement numbers
1998 Morgan Kaufmann Publishers
87
Fast multipliers
•
Combinational implementations
– Conventional multiplier algorithm
• partial products with AND gates
• adders
– Lots of modifications
•
Sequential implementations
– Pipelined multiplier
• registers between levels of logic
• result delayed
• effective speed of multiple multiplications increased
1998 Morgan Kaufmann Publishers
88
Four-Bit Binary Multiplication
Multiplicand
Multiplier
1st partial product
2nd partial product
3rd partial product
4th partial product
Final product
+
P7
A3B3
P6
A2B3
A3B2
P5
B3
 A3
A0B3
A1B3 A1B2
A2B2 A2B1
A3B1 A3B0
P4
P3
B2
A2
A0B2
A1B1
A2B0
P2
B1
A1
A0B1
A1B0
P1
B0
A0
A0B0
P0
1998 Morgan Kaufmann Publishers
89
Classical Implementation
A0
B0
A0
B1
A0
B2
A0
B3
&
&
&
&
PP1
4
/
PP2
4
/
PP3
4
/
PP4
4
/
6
/
P7:0
6
/
1998 Morgan Kaufmann Publishers
90
Pipelined Multiplier
Clk
/
/
/
/
/
/
/
/
/
/
1998 Morgan Kaufmann Publishers
91
Division
•
Simple method:
– Initialise the remainder with the dividend
– Start from most significant end
– Subtract divisor from the remainder if possible (quotient bit 1)
– Shift divisor to the right and repeat
1998 Morgan Kaufmann Publishers
92
Division, First Version
Divisor
Shift right
64 bits
Quotient
Shift left
64-bit ALU
32 bits
Remainder
Write
Control
test
64 bits
1998 Morgan Kaufmann Publishers
93
Division, Final Version
Divisor
Same hardware for
multiply and divide.
32 bits
32-bit ALU
Shift right
Remainder Shift left
Write
Control
test
64 bits
1998 Morgan Kaufmann Publishers
94
Floating Point (a brief look)
•
We need a way to represent
– numbers with fractions, e.g., 3.1416
– very small numbers, e.g., .000000001
– very large numbers, e.g., 3.15576  109
•
Representation:
– sign, exponent, significand:
(–1)sign significand 2exponent
– more bits for significand gives more accuracy
– more bits for exponent increases range
•
IEEE 754 floating point standard:
– single precision: 8 bit exponent, 23 bit significand
– double precision: 11 bit exponent, 52 bit significand
1998 Morgan Kaufmann Publishers
95
IEEE 754 floating-point standard
•
Leading “1” bit of significand is implicit
•
Exponent is “biased” to make sorting easier
– all 0s is smallest exponent all 1s is largest
– bias of 127 for single precision and 1023 for double precision
– summary: (–1)sign significand) 2exponent – bias
•
Example:
– decimal: -.75 = -3/4 = -3/22
– binary: -.11 = -1.1 x 2-1
– floating point: exponent = 126 = 01111110
– IEEE single precision: 10111111010000000000000000000000
1998 Morgan Kaufmann Publishers
96
Floating-point addition
1. Shift the significand of the number with the lesser exponent right
until the exponents match
2. Add the significands
3. Normalise the sum, checking for overflow or underflow
4. Round the sum
1998 Morgan Kaufmann Publishers
97
Floating-point addition
1998 Morgan Kaufmann Publishers
98
Floating-point multiplication
1.
2.
3.
4.
5.
Add the exponents
Multiply the significands
Normalise the product, checking for overflow or underflow
Round the product
Find out the sign of the product
1998 Morgan Kaufmann Publishers
99
Floating Point Complexities
•
Operations are somewhat more complicated (see text)
•
In addition to overflow we can have “underflow”
•
Accuracy can be a big problem
– IEEE 754 keeps two extra bits during intermediate calculations,
guard and round
– four rounding modes
– positive divided by zero yields “infinity”
– zero divide by zero yields “not a number”
– other complexities
•
Implementing the standard can be tricky
1998 Morgan Kaufmann Publishers
100
Chapter Four Summary
•
•
•
•
•
Computer arithmetic is constrained by limited precision
Bit patterns have no inherent meaning but standards do exist
– two’s complement
– IEEE 754 floating point
Computer instructions determine “meaning” of the bit patterns
Performance and accuracy are important so there are many
complexities in real machines (i.e., algorithms and
implementation).
We are ready to move on (and implement the processor)
1998 Morgan Kaufmann Publishers
101