Instructions

Download Report

Transcript Instructions

Lecture 3 MIPS ISA and Performance
Peng Liu
[email protected]
1
MIPS Instruction Encoding
• MIPS instructions are encoded in different forms,
depending upon the arguments
– R-format, I-format, J-format
• MIPS architecture has three instruction formats, all 32 bits
in length
– Regularity is simpler and improves performance
• A 6 bit opcode appears at the beginning of each
instruction
– Control logic based on decode instruction type
2
Instruction Formats
• I-format: used for instructions with immediates, lw and sw
(since the offset counts as an immediate), and the branches
(beq and bne),
– (but not the shift instructions; later)
• J-format: used for j and jal
• R-format: used for all other instructions
• It will soon become clear why the instructions have been
partitioned in this way.
3
I-Format
4
I-Format Example
5
I-Format Example: Load/Store
6
J-Format (1/2)
• Define “fields” of the following number of bits each:
6 bits
26 bits
• As usual, each field has a name:
opcode
target address
• Key Concepts
– Keep opcode field identical to R-format and I-format for
consistency.
– Combine all other fields to make room for large target address.
7
J-Format (2/2)
• Summary:
– New PC = { PC[31..28], target address, 00 }
• Understand where each part came from!
• Note: In Verilog,
{ , , } means concatenation
{ 4 bits , 26 bits , 2 bits } = 32 bit address
– { 1010, 11111111111111111111111111, 00 } =
10101111111111111111111111111100
– We use Verilog in this class
8
R-Format (1/2)
• Define “fields” of the following number of bits each: 6 + 5 + 5
+ 5 + 5 + 6 = 32
6
5
5
5
5
6
shamt
funct
• For simplicity, each field has a name:
opcode
rs
rt
rd
9
R-Format (2/2)
• More fields:
– rs (Source Register): generally used to specify register
containing first operand
– rt (Target Register): generally used to specify register
containing second operand (note that name is misleading)
– rd (Destination Register): generally used to specify register
which will receive result of computation
10
R-Format Example
• MIPS Instruction:
add
$8,$9,$10
Decimal number per field representation:
0
9
10
8
0
32
Binary number per field representation:
000000 01001 01010 01000 00000 100000
hex representation:
decimal representation:
012A 4020hex
hex
19,546,144ten
On Green Card: Format in column 1, opcodes in column 3
11
MIPS I Operation Overview
• Arithmetic Logical:
– Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT,
SLTU
– AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUI
– SLL, SRL, SRA, SLLV, SRLV, SRAV
• Memory Access:
– LB, LBU, LH, LHU, LW, LWL,LWR
– SB, SH, SW, SWL, SWR
12
MIPS logical instructions
Instruction
and
or
xor
nor
and immediate
or immediate
xor immediate
shift left logical
shift right logical
shift right arithm.
shift left logical
shift right logical
shift right arithm.
Example
and $1,$2,$3
or $1,$2,$3
xor $1,$2,$3
nor $1,$2,$3
andi $1,$2,10
ori $1,$2,10
xori $1, $2,10
sll $1,$2,10
srl $1,$2,10
sra $1,$2,10
sllv $1,$2,$3
srlv $1,$2, $3
srav $1,$2, $3
Meaning
$1 = $2 & $3
$1 = $2 | $3
$1 = $2 ^ $3
$1 = ~($2 |$3)
$1 = $2 & 10
$1 = $2 | 10
$1 = ~$2 &~10
$1 = $2 << 10
$1 = $2 >> 10
$1 = $2 >> 10
$1 = $2 << $3
$1 = $2 >> $3
$1 = $2 >> $3
Comment
3 reg. operands; Logical AND
3 reg. operands; Logical OR
3 reg. operands; Logical XOR
3 reg. operands; Logical NOR
Logical AND reg, constant
Logical OR reg, constant
Logical XOR reg, constant
Shift left by constant
Shift right by constant
Shift right (sign extend)
Shift left by variable
Shift right by variable
Shift right arith. by variable
Q: Can some multiply by 2i ? Divide by 2i ? Invert?
13
M I P S Reference Data :CORE INSTRUCTION SET (1)
NAME
MNEMON-IC
FORMAT
OPERATION (in Verilog)
OPCODE/FU
NCT (hex)
Add
add
R
R[rd] = R[rs] + R[rt] (1)
Add Immediate
addi
I
R[rt] = R[rs] +
SignExtImm (1)(2)
8hex
Branch On
Equal
beq
I
if(R[rs]==R[rt])
PC=PC+4+ BranchAddr
(4)
4hex
0 / 20hex
(1) May cause overflow exception
(2) SignExtImm = { 16{immediate[15]}, immediate }
(3) ZeroExtImm = { 16{1b’0}, immediate }
(4) BranchAddr = { 14{immediate[15]}, immediate, 2’b0}
14
MIPS Data Transfer Instructions
Instruction
sw 500($4), $3
sh 502($2), $3
sb 41($3), $2
Comment
Store word
Store half
Store byte
lw $1, 30($2)
lh $1, 40($3)
lhu $1, 40($3)
lb $1, 40($3)
lbu $1, 40($3)
Load word
Load halfword
Load halfword unsigned
Load byte
Load byte unsigned
lui $1, 40
Q: Why need lui?
Load Upper Immediate (16 bits shifted left by 16)
LUI
R5
R5
0000 … 0000
15
Multiply / Divide
• Start multiply, divide
–
–
–
–
MULT rs, rt
MULTU rs, rt
DIV rs, rt
DIVU rs, rt
• Move result from multiply, divide
Registers
– MFHI rd
– MFLO rd
• Move to HI or LO
– MTHI rd
– MTLO rd
HI
LO
16
MIPS Arithmetic Instructions
Instruction
add
subtract
add immediate
add unsigned
subtract unsigned
add imm. unsign.
multiply
multiply unsigned
divide
Example
add $1,$2,$3
sub $1,$2,$3
addi $1,$2,100
addu $1,$2,$3
subu $1,$2,$3
addiu $1,$2,100
mult $2,$3
multu$2,$3
div $2,$3
divide unsigned
divu $2,$3
Move from Hi
Move from Lo
mfhi $1
mflo $1
Meaning
$1 = $2 + $3
$1 = $2 – $3
$1 = $2 + 100
$1 = $2 + $3
$1 = $2 – $3
$1 = $2 + 100
Hi, Lo = $2 x $3
Hi, Lo = $2 x $3
Lo = $2 ÷ $3,
Hi = $2 mod $3
Lo = $2 ÷ $3,
Hi = $2 mod $3
$1 = Hi
$1 = Lo
Comments
3 operands; exception possible
3 operands; exception possible
+ constant; exception possible
3 operands; no exceptions
3 operands; no exceptions
+ constant; no exceptions
64-bit signed product
64-bit unsigned product
Lo = quotient, Hi = remainder
Unsigned quotient & remainder
Used to get copy of Hi
Used to get copy of Lo
Q: Which add for address arithmetic? Which add for integers?
17
Green Card: ARITHMETIC CORE INSTRUCTION SET (2)
NAME
MNEMON-IC
FORMAT
Branch On FP
True
bc1t
FI
Load FP Single
lwc1
div
Divide
OPERATION (in Verilog) OPCODE/FMT
/ FT/ FUNCT
(hex)
if (FPcond) PC=PC + 4
+
BranchAddr (4)
11/8/1/--
I
F[rt] = M[R[rs] +
SignExtImm] (2)
11/8/1/--
R
Lo=R[rs]/R[rt];
Hi=R[rs]%R[rt]
31/--/--/--
18
When does MIPS Sign Extend?
• When value is sign extended, copy upper bit to full value:
Examples of sign extending 8 bits to 16 bits:
00001010  00000000 00001010
10001100  11111111 10001100
• When is an immediate operand sign extended?
– Arithmetic instructions (add, sub, etc.) always sign extend immediates even for the unsigned versions of the
instructions!
– Logical instructions do not sign extend immediates (They are zero extended)
– Load/Store address computations always sign extend immediates
•
Multiply/Divide have no immediate operands however:
– “unsigned”  treat operands as unsigned
•
The data loaded by the instructions lb and lh are extended as follows (“unsigned”  don’t extend):
– lbu, lhu are zero extended
– lb, lh are sign extended
Q: Then what is does
add unsigned (addu) mean
since not immediate?
19
MIPS Compare and Branch
• Compare and Branch
– BEQ rs, rt, offset
– BNE rs, rt, offset
if R[rs] == R[rt] then PC-relative branch
<>
• Compare to zero and Branch
–
–
–
–
–
–
BLEZ rs, offset
BGTZ rs, offset
BLT
BGEZ
BLTZAL rs, offset
BGEZAL
if R[rs] <= 0 then PC-relative branch
>
<
>=
if R[rs] < 0 then branch and link (into R 31)
>=!
• Remaining set of compare and branch ops take two instructions
• Almost all comparisons are against zero!
20
MIPS
jump,
branch,
compare
Instructions
Instruction
Example
Meaning
branch on equal
branch on not eq.
beq $1,$2,100
if ($1 == $2) go to PC+4+100
Equal test; PC relative branch
bne $1,$2,100
if ($1!= $2) go to PC+4+100
Not equal test; PC relative
slt $1,$2,$3
if ($2 < $3) $1=1; else $1=0
set on less than
than; 2’s comp.
set less than imm.
slti $1,$2,100
if ($2 < 100) $1=1; else $1=0
constant; 2’s comp.
set less than uns.
sltu $1,$2,$3
if ($2 < $3) $1=1; else $1=0
than; natural numbers
set l. t. imm. uns.
sltiu $1,$2,100
if ($2 < 100) $1=1; else $1=0
constant; natural numbers
jump
j 10000
go to 10000
Jump to target address
jump register
jr $31
go to $31
For switch, procedure return
jump and link
jal 10000
$31 = PC + 4; go to 10000
For procedure call
Compare less
Compare <
Compare less
Compare <
21
Signed vs. Unsigned Comparison
$1= 0…00 0000 0000 0000 0001
$2= 0…00 0000 0000 0000 0010
$3= 1…11 1111 1111 1111 1111
• After executing these instructions:
slt $4,$2,$1 ; if ($2 < $1) $4=1; else $4=0
slt $5,$3,$1 ; if ($3 < $1) $5=1; else $5=0
sltu $6,$2,$1 ; if ($2 < $1) $6=1; else $6=0
sltu $7,$3,$1 ; if ($3 < $1) $7=1; else $7=0
• What are values of registers $4 - $7? Why?
$4 =
; $5 =
; $6 =
; $7 =
;
22
MIPS Assembler Register Convention
Name
Number Usage
Preserved across
a call?
the value 0
n/a
return values
no
arguments
no
temporaries
no
saved
yes
temporaries
no
stack pointer
yes
return address
yes
$zero
$v0-$v1
$a0-$a3
$t0-$t7
$s0-$s7
$t18-$t19
$sp
$ra
0
2-3
4-7
8-15
16-23
24-25
29
31
• “caller saved”
• “callee saved”
• On Green Card in Column #2 at bottom
23
Peer Instruction: $s3=i, $s4=j, $s5=@A
Loop: addiu
sll
addu
lw
addiu
slti
beq
slti
bne
$s4,$s4,1
$t1,$s3,2
$t1,$t1,$s5
$t0,0($t1)
$s3,$s3,1
$t1,$t0,10
$t1,$0, Loop
$t1,$t0, 0
$t1,$0, Loop
#
#
#
#
#
#
#
#
#
j = j + 1
$t1 = 4 * i
$t1 = @ A[i]
$t0 = A[i]
i = i + 1
$t1 = $t0 < 10
goto Loop
$t1 = $t0 < 0
goto Loop
do j = j + 1
while (______);
What C code properly fills in the blank in loop on right?
1:
2:
3:
4:
5:
6
A[i++]
A[i++]
A[i]
A[i++]
A[i]
>=
>=
>=
>=
>=
10
10
10
10
10
|
||
||
&&
A[i]
A[i++]
A[i]
A[i++]
<
<
<
<
0
0
0
0
None of the above
24
Green Card: OPCODES, BASE CONVERSION, ASCII (3)
MIPS
opcode
(31:26)
(1) MIPS
funct (5:0)
(2) MIPS
funct (5:0)
Binary
Decimal
Hexadeci-mal
ASCII
(1)
sll
add.f
00 0000
0
0
NUL
j
srl
mul.f
00 0010
2
2
STX
lui
sync
floor.w.f
00 1111
15
f
SI
lbu
and
cvt.w.f
10 0100
36
24
$
(1) opcode(31:26) == 0
(2) opcode(31:26) == 17 ten (11 hex );
if fmt(25:21)==16 ten (10 hex ) f = s (single);
if fmt(25:21)==17 ten (11 hex ) f = d (double)
Note: 3-in-1 - Opcodes, base conversion, ASCII!
26
Green Card
• green card /n./ [after the "IBM System/360
Reference Data" card] A summary of an assembly
language, even if the color is not green. For
example,
"I'll go get my green card so I can check the
addressing mode for that instruction."
www.jargon.net
Image from Dave's Green Card Collection:
http://www.planetmvs.com/greencard/
27
Peer Instruction
Which instruction has same representation as 35ten?
opcode rs
A. add $0, $0, $0
B. subu $s0,$s0,$s0
opcode rs
C. lw $0, 0($0)
opcode
rs
D. addi $0, $0, 35
E. subu $0, $0, $0
opcode
rs
F. Trick question!
rs
Instructions are not numbers opcode
Registers numbers and names:
0: $0, 8: $t0, 9:$t1, ..15: $t7, 16: $s0, 17: $s1, .. 23: $s7
Opcodes and function fields (if necessary)
rt
rd
shamt funct
rt
rd
shamt funct
rt
offset
rt
immediate
rt
rd
shamt funct
add: opcode = 0, funct = 32
subu: opcode = 0, funct = 35
addi: opcode = 8
lw: opcode = 35
28
Branch & Pipelines
Time
li $3, #7
sub $4, $4, 1
bz $4, LL
execute
ifetch
execute
ifetch
addi $5, $3, 1
LL: slt
$1, $3, $5
execute
ifetch
Branch
execute
Branch Target
ifetch
Delay Slot
execute
By the end of Branch instruction, the CPU knows whether or not
the branch will take place.
However, it will have fetched the next instruction by then,
regardless of whether or not a branch will be taken.
Why not execute it?
30
Delayed Branches
li
$3, #7
sub
$4, $4, 1
bz
$4, LL
addi $5, $3, 1
 Delay Slot Instruction
subi $6, $6, 2
LL: slt
$1, $3, $5
• In the “Raw” MIPS, the instruction after the branch is executed even when the branch is
taken
– This is hidden by the assembler for the MIPS “virtual machine”
– allows the compiler to better utilize the instruction pipeline (???)
• Jump and link (jal inst):
– Put the return addr. Into link register ($31):
• PC+4 (logical architecture)
• PC+8 physical (“Raw”) architecture  delay slot executed
– Then jump to destination address
31
Filling Delayed Branches
Branch:
Inst Fetch
Execute
Dcd & Op Fetch
Inst Fetch
execute successor
even if branch taken!
Then branch target
or continue
Dcd & Op Fetch
Execute
Inst Fetch
•Compiler can fill a single delay slot
with a useful instruction 50% of the
time.
• try to move down from above
jump
•move up from target, if safe
Single delay slot
impacts the critical path
add $3, $1, $2
sub $4, $4, 1
bz $4, LL
NOP
...
LL: add rd, ...
Is this violating the ISA abstraction?
32
Summary: Salient features of MIPS I
• 32-bit fixed format inst (3 formats)
• 32 32-bit GPR (R0 contains zero) and 32 FP registers (and HI LO)
– partitioned by software convention
• 3-address, reg-reg arithmetic instr.
• Single address mode for load/store: base+displacement
– no indirection, scaled
• 16-bit immediate plus LUI
• Simple branch conditions
– compare against zero or two registers for =,
– no integer condition codes
• Delayed branch
– execute instruction after a branch (or jump) even if the
branch is taken
(Compiler can fill a delayed branch with useful work about
50% of the time)
33
And in Conclusion...
• Continued rapid improvement in Computing
– 2X every 1.5 years in processor speed;
every 2.0 years in memory size;
every 1.0 year in disk capacity;
Moore’s Law enables processor, memory
(2X transistors/chip/ ~1.5 ro 2.0 yrs)
• 5 classic components of all computers
Control Datapath Memory Input Output
Processor
34
MIPS Machine Instruction Review:
Instruction Format Summary
35
Addressing Modes Summary
• Register addressing
– Operand is a register (e.g. ALU)
• Base/displacement addressing (ex. load/store)
– Operand is at the memory location that is the sum of
– a base register + a constant
• Immediate addressing (e.g. constants)
– Operand is a constant within the instruction itself
• PC-relative addressing (e.g. branch)
– Address is the sum of PC and constant in instruction (e.g.
branch)
• Pseudo-direct addressing (e.g. jump)
– Target address is concatenation of field in instruction and the
PC
36
Addressing Modes Summary
37
Break
38
Computer Performance Metrics
• Response Time (latency)
– How long does it take for my job to run?
– How long does it take to execute a job?
– How long must I wait for the database query?
• Throughput
– How many jobs can the machine run at once?
– What is the average execution rate?
– How many queries per minute?
• If we upgrade a machine with a new processor what to we
increase?
• If we add a new machine to the lab what do we increase?
39
Performance = Execution Time
• Elapsed Time
– Counts everything (disk and memory accesses, I/O, etc.)
– A useful number, but often not good for comparison
purposes
• E.g., OS & multiprogramming time make it difficult to compare CPUs
• CPU time (CPU = Central Processing Unit = processor)
– Doesn’t count I/O or time spent running other programs
– Can be broken up into system time, and user time
• Our focus: user CPU time
– Time spent executing the lines of code that are “in” our
program
– Includes arithmetic, memory, and control instructions, …
40
Clock Cycles
• Instead of reporting execution time in seconds, we often
use cycles
• Clock “ticks” indicate when to start activities
• Cycle time = time between ticks = seconds per cycle
• Clock rate (frequency) = cycles per second (1Hz. =
1cycle/sec)
– A 2Ghz clock has a 500 picoseconds (ps) cycle time.
41
Performance and Execution Time
• The program should be something real people care about
– Desktop: MS office, edit, compile
– Server: web, e-commerce, database
– Scientific: physics, weather forecasting
42
Measuring Clock Cycles
• Clock cycles/program is not an intuitive or easily
determined value, so
– Clock Cycles = Instructions x Clock Cycles Per
Instruction
• Cycles Per Instruction (CPI) used often
• CPI is an average since the number of cycles per
instruction varies from instruction to instruction
– Average depends on instruction mix, latency of each inst.
Type etc.
• CPIs can be used to compare two implementations of the
same ISA, but is not useful alone for comparing different
ISAs
– An X86 add is different from a MIPS add
43
Using CPI
• Drawing on the previous equation:
ExecutionTime  Instructions  CPI  ClockCycleTime
ExecutionTime 
Instructions  CPI
ClockRate
• To improve performance (i.e. reduce execution time)
– Increase clock rate (decrease clock cycle time) OR
– Decrease CPI OR
– Reduce the number of instructions
• Designers balance cycle time against the number of cycles required
– Improving one factor may make the other one worse
44
Clock Rate ≠ Performance
• Mobile Intel Pentium 4 Vs Intel Pentium M
– 2.4GHz
– P4 is 50% faster?
1.6GHz
• Performance on Mobilemark with same memory and disk
– Word, excel, photoshop, powerpoint, etc.
– Mobile Pentium 4 is only 15% faster
• What is the relative CPI?
– ExecTime=ICxCPI/ClockRate
– ICxCPIM/1.6 = 1.15xICxCPI4/2.4
– CPI4/CPIM=2.4/(1.15x1.6)=1.3
45
CPI Calculation
• Different instruction types require different numbers of cycles
• CPI is often reported for types of instructions
n
ClockCylces   (CPI i  ICi )
i 1
• where CPIi is the CPI for the type of instructions
and ICi is the count of that type of instruction
46
CPI Calculation
• To compute the overall average CPI use
InstructionCounti 

ClockCylces    CPI i 

InstructionCount 
i 1 
n
47
Computing CPI Example
• Given this machine, the CPI is the sum of CPI x Frequency
• Average CPI is 0.5 + 0.4 + 0.4 + 0.2 = 1.5
• What fraction of the time for data transfer?
48
What is the Impact of Displacement Based
Memory Addressing Mode?
• Assume 50% of MIPS loads and stores have a zero
displacement.
49
Speedup
• Speedup allows us to compare different CPUs or optimizations
CPUtimeOld
Speedup 
CPUtimeNew
• Example
– Original CPU takes 2sec to run a program
– New CPU takes 1.5sec to run a program
– Speedup = 1.333 or speedup or 33%
50
Amdahl’s Law
• If an optimization improves a fraction f of execution time by a factor of a
Told
1
Speedup 

[(1  f )  f / a]* Told (1  f )  f / a
• This formula is known as Amdahl’s Law
• Lessons from
– If f->100%, then speedup = a
– If a->∞, the speedup = 1/(1-f)
• Summary
– Make the common case fast
– Watch out for the non-optimized component
51
Evaluating Performance
• Performance best determined by running a real application
– Use programs typical of expected workload
• e.g. compiler/editors, scientific applications, graphics, etc.
• Microbenchmarks
– Small programs, synthetic or kernels from larger applications
– Nice for architects and designers
– Can be misleading
• Benchmarks
– Collection of real programs that companies have agreed on
– Components: programs, inputs & outputs, measurements rules,
metrics
– Can still be abused
52
Benchmarks
•
•
•
•
System Performance Evaluation Cooperative (SPEC)
Scientific computing: Linpack, SpecOMP, SpecHPC,…
Embedded benchmarks: EEMBC, Dhrystone, …
Enterprise computing
– TPC-C, TPC-W, TPC-H
– SpecJbb, SpecSFS, SpecMail, Streams,
• Multiprocessor: PARSEC, SPLASH-2, EEMBC (multicore)
• Other
– 3Dmark, ScienceMark, Winstone, iBench, AquaMark, …
• Watch out: your results will be as good as your benchmarks
– Make sure you know what the benchmark is designed to
measure
– Performance is not the only metric for computing systems
• Cost, power consumption, reliability, real-time performance, …
55
Summarizing Performance
• Combining results from
multiple programs into 1
benchmark score
– Sometimes misleading,
always controversial, … and
inevitable
– We all like quoting a single
number
• 3 types of means
– Arithmetic: for times
– Harmonic: for rates
– Geometric: for rations
1 n
AM   Weighti  Timei
n i i
n
HM  n
Weighti 

Ratei
i 1
 n

GM    Ratioi 
 i 1

1
 
n
56
Using the Means
• Arithmetic mean:
– When you have individual performance scores in latency
• Harmonic mean:
– When you have individual performance scores in throughput
• Geometric mean:
– Nice property: GM(X/Y)= GM(X)/GM(Y)
– But difficult to related back to execution times
• Note
– Always look at the results for individual programs
57
Performance Summary
• Performance is specific to a particular programs
– Total execution time is a consistent summary of performance
• For a given architecture performance increases come
from:
– Increase in clock rate (without adverse CPI effects)
– Improvements in processor organization that lower CPI
– Compiler enhancements that lower CPI and/or instruction
count
– Algorithm/Language choices that affect instruction count
• Pitfall: expecting improvement in one aspect of a
machine’s performance to affect the total performance
58
Break
59
Translation Hierarchy
• High-level->Assembly->Machine
60
Compiler
Converts High-level Language to Machine Code
61
Intermediate Representation (IR)
• Three-address code (TAC)
t1= 2 * I
j= 2 * I + 1;
If (j >= n)
j = 2*I + 3;
return a[j];
t2 = t1 + 1
j = t2
t3 = j < n
If t3 goto L0
t4 = 2 * I
t5 = t4 + 3
j = t5
L0: t6 = a[j]
return t6
Single assignment
62
Optimizing Compilers
• Provide efficient mapping of program to machine
– Code selection and ordering
– Eliminating minor inefficiencies
– Register allocation
• Don’t (usually) improve asymptotic efficiency
– Up to programmer to select best overall algorithm
– Big-O savings are (often) more important than constant
factors
• But constant factors also matter
• Optimization types
– Local: inside basic blocks
– Global: across basic blocks e.g. loops
63
Limitations of Optimizing Compilers
• Operate Under Fundamental Constraints
– Improved code has same output
– Improved code has same side effects
– Improved code is as correct as original code
• Most analysis is performed only within procedures
– Whole-program analysis is too expensive in most cases
• Most analysis is based only on static information
– Compiler has difficulty anticipating run-time inputs
– Profile driven dynamic compilation becoming more prevalent
– Dynamic compilation (e.g. Java JIT)
• When in doubt, the compiler must be conservative
64
Preview: Code Optimization
• Goal: Improve performance by:
– Removing redundant work
• Unreachable code
• Common-subexpression elimination
• Induction variable elimination
– Using simpler and faster operations
• Dealing with constants in the compiler
• Strength reduction
– Managing register well
ExecutionTime = Instructions x CPI x ClockCycleTime
65
Constant Folding
• Detect & combine values that will be constant.
• Respect laws of arithmetic on target machine.
a = 10 * 5 + 6 – b;
66
Constant Propagation
• When x gets assigned a constant c, replace users of x
with uses of c, as long as x remains unchanged.
• Very useful for turning constants into immediates, for
code generator.
67
Constant Propagation
68
Reduction in Strength
• Some operations are faster (lower CPI) than others:
–
–
–
–
Multiplication slower than addition
Mult or Divide by 2n slower than shift left or right
Multiplies take 12 cycle on MIPS R2000
ALU instruction take 1 cycle
• Can replace complex instruction with equivalent simple
instructions
• Correctness:
– Overflow behavior preserved?
– Other side-effects of operation? (interrupts)
69
Reduction in Strength
• How to replace multiplication with addition?
• Often presents in for loops & array walks.
while (i<100) arr[i++] = 0;
70
Common Sub-Expression Elimination
71
Common Sub-Expression Elimination
72
Induction Variable Elimination
while (i<100) arr [i++] = 0;
• Remember replacing multiply with add in loop?
• Why not get rid of induction variable I altogether?
• Induction variable is a variable that is changed by a
constant amount on every iteration of a loop.
73
Register Allocation & Assignment
• Register provide fastest data access in the processor.
• Two terms:
– What variables belong in register? (allocation)
– What register will hold this value? (assignment)
• Once a value is in a register, we’d like not to spill &
restore it to use same value again.
74
Assembly Code Generation
75
Compiler Performance on Bubble Sort
76
Assembler
• Expands macros and pseudo instructions as well as
converts constants.
• Primary purpose is to produce an object file
– Machine language instructions
– Application data
– Information for memory organization
77
Object File
78
Linker
• Linker combines multiple object modules
– Identify where code/data will be placed in memory
– Resolve code/data cross references
– Produces executable if all references found
• Steps
– Place code and data modules in memory
– Determine the address of data and instruction labels
– Patch both the internal and external references
• Separation between compiler and linker makes standard
libraries and efficient solution to maintaining modular
code.
79
Loader
• Loader used at run-time
–
–
–
–
–
–
–
Reads executable file header for size of text/data segments
Create address space sufficiently large
Copy program from executable on disk into memory
Copy arguments to main program’s stack
Initialize machine registers and set stack pointer
Jump to start-up routine
Terminate program when execution completes
80
SPIM System Calls
• SPIM provides a small number of OS “system calls”
Service
Code
Argument
Result
print_int
1
$a0 = integer
print_float
2
$f12 = float
print_double
3
$f12 = double
print_string
4
$a0 = string
read_int
5
Integer in $v0
read_float
6
Float if $f0
read_double
7
Double in $f0
read_string
8
$a0 = buffer, $a1 = length
Data in buffer
Sbrk
9
$a0 = amount
Address in $v0
Exit
10
• Set system call code in $v0 and pass arguments as needed
• See web site for details and examples.
81
MIPS Assembler Directives
• SPIM supports a subset of the MIPS assembler directives
• Some of the directives include
–
–
–
–
–
.asciiz – Store a null-terminated string in memory
.data –Start of data segment
.global –Identify an exported symbol
.text –Start of text segment
.word –Store words in memory
82
HomeWork
• HW1
– http://mypage.zju.edu.cn/liupeng
83
Acknowledgements
• These slides contain material from courses:
– UCB CS152
– Stanford EE108B
84