Transcript Lecture 3

CS 152: Computer Architecture
and Engineering
Lecture 3
Performance, Technology & Delay Modeling
Modified From the Lectures of Randy H. Katz
UC Berkeley
©UCB Fall 2002
CS 152
Lec 3.1
Review: Salient features of MIPS I
• 32-bit fixed format inst (3 formats)
• 32 32-bit GPR (R0 contains zero) and 32 FP registers
(+ HI LO)
– Partitioned by software convention
• 3-address, reg-reg arithmetic instr.
• Single address mode for load/store:
base+displacement
– No indirection, scaled
16-bit immediate plus LUI
• Simple branch conditions
•
– Compare against zero or two registers for =,
– No integer condition codes
• Support for 8 bit, 16 bit, and 32 bit integers
• Support for 32 bit and 64 bit floating point
Lec 3.2
Review: MIPS Addr Modes/Instruction Formats
• All instructions 32 bits wide
Register (direct)
op
rs
rt
rd
register
Immediate
Base+index
op
rs
rt
immed
op
rs
rt
immed
register
PC-relative
op
rs
PC
rt
Memory
+
immed
Memory
+
Lec 3.3
Review: When Does MIPS Sign Extend?
• When value is sign extended, copy upper bit to full value:
Examples of sign extending 8 bits to 16 bits:
00001010  00000000 00001010
10001100  11111111 10001100
• When is an immediate value sign extended?
– Arithmetic instructions (add, sub, etc.) sign extend immediates
even for the unsigned versions of the instructions!
– Logical instructions do not sign extend
addi $r2, $r3, -1
has 0xFFFF in immediate field
and will extend to 0xFFFFFFFF before adding
andi $r2, $r3, -1 has 0xFFFF in immediate field
and will extend to 0x0000FFFF before anding
– Kinda weird to put negative numbers in logical instructions
Lec 3.4
Review: Details of the MIPS Instruction Set
• Register zero always has the value zero (even if you try to write it)
• Branch/jump and link put the return addr.
PC+4 into the link register (R31), also called “ra”
• All instructions change all 32 bits of the destination register
(including lui, lb, lh) and all read all 32 bits of sources (add, and, …)
• Difference between signed and unsigned versions:
– For add and subtract: signed causes exception on overflow
» No difference in sign-extension behavior!
– For multiply and divide, distinguishes type of operation
• Thus, overflow can occur in these arithmetic and logical instructions:
– add, sub, addi
– it cannot occur in addu, subu, addiu, and, or, xor, nor, shifts, mult, multu, div, divu
• Immediate arithmetic & logical instructions are extended as follows:
– logical immediates ops are zero extended to 32 bits
– arithmetic immediates ops are sign extended to 32 bits (including addu)
• Data loaded by the instructions lb and lh are extended as follows:
– lbu, lhu are zero extended
– lb, lh are sign extended
Lec 3.5
Performance
• Purchasing perspective
– Given a collection of machines, which has the
» Best performance ?
» Least cost ?
» Best performance / cost ?
• Design perspective
– Faced with design options, which has the
» Best performance improvement ?
» Least cost ?
» Best performance / cost ?
• Both require
– basis for comparison
– metric for evaluation
• Our goal: understand cost & performance implications
of architectural choices
Lec 3.6
Two Notions of “Performance”
Plane
DC to Paris
Speed
Passengers
Throughput
(pmph)
Boeing 747
6.5 hours
610 mph
470
286,700
BAD/Sud
Concorde
3 hours
1350 mph
132
178,200
Which has higher performance?
• Time to do the task (Execution Time)
– execution time, response time, latency
• Tasks per day, hour, week, sec, ns. .. (Performance)
– throughput, bandwidth
Response time and throughput often are in opposition
Lec 3.7
Definitions
• Performance is in units of things-per-second
– bigger is better
• If we are primarily concerned with response time
– performance(x) =
1
execution_time(x)
" X is n times faster than Y" means
Performance(X)
n
=
---------------------Performance(Y)
Lec 3.8
Example
• Time of Concorde vs. Boeing 747?
• Concord is 1350 mph / 610 mph = 2.2 times faster
= 6.5 hours / 3 hours
• Throughput of Concorde vs. Boeing 747 ?
• Concord is 178,200 pmph / 286,700 pmph
• Boeing is 286,700 pmph / 178,200 pmph
= 0.62 “times faster”
= 1.60 “times faster”
• Boeing is 1.6 times (“60%”) faster in terms of throughput
• Concord is 2.2 times (“120%”) faster in terms of flying time
We will focus primarily on execution time for a single job
Lots of instructions in a program => Instruction thruput important!
Lec 3.9
Basis of Evaluation
Pros
• representative
• portable
• widely used
• improvements
useful in reality
• easy to run, early
in design cycle
• identify peak
capability and
potential
bottlenecks
Cons
Actual Target Workload
Full Application Benchmarks
Small “Kernel”
Benchmarks
Microbenchmarks
• very specific
• non-portable
• difficult to run, or
measure
• hard to identify cause
• less representative
• easy to “fool”
• “peak” may be a long
way from application
performance
Lec 3.10
SPEC95
• Eighteen application benchmarks (with
inputs) reflecting a technical computing
workload
• Eight integer
– go, m88ksim, gcc, compress, li, ijpeg, perl, vortex
• Ten floating-point intensive
– tomcatv, swim, su2cor, hydro2d, mgrid, applu,
turb3d, apsi, fppp, wave5
• Must run with standard compiler flags
– eliminate special undocumented incantations that
may not even generate working code for real
programs
Lec 3.11
Metrics of Performance
Application
Programming
Language
Compiler
ISA
Datapath
Control
Function Units
Transistors Wires Pins
Seconds per program
Useful Operations per second
(millions) of Instructions per second – MIPS
(millions) of (F.P.) operations per second – MFLOP/s
Megabytes per second
Cycles per second (clock rate)
Each metric has a place and a purpose, and each can be misused
Lec 3.12
CPI
“Average cycles per instruction”
CPIave = (CPU Time * Clock Rate) / Instruction Count
= Clock Cycles / Instruction Count
n

CPU time = ClockCycleTime *
CPI i * Ii
i =1
n
CPI =
 CPI i
i =1
*
F
i
where F
i
=
I
i
Instruction Count
"instruction frequency"
Invest Resources where time is Spent!
Lec 3.13
Aspects of CPU Performance
CPU time
= Seconds
Program
Program
= Instructions x Cycles
Program
instr count
CPI
Instruction
x Seconds
Cycle
clock rate
Compiler
Instr. Set
Organization
Technology
Lec 3.14
Amdahl's Law
Speedup due to enhancement E:
ExTime w/o E
Speedup(E) = -------------------ExTime w/ E
Performance w/ E
= --------------------Performance w/o E
Suppose that enhancement E accelerates a fraction F of the task by a
factor S and the remainder of the task is unaffected then,
ExTime(with E) = ((1-F) + F/S) x ExTime(without E)
Speedup(with E) =
1
(1-F) + F/S
Lec 3.15
Example (RISC Processor)
Base Machine (Reg / Reg)
Op
ALU
Load
Store
Branch
Freq
50%
20%
10%
20%
Typical Mix
Cycles
1
5
3
2
CPI(i)
.5
1.0
.3
.4
2.2
% Time
23%
45%
14%
18%
How much faster would the machine be if a better data cache
reduced the average load time to 2 cycles?
How does this compare with using branch prediction to shave a
cycle off the branch time?
What if two ALU instructions could be executed at once?
Lec 3.16
Summary: Evaluating Instruction Sets and Implementation
• Design-time metrics:
– Can it be implemented, in how long, at what cost?
– Can it be programmed? Ease of compilation?
• Static Metrics:
– How many bytes does the program occupy in memory?
• Dynamic Metrics:
–
–
–
–
How
How
How
How
many instructions are executed?
many bytes does the processor fetch to execute the program?
many clocks are required per instruction?
CPI
"lean" a clock is practical?
• Best Metric:
Time to execute the program!
Inst. Count
NOTE: Depends on instructions set, processor organization,
and compilation techniques.
Cycle Time
Lec 3.17
Administrative Matters
• Course accounts available from the TAs
• Lab #2 posted, due in 1.5 weeks—individual work on
this one; submission by midnight
Lec 3.18
Finite State Machines:
• System state is explicit in representation
• Transitions between states represented as arrows with inputs on
arcs
• Output may be either part of state or on arcs
1
“Mod 3 Machine”
Input (MSB first)
Mod 3
106
1
Alpha/
0
0
1
Beta/
1
0
1 1 0101 0
Delta/
100122 1
1
0
2
Lec 3.19
Implementation as Combinational Logic + Latch
1/0
“Moore Machine”
“Mealey Machine”
Latch
Combinational
Logic
Alpha/
0
0/0
1/1
Beta/
0/1
Delta/
1/1
00
01
10
00
01
10
0/0
2
Input State old State new
0
0
0
1
1
1
1
00
10
01
01
00
10
Div
0
0
1
0
1
1
Lec 3.20
Performance and Technology Trends
1000
Supercomputers
Performance
100
Mainframes
10
Minicomputers
Microprocessors
1
0.1
1965
1970
1975
1980
1985
1990
1995
2000
Year
• Technology Power: 1.2 x 1.2 x 1.2 = 1.7 x / year
– Feature Size: shrinks 10% / yr. => Switching speed improves 1.2 / yr.
– Density: improves 1.2x / yr.
– Die Area: 1.2x / yr.
• Lesson of RISC is to keep the ISA as simple as possible:
– Shorter design cycle => fully exploit the advancing technology (~3yr)
– Advanced branch prediction and pipeline techniques
– Bigger and more sophisticated on-chip caches
Lec 3.21
Range of Design Styles
Custom Control Logic
Custom Design
Custom
ALU
Standard Cell
Gates
Gates
Routing Channel
Standard
ALU
Custom
Register File
Gate Array/FPGA/CPLD
Standard Registers
Gates
Routing Channel
Gates
Performance
Design Complexity (Design Time)
Compact
Longer wires
Lec 3.22
Basic Technology: CMOS
• CMOS: Complementary Metal Oxide Semiconductor
– NMOS (N-Type Metal Oxide Semiconductor) transistors
– PMOS (P-Type Metal Oxide Semiconductor) transistors
Vdd = 5V
• NMOS Transistor
– Apply a HIGH (Vdd) to its gate
turns the transistor into a “conductor”
– Apply a LOW (GND) to its gate
shuts off the conduction path
• PMOS Transistor
– Apply a HIGH (Vdd) to its gate
shuts off the conduction path
– Apply a LOW (GND) to its gate
turns the transistor into a “conductor”
GND = 0v
Vdd = 5V
GND = 0v
Lec 3.23
Basic Components: CMOS Inverter
Vdd
Symbol
In
Circuit
PMOS
In
Out
Out
NMOS
• Inverter Operation
Vdd
Vout
Vdd
Vdd
Vdd
Open
Charge
Out
Open
Vdd
Vin
Discharge
Lec 3.24
Basic Components: CMOS Logic Gates
NOR Gate
NAND Gate
A
A
Out
B Out
0
0
1
1
B
0
1
0
1
1
1
1
0
A
A
Out
B
Vdd
0
0
1
1
B Out
0
1
0
1
1
0
0
0
Vdd
A
Out
B
B
Out
A
Lec 3.25
Voltage Waveforms versus Time
Voltage
1 => Vdd
Vin
Vout
Vin
Vout
0 => GND
Time
Lec 3.26
Series Connection
Vin
V1
G1
Vdd
Vout
Vin
G2
G1
Voltage
Vdd
Vdd/2
GND
V1
Vin
d1
Vdd
V1
C1
Vout
G2
Cout
Vout
d2
Time
• Total Propagation Delay = Sum of individual delays = d1 + d2
• Capacitance C1 has two components:
– Capacitance of the wire connecting the two gates
– Input capacitance of the second inverter
Lec 3.27
Gate Comparison
Vdd
Vdd
A
Out
B
B
Out
A
NAND Gate
NOR Gate
• PMOS are 3 times slower than NMOS (3 times higher
resistance) so if all devices are the same size then a
NAND Low to High will be
• Better to put NMOS transistors in series
Lec 3.28
Calculating Delays
Vin
V1
Vdd
V2
Vin
G1
V3
Vdd
V1
C1
G2
V2
Vdd
G3
V3
• Sum delays along serial paths
• Delay (Vin -> V2) ! = Delay (Vin -> V3)
– Delay (Vin -> V2) = Delay (Vin -> V1) + Delay (V1 -> V2)
– Delay (Vin -> V3) = Delay (Vin -> V1) + Delay (V1 -> V2) + Delay (V1 -> V3)
• Critical Path = The longest delay path (Vin->V3)
• C1 = Wire Capacitance + Cin of Gate 2 + Cin of Gate 3
Lec 3.29
General C/L Cell Delay Model
Vout
A
B
X
.
.
.
Combinational
Logic Cell
Delay
Va -> Vout
Cout
X
Internal Delay
X
X
delay per load (Cload)
X
Nanoseconds/femtoFarad
= ns/fF
Ccritical
Cout
• Combinational Cell (symbol) is fully specified by:
– Functional (input -> output) behavior
» Truth-table, logic equation, VHDL
– Load at each input
– Critical prop delay from each input to each output for each
transition
» THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
capacitance
– Linear model is good enough up to Ccritical
Lec 3.30
Characterize a Gate
• Input capacitance for each input
• For each input-to-output path:
– For each output transition (H->L, L->H)
» Internal delay (ns) - e.g. for low to high from A to O: TPAOlh
» Load dependent delay (ns / fF) - e.g. TPAOlhf
• Example: 2-input NAND Gate
A
Delay A -> O
O: Low -> High
O
B
Slope =
0.0021ns / fF
For A and B: Input Load = 61 fF
For A -> O :
TPAOlh = 0.5ns
TPAOhl = 0.1ns
0.5ns
CO
TPAOlhf = 0.0021ns / fF
TPAOhlf = 0.0020ns / fF
Lec 3.31
A Specific Example: 2 to 1 MUX
A
Wire
0
A
Wire 1
Gate 3
B
Gate 2
S
Wire
2
Y = (A and !S)
or (A and S)
B
2 x 1 Mux
Gate 1
S
Inv: I.L. = 50 fF; T I.D.= .01 ns; T L.D.D. = .01 ns/fF
• Input Load (I.L.)
– A, B: I.L. (NAND) = 61 fF
– S: I.L. (INV) + I.L. (NAND) = 50 fF + 61 fF = 111 fF
• Load Dependent Delay (L.D.D.): Set by Gate 3
– TAYlhf = 0.021 ns / fF
– TBYlhf = 0.021 ns / fF
– TSYlhf = 0.021 ns / fF
TAYhlf = 0.020 ns / fF
TBYhlf = 0.020 ns / fF
TSYlhf = 0.020 ns / fF
Lec 3.32
Y
2 to 1 MUX: Internal Delay Calculation
A
Gate 1
Wire
0
Wire 1
Gate 3
B
Gate 2
S
Wire
2
Y = (A and !S) or (A and S)
L.D.D. = Load Dependent Delay
I.D. = Internal Delay
• Internal Delay (I.D.):
– A to Y: I.D. G1 + (Wire_1_C + G3_Input_C) * L.D.D G1 + I.D. G3
– B to Y: I.D. G2 + (Wire_2_C + G3_Input_C) * L.D.D. G2 + I.D. G3
– S to Y (Worst Case) : I.D. Inv + (Wire_0_C + G1_Input_C) * L.D.D. Inv
+ Internal Delay A to Y
• We can approximate the size of “Wire_1_C” by:
– Assume Wire 1 has the same C as the input capacitance off all the gates
attached to it (G3 in this case).
Therefore the total C that Gate1 needs to drive = 2.0 x G3_Input_C
Lec 3.33
2 to 1 MUX: Internal Delay Calculation (continue)
A
Gate 1
Wire
0
Wire 1
Gate 3
B
Gate 2
S
Y = (A and !S) or (A and S)
Wire
2
• Internal Delay (I.D.):
– A to Y: I.D. G1 + (Wire 1 C + G3 Input C) * L.D.D G1 + I.D. G3
– B to Y: I.D. G2 + (Wire 2 C + G3 Input C) * L.D.D. G2 + I.D. G3
– S to Y (Worst Case): I.D. Inv + (Wire 0 C + G1 Input C) * L.D.D. Inv +
Internal Delay A to Y
• Specific Example:
– TAYlh = TPhl G1 + (2.0 * 61 fF) * TPhlf G1 + TPlh G3
= 0.1ns + 122 fF * 0.0020 ns/fF + 0.5ns = 0.844 ns
Lec 3.34
Abstraction: 2 to 1 MUX
A
Gate 3
B
Y
B
Gate 2
2 x 1 Mux
A
Gate 1
Y
S
S
• Input Load: A = 61 fF, B = 61 fF, S = 111 fF
• Load Dependent Delay:
– TAYlhf = 0.021 ns / fF
– TBYlhf = 0.021 ns / fF
– TSYlhf = 0.021 ns / fF
TAYhlf = 0.020 ns / fF
TBYhlf = 0.020 ns / fF
TSYlhf = 0.020 ns / f F
• Internal Delay:
– TAYlh = TPhl G1 + (2.0 * 61 fF) * TPhlf G1 + TPlh G3
= 0.1ns + 122 fF * 0.0020ns/fF + 0.5ns = 0.844ns
– Fun Exercises!: TAYhl, TBYlh, TSYlh, TSYlh
Lec 3.35
Storage Element’s Timing Model
Clk
D
Q
Setup Hold
D
Don’t Care
Don’t Care
Clock-to-Q
Q
Unknown
• Setup Time: Input must be stable BEFORE the trigger
clock edge
• Hold Time: Input must REMAIN stable after the
trigger clock edge
• Clock-to-Q time:
– Output cannot change instantaneously at the trigger clock
edge
– Similar to delay in logic gates, two components:
» Internal Clock-to-Q
» Load dependent Clock-to-Q
Lec 3.36
CS152 Building Blocks (maybe more….)
• Logic elements
–
–
–
–
–
–
–
–
–
NAND2, NAND3, NAND 4
NOR2, NOR3, NOR4
INV1x (normal inverter)
INV4x (inverter with large output drive)
XOR2
XNOR2
PWR: Source of 1’s
GND: Source of 0’s
fast MUXes
• Storage Element
– D flip flop - negative edge triggered
Lec 3.37
Clocking Methodology
Clk
.
.
.
.
.
.
Combination Logic
.
.
.
.
.
.
• All storage elements are clocked by the same clock
edge (but there may be clock skews)
• The combination logic block’s:
– Inputs are updated at each clock tick
– All outputs MUST be stable before the next clock tick
Lec 3.38
Hold Time Violations
Clk-to-Q+Delay
Clk1
Hold Time
Clk2
.
.
.
.
.
.
Clk1
Combination Logic
.
.
.
.
.
.
Clk2
• The worst case scenario for hold time consideration:
– The input register sees CLK1
– The output register sees CLK2
– fast FF1 output must not change input to FF2 for same clock
edge
• (CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time
Lec 3.39
Clock Skew’s Effect on Hold Time Violation
Clk-to-Q+Delay
Clk1
Hold Time
Clk2
Clock Skew
.
.
.
.
.
.
Clk1
Combination Logic
.
.
.
.
.
.
Clk2
• The worst case scenario for hold time consideration:
– The input register sees CLK1
– The output register sees CLK2
– fast FF1 output must not change input to FF2 for same clock
edge
• (CLK-to-Q + Shortest Delay Path) > Hold Time + Clock Skew or
(CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time
Lec 3.40
Critical Path & Cycle Time
Clk
.
.
.
.
.
.
.
.
.
.
.
.
• Critical path: the slowest path between any two
storage devices
• Cycle time is a function of the critical path
• must be greater than:
– Clock-to-Q + Longest Path through the Combination Logic +
Setup
Lec 3.41
Clock Skew’s Effect on Cycle Time
Clk1
Clock Skew
Clk2
FF1
.
.
.
.
.
.
Clk1
Clock Skw
Setup
FF2
.
.
.
.
.
.
Clk2
• Worst case scenario for cycle time consideration:
– The input register sees CLK1
– The output register sees CLK2
• Cycle Time = CLK-to-Q(FF1) + Longest
Delay(C/L) + Setup(FF2) + Clock Skew
Lec 3.42
Tricks to Reduce Cycle Time
• Reduce the number of gate levels
A
A
B
B
C
C
D
D
 Pay attention to loading
• One gate driving many gates is a bad idea
• Avoid using a small gate to drive a long wire
 Use multiple stages to drive large load
INV4x
Clarge
INV4x
Lec 3.43
How to Avoid Hold Time Violations?
Clk
.
.
.
.
.
.
Combination Logic
.
.
.
.
.
.
• Hold time requirement:
– Input to register must NOT change immediately after the
clock tick
• This is usually easy to meet in the “edge trigger”
clocking scheme
• Hold time of most FFs is <= 0 ns
• CLK-to-Q + Shortest Delay Path must be greater than
Hold Time
Lec 3.44
Hold Time Violation
Clk-to-Q+Delay
Clk1
Hold Time
Clk2
.
.
.
.
.
.
Combination Logic
.
.
.
Clk1
.
.
.
Clk2
• The worst case scenario for hold time consideration:
– The input register sees CLK1
– The output register sees CLK2
– fast FF1 output must not change input to FF2 for same clock edge
• For no violation (CLK-to-Q + Shortest Delay Path) > Hold Time
• A violation is shown above
Lec 3.45
A Hold Time Violation Because of Clock Skew
Clk1
Clk-to-Q+Delay
Clk2
Hold Time
Clock Skew
.
.
.
.
.
.
Clk1
Combination Logic
.
.
.
.
.
.
Clk2
• For no violation
(CLK-to-Q + Shortest Delay Path) > Hold Time + Clock
Skew
or (CLK-to-Q + Shortest Delay Path - Clock Skew) >
Hold Time
Lec 3.46
Summary
• Total execution time is the most reliable measure of
performance
• Amdahl’s law: Law of Diminishing Returns
• Performance and Technology Trends
– Keep the design simple (KISS rule) to take advantage of the latest
technology
– CMOS inverter and CMOS logic gates
• Delay Modeling and Gate Characterization
– Delay = Internal Delay + (Load Dependent Delay x Output Load)
• Clocking Methodology and Timing Considerations
– Simplest clocking methodology
»
All storage elements use the SAME clock edge
– Cycle Time  CLK-to-Q + Longest Delay Path + Setup + Clock Skew
– (CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time
Lec 3.47
To Get More Information
• EECS 141 - Digital Integrated Circuit
Design - 105 no longer a prerequisite only EECS 40 required!
• Book: Digital Integrated Circuits - A
design perspective - by Jan Rabaey
• Web page (slides from book)
– http://bwrc.eecs.berkeley.edu/icdesign/instructors.html
Lec 3.48