Transcript chap1-2

CEN 316
Computer Organization and Design
Assessing and Understanding Performance
Mansour AL Zuair
Chapter 2: Performance
•
•
•
•
•
Performance:
– What is it: measures of performance
The CPU performance equation:
– Execution time as the measure
– What affects execution time
– Examples
Popular alternative metrics
– Why they don’t work
Benchmarks
Amdahl's law
CEN316 - Chapter 1-2
2
Performance is Time
Plane
DC to Paris
Speed
Passengers
Throughput
(PMPH)
Boeing 747
6.5 hours
610 MPH
470
286,700
BAD/Sud
Concorde
3 hours
1350 MPH
132
178,200
• Time to do the task (execution time)
– Execution time, response time, latency
• Tasks per unit time (sec, minute, ...)
– Throughput, bandwidth
CEN316 - Chapter 1-2
3
Throughput and Response Time
• Example
– What is the effect of the following changes on throughput
and response time?
» Increasing processor speed?
» Increasing the number of processors on the same system
(multitask)?
» Is there any relation between response time and throughput?
» What about queuing?
CEN316 - Chapter 1-2
4
Performance as Response Time
• Performance is most often measured as response time or
execution time for some task.
• “X is n times faster than Y” means
Performanc e(X) Execution Time(Y)

n
Performanc e(Y) Execution Time(X)
• Example
– Execution time of program P
» X is 5 sec
» Y is 10 sec.
– X is 2 times faster than Y.
CEN316 - Chapter 1-2
5
What Time to Measure?
• Elapsed time, wall-clock time:
– Actual time from start to completion.
– Depends on CPU, system, I/O, etc.
– Often used in real benchmarks.
– Only suitable choice when I/O is included.
• CPU time:
– Measure/analyze CPU performance only.
– May be suitable when machine is timeshared.
– Possibly both user and system component.
– User CPU time is our focus for first part of course.
• Elapsed time = CPU time + idle time.
– Usually and assuming time is accurately accounted for.
CEN316 - Chapter 1-2
6
Metrics of performance
• Different performance metrics are appropriate at different levels:
Application
Answers per month
Operations per second
Programming
Language
Compiler
ISA
Datapath
Control
Function Units
(millions) of Instructions per second – MIPS
(millions) of (F.P.) operations per second – MFLOP/s
Instruction Set Architecture
Cycles per Instruction
Cycles per second (clock rate)
Transistors
CEN316 - Chapter 1-2
7
Relating Processor Metrics
• CPU execution time per program =
CPU clock cycles/program  Clock cycle time =
CPU clock cycles/program ÷ Clock rate (frequency)
• CPU clock cycles/program =
Instructions/program  Clock cycles Per Instruction
• Clock cycles Per Instruction (CPI) is an average measurement,
it depends on :
– ISA, the implementation, and the program measured
– CPI = CPU clock cycles/program ÷ Instructions/program
– Also, Instructions per clock cycle or IPC = 1 / CPI
• CPU execution time = Instructions  CPI  Clock cycle
CEN316 - Chapter 1-2
8
Aspects of CPU Performance
CPU time 
•
•
seconds instructio ns
cycles
seconds



program
program
instructio n
cycle
Instead of reporting execution time in seconds, we often use cycles
Clock “ticks” indicate when to start activities (one abstraction):
time
•
•
cycle time = time between ticks = seconds per cycle
clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec)
A 200 Mhz. clock has a
1
200  106
 109  5 nanosecond s cycle time
CEN316 - Chapter 1-2
9
Example
• Two machines A and B with the same ISA,
• A has a clock cycle time of 1ns and a CPI=2 for certain
program.
• B has clock cycle time of 2ns and a CPI=1.2 for the same
program.
• Which machine is faster and by how much?
• CPU time (A) = I x 2 x 1 ns
• CPU time (B) = I x 1.2 x 2 ns
CPU performanc e (A)
 1.2
CPU performanc e (B)
CEN316 - Chapter 1-2
10
Example of Improving performance
Given the following Information
– program run in 10
seconds on computer A, 4
GH clock
– Build a computer that
runs the same program in
6 seconds
– Assumptions:
» Clock rate can be increased
substantially
» Increasing clock rate increases
the clock cycles for this
program by 20%
What is the target clock rate?
CPU time A 
10 seconds 
CPU Clock Cycles
Clock rate A
CPU Clock Cycles
cycles
4 x 109
seconds
A
A
CPU Clock Cycles A  40 x 109 cycles
CPU time B 
1.2xCPU Clock Cycles
Clock rate B
A
1.2x40x10 9 cycles A
6 seconds 
Clock rate B
1.2x40 x 109 cycles
clock rate B 
 8GH
6 sec onds
CEN316 - Chapter 1-2
11
Organizational Trade-offs
• Instruction set (and hence Instruction Count), CPI, and Clock
cycle time interact in complex ways:
Application
Programming
Language
Compiler
ISA
Instruction Mix
Datapath
CPI
Function Units
Transistors
Cycle Time
CEN316 - Chapter 1-2
12
Cycles Per Instruction (CPI)
• CPI = average number of clock cycles per instruction
– CPI = Clock Cycles / Instruction Count
• Affected by both :
– cost for each instruction type
– the frequency of different instructions
» called the instruction mix
• Useful way to compute CPI (for n instruction types):
n
CPI   CPIi  Fi where Fi 
i 1
Instructio n Count i
Instructio n Count
CEN316 - Chapter 1-2
13
Example comparing code segments
Given the following Info from HW designer
Instruction
class
CPI
A
B
C
1
2
3
Code
Sequence
1
2
Instruction count for
instruction class
A
B
C
2
4
1
1
2
1
Number of instructions?
5 for sequence 1 and 6 For sequence 2
Which one is faster?
CPU clock cycles1 = (2x1)+ (1x2) + (2x3)=10 cycles
CPU clock cycles2 = 9 cycles
What is the CPI for each sequence?
CPI1 = 10/5 =2 CPI2 = 9/6=1.5
CEN316 - Chapter 1-2
14
Example CPI Computation
• RISC processor: register-register ISA:
Instruction
Type
Frequency
Fi
Cycles
CPIi
CPI Contr
Fi  CPIi
Time %
This Instr
ALU
Load
Store
Branch
50%
20%
10%
20%
1
2
2
2
0.5
0.4
0.2
0.4
33%
27%
13%
27%
Typical Mix
CPI = 1.5
CEN316 - Chapter 1-2
15
Using the CPU Performance Equation
•
Example:
Consider adding ALU instructions that can have one memory operand
to the MIPS ISA to produce MIPSE:
– MIPSE = MIPS + ALU instrs with a memory operand.
– Initial mix and cycle counts on MIPS:
Instr class
Load
Store
ALU op
Branch
Overall CPI
Freq
30%
15%
40%
15%
Cycles
2
2
1
2
MIPS CPI
0.6
0.3
0.4
0.3
1.6
– Assume:
» CPI of the MIPSE instruction ALU-with-memory-instruction is 2
» Clock cycle 1.25 times the MIPS clock cycle
» One half of the load instructions and a corresponding number of ALU
instructions are replaced by ALU-with-memory
– Which machine is faster?
CEN316 - Chapter 1-2
16
Solution
•
Normalize mix to 100 instructions
– can be easier to calculate and enhance intuition
Cycles
ALU
1
Load
2
Store
2
Branch
2
ALU+mem
2
Total
MIPS
Instruc Cycle
Count Count
40
40
30
60
15
30
15
30
0
0
100
160
Instruc
Delta
-15
-15
15
MIPSE
Instruc Cycle
Count Count
25
25
15
30
15
30
15
30
15
30
85
145
MIPS execution time = 160 cycles  CCMIPS
MIPSE execution time = 145 cycles  (1.25X CCMIPS)
MIPSE takes ((145  1.25) / 160) times as long as MIPS
MIPSE is 1.13  performance of MIPS
CEN316 - Chapter 1-2
17
Alternative Performance Metrics: MIPS
• Use something other than time
– Often good intention to find a simple metric
» bigger is better, general measure, summarizes performance
• Most common metric:
MIPS (Millions of Instructions Per Second):
MIPS 
Instructio n Count Clock Rate

6
Time 10
CPI 106
• Flaws in using MIPS
– Machines with different instruction sets ?
– Programs with different instruction mixes ?
» dynamic frequency of instructions
– Can vary inversely with performance!
CEN316 - Chapter 1-2
18
Example of Problems
• Consider an optimized and unoptimized version of the same
program:
Memory
Instructions
ALU
Instructions
Branch
Instruction
FP
Instruction
Total
Instructions
Unoptimize
d Program
100M
100M
30M
40M
270M
Optimized
Program
50M
50M
30M
40M
170M
CPI
• Here are cycle counts for instructions
Memory
Cycles
ALU Cycles
Branch
Cycles
FP Cycles
Per
Instruction
2
1
3
5
Unoptimize
d Program
200M
100M
90M
200M
2.2
Optimized
Program
100M
50M
90M
200M
2.6
CEN316 - Chapter 1-2
19
Example continued
• Assuming a 200 MHz clock:
– MIPSunoptimized = 200/2.2 = 91
– MIPSoptimized = 200/2.6 = 77
– Performanceunoptimized > Performanceoptimized
• But look at Execution time!
– Execution timeunoptimized = CPI  IC / CR = 2.2  270 / 200 = 3s
– Execution timeoptimized = CPI  IC / CR = 2.6  170 / 200 = 2.2s
– Performanceoptimized > Performanceunoptimized
• MIPS measurement is inverse of reality!
CEN316 - Chapter 1-2
20
Another Alternative: MFLOPS
• MFLOPS (Millions of FLoating Operations per Second):
– common metric in scientific/engineering and supercomputer
arenas
– MFLOPS = Floating point Operations
Time X 106
– Machine dependent: what is a floating point op?
– often not where time is spent (i.e. not in FP operations)
– at best, no better than execution time
– at worst, much less informative and more deceptive
CEN316 - Chapter 1-2
21
What is benchmarks?
• Benchmarks – a set of programs that form a “workload”
specifically chosen to measure performance
• SPEC (System Performance Evaluation Cooperative)
creates standard sets of benchmarks starting with
SPEC89. The latest is SPEC CPU2006 which consists of
12 integer benchmarks (CINT2006) and 17 floating-point
benchmarks (CFP2006).
www.spec.org
• There are also benchmark collections for power workloads
(SPECpower_ssj2008), for mail workloads
(SPECmail2008), for multimedia workloads (mediabench),
…
CEN316 - Chapter 1-2
22
Comparing and Summarizing Performance

How do we summarize the performance for benchmark
set with a single number?
l
l
First the execution times are normalized giving the “SPEC ratio”
(bigger is faster, i.e., SPEC ratio is the inverse of execution time)
The SPEC ratios are then “averaged” using the geometric mean
(GM)
GM =
n
n

SPEC ratioi
i=1
• Guiding principle in reporting performance measurements is
reproducibility – list everything another experimenter would need to
duplicate the experiment (version of the operating system, compiler
settings, input set used, specific computer configuration (clock rate,
cache sizes and speed, memory size and speed, etc.))
CEN316 - Chapter 1-2
23
Amdahl's Law
• Handy for evaluating impact of a change not tied to CPU
performance equation
• Insight: No improvement of a feature enhances performance by
more than the use of the feature.
• Suppose that enhancement E accelerates fraction F of a
program by a factor S (remainder of the task is unaffected):

 F 
Execution Time enhanced   1  F      Execution Time unenhanced
 S 

F
1–F
S=
CEN316 - Chapter 1-2
24
Example on Amdahl's Law
• Assume a program runs in 100 seconds on a machine whre
multiply operations consumes 80 seconds of this time. How
much do we have to improve the speed of multiplication if we
want to run the program 5 times faster?
• Solution:
– Execution time after improvement =
Execution time affected by improvement
+ Execution time unaffected
A mount of improvement
– 20 seconds = (80/n) + 20 seconds
– What is the value of n???
CEN316 - Chapter 1-2
25
Summary : Performance
• Time is the measure of computer performance!
– Performance equation includes three parts; all three together
determine performance
• Good products created when have:
– Good benchmarks
– Good ways to summarize performance
• Will need different performance metrics as well as a different set
of applications to benchmark embedded and desktop
computers, which are more focused on response time, versus
servers, which are more focused on throughput
• Remember Amdahl’s Law: Speedup is limited by unimproved
part of program
CEN316 - Chapter 1-2
26