Transcript performance

performance
These slides are base on the chapter 2 from the following book:
D. A. Patterson and J. L. Hennessey, Computer Organization & Design: The
Hardware Software Interface, Morgan Kauffman, second edition 1998
If you need more explanations you can find them in the book itself.
Here is the list of the relevant slides numbers (from the chapter 2 slides):
11 – 14, 18 – 22, 28 – 30.
The slides contain some examples (without solutions).
We would solve some of them in the class.
We would focus on user CPU time – time spent executing the lines of code that
are “in” our program (i.e. without I/O time, etc).
Definition of performance: for some program running of machine X,
Performancex = 1 / Execution timex
Note that “machine X is n time faster than machine Y” => Px / Py = n
Clock cycle: time between 2 consequent (machine) clock ticks.
Instead of reporting execution time in seconds, we often use cycles.
Clock rate (frequency) = cycles per second. ( 1 Hz = 1 cycle/sec)
Example:
Machine with 200 Mhz clock has 200 * 106 Hz => it produces 2*108 clock
cycles per second => its cycle (time) is 1/ 2*108 = 5 nanoseconds.
(nanosecond = 10-9 seconds).
Note: different (machine) instructions take different amount of clock cycles.
e.g.: integers  floating points; memory access  register access, etc.
Problem:
Some program runs in 10 seconds on computer A, which has a 400 Mhz. clock.
We built a new machine B, which runs in 600MHZ, but this machine requires each
instruction 1.2 times as many clock cycles as machine A.
How much time would it take machine B to execute the same program?
Solution:
clock rate = cycles per second
400 Mhz = 4*108 Hz => machine A provides 4*108 cycles per second
program runs 10 seconds on machine A => program execution takes 4*109 cycles
= > on machine B it would take 1.2 * 4*109 = 4.8 *109 cycles.
How much time would it run on machine B?
4.8 *109 / 6 *108 Hz = 8, or 8 seconds.
Problem:
There are two different classes of instructions: A and B
- machine A has a clock cycle time of 10 ns. (nanoseconds) and a CPI (cycles per
instruction) of 2.0 for class A instruction, CPI of 1.5 for class B instructions.
- machine B has a clock cycle time of 20 ns. and a CPI of 1.25 for both instructions
classes.
a given program is 50% class A instructions and 50% class B instructions
which machine runs this program faster?
Solution:
machine A: ns. per class A instruction = 2.0 * 10 = 20.
machine A: ns. per class B instruction = 3.0 * 10 = 30.
machine B: ns. per instruction = 1.25 * 20 = 25.
execution time on machine A: C * (0.5 * 20 + 0.5 * 30) = C * 25.
execution time on machine B: C * 1*25 = C * 25.
=> the machines have same performance for the given program
Problem:
There are three different classes of instructions: class A, B and C.
They require one, three and five cycles respectively.
There are two code sequences:
- first code contains: 1 instructions of class A, 2 of B, and 1 of C.
- second code contains: 6 instructions of class A, 1 of B, and 1 of C.
A)Which sequence will be faster?
B) By how much?
C) What is the CPI for each sequence?
Solution:
first code: 1*1+2*3+1*5 = 12 cycles => CPI = 12 / (1+2+1) = 3
second code: 6*1+1*3+1*5 = 14 cycles => CPI = 14 / (6+1+1) = 1.75
A) first code is faster.
B) By 14/12.
C) 3 for first code, 1.75 for second code
Amdahl’s Law:
e.t. after improvement = e.t. unaffected + (e.t. affected / amount of improvement)
(e.t. = execution time)
Problem:
A program runs in 100 seconds, with multiply (instructions) responsible for 80
seconds of this time. (i.e. a program spends 80 seconds for execution of multiply
instructions). How much do we have to improve the speed of multiplication if we
want the program to run 4 times faster? How about making it 5 times faster?
Solution:
e.t. after improvement = 20 seconds + 80 seconds / x
=> 100 / 4 = 20 + 80 / x
=> x = 16
This means that multiplication should be executed 16 time faster!
Now , to make run time 5 times faster:
100 / 5 = 20 + 80 / x
=> x =  !!!
This means that the multiplication should take 0 time! That’s impossible.
Problem:
Suppose we want to improve in a well known benchmark, we know that floatingpoint instructions are 70% of the benchmark, and benchmark runs for 20 seconds,
we enhanced the machine making all floating-point instructions run 7 times faster,
but for some reason, this caused rest of the instructions run double the time.
what will the speedup be?
Floating point instructions run for 14 seconds, the rest 6 seconds.
Solution:
e.t. after improvement = 6*2 seconds + 14 / 7 = 12+2 = 14 seconds
=> speedup = 20 / 14.
Summary:
- performance is specific to a particular program(s). Total execution time is a
consistent summary of performance.
- for a given architecture, performance increases come from:
- increases in clock rate (without adverse CPI affects)
- improvements in processor organization that lower CPI
- compiler enhancements that lower CPI and / or instruction count
Pitfall: expecting improvement in one aspect of a machine’s performance to affect
the total performance.