Performance (PPT Slides)

Download Report

Transcript Performance (PPT Slides)

ENG3380
Computer Organization and
Architecture
“Performance Issues”
Winter 2017
S. Areibi
School of Engineering
University of Guelph
Topics










Defining Performance
Measuring Execution time
Clock Cycles per Instruction (CPI)
Performance Factors
Power Consumption
Benchmarks
Enhancing Performance
Amdahl’s Law
Little Law
Summary
With thanks to W. Stallings, Hamacher, J. Hennessy, M. J. Irwin for lecture slide contents
Many slides adapted from the PPT slides accompanying the textbook and CSE331 Course
2
References
I.
II.
III.
“Computer Organization and Architecture:
Designing for Performance”, 10th edition,
by William Stalling, Pearson.
“Computer Organization and Design: The
Hardware/Software Interface”, 4th edition, by D.
Patterson and J. Hennessy, Morgan Kaufmann
Computer Organization and Architecture:
Themes and Variations”, 2014, by Alan
Clements, CENGAGE Learning
3
Performance
4
Performance
o When trying to choose among different computers, performance is
an important attribute.
o Accurately measuring and comparing different computers is
critical to purchasers and therefore to designers.
o Evaluating the performance of computers is an essential skill that
Computer Engineers should have.
o Assessing the performance of computers can be quite challenging.
Performance Metrics

Purchasing perspective

given a collection of machines, which has the
- best performance ?
- least cost ?
- best cost/performance?

Design perspective

faced with design options, which has the
- best performance improvement ?
- least cost ?
- best cost/performance?

Both require



basis for comparison
metric for evaluation
Our goal is to understand what factors in the architecture
contribute to overall system performance and the relative
importance (and cost) of these factors
Defining Performance
When we say one computer has better performance than another,
What do we mean?
Boeing 777
Boeing 777
Boeing 747
Boeing 747
BAC/Sud
Concorde
BAC/Sud
Concorde
Douglas
DC-8-50
Douglas DC8-50
0
100
200
300
400
0
500
Boeing 777
Boeing 777
Boeing 747
Boeing 747
BAC/Sud
Concorde
BAC/Sud
Concorde
Douglas
DC-8-50
Douglas DC8-50
500
1000
Cruising Speed (mph)
4000
6000
8000 10000
Cruising Range (miles)
Passenger Capacity
0
2000
1500
0
100000 200000 300000 400000
Passengers x mph
Response Time and Throughput
o Response time (execution time)
● How long it takes to do a task
● The time between the start and completion of a task
o Throughput
● Total work done per unit time
♦ e.g., tasks/transactions/… per hour
o How are response time and throughput affected by
● Replacing the processor with a faster version?
● Adding more processors?
o We’ll focus on response time for now…
Relative Performance
o Define Performance = 1/Execution Time
o “X is n time faster than Y”
Performanc e X Performanc e Y
 Execution time Y Execution time X  n
o
Example: time taken to run a program



10s on A, 15s on B
Execution TimeB / Execution TimeA
= 15s / 10s = 1.5
So A is 1.5 times faster than B
Measuring Execution Time
o Elapsed time
● Total response time, including all aspects
♦ Processing, I/O, OS overhead, idle time
● Determines system performance
o CPU Execution Time
● Time spent processing a given job (or task)
♦ Discounts I/O time, other jobs’ shares
● Comprises:
♦ User CPU time (Just Program itself)
♦ System CPU time (Spent in OS performing tasks for program)
● Different programs are affected differently by CPU and
system performance
CPU Clocking
o Operation of digital hardware governed by a
constant-rate clock
Clock period
Clock (cycles)
Data transfer
and computation
Update state

Clock period (cycle time): duration of a clock cycle


e.g., 250ps = 0.25ns = 250×10–12s
Clock frequency (rate): cycles per second

e.g., 4.0GHz = 4000MHz = 4.0×109Hz
CPU Time
Assume a program (e.g., sorting) is completed in 250 Clock Cycles.
What is the total CPU Time used?
CPU Time # CPU Clock Cycles  Clock Cycle Time
# CPU Clock Cycles for a program

Clock Rate
Performance improved by
o Reducing number of clock cycles
o Increasing clock rate
o Hardware designer must often trade off clock rate
against cycle count
o Many techniques that decrease the number of clock cycles may
also increase the clock cycle time!!
CPU Time Example

Our favorite program runs in 10 seconds on computer A, which
has a 4 GHz clock. We are trying to help a computer designer
build a computer B, that will run this program in 6 seconds.
The designer has determined that a substantial increase in the
clock rate is possible, but this increase will affect the rest of the
CPU design, causing computer B to require 1.2 times as many
clock cycles as computer A for this program.
What clock rate should we tell the designer to target?

CPU timeA = CPU (clock cycles)A / (Clock rate)A

10 seconds = CPU (clock cycles)A / 4 x 109 cycles/second

CPU (clock cycles)A = 10 seconds x 4 x 109 cycles/sec = 40 x 109 cycles

CPU timeB = 1.2 x CPU clock cyclesA / (Clock rate)B

6 seconds = 1.2 x 40 x 109 (clock cycles)A / (Clock rate) B

(Clock rate)B = 1.2 x 40 x 109 cycles / 6 seconds = 8 GHz
13
Sample of Instruction Set of 68HC12
An addressing mode
refers to the process
used by the CPU to
determine the location
of an argument!
14
Inherent Addressing Mode
Inherent addressing,
the opcode contains the
Effective Address, i.e. does
not require an operand.
15
Immediate Addressing Mode
The argument itself follows
the opcode. EA = &opcode+1
16
Clock Cycles Per Instruction: CPI

Not all instructions take the same amount of time to execute

One way to think about execution time is that it equals the number of
instructions executed multiplied by the average time per
instruction
# CPU clock cycles
# Instructions Average clock cycles
= for a program x
for a program
per instruction

Clock cycles per instruction (CPI) – the average number
of clock cycles each instruction takes to execute

A way to compare two different implementations of the same ISA
CPI
CPI for this instruction class
A
B
C
1
2
3
17
Instruction Count and CPI
Clock Cycles  Instructio n Count  Cycles per Instructio n
CPU Time  Instructio n Count  CPI  Clock Cycle Time
Instructio n Count  CPI

Clock Rate
o Instruction Count for a program
● Determined by program, ISA and compiler
o Average cycles per instruction
● Determined by CPU hardware
● If different instructions have different CPI
♦ Average CPI affected by instruction mix
CPI Example
o Computer A: Cycle Time = 250ps, CPI = 2.0
o Computer B: Cycle Time = 500ps, CPI = 1.2
o Same ISA
o Which is faster, and by how much?
CPU Time
CPU Time
A
 Instructio n Count  CPI  Cycle Time
A
A
 I  2.0  250ps  I  500ps
A is faster…
B
 Instructio n Count  CPI  Cycle Time
B
B
 I  1.2  500ps  I  600ps
B  I  600ps  1.2
CPU Time
I  500ps
A
CPU Time
…by this much
CPI in More Detail
o If different instruction classes take different numbers
of cycles (assuming n classes)
n
Clock Cycles   (CPIi  Instructio n Count i )
i1

Weighted average CPI
n
Clock Cycles
Instructio n Count i 

CPI 
   CPIi 

Instructio n Count i1 
Instructio n Count 
Relative frequency
CPI Example
o Alternative compiled code sequences using
instructions in classes A, B, C
Instruction
Count

Class
A
B
C
CPI for class
1
2
3
IC in sequence 1
2
1
2
IC in sequence 2
4
1
1
Sequence 1: IC = 5


Clock Cycles
= 2×1 + 1×2 + 2×3
= 10
Avg. CPI = 10/5 = 2.0

Sequence 2: IC = 6


Clock Cycles
= 4×1 + 1×2 + 1×3
= 9 (Faster)
Avg. CPI = 9/6 = 1.5
THE Performance Equation

Our basic performance equation is then
CPU time
= Instruction_count x CPI x clock_cycle
or
CPU time

=
Instruction_count x
CPI
----------------------------------------------clock_rate
These equations separate the three key factors that affect
performance




Can measure the CPU execution time by running the program
The clock rate is usually given
Can measure overall instruction count by using profilers/ simulators
without knowing all of the implementation details
CPI varies by instruction type and ISA implementation for which we
must know the implementation details
Performance Summary
The BIG Picture
Instructio ns Clock cycles
Seconds
CPU Time 


Program
Instructio n Clock cycle
Performance depends on?
o Algorithm: affects IC, possibly CPI
o Programming language: affects IC, CPI
o Compiler: affects IC, CPI
o Instruction set architecture: affects IC, CPI, Tc
Determinates of CPU Performance
CPU time
= Instruction_count x CPI x clock_cycle
Algorithm
Programming
language
Compiler
ISA
Processor
organization
Technology
Instruction_
count
CPI
clock_cycle
X
X
X
X
X
X
X
X
X
X
X
X
24
A Simple Example
Op
Freq
CPIi
Freq x CPIi
ALU
50%
1
.5
.5
.5
.25
Load
20%
5
1.0
.4
1.0
1.0
Store
10%
3
.3
.3
.3
.3
Branch
20%
2
.4
.4
.2
.4
2.2
1.6
2.0
1.95
=

How much faster would the machine be if a better data cache
reduced the average load time to 2 cycles?
CPU time new = 1.6 x IC x CC

2.2/1.6 means 37.5% faster
How does this compare with using branch prediction to shave
a cycle off the branch time?
CPU time new = 2.0 x IC x CC

so
so
2.2/2.0 means 10% faster
What if two ALU instructions could be executed at once?
CPU time new = 1.95 x IC x CC
so
2.2/1.95 means 12.8% faste
Pitfall: MIPS as a Performance Metric
MIPS: Millions of Instructions Per Second
o Doesn’t account for?
♦ Differences in ISAs between computers
♦ Differences in complexity between instructions
Instructio n count
MIPS 
Execution time  10 6
Instructio n count
Clock rate


6
Instructio n count  CPI
CPI

10
6
 10
Clock rate

CPI varies between programs on a given CPU
MIPS as a Performance Metric: Example
Consider the following performance measurement for a program:
Measurement
Computer A
Computer B
Instruction Count
10 billion
8 billion
Clock Rate
4 GHz
4 GHz
CPI
1.0
1.1
1. Which computer has the higher MIPS rating?
2. Which computer is faster?
1. Computer A has the higher MIPS rating.
2. Computer B is faster
SPEC: Benchmarking
28
Benchmarking Processors
o Workload:
 A set of programs run on a computer that is either the actual
collection of application run by a user or constructed from real
programs to approximate such a mix
o Benchmark:
 A program selected for use in comparing computer performance
 Programs specifically chosen to measure performance
29
System Performance Evaluation Corporation (SPEC)
o Benchmark suite
● A collection of programs, defined in a high-level language
● Together attempt to provide a representative test of a computer in
a particular application or system programming area
o SPEC (System Performance Evaluation Cooperative)
● An industry consortium
● An Effort funded and supported by a number of computer vendors
to create standard sets of benchmarks
● Defines and maintains the best known collection of benchmark
suites aimed at evaluating computer systems
● Performance measurements are widely used for comparison and
research purposes
30
Benchmark Principles
Desirable characteristics of a benchmark program:
1. It is written in a high-level language, making it portable
across different machines
2. It is representative of a particular kind of programming
domain or paradigm, such as systems programming,
numerical programming, or commercial programming
3. It can be measured easily
4. It has wide distribution
31
SPEC
CPU2006
o Best known SPEC benchmark suite
o Industry standard suite for processor
intensive applications
o Appropriate for measuring performance
for applications that spend most of their
time doing computation rather than I/O
o Consists of 17 floating point programs
written in C, C++, and Fortran and 12
integer programs written in C and C++
o Suite contains over 3 million lines of code
o Fifth generation of processor intensive
suites from SPEC
Benchmark
400.perlbench
Reference
time
(hours)
Instr
count
(billion)
Language
2.71
2,378
C
Application
Area
Brief Description
Programming
Language
PERL programming
language interpreter, applied
to a set of three programs.
Compression
General-purpose data
compression with most work
done in memory, rather than
doing I/O.
C Compiler
Based on gcc Version 3.2,
generates code for Opteron.
401.bzip2
2.68
2,472
C
403.gcc
2.24
1,064
C
429.mcf
2.53
327
C
Combinatoria
l
Optimization
Vehicle scheduling
algorithm.
445.gobmk
2.91
1,603
C
Artificial
Intelligence
Plays the game of Go, a
simply described but deeply
complex game.
456.hmmer
2.59
3,363
C
Search Gene
Sequence
Protein sequence analysis
using profile hidden Markov
models.
458.sjeng
3.36
2,383
C
Artificial
Intelligence
A highly ranked chess
program that also plays
several chess variants.
462.libquantum
5.76
3,555
C
Physics /
Quantum
Computing
Simulates a quantum
computer, running Shor's
polynomial-time
factorization algorithm.
464.h264ref
6.15
3,731
C
Video
Compression
H.264/AVC (Advanced
Video Coding) Video
compression.
Discrete
Event
Simulation
Uses the OMNet++ discrete
event simulator to model a
large Ethernet campus
network.
Path-finding
Algorithms
Pathfinding library for 2D
maps.
XML
Processing
A modified version of
Xalan-C++, which
transforms XML documents
to other document types.
471.omnetpp
1.74
687
C++
473.astar
1.95
1,200
C++
483.xalancbmk
1.92
1,184
C++
SPEC
CPU2006
Integer
Benchmarks
33
Reference
time (hours)
Instr count
(billion)
Language
410.bwaves
3.78
1,176
Fortran
Fluid Dynamics
416.gamess
5.44
5,189
Fortran
433.milc
2.55
937
C
Quantum
Chemistry
Physics / Quantum
Chromodynamics
434.zeusmp
2.53
1,566
Fortran
Physics / CFD
Biochemistry /
Molecular
Dynamics
Benchmark
Application Area
435.gromacs
1.98
1,958
C, Fortran
436.cactusAD
M
437.leslie3d
3.32
1,376
C, Fortran
2.61
1,273
Fortran
444.namd
2.23
2,483
C++
447.dealII
3.18
2,323
C++
450.soplex
2.32
703
C++
453.povray
1.48
940
C++
454.calculix
2.29
3,04`
C, Fortran
459.GemsFDT
D
2.95
1,320
Fortran
Computational
Electromagnetics
465.tonto
2.73
2,392
Fortran
Quantum
Chemistry
470.lbm
3.82
1,500
C
481.wrf
3.10
1,684
C, Fortran
482.sphinx3
5.41
2,472
C
Physics / General
Relativity
Fluid Dynamics
Biology /
Molecular
Dynamics
Finite Element
Analysis
Linear
Programming,
Optimization
Image Ray-tracing
Structural
Mechanics
Fluid Dynamics
Weather
Speech recognition
Brief Description
Computes 3D transonic
transient laminar viscous
flow.
Quantum chemical
computations.
Simulates behavior of
quarks and gluons
Computational fluid
dynamics simulation of
astrophysical phenomena.
Simulate Newtonian
equations of motion for
hundreds to millions of
particles.
Solves the Einstein
evolution equations.
Model fuel injection flows.
Simulates large
biomolecular systems.
Program library targeted at
adaptive finite elements and
error estimation.
Test cases include railroad
planning and military airlift
models.
3D Image rendering.
Finite element code for
linear and nonlinear 3D
structural applications.
Solves the Maxwell
equations in 3D.
Quantum chemistry
package, adapted for
crystallographic tasks.
Simulates incompressible
fluids in 3D.
Weather forecasting model
Speech recognition
software.
SPEC
CPU2006
Floating-Point
Benchmarks
34
Terms Used in SPEC Documentation
o Benchmark
● A program written in a highlevel language that can be
compiled and executed on any
computer that implements the
compiler
o System under test
● This is the system to be evaluated
o Reference machine
● This is a system used by SPEC to
establish a baseline performance
for all benchmarks
♦
Each benchmark is run and
measured on this machine to
establish a reference time for
that benchmark
o Base metric
● These are required for all
reported results and have strict
guidelines for compilation
o Peak metric
● This enables users to attempt to
optimize system performance by
optimizing the compiler output
o Speed metric
● This is simply a measurement of
the time it takes to execute a
compiled benchmark
♦
Used for comparing the ability
of a computer to complete
single tasks
o Rate metric
● This is a measurement of how
many tasks a computer can
accomplish in a certain amount
of time
♦
This is called a throughput,
capacity, or rate measure
♦
Allows the system under test to
execute simultaneous tasks to
take advantage of multiple
processors
35
Start
Get next
program
Run program
three times
Select
median value
Ratio(prog) =
Tref(prog)/TSUT(prog)
Yes
More
programs?
No Compute geometric
mean of all ratios
End
Figure 2.7 SPEC Evaluation Flowchart
36
Table 2.7 Some SPEC CINT2006 Results
(a) Sun Blade 1000
Benchmark
400.perlbench
401.bzip2
403.gcc
429.mcf
445.gobmk
456.hmmer
458.sjeng
462.libquantum
464.h264ref
471.omnetpp
473.astar
483.xalancbmk
Execution
time
3077
3260
Execution
time
3076
3263
Execution
time
3080
3260
Reference
time
9770
9650
Ratio
2711
2356
3319
2701
2331
3310
2702
2301
3308
8050
9120
10490
2.98
3.91
3.17
2586
3452
10318
2587
3449
10319
2601
3449
10273
9330
12100
20720
3.61
3.51
2.01
5246
2565
5290
2572
5259
2582
22130
6250
4.21
2.43
2522
2014
2554
2018
2565
2018
7020
6900
2.75
3.42
3.18
2.96
37
CINT2006 for Intel Core i7 920
Comparing and Summarizing Performance

How do we summarize the performance for benchmark
set with a single number?

The average of execution times that is directly proportional to
total execution time is the arithmetic mean (AM)
n
AM =
1/n

Timei
i=1



Where Timei is the execution time for the ith program of a total of
n programs in the workload
A smaller mean indicates a smaller average execution time and
thus improved performance
Guiding principle in reporting performance measurements
is reproducibility – list everything another experimenter
would need to duplicate the experiment (version of the
operating system, compiler settings, input set used,
specific computer configuration (clock rate, cache sizes
and speed, memory size and speed, etc.))
Summary: Evaluating ISAs


Design-time metrics:

Can it be implemented, in how long, at what cost?

Can it be programmed? Ease of compilation?
Static Metrics:


How many bytes does the program occupy in memory?
Dynamic Metrics:
How many instructions are executed? How many bytes does the
processor fetch to execute the program?
CPI
 How many clocks are required per instruction?
 How "lean" a clock is practical?

Best Metric: Time to execute the program!
depends on the instructions set, the
processor organization, and
compilation techniques.
Inst. Count
Cycle Time
CMOS Technology
41
MOSFET: Metal Oxide
Semiconductor
Field Effect Transistor
 A voltage controlled device
 Handles less current than a BJT
(Slower)
 Dissipates less power
 Achieves higher density on an IC
 Has full swing voltage 0  5V
42
Source
terminal
nMOS Transistor
Control gate
terminal
Drain
terminal
Silicon
dioxide
control gate
source
Sou
term
drain
Silicon
substrate
(a) Standard MOS transistor
An nMOS Transistor
Ids
|V GS|
Vgs
43
Implementing Logic using:
nMOS vs. pMOS Devices
44
Static Complementary MOS (CMOS)
VDD
In1
In2
PUN and PDN are dual logic networks
PMOS only
PUN
InN
In1
In2
InN
F(In1,In2,…InN)
PDN
NMOS only
VSS
At every point in time (except during the switching transients) each
gate output is connected to either VDD or VSS via a low resistive path
45
CMOS Inverter (normal output)
Pull-up
Network
A
Y
VDD
0
1
A
A
Y
Y
GND
Pull-down
Network
46
CMOS Inverter
A
Y
VDD
0
1
OFF
0
A=1
Y=0
ON
A
Y
GND
47
CMOS Inverter
A
Y
0
1
1
0
VDD
ON
A=0
Y=1
OFF
A
Y
GND
48
Example Gate: NAND
49
Example Gate: NOR
50
CMOS Properties
․There is always a path from one supply (VDD or GND) to
the output.
․There is never a path from one supply to the other. (This
is the basis for the low power dissipation in CMOS—
virtually no static power dissipation.)
․There is a momentary drain of current (and thus power
consumption) when the gate switches from one state to
another.


Thus, CMOS circuits have dynamic power dissipation.
The amount of power depends on the switching frequency.
51
Technology Scaling
52
Technology Scaling

Technology scaling has a threefold objective:





Increase the transistor density
Reduce the gate delay
Stabilize Dynamic Power Consumption
Main challenges faced are static power
consumption, delivery and density which
determine the performance of the chip.
Finding solutions to these challenges makes
technology scaling an important issue to
academia and the industry.
53
Power Consumption
54
Sources of Power Consumption

Power has three components



Static power: when input isn’t switching
Dynamic capacitive power: due to
charging and discharging of load
capacitance
Dynamic short-circuit power: direct
current from VDD to Gnd when both
transistors are on
Dynamic Power
°
Dynamic power is required to charge and discharge
load capacitances when transistors switch.
°
One cycle involves a rising and falling output.
•
•
On rising output, charge Q = CVDD is required
On falling output, charge is dumped to GND
VDD
1. Charge/discharge current
iDD(t)
2. Short circuit current
Short circuit power <10% of
dynamic power
fsw
C
Dynamic Power
T
Pdynamic
1
  iDD (t )VDD dt
T 0
T
VDD

iDD (t )dt

T 0
VDD

TfswCVDD 
T
 CVDD 2 f sw
• Suppose load C is switched
between VDD and GND at average
frequency fsw
• Over time T, load is charged and
discharged TfSW times.
• In one complete charge/discharge
cycle, a total charge of Q=CVDD is
transferred between VDD and GND
Taking the integral of the current
over interval T as the total charge
delivered during time T
VDD
iDD(t)
fsw
C
Lowering Dynamic Power
Capacitance:
Function of fan-out, wire
length, transistor sizes
Supply Voltage:
Has been dropping
with successive
generations
Pdyn = CL VDD2 fsw
VDD
Clock frequency:
Increasing…
iDD(t)
fsw
C
Power Trends
o In CMOS IC technology
Power  Capacitive load  Voltage 2  Frequency
×30
5V → 1V
×1000
Power as a Performance Metric
• Power consumption – especially in the embedded market: battery life is important
– For power-limited applications, the most important metric is energy efficiency
Reducing Power
o Suppose a new CPU has
● 85% of capacitive load of old CPU
● 15% voltage and 15% frequency reduction
Pnew Cold  0.85  (Vold  0.85) 2  Fold  0.85
4


0.85
 0.52
2
Pold
Cold  Vold  Fold

The power wall



We can’t reduce voltage further
We can’t remove more heat
How else can we improve performance?
Enhancing Performance
62
Improvements in Chip Organization and Architecture
o Increase hardware speed of processor
● Fundamentally due to shrinking logic gate size
♦ More gates, packed more tightly, increasing clock rate
♦ Propagation time for signals reduced
o Increase size and speed of caches
● Dedicating part of processor chip
♦ Cache access times drop significantly
o Change processor organization and architecture
● Increase effective speed of instruction execution
● Parallelism
63
Processor Organization & Architecture
Techniques built into contemporary processors include:
Pipelining
• Processor moves data or instructions into a conceptual
pipe with all stages of the pipe processing
simultaneously
Branch prediction
• Processor looks ahead in the instruction code fetched
from memory and predicts which branches, or groups
of instructions, are likely to be processed next
Superscalar execution
• This is the ability to issue more than one instruction in
every processor clock cycle. (In effect, multiple
parallel pipelines are used.)
Data flow analysis
• Processor analyzes which instructions are dependent
on each other’s results, or data, to create an optimized
schedule of instructions
Speculative execution
• Using branch prediction and data flow analysis, some
processors speculatively execute instructions ahead of
their actual appearance in the program execution,
holding the results in temporary locations, keeping
execution engines as busy as possible
64
The Latest Revolution: Multicores
• The power challenge has forced a change in the design of
microprocessors
– Since 2002 the rate of improvement in the response time of
programs on desktop computers has slowed from a factor of 1.5
per year to less than a factor of 1.2 per year
• Since 2006, all desktop and server companies are shipping
microprocessors with multiple processors – cores – per chip
Product
Cores per chip
Clock rate
Power

AMD
Barcelona
Intel
Nehalem
IBM Power 6 Sun Niagara
2
4
4
2
8
2.5 GHz
~2.5 GHz?
4.7 GHz
1.4 GHz
120 W
~100 W?
~100 W?
94 W
The plan is to double the number of cores per chip per
generation (about every two years)
Multicore
The use of multiple processors
on the same chip provides the
potential to increase
performance without
increasing the clock rate
Strategy is to use two simpler
processors on the chip rather
than one more complex
processor
With two processors larger
caches are justified
As caches became larger it
made performance sense to
create two and then three
levels of cache on a chip
66
Amdahl Law
67
AMDAHL’S LAW
• A rule stating that the performance enhancement possible with a
given improvement is limited by the amount that the improved
feature is used.
• It is a quantitative version of the law of diminishing returns.
68
Amdahl’s Law
o Deals with the potential speedup of a program using multiple processors
compared to a single processor
o Illustrates the problems facing industry in the development of multi-cores
● Software must be adapted to a highly parallel execution environment
to exploit the power of parallel processing
o Can be generalized to evaluate and design technical improvement in a
computer system
Spedup
f = 0.95
f = 0.90
f = 0.75
f = 0.5
Number of Processors
Amdahl’s Law
• Amdahl's law, is used to find the maximum expected improvement
to an overall system when only part of the system is improved.
• It is often used in parallel computing to predict the theoretical
maximum speedup using `n’ processors.
• The speedup of a program using multiple processors in parallel
computing is limited by the time needed for the sequential fraction
of the program.
70
Amdahl’s Law
• What amount of speedup can you possibly expect from
parallelization?
– Given certain existence of some non-parallelizable code
in every program
• Famous Equation
– Ts = time to execute serial version
– Tp = time to execute parallel version
– 1/s = inherently-serial fraction of computation
Page 71
Limitations?
Amdahl’s Law
T=
1
(1 – a) + a / s
T = Overall speedup
a = Fraction of the original program that could be enhanced by executing in
parallel or transferring it to hardware.
s = Expected speedup obtained (from hardware) for particular fraction of
program. In other words the number of processors used or hardware accelerators
or resources that can be used or created to enhance performance!
72
Amdahl’s Law: Example
• Assume you profiled an application and noticed that 12% of
the application can be parallelized and 88% of the operations
are not parallelizable.
• Question: What is the maximum speedup you can achieve?
Assume infinite # of processors available!
• Answer: Amdahl’s law states that the maximum speedup of
the parallelized version is:
 1/(1 – 0.12) = 1.136 times as fast as the non-parallelized implementation.
 We are assuming that `s’ is infinite
73
Amdahl’s Law: Example

Floating point instructions improved to run 2X;
but only 10% of actual instructions are FP
Speedupoverall =
=
1
(1-0.1) + 0.1/2
=
1
0.95
=
1.053
Law of diminishing return:
Focus on the common case!
74
Summary

Cost/performance is improving
•

Instruction set architecture
•
o

The hardware/software interface
Execution time: the best performance measure
Power is a limiting factor
•
•
Due to underlying technology development
Use parallelism to improve performance
Several techniques to improve performance:
•
•
•
•
Pipelining
Superscalar Execution
Multiprocessing
Memory Hierarchy (Cache)
75
(b) Sun Blade X6250
Benchmark
400.perlbench
401.bzip2
403.gcc
429.mcf
445.gobmk
456.hmmer
458.sjeng
462.libquantum
464.h264ref
471.omnetpp
473.astar
483.xalancbmk
Execution
time
497
613
Execution
time
497
614
Execution
time
497
613
Reference
time
9770
9650
Ratio
Rate
19.66
15.74
78.63
62.97
529
472
637
529
472
637
529
473
637
8050
9120
10490
15.22
19.32
16.47
60.87
77.29
65.87
446
631
614
446
632
614
446
630
614
9330
12100
20720
20.92
19.18
33.75
83.68
76.70
134.98
830
619
830
620
830
619
22130
6250
26.66
10.10
106.65
40.39
580
422
580
422
580
422
7020
6900
12.10
16.35
48.41
65.40
77
Power Dissipation

Many Integrated Core (MIC)
Graphics Processing Unit (GPU)
MIC
o Leap in performance as well as
the challenges in developing
software to exploit such a large
number of cores
o The multicore and MIC strategy
involves a homogeneous
collection of general purpose
processors on a single chip
GPU
o Core designed to perform
parallel operations on
graphics data
o Traditionally found on a
plug-in graphics card, it is
used to encode and render 2D
and 3D graphics as well as
process video
o Used as vector processors for
a variety of applications that
require repetitive
computations
79
Multicore Computer Structure
o
Central processing unit (CPU)
MOTHERBOARD
Main memory chips
● Portion of the computer that fetches
and executes instructions
● Consists of an ALU, a control unit,
and registers
I/O chips
Processor
chip
● Referred to as a processor in a system
with a single processing unit
o
PROCESSOR CHIP
Core
Core
● An individual processing unit on a
processor chip
L3 cache
● May be equivalent in functionality to
a CPU on a single-CPU system
Core
● Specialized processing units are also
referred to as cores
o
Processor
● A physical piece of silicon containing
one or more cores
● Is the computer component that
interprets and executes instructions
● Referred to as a multicore processor if
it contains multiple cores
Core
Core
Core
Core
L3 cache
Core
Core
CORE
Instruction
logic
Arithmetic
and logic
unit (ALU)
Load/
store logic
L1 I-cache
L1 data cache
L2 instruction
cache
L2 data
cache
80
Figure 1.2 Simplified View of Major Elements of a Multicore Compute