EECC551 - Shaaban

Download Report

Transcript EECC551 - Shaaban

The Von Neumann Computer Model
• Partitioning of the computing engine into components:
– Central Processing Unit (CPU): Control Unit (instruction decode , sequencing
of operations), Datapath (registers, arithmetic and logic unit, buses).
– Memory: Instruction and operand storage.
– Input/Output (I/O) sub-system: I/O bus, interfaces, devices.
– The stored program concept: Instructions from an instruction set are fetched
from a common memory and executed one at a time
Control
Input
Memory
(instructions,
data)
Computer System
Datapath
registers
ALU, buses
Output
CPU
I/O Devices
EECC551 - Shaaban
#1 Lec # 1 Fall 2000 9-7-2000
Generic CPU Machine Instruction Execution Steps
Instruction
Fetch
Instruction
Decode
Operand
Fetch
Execute
Result
Store
Next
Obtain instruction from program storage
Determine required actions and instruction size
Locate and obtain operand data
Compute result value or status
Deposit results in storage for later use
Determine successor or next instruction
Instruction
EECC551 - Shaaban
#2 Lec # 1 Fall 2000 9-7-2000
Hardware Components of Any Computer
Five classic components of all computers:
1. Control Unit; 2. Datapath; 3. Memory; 4. Input; 5. Output
}
Processor
Computer
Processor
(active)
Control
Unit
Datapath
Memory
(passive)
(where
programs,
data
live when
running)
Devices
Keyboard,
Mouse, etc.
Input
Disk
Output
Display,
Printer, etc.
EECC551 - Shaaban
#3 Lec # 1 Fall 2000 9-7-2000
CPU Organization
• Datapath Design:
– Capabilities & performance characteristics of principal
Functional Units (FUs):
• (e.g., Registers, ALU, Shifters, Logic Units, ...)
– Ways in which these components are interconnected (buses
connections, multiplexors, etc.).
– How information flows between components.
• Control Unit Design:
– Logic and means by which such information flow is controlled.
– Control and coordination of FUs operation to realize the targeted
Instruction Set Architecture to be implemented (can either be
implemented using a finite state machine or a microprogram).
• Hardware description with a suitable language, possibly
using Register Transfer Notation (RTN).
EECC551 - Shaaban
#4 Lec # 1 Fall 2000 9-7-2000
Recent Trends in Computer Design
• The cost/performance ratio of computing systems have seen a
steady decline due to advances in:
– Integrated circuit technology: decreasing feature size,
• Clock rate improves roughly proportional to improvement in 
• Number of transistors improves proportional to  (or faster).
– Architectural improvements in CPU design.
• Microprocessor systems directly reflect IC improvement in terms
of a yearly 35 to 55% improvement in performance.
• Assembly language has been mostly eliminated and replaced by
other alternatives such as C or C++
• Standard operating Systems (UNIX, NT) lowered the cost of
introducing new architectures.
• Emergence of RISC architectures and RISC-core architectures.
• Adoption of quantitative approaches to computer design based on
empirical performance observations.
EECC551 - Shaaban
#5 Lec # 1 Fall 2000 9-7-2000
Processor Performance Trends
Mass-produced microprocessors a cost-effective high-performance
replacement for custom-designed mainframe/minicomputer CPUs
1000
Supercomputers
100
Mainframes
10
Minicomputers
Microprocessors
1
0.1
1965
1970
1975
1980 1985
Year
1990
1995
2000
EECC551 - Shaaban
#6 Lec # 1 Fall 2000 9-7-2000
Microprocessor Performance
1987-97
1200
DEC Alpha 21264/600
1000
800
Integer SPEC92 Performance
600
DEC Alpha 5/500
400
200
0
DEC Alpha 5/300
DEC
HP
SunMIPSMIPSIBM 9000/AXP/
DEC Alpha 4/266
-4/ M M/ RS/ 750 500
IBM POWER 100
260 2000 1206000
87 88 89 90 91 92 93 94 95 96 97
EECC551 - Shaaban
#7 Lec # 1 Fall 2000 9-7-2000
Microprocessor Frequency Trend
10,000
100
Intel
DEC
Gate delays/clock
21264S
1,000
Mhz
21164A
21264
Pentium(R)
21064A
21164
II
21066
MPC750
604
604+
10
Pentium Pro
601, 603 (R)
Pentium(R)
100
Gate Delays/ Clock
Processor freq
scales by 2X per
generation
IBM Power PC
486
386
2005
2003
2001
1999
1997
1995
1993
1991
1989
1
1987
10
 Frequency doubles each generation
 Number of gates/clock reduce by 25%
EECC551 - Shaaban
#8 Lec # 1 Fall 2000 9-7-2000
Microprocessor Transistor
Count Growth Rate
100000000
Alpha 21264: 15 million
Pentium Pro: 5.5 million
PowerPC 620: 6.9 million
Alpha 21164: 9.3 million
Sparc Ultra: 5.2 million
10000000
Moore’s Law
Pentium
i80486
Transistors
1000000
i80386
i80286
100000
Moore’s Law:
i8086
10000
2X transistors/Chip
Every 1.5 years
i8080
i4004
1000
1970
1975
1980
1985
1990
1995
2000
Year
EECC551 - Shaaban
#9 Lec # 1 Fall 2000 9-7-2000
Increase of Capacity of VLSI Dynamic RAM Chips
size
year
size(Megabit)
1980
0.0625
1983
0.25
1986
1
1989
4
1992
16
1996
64
1999
256
2000
1024
1000000000
100000000
Bits
10000000
1000000
100000
10000
1000
1970
1975
1980
1985
Year
1990
1995
2000
1.55X/yr,
or doubling every 1.6
years
EECC551 - Shaaban
#10 Lec # 1 Fall 2000 9-7-2000
DRAM Cost Over Time
Current second half 1999 cost:
~ $1 per MB
EECC551 - Shaaban
#11 Lec # 1 Fall 2000 9-7-2000
Recent Technology Trends
(Summary)
Capacity
Speed (latency)
Logic 2x in 3 years
2x in 3 years
DRAM 4x in 3 years
2x in 10 years
Disk
2x in 10 years
4x in 3 years
EECC551 - Shaaban
#12 Lec # 1 Fall 2000 9-7-2000
Computer Technology Trends:
Evolutionary but Rapid Change
• Processor:
– 2X in speed every 1.5 years; 1000X performance in last decade.
• Memory:
– DRAM capacity: > 2x every 1.5 years; 1000X size in last decade.
– Cost per bit: Improves about 25% per year.
• Disk:
–
–
–
–
Capacity: > 2X in size every 1.5 years.
Cost per bit: Improves about 60% per year.
200X size in last decade.
Only 10% performance improvement per year, due to mechanical
limitations.
• Expected State-of-the-art PC by end of year 2000 :
– Processor clock speed:
– Memory capacity:
– Disk capacity:
> 1500 MegaHertz (1.5 GigaHertz)
> 500 MegaByte (0.5 GigaBytes)
> 100 GigaBytes (0.1 TeraBytes)
EECC551 - Shaaban
#13 Lec # 1 Fall 2000 9-7-2000
Distribution of Cost in a System: An Example
Decreasing
fraction
of total cost
Increasing
fraction
of total cost
EECC551 - Shaaban
#14 Lec # 1 Fall 2000 9-7-2000
A Simplified View of The
Software/Hardware Hierarchical Layers
EECC551 - Shaaban
#15 Lec # 1 Fall 2000 9-7-2000
A Hierarchy of Computer Design
Level Name
1
Modules
Electronics
2
Logic
3
Organization
Gates, FF’s
Registers, ALU’s ...
Processors, Memories
Primitives
Descriptive Media
Transistors, Resistors, etc.
Gates, FF’s ….
Circuit Diagrams
Logic Diagrams
Registers, ALU’s …
Register Transfer
Notation (RTN)
Low Level - Hardware
4 Microprogramming
Assembly Language
Microinstructions
Microprogram
Firmware
5 Assembly language
programming
6 Procedural
Programming
7
Application
OS Routines
Applications
Drivers ..
Systems
Assembly language
Instructions
Assembly Language
Programs
OS Routines
High-level Languages
High-level Language
Programs
Procedural Constructs
Problem-Oriented
Programs
High Level - Software
EECC551 - Shaaban
#16 Lec # 1 Fall 2000 9-7-2000
Hierarchy of Computer Architecture
High-Level Language Programs
Software
Assembly Language
Programs
Application
Operating
System
Machine Language
Program
Compiler
Software/Hardware
Boundary
Firmware
Instr. Set Proc. I/O system
Instruction Set
Architecture
Datapath & Control
Hardware
Digital Design
Circuit Design
Microprogram
Layout
Logic Diagrams
Register Transfer
Notation (RTN)
Circuit Diagrams
EECC551 - Shaaban
#17 Lec # 1 Fall 2000 9-7-2000
Computer Architecture Vs. Computer Organization
• The term Computer architecture is sometimes erroneously restricted
to computer instruction set design, with other aspects of computer
design called implementation
• More accurate definitions:
– Instruction set architecture (ISA): The actual programmervisible instruction set and serves as the boundary between the
software and hardware.
– Implementation of a machine has two components:
• Organization: includes the high-level aspects of a computer’s
design such as: The memory system, the bus structure, the
internal CPU unit which includes implementations of arithmetic,
logic, branching, and data transfer operations.
• Hardware: Refers to the specifics of the machine such as detailed
logic design and packaging technology.
• In general, Computer Architecture refers to the above three aspects:
Instruction set architecture, organization, and hardware.
EECC551 - Shaaban
#18 Lec # 1 Fall 2000 9-7-2000
Computer Architecture’s Changing
Definition
• 1950s to 1960s:
Computer Architecture Course = Computer Arithmetic.
• 1970s to mid 1980s:
Computer Architecture Course = Instruction Set Design,
especially ISA appropriate for compilers.
• 1990s:
Computer Architecture Course = Design of CPU,
memory system, I/O system, Multiprocessors.
EECC551 - Shaaban
#19 Lec # 1 Fall 2000 9-7-2000
The Task of A Computer Designer
• Determine what attributes that are important to the
design of the new machine.
• Design a machine to maximize performance while
staying within cost and other constraints and metrics.
• It involves more than instruction set design.
– Instruction set architecture.
– CPU Micro-Architecture.
– Implementation.
• Implementation of a machine has two components:
– Organization.
– Hardware.
EECC551 - Shaaban
#20 Lec # 1 Fall 2000 9-7-2000
Recent Architectural Improvements
• Increased optimization and utilization of cache systems.
• Memory-latency hiding techniques.
• Optimization of pipelined instruction execution.
• Dynamic hardware-based pipeline scheduling.
• Improved handling of pipeline hazards.
• Improved hardware branch prediction techniques.
• Exploiting Instruction-Level Parallelism (ILP) in terms of
multiple-instruction issue and multiple hardware functional
units.
• Inclusion of special instructions to handle multimedia
applications.
• High-speed bus designs to improve data transfer rates.
EECC551 - Shaaban
#21 Lec # 1 Fall 2000 9-7-2000
The Concept of Memory Hierarchy
CPU
Registers
C
a
c
h
e
< 4MB
Memory
< 4 GB
60ns
I/O dev.
>2 GB
5ms
EECC551 - Shaaban
#22 Lec # 1 Fall 2000 9-7-2000
Typical Parameters of Memory Hierarchy Levels
EECC551 - Shaaban
#23 Lec # 1 Fall 2000 9-7-2000
Current Computer Architecture Topics
Input/Output and Storage
Disks, WORM, Tape
Emerging Technologies
Interleaving
Bus protocols
DRAM
Memory
Hierarchy
Coherence,
Bandwidth,
Latency
L2 Cache
L1 Cache
VLSI
Instruction Set Architecture
Addressing,
Protection,
Exception Handling
Pipelining, Hazard Resolution, Superscalar,
Reordering, Branch Prediction, Speculation,
VLIW, Vector, DSP, ...
Multiprocessing,
Simultaneous CPU Multi-threading
RAID
Pipelining and Instruction
Level Parallelism (ILP)
Thread Level Parallelism (TLB)
EECC551 - Shaaban
#24 Lec # 1 Fall 2000 9-7-2000
Computer Performance Evaluation:
Cycles Per Instruction (CPI)
• Most computers run synchronously utilizing a CPU clock
running at a constant clock rate:
where: Clock rate = 1 / clock cycle
• A computer machine instruction is comprised of a number of
elementary or micro operations which vary in number and
complexity depending on the instruction and the exact CPU
organization and implementation.
– A micro operation is an elementary hardware operation that can be
performed during one clock cycle.
– This corresponds to one micro-instruction in microprogrammed CPUs.
– Examples: register operations: shift, load, clear, increment, ALU
operations: add , subtract, etc.
• Thus a single machine instruction may take one or more cycles to
complete termed as the Cycles Per Instruction (CPI).
EECC551 - Shaaban
#25 Lec # 1 Fall 2000 9-7-2000
Computer Performance Measures:
Program Execution Time
• For a specific program compiled to run on a specific machine “A”,
the following parameters are provided:
– The total instruction count of the program.
– The average number of cycles per instruction (average CPI).
– Clock cycle of machine “A”
•
How can one measure the performance of this machine running this
program?
– Intuitively the machine is said to be faster or has better performance
running this program if the total execution time is shorter.
– Thus the inverse of the total measured program execution time is a
possible performance measure or metric:
PerformanceA = 1 / Execution TimeA
How to compare performance of different machines?
What factors affect performance? How to improve performance?
EECC551 - Shaaban
#26 Lec # 1 Fall 2000 9-7-2000
Measuring Performance
• For a specific program or benchmark running on machine x:
Performance = 1 / Execution Timex
• To compare the performance of machines X, Y, executing specific code:
n = Executiony / Executionx
= Performance x / Performancey
•
System performance refers to the performance and elapsed time measured
on an unloaded machine.
•
•
CPU Performance refers to user CPU time on an unloaded system.
Example:
For a given program:
Execution time on machine A: ExecutionA = 1 second
Execution time on machine B: ExecutionB = 10 seconds
PerformanceA /PerformanceB = Execution TimeB /Execution TimeA = 10 /1 = 10
The performance of machine A is 10 times the performance of machine B when
running this program, or: Machine A is said to be 10 times faster than machine B
when running this program.
EECC551 - Shaaban
#27 Lec # 1 Fall 2000 9-7-2000
CPU Performance Equation
CPU time = CPU clock cycles for a program
X
Clock cycle time
or:
CPU time = CPU clock cycles for a program / clock rate
CPI (clock cycles per instruction):
CPI = CPU clock cycles for a program / I
where I is the instruction count.
EECC551 - Shaaban
#28 Lec # 1 Fall 2000 9-7-2000
CPU Execution Time: The CPU Equation
• A program is comprised of a number of instructions, I
– Measured in:
instructions/program
• The average instruction takes a number of cycles per
instruction (CPI) to be completed.
– Measured in: cycles/instruction
• CPU has a fixed clock cycle time C = 1/clock rate
– Measured in:
seconds/cycle
• CPU execution time is the product of the above three
parameters as follows:
CPU Time
=
I
x CPI
x
C
CPU time
= Seconds
Program
= Instructions x Cycles
Program
x Seconds
Instruction
Cycle
EECC551 - Shaaban
#29 Lec # 1 Fall 2000 9-7-2000
CPU Execution Time
For a given program and machine:
CPI = Total program execution cycles / Instructions count

CPU clock cycles = Instruction count x CPI
CPU execution time =
= CPU clock cycles x Clock cycle
= Instruction count x CPI x Clock cycle
=
I
x CPI x C
EECC551 - Shaaban
#30 Lec # 1 Fall 2000 9-7-2000
CPU Execution Time: Example
• A Program is running on a specific machine with the
following parameters:
– Total instruction count: 10,000,000 instructions
– Average CPI for the program: 2.5 cycles/instruction.
– CPU clock rate: 200 MHz.
• What is the execution time for this program:
CPU time
= Seconds
Program
= Instructions x Cycles
Program
Instruction
x Seconds
Cycle
CPU time = Instruction count x CPI x Clock cycle
= 10,000,000
x 2.5 x 1 / clock rate
= 10,000,000
x 2.5 x 5x10-9
= .125 seconds
EECC551 - Shaaban
#31 Lec # 1 Fall 2000 9-7-2000
Aspects of CPU Execution Time
CPU Time = Instruction count x CPI x Clock cycle
Depends on:
Program Used
Compiler
ISA
Instruction Count I
Depends on:
Program Used
Compiler
ISA
CPU Organization
CPI
Clock
Cycle
C
Depends on:
CPU Organization
Technology
EECC551 - Shaaban
#32 Lec # 1 Fall 2000 9-7-2000
Factors Affecting CPU Performance
CPU time
= Seconds
= Instructions x Cycles
Program
Program
Instruction
Instruction
Count I
CPI
Program
X
X
Compiler
X
X
Instruction Set
Architecture (ISA)
X
X
Organization
Technology
x Seconds
X
Cycle
Clock Cycle C
X
X
EECC551 - Shaaban
#33 Lec # 1 Fall 2000 9-7-2000
Performance Comparison: Example
• From the previous example: A Program is running on a specific
machine with the following parameters:
– Total instruction count: 10,000,000 instructions
– Average CPI for the program: 2.5 cycles/instruction.
– CPU clock rate: 200 MHz.
• Using the same program with these changes:
– A new compiler used: New instruction count 9,500,000
New CPI: 3.0
– Faster CPU implementation: New clock rate = 300 MHZ
• What is the speedup with the changes?
Speedup
= Old Execution Time = Iold x
New Execution Time Inew x
CPIold
x Clock cycleold
CPInew
x Clock Cyclenew
Speedup = (10,000,000 x 2.5 x 5x10-9) / (9,500,000 x 3 x 3.33x10-9 )
= .125 / .095 = 1.32
or 32 % faster after changes.
EECC551 - Shaaban
#34 Lec # 1 Fall 2000 9-7-2000
Instruction Types And CPI
• Given a program with n types or classes of
instructions with:
– Ci = Count of instructions of typei
– CPIi = Average cycles per instruction of typei
n
CPU clock cycles  
i 1
CPI  C 
i
i
EECC551 - Shaaban
#35 Lec # 1 Fall 2000 9-7-2000
Instruction Types And CPI: An Example
• An instruction set has three instruction classes:
Instruction class
A
B
C
CPI
1
2
3
• Two code sequences have the following instruction counts:
Code Sequence
1
2
Instruction counts for instruction class
A
B
C
2
1
2
4
1
1
• CPU cycles for sequence 1 = 2 x 1 + 1 x 2 + 2 x 3 = 10 cycles
CPI for sequence 1 = clock cycles / instruction count
= 10 /5 = 2
• CPU cycles for sequence 2 = 4 x 1 + 1 x 2 + 1 x 3 = 9 cycles
CPI for sequence 2 = 9 / 6 = 1.5
EECC551 - Shaaban
#36 Lec # 1 Fall 2000 9-7-2000
Instruction Frequency & CPI
• Given a program with n types or classes of
instructions with the following characteristics:
Ci = Count of instructions of typei
CPIi = Average cycles per instruction of typei
Fi = Frequency of instruction typei
= Ci / total instruction count
Then:
CPI   CPI i  F i 
n
i 1
EECC551 - Shaaban
#37 Lec # 1 Fall 2000 9-7-2000
Instruction Type Frequency & CPI:
A RISC Example
Base Machine (Reg / Reg)
Op
Freq Cycles CPIxF
ALU
50%
1
.5
Load
20%
5
1.0
Store
10%
3
.3
Branch
20%
2
.4
% Time
23%
45%
14%
18%
Typical Mix
CPI   CPI i  F i 
n
i 1
CPI = .5 x 1 + .2 x 5 + .1 x 3 + .2 x 2 = 2.2
EECC551 - Shaaban
#38 Lec # 1 Fall 2000 9-7-2000
Metrics of Computer Performance
Execution time: Target workload,
SPEC95, etc.
Application
Programming
Language
Compiler
ISA
(millions) of Instructions per second – MIPS
(millions) of (F.P.) operations per second – MFLOP/s
Datapath
Control
Megabytes per second.
Function Units
Transistors Wires Pins
Cycles per second (clock rate).
Each metric has a purpose, and each can be misused.
EECC551 - Shaaban
#39 Lec # 1 Fall 2000 9-7-2000
Choosing Programs To Evaluate Performance
Levels of programs or benchmarks that could be used to evaluate
performance:
– Actual Target Workload: Full applications that run on the
target machine.
– Real Full Program-based Benchmarks:
• Select a specific mix or suite of programs that are typical of
targeted applications or workload (e.g SPEC95).
– Small “Kernel” Benchmarks:
• Key computationally-intensive pieces extracted from real
programs.
– Examples: Matrix factorization, FFT, tree search, etc.
• Best used to test specific aspects of the machine.
– Microbenchmarks:
• Small, specially written programs to isolate a specific aspect
of performance characteristics: Processing: integer, floating
point, local memory, input/output, etc.
EECC551 - Shaaban
#40 Lec # 1 Fall 2000 9-7-2000
Types of Benchmarks
Cons
Pros
• Representative
• Portable.
• Widely used.
• Measurements
useful in reality.
Actual Target Workload
Full Application Benchmarks
• Easy to run, early in
the design cycle.
• Identify peak
performance and
potential bottlenecks.
Small “Kernel”
Benchmarks
Microbenchmarks
• Very specific.
• Non-portable.
• Complex: Difficult
to run, or measure.
• Less representative
than actual workload.
• Easy to “fool” by
designing hardware
to run them well.
• Peak performance
results may be a long
way from real application
performance
EECC551 - Shaaban
#41 Lec # 1 Fall 2000 9-7-2000
SPEC: System Performance
Evaluation Cooperative
• The most popular and industry-standard set of CPU benchmarks.
• SPECmarks, 1989:
– 10 programs yielding a single number (“SPECmarks”).
• SPEC92, 1992:
– SPECInt92 (6 integer programs) and SPECfp92 (14 floating point
programs).
• SPEC95, 1995:
– Eighteen new application benchmarks selected (with given inputs)
reflecting a technical computing workload.
– SPECint95 (8 integer programs):
• go, m88ksim, gcc, compress, li, ijpeg, perl, vortex
– SPECfp95 (10 floating-point intensive programs):
• tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5
– Source code must be compiled with standard compiler flags.
EECC551 - Shaaban
#42 Lec # 1 Fall 2000 9-7-2000
SPEC95
Benchmark
Integer
Floating
Point
go
m88ksim
gcc
compress
li
ijpeg
perl
vortex
tomcatv
swim
su2cor
hydro2d
mgrid
applu
trub3d
apsi
fpppp
wave5
Description
Artificial intelligence; plays the game of Go
Motorola 88k chip simulator; runs test program
The Gnu C compiler generating SPARC code
Compresses and decompresses file in memory
Lisp interpreter
Graphic compression and decompression
Manipulates strings and prime numbers in the special-purpose programming language Perl
A database program
A mesh generation program
Shallow water model with 513 x 513 grid
quantum physics; Monte Carlo simulation
Astrophysics; Hydrodynamic Naiver Stokes equations
Multigrid solver in 3-D potential field
Parabolic/elliptic partial differential equations
Simulates isotropic, homogeneous turbulence in a cube
Solves problems regarding temperature, wind velocity, and distribution of pollutant
Quantum chemistry
Plasma physics; electromagnetic particle simulation
EECC551 - Shaaban
#43 Lec # 1 Fall 2000 9-7-2000
SPEC95 For High-End CPUs
Fourth Quarter 1999
EECC551 - Shaaban
#44 Lec # 1 Fall 2000 9-7-2000
Expected SPEC95 For High-End CPUs
First Quarter 2000
EECC551 - Shaaban
#45 Lec # 1 Fall 2000 9-7-2000
Comparing and Summarizing
Performance
• Total execution time of the compared machines.
• If n program runs or n programs are used:
– Arithmetic mean:
1 n

n Time
i
i 1
– Weighted Execution Time:
n
Weight  Time
i 1
i
i
– Normalized Execution time (arithmetic or geometric
mean). Formula for geometric mean:
n
n
 Execution_ time_ ratio
i 1
i
EECC551 - Shaaban
#46 Lec # 1 Fall 2000 9-7-2000
Computer Performance Measures :
MIPS (Million Instructions Per Second)
• For a specific program running on a specific computer is a measure of
millions of instructions executed per second:
MIPS =
=
=
=
Instruction count / (Execution Time x 106)
Instruction count / (CPU clocks x Cycle time x 106)
(Instruction count x Clock rate) / (Instruction count x CPI x 106)
Clock rate / (CPI x 106)
• Faster execution time usually means faster MIPS rating.
• Problems:
– No account for instruction set used.
– Program-dependent: A single machine does not have a single MIPS
rating.
– Cannot be used to compare computers with different instruction
sets.
– A higher MIPS rating in some cases may not mean higher
performance or better execution time. i.e. due to compiler design
variations.
EECC551 - Shaaban
#47 Lec # 1 Fall 2000 9-7-2000
Compiler Variations, MIPS, Performance:
An Example
• For the machine with instruction classes:
Instruction class
A
B
C
CPI
1
2
3
• For a given program two compilers produced the
following instruction counts:
Code from:
Compiler 1
Compiler 2
Instruction counts (in millions)
for each instruction class
A
B
C
5
1
1
10
1
1
• The machine is assumed to run at a clock rate of 100 MHz
EECC551 - Shaaban
#48 Lec # 1 Fall 2000 9-7-2000
Compiler Variations, MIPS, Performance:
An Example (Continued)
MIPS = Clock rate / (CPI x 106) = 100 MHz / (CPI x 106)
CPI = CPU execution cycles
/ Instructions count
n
CPU clock cycles   CPI i  Ci
i 1


CPU time = Instruction count x CPI / Clock rate
•
For compiler 1:
– CPI1 = (5 x 1 + 1 x 2 + 1 x 3) / (5 + 1 + 1) = 10 / 7 = 1.43
– MIP1 = 100 / (1.428 x 106) = 70.0
– CPU time1 = ((5 + 1 + 1) x 106 x 1.43) / (100 x 106) = 0.10 seconds
•
For compiler 2:
– CPI2 = (10 x 1 + 1 x 2 + 1 x 3) / (10 + 1 + 1) = 15 / 12 = 1.25
– MIP2 = 100 / (1.25 x 106) = 80.0
– CPU time2 = ((10 + 1 + 1) x 106 x 1.25) / (100 x 106) = 0.15 seconds
EECC551 - Shaaban
#49 Lec # 1 Fall 2000 9-7-2000
Computer Performance Measures :
MFOLPS (Million FLOating-Point Operations Per Second)
• A floating-point operation is an addition, subtraction, multiplication,
or division operation applied to numbers represented by a single or
double precision floating-point representation.
• MFLOPS, for a specific program running on a specific computer, is a
measure of millions of floating point-operation (megaflops) per
second:
MFLOPS = Number of floating-point operations / (Execution time x 106 )
• A better comparison measure between different machines than
MIPS.
• Program-dependent: Different programs have different percentages
of floating-point operations present. i.e compilers have no such
operations and yield a MFLOPS rating of zero.
• Dependent on the type of floating-point operations present in the
program.
EECC551 - Shaaban
#50 Lec # 1 Fall 2000 9-7-2000
Quantitative Principles
of Computer Design
• Amdahl’s Law:
The performance gain from improving some portion of
a computer is calculated by:
Speedup =
Performance for entire task using the enhancement
Performance for the entire task without using the enhancement
or Speedup =
Execution time without the enhancement
Execution time for entire task using the enhancement
EECC551 - Shaaban
#51 Lec # 1 Fall 2000 9-7-2000
Performance Enhancement Calculations:
Amdahl's Law
• The performance enhancement possible due to a given design
improvement is limited by the amount that the improved feature is used
• Amdahl’s Law:
Performance improvement or speedup due to enhancement E:
Execution Time without E
Speedup(E) = -------------------------------------Execution Time with E
Performance with E
= --------------------------------Performance without E
– Suppose that enhancement E accelerates a fraction F of the
execution time by a factor S and the remainder of the time is
unaffected then:
Execution Time with E = ((1-F) + F/S) X Execution Time without E
Hence speedup is given by:
Execution Time without E
1
Speedup(E) = --------------------------------------------------------- = -------------------((1 - F) + F/S) X Execution Time without E
(1 - F) + F/S
EECC551 - Shaaban
#52 Lec # 1 Fall 2000 9-7-2000
Pictorial Depiction of Amdahl’s Law
Enhancement E accelerates fraction F of execution time by a factor of S
Before:
Execution Time without enhancement E:
Unaffected, fraction: (1- F)
Affected fraction: F
Unchanged
Unaffected, fraction: (1- F)
F/S
After:
Execution Time with enhancement E:
Execution Time without enhancement E
1
Speedup(E) = ------------------------------------------------------ = -----------------Execution Time with enhancement E
(1 - F) + F/S
EECC551 - Shaaban
#53 Lec # 1 Fall 2000 9-7-2000
Performance Enhancement Example
• For the RISC machine with the following instruction mix given earlier:
Op
ALU
Load
Store
Branch
Freq
50%
20%
10%
20%
Cycles
1
5
3
2
CPI(i)
.5
1.0
.3
.4
% Time
23%
45%
14%
18%
CPI = 2.2
• If a CPU design enhancement improves the CPI of load instructions
from 5 to 2, what is the resulting performance improvement from this
enhancement:
Fraction enhanced = F = 45% or .45
Unaffected fraction = 100% - 45% = 55% or .55
Factor of enhancement = 5/2 = 2.5
Using Amdahl’s Law:
1
1
Speedup(E) = ------------------ = --------------------- =
(1 - F) + F/S
.55 + .45/2.5
1.37
EECC551 - Shaaban
#54 Lec # 1 Fall 2000 9-7-2000
An Alternative Solution Using CPU Equation
Op
ALU
Load
Store
Branch
Freq
50%
20%
10%
20%
Cycles
1
5
3
2
CPI(i)
.5
1.0
.3
.4
% Time
23%
45%
14%
18%
CPI = 2.2
• If a CPU design enhancement improves the CPI of load instructions
from 5 to 2, what is the resulting performance improvement from this
enhancement:
Old CPI = 2.2
New CPI = .5 x 1 + .2 x 2 + .1 x 3 + .2 x 2 = 1.6
Original Execution Time
Speedup(E) = ----------------------------------New Execution Time
Instruction count x old CPI x clock cycle
= ---------------------------------------------------------------Instruction count x new CPI x clock cycle
old CPI
= ------------ =
new CPI
2.2
--------1.6
= 1.37
Which is the same speedup obtained from Amdahl’s Law in the first solution.
EECC551 - Shaaban
#55 Lec # 1 Fall 2000 9-7-2000
Performance Enhancement Example
• A program runs in 100 seconds on a machine with multiply
operations responsible for 80 seconds of this time. By how much
must the speed of multiplication be improved to make the program
four times faster?
Desired speedup = 4 =
100
----------------------------------------------------Execution Time with enhancement
Execution time with enhancement
= 25 seconds
25 seconds = (100 - 80 seconds) + 80 seconds / n
25 seconds =
20 seconds
+ 80 seconds / n
 5

n
= 80 seconds / n
= 80/5 = 16
Hence multiplication should be 16 times faster to get a speedup of 4.
EECC551 - Shaaban
#56 Lec # 1 Fall 2000 9-7-2000
Performance Enhancement Example
• For the previous example with a program running in 100 seconds on
a machine with multiply operations responsible for 80 seconds of this
time. By how much must the speed of multiplication be improved
to make the program five times faster?
Desired speedup = 5 =
100
----------------------------------------------------Execution Time with enhancement
Execution time with enhancement =
20 seconds
20 seconds = (100 - 80 seconds) + 80 seconds / n
20 seconds =
20 seconds
+ 80 seconds / n
 0
= 80 seconds / n
No amount of multiplication speed improvement can achieve this.
EECC551 - Shaaban
#57 Lec # 1 Fall 2000 9-7-2000
Extending Amdahl's Law To Multiple Enhancements
• Suppose that enhancement Ei accelerates a fraction Fi of the
execution time by a factor Si and the remainder of the time is
unaffected then:
Speedup 
Original Execution Time
((1   F )   F ) XOriginal
i
i
Speedup 
i
i
S
Execution Time
i
1
((1   F )   F )
i
i
i
i
S
i
Note: All fractions refer to original execution time.
EECC551 - Shaaban
#58 Lec # 1 Fall 2000 9-7-2000
Amdahl's Law With Multiple Enhancements:
Example
•
Three CPU performance enhancements are proposed with the following
speedups and percentage of the code execution time affected:
Speedup1 = S1 = 10
Speedup2 = S2 = 15
Speedup3 = S3 = 30
•
•
Percentage1 = F1 = 20%
Percentage1 = F2 = 15%
Percentage1 = F3 = 10%
While all three enhancements are in place in the new design, each
enhancement affects a different portion of the code and only one
enhancement can be used at a time.
What is the resulting overall speedup?
Speedup 
1
((1   F )   F )
i
i
•
i
i
S
i
Speedup = 1 / [(1 - .2 - .15 - .1) + .2/10 + .15/15 + .1/30)]
= 1/ [
.55
+
.0333
]
= 1 / .5833 = 1.71
EECC551 - Shaaban
#59 Lec # 1 Fall 2000 9-7-2000
Pictorial Depiction of Example
Before:
Execution Time with no enhancements: 1
Unaffected, fraction: .55
S1 = 10
F1 = .2
S2 = 15
S3 = 30
F2 = .15
F3 = .1
/ 15
/ 10
/ 30
Unchanged
Unaffected, fraction: .55
After:
Execution Time with enhancements: .55 + .02 + .01 + .00333 = .5833
Speedup = 1 / .5833 = 1.71
Note: All fractions refer to original execution time.
EECC551 - Shaaban
#60 Lec # 1 Fall 2000 9-7-2000