COA2011PKP-1 - coapkp-ukm

Download Report

Transcript COA2011PKP-1 - coapkp-ukm

Computer Organization
and Architecture
Lecture 1: Introduction
KT6213
1
Architecture & Organization (Stalling)


Architecture is those attributes visible to the
programmer
◦ Instruction set, number of bits used for data
representation, I/O mechanisms, addressing
techniques.
◦ e.g. Is there a multiply instruction?
Organization is how features are implemented
◦ Control signals, interfaces, memory technology.
◦ e.g. Is there a hardware multiply unit or is it done by
repeated addition?
KT6213
2
Computer Architecture Is …
the attributes of a [computing] system as seen by the
programmer, i.e., the conceptual structure and
functional behavior, as distinct from the organization
of the data flows and controls the logic design, and
the physical implementation.
Amdahl, Blaaw, and Brooks, 1964
KT6213
3
Cont.
Computer Architecture is the design of computers,
including their instruction sets, hardware components, and
system organization [Patterson].
 Thus two essential parts of computer architecture:

◦
Instruction-set Architecture (ISA)
◦
Hardware-system Architecture (HSA)
Technology
Parallelism
Programming
Languages
Applications
Computer Architecture:
• Instruction Set Design
• Organization
• Hardware
Operating
Systems
KT6213
Measurement &
Evaluation
Interface Design
(Inst. Set Arch.)
History
4
Instruction-set Architecture (ISA)

The instruction set architecture of a computer includes
anything a programmer would need to know to make the
computer run correctly. This include:
◦
◦
◦
◦
◦

(a) The number and types of registers
(b) Instruction set (what operations can be performed?)
(c) Instruction format (how are they specified?)
(d) Addressing mode (how is data obtained? - direct vs. indirect)
(e) Exception handling (what happens when something goes
wrong?)
Instruction-set architecture includes the specifications
that determine how machine-language programs will
interact with the computer. That is, in general, two
computers with the same ISA will run the same programs.
This is the notion of a computer-family architecture.
KT6213
5
Hardware-system Architecture (HSA)


The Hardware-system architecture deals with the
computer's major hardware subsystems, including central
processing unit (CPU), its storage system, and its inputoutput system.
The computer hardware design determines the
implementation of the various computer components. This
includes
◦ (a) Capabilities and performance of the functional units (e.g.,
registers, ALUs, shifters)
◦ (b) Methods for connecting the functional units (e.g., data bus)
◦ (c) Control logic for the functional units

Typically, the computer hardware is designed based on the
instruction set architecture.
KT6213
6
A successful ISA generally has many implementations (a
computer-family) which are different in their HSA.
 Compatibility is the ability of different computers to
run the same programs.
◦ Upward compatibility allows high-performance
members of a family to run the same program as do the
low-performance members
◦ Downward compatibility is not always possible,
since high-performance family members often have
features not available on lower-performance members.

KT6213
7
Computer Family




A computer family is a set of implementations that share
the same or similar ISA (using a variety of technologies,
memory sizes, and speeds). For example, IBM system/360
(1960s), PDP-8 family (1965), PDP-11 family (1965), and
IBM system/370 (1970s).
All Intel x86 family share the same basic architecture
The IBM System/370 family share the same basic
architecture
This gives code compatibility
◦ At least backwards

Organization differs between different versions
KT6213
8
Computer Evolution
KT6213
9
Historical
Perspective
KT6213
10
Early Computing
1946:
ENIAC, us Army, 18,000 Vacuum Tubes
1949:
UNIVAC I, $250K, 48 systems sold
1954:
IBM 701, Core Memory
1957:
Moving Head Disk
1958:
Transistor, FORTRAN, ALGOL, CDC & DEC
Founded
1964:
IBM 360, CDC 6600, DEC PDP-8
1969:
UNIX
1970:
FLOPPY DISK
1981:
IBM PC, 1st Successful Portable (Osborne1)
1986:
Connection Machine, MAX Headroom Debut
KT6213
11
Underlying Technologies
Generation
Year
54
58
60
64
66
67
71
Evolutionary 73
75
78
80
84
Parallelism
87
89
92
Logic
Storage
Tubes
core (8 ms)
Transistor (10µs)
Hybrid (1µs)
IC (100ns)
LSI (10ns)
(8-bit µP)
(16-bit µP)
VLSI (10ns)
(32-bit µP)
ULSI
GAs
(64-bit µP)
KT6213
Prog. Lang.
FORTRAN
ALGOL, COBOL
thin film (200ns) Lisp, APL, Basic
PL/1, Simula,C
1k DRAM
4k DRAM
16k DRAM
64k DRAM
256k DRAM
1M DRAM
4M DRAM
16M DRAM
O.O.
O/S
Batch
Multiprog.
V.M.
Networks
ADA
C++
Fortran90
12
What has happened in the 1990s

“Network-Integrated Computing”
◦ Wide-area AND local-area integration of cluster-based
computing, and high performance networks

Scalable Technologies for Computing, Networking,
and Information Systems
◦
◦
◦
◦
◦
Systems that scale DOWN as well as UP
High performance workstations
Clusters and distributed systems
Massively parallel I/O and computer servers
National Information Infrastructure
KT6213
13
What has been predicted for the Late 1990s
and Early 2000s

Technology
◦ Very large dynamic RAM: 64 Mbits and beyond
◦ Large fast Static RAM: 1 MB, 10ns

Complete systems on a chip
◦ 10+ Million Transistors

Parallelism
◦ Superscalar, Superpipeline,Vector, Multiprocessors
◦ Processor Arrays
KT6213
14
What has been predicted for the Late 1990s
and Early 2000s

Low Power
◦ 50% of PCs portable by 1995
◦ Performance per watt

Parallel I/O
◦ Many applications is I/O limited, not computation
◦ Computation scaling but memory, I/O bandwidth not
keeping pace

Multimedia
◦ New interface technologies
◦ Video, speech, handwriting, virtual reality, …
KT6213
15
Review of Technology Trends and Cost
/Performance
KT6213
16
Original Food Chain Picture
Big Fishes Eating Little Fishes
KT6213
17
1988 Computer Food Chain
Mainframe
Supercomputer
Minisupercomputer
Work- PC
Ministation
computer
Massively Parallel
Processors
KT6213
18
Massively Parallel Processors
Minisupercomputer
Minicomputer
1998 Computer Food Chain
Mainframe
Server
Supercomputer
KT6213
Work- PC
station
Now who is eating whom?
19
Why Such Change in 10 years?

Performance
◦ Technology Advances
 CMOS VLSI dominates older technologies (TTL, ECL) in cost AND
performance
◦ Computer architecture advances improves low-end
 RISC, superscalar, RAID, …

Price: Lower costs due to …
◦ Simpler development
 CMOS VLSI: smaller systems, fewer components
◦ Higher volumes
 CMOS VLSI : same dev. cost 10,000 vs. 10,000,000 units

Function
◦ Rise of networking/local interconnection technology
KT6213
20
Moore’s Law




Gordon Moore - cofounder of Intel
Increased density of components on chip
Number of transistors on a chip will double every year
Since 1970’s development has slowed a little


Cost of a chip has remained almost unchanged
Higher packing density means shorter electrical paths,
giving higher performance
Smaller size gives increased flexibility
Reduced power and cooling requirements
Fewer interconnections increases reliability



◦ Number of transistors doubles every 18 months
KT6213
21
Performance Mismatch



Processor speed increased
Memory capacity increased
Memory speed lags behind processor speed
KT6213
22
DRAM and Processor Characteristics
KT6213
23
Trends in DRAM use
KT6213
24
Memory Capacity
(Single Chip DRAM)
size
1000000000
100000000
size(Mb)
1980
0.0625
0.25
1
4
16
64
256
1983
1986
1989
1992
1996
2000
10000000
Bits
Year
1000000
100000
10000
cyc time
250 ns
220 ns
190 ns
165 ns
145 ns
120 ns
100 ns
1000
1970
1975
1980
1985
1990
1995
2000
Year
KT6213
25
total
amount
of
work
done in a
given
time
time between the start and the completion
of an event
KT6213
26
Performance milestone
KT6213
27
Technology Trends
(Summary)
Capacity
Speed (latency)
Logic
2x in 3 years
2x in 3 years
DRAM
4x in 3 years
2x in 10 years
Disk
4x in 3 years
2x in 10 years
KT6213
28
Growth in CPU Transistor Count
KT6213
29
Technology Trends: Microprocessor
Capacity
100000000
“Graduation Window”
10000000
Moore’s Law
Pentium
i80486
1000000
Transistors
Alpha 21264: 15 million
Pentium Pro: 5.5 million
PowerPC 620: 6.9 million
Alpha 21164: 9.3 million
Sparc Ultra: 5.2 million
i80386
i80286
100000
CMOS improvements:
• Die size: 2X every 3 yrs
• Line width: halve / 7 yrs
i8086
10000
i8080
i4004
1000
1970
1975
1980
1985
1990
1995
2000
Year
KT6213
30
Growth in Processor Performance
KT6213
31
Performance Trends
(Summary)

Workstation performance (measured in Spec Marks)
improves roughly 50% per year
(2X every 18 months)

Improvement in cost performance estimated at 70% per
year
KT6213
32
Measurement and Evaluation
Architecture is an iterative process:
• Searching the space of possible designs
• At all levels of computer systems
Design
Analysis
Creativity
Cost /
Performance
Analysis
Good Ideas
Bad Ideas
KT6213
Mediocre Ideas
33
Computer Engineering Methodology
Technology
Trends
KT6213
34
Computer Engineering Methodology
Evaluate Existing
Systems for
Bottlenecks
Benchmarks
Technology
Trends
KT6213
35
Computer Engineering Methodology
Evaluate Existing
Systems for
Bottlenecks
Benchmarks
Technology
Trends
Simulate New
Designs and
Organizations
Workloads
KT6213
36
Computer Engineering Methodology
Implementation
Complexity
Evaluate Existing
Systems for
Bottlenecks
Benchmarks
Technology
Trends
Implement Next
Generation System
Simulate New
Designs and
Organizations
Workloads
KT6213
37
38
39
40
41
42
43
44
Summary: Price vs. Cost
100%
80%
Average Discount
60%
Gross Margin
40%
Direct Costs
20%
Component Costs
0%
Mini
5
4
W/S
PC
4.7
3.5
3.8
Average Discount
2.5
3
Gross Margin
1.8
2
Direct Costs
1.5
1
Component Costs
0
Mini
KT6213
W/S
PC
45
System Performance
KT6213
46
47
Measuring and Reporting Performance
• Designing high performance computers is one of the major
goals of any computer architect.
• As a result, assessing the performance of computer
hardware is at the heart of computer design and greatly
affects the demand and market value of the computer.
• However, measuring performance of a computer system is
not a straightforward task:



Metrics – How do we describe in a numerical way the
performance of a computer?
What tools do we use to find those metrics?
How do we summarize the performance?
KT6213
48
Measuring and Reporting Performance

What do we mean by one Computer is faster than
another?
◦ program runs less time

Response time or execution time
◦ time that users see the output

Throughput
◦ total amount of work done in a given time
KT6213
49
Performance

“Increasing and decreasing” ?????
We use the term “improve performance” or
“ improve execution time” When we mean
increase performance and decrease execution
time .
 improve performance = increase performance
 improve execution time = decrease execution
time

KT6213
50
What is performance ?
…. how fast does this computer run MY program ?
 …. is machine A faster than machine B, and if so,
how much faster ?
 …. one of the three factors driving architecture

◦ …. effective use of new technology
 Should I use it to enhance the architecture or improve performance
of existing architecture.
◦ …. can a desired performance improvement be achieved
by a given set of implementation or organization changes ?
KT6213
51
Measuring Performance
Definition of time



Wall Clock time
Response time
Elapsed time
◦ A latency to complete a task including disk
accesses, memory accesses, I/O activities,
operating system overhead
KT6213
52
Measure Performance – user CPU time
• Response time may include disk access, memory access,
input/output activities, CPU event and operating system
overhead—everything.
• In order to get an accurate measure of performance, we use
CPU time instead of using response time.
• CPU time is the time the CPU spends computing a program
and does not include time spent waiting for I/O or running
other programs.
• CPU time can also be divided into user CPU time (program)
and system CPU time (OS).
• Key in UNIX command time, we have,
90.7u 12.9s 2.39 65% (user CPU, system CPU, total
response,%)
• In our performance measures, we use user CPU time –
because of its independence on the OS and other factors.
53
Performance Metrics
KT6213
54
Benchmarking: Performance Measure
KT6213
55
What is the computer user interested in?


Reduce the time to run certain task
◦ Execution time (response time)
◦ The time between the start and the completion of an
event.
Increase the tasks per week, day, hour, sec, ns …
◦ Throughput
◦ The total amount of work done in a given time.
KT6213
56
Example
Do the following changes to a computer system increase
throughput, reduce response time, or both?
1)Replacing the processor in a computer with a faster version
2)Adding additional processors to a system that uses multiple
processors for separate tasks –for example handling an airline
reservation system
Answer
1) Both response time and throughput are improved.
2) Only throughput increases.
57
Performance Measure
Plane
DC to Paris
Speed
Passengers
Throughput
(pmph)
Boeing 747
6.5 hours
610 mph
470
286,700
Concorde
3 hours
1350 mph
132
178,200
• Time to run the task (ExTime)
– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns … (Performance)
– Throughput, bandwidth
KT6213
58
Execution Time

In this subject, we will be primarily interested in execution
time as a measure of performance.

The relationship between performance and execution
time on a computer X (reciprocal) is defined as follows:
1
Performanc e X 
Execution time X
To maximize performance of an application, we need to minimize
its execution time.
KT6213
59
Performance comparison

To compare design alternatives, we use the following
equation:
Execution Time Y Performanc e X

n
Execution Time X Performanc e Y
• “X is fast n times faster than Y” or “the throughput of X
is n times higher than Y” means that the execution time
is n times less on X than Y.
KT6213
60
Metrics of Performance
What is actually measured?
Application
Answers per month
Operations per second
Programming
Language
Compiler
ISA
(millions) of Instructions per second: MIPS
(millions) of (FP) operations per second: MFLOP/s
Datapath
Control
Function Units
Transistors Wires Pins
KT6213
Megabytes per second
Cycles per second (clock rate)
61
Aspects of CPU Performance
CPU time
= Seconds
= Instructions x
Program
Program
Inst Count
CPI
X
Compiler
X
(X)
Inst. Set.
X
X
Organization
X
KT6213
x Seconds
Instruction
Program
Technology
Cycles
Cycle
Clock Rate
X
X
62
CPU Performance
 All computers are constructed using a clock to
operate its circuits.
• Typically measured by two basic metrics
• Clock rate – today in MHz and GHz
• Clock cycle time: clock cycle time = 1/clock rate
• E.g., 1 GHz rate corresponds to 1 ns cycle time
 Thus the CPU time for a program is given by:
CPU Time  CPU clock cycles for a program  Clock cycle time
Or,
CPU clock cycles for a program
CPU Time 
Clock rate
63
CPU Performance
 More typically, we tend to count # instructions executed,
known as Instruction Count (IC)
 CPI: average number of clock cycles per instruction
CPU clock cycles for a program
CPI 
IC
Depends on IS of the
computer and its compiler
 Hence, alternative method to get the CPU time
IC  CPI
CPU time  IC  CPI  Clock cycle time 
Clock rate
 CPU performance is equally dependent upon 3 characteristics: clock
cycle time, CPI, IC. They are independent and making one better often
makes another worse because the basic technologies involved in
changing each characteristics are interdependent.
 Clock cycle time: hardware technology and organization
 CPI: organization and instruction set architecture
 IC: instruction set architecture and complier technology
KT6213
64
CPU Performance
Example: Suppose we have 2 implementations of the same instruction
set architecture. Computer A has a clock cycle time of 10 nsec and a
CPI of 2.0 for some program, and computer B has a clock cycle time of
20 nsec and a CPI of 1.2 for the same program.
What machine is faster for this program?
Answer
Assume the program require IN instructions to be executed:
CPU clock cycleA = IN x 2.0
CPU clock cycleB = IN x 1.2
CPU timeA = IN x 2.0 x 10 = 20 x IN nsec
CPU timeB = IN x 1.2 x 20 = 24 x IN nsec
So computer A is faster than computer B.
65
CPU Performance
 Often the overall performance is easy to deal with
on a per instruction set basis
# times instruction i
is executed
n
CPU clock cycles   ICi  CPIi
i 1
CPI for instruction i
 n

CPU time    ICi  CPIi   Clock cycle time
 i 1

 The overall CPI can be expressed as:
n
 IC  CPI
i
i
n
ICi
CPI 

 CPIi
Instructio n count i 1 Instructio n count
i 1
KT6213
66
KT6213
67
Cycles Per Instruction
Example: Suppose we have a machine where we can count the
frequency with which instructions are executed. We also know how
many cycles it takes for each instruction type.
Base Machine (Reg / Reg)
Instruction
Freq CPI
ALU
50%
1
Load
20%
2
Store
10%
2
Branch
20%
2
(Total CPI
(% Time)
(33%)
(27%)
(13%)
(27%)
1.5)
How do we get total CPI?
How do we get %time?
KT6213
68
CPU Performance
Example: Suppose we have made the following measurements:
Frequency of FP operations (other than FPSQR) = 25%
Avg. CPI of FP operations = 4.0
Avg. CPI of other instructions = 1.33
Frequency of FPSQR = 2%
CPI of FPSQR = 20
Compare two designs: 1) decrease CPI of FPSQR to 2; 2) decrease the
avg. CPI of all FP operations to 2.5.
Answer
First, find the original CPI:
n
CPIo  
i 1
ICi
 CPIi  27%  4  73% 1.33  2.051
Instructio n count
The CPI with design 1:
CPI1  CPIo  2%  Improved CPI for FPSQR  2.051 - 2%  (20 - 2)  1.691
The CPI with design 2:
CPI2  27%  2.5  73% 1.33  1.649
KT6213
So design 1 is better
69
MIPS: Millions of Instructions per Second
KT6213
70
Relative MIPS and Speedup
KT6213
71

CPU Performance Measure - how many millions of instructions per
second (MIPS) a CPU can execute.
◦ MIPS - million instructions per second
◦
(a) MIPS = instruction count/(execution time x 10^6)
◦
= clock rate/(CPI x 10^6)
◦
(b) Example: A program that executes 3 million instructions
◦
in 2 seconds has a MIPS rating of 1.5
◦
(c) Advantages


◦
easy to understand
easy to measure
(d) Disadvantages
 cannot accurately compare machines with different instructions set (machines
with powerful instruction sets are penalized)
 varies from program to program (programs with lots of simple instructions will
have higher MIPS ratings)
 can vary inversly with performance
(e) different types

◦
◦
◦
- native: As defined in (a) for a given program
- peak: Use instruction mix that minimizes CPI (may be very unrealistic)
- relative: compared to reference machine
KT6213
72
MFLOPS: Focus on One Type of Work
KT6213
73
Normalized MFLOPS
KT6213
74
75
76
Some “Misleading” Performance Measures
 There are certain computer performance measures which
are famous with computer manufactures and sellers – but
may be misleading.
 MIPS (Million Instructions Per Second)
 MIPS depends on the instruction set to make it difficult to
compare MIPS of different instructions.
 MIPS varies between programs on the same computer –
different programs use different instruction mix.
 MIPS can vary inversely to performance –most importantly.
77
Some “Misleading” Performance Measures
 MFLOPS: Focus on one type of work
 MFLOPS (Million Floating-point Operations Per Second) depends
on the program. Must be FP intensive.
 MFLOPS depends on the computer as well.
 The floating-point operations vary in complexity (e.g., add & divide).
 Peak Performance: Performance that the manufacture
guarantees you won’t exceed
 Difference between peak performance and average performance
is huge.
 Peak performance is not useful in predicting observed
performance.
78
Programs for Measuring Performance
• Real applications: text processing software (Word), compliers (C),
and other applications like Photoshop – have inputs, outputs, and
options when the user wants to use them.
 One major downside: Real applications often encounter portability
problems arising from dependences on OS or complier.
• Modified (or scripted) applications: Modification is to enhance
portability or focus on one particular aspect of system performance.
Scripts are used to simulate the interaction behaviors.
• Kernels: small, key pieces from real programs. Typically used to
evaluate individual features of the machine.
• Toy benchmarks: typically between 10 and 100 lines of code and
produce a known result.
• Synthetic benchmarks: artificially created code to match an average
execution profile.
KT6213
79
Benchmark Suites
• They are a collection of programs (workload) that try to
explore and capture all the strengths and weaknesses of a
computer system (real programs, kernels).
• A key advantage of such suites is that the weakness of any one
benchmark is lessened by the presence of the other
benchmarks.
• Good vs. Bad benchmarks
– Improving product for real programs vs. improving product for
benchmarks to get more sales.
– If benchmarks are inadequate, then sales wins!
KT6213
80
SPEC Benchmarks
SPEC: System Performance Evaluation Cooperation




Most successful attempts and widely adopted.
First generation 1989
◦ 10 programs yielding a single number (“SPECmarks”)
Second generation 1992
◦ SPECInt92 (6 integer programs) and SPECfp92 (14 floating point
programs)
◦ Unlimited compiler flags.
Third generation 1995
◦ New set of programs: SPECint95 (8 integer programs) and SPECfp95 (10
floating point)
◦ Single flag setting for all programs: SPECint_base95, SPECfp_base95
◦ “benchmarks useful for 3 years”
KT6213
81
SPEC Benchmarks
CINT2000 (Integer Component of SPEC CPU2000)
Program
164.gzip
C
175.vpr
C
Routing
176.gcc
C
181.mcf
Optimization
186.crafty
C
197.parser
C
252.eon
253.perlbmk
C
254.gap
C
255.vortex
C
256.bzip2
C
300.twolf
C
KT6213
Language
What Is It
Compression
FPGA Circuit Placement and
C
C++
C Programming Language Compiler
Combinatorial
Game Playing: Chess
Word Processing
Computer Visualization
PERL Programming Language
Group Theory, Interpreter
Object-oriented Database
Compression
Place and Route Simulator
http://www.spec.org/osg/cpu2000/CINT2000/
82
SPEC Benchmarks
CFP2000 (Floating Point Component of SPEC CPU2000)
Program
168.wupwise
171.swim
172.mgrid
Field
173.applu
Equations
177.mesa
178.galgel
179.art
183.equake
187.facerec
188.ammp
189.lucas
191.fma3d
200.sixtrack
301.apsi
Distribution
Language
What Is It
Fortran 77
Physics / Quantum Chromodynamics
Fortran 77
Shallow Water Modeling
Fortran 77
Multi-grid Solver: 3D Potential
Fortran 77
Parabolic / Elliptic Differential
C
3-D Graphics Library
Fortran 90
Computational Fluid Dynamics
C
Image Recognition / Neural Networks
C
Seismic Wave Propagation Simulation
Fortran 90
Image Processing: Face Recognition
C
Computational Chemistry
Fortran 90
Number Theory / Primality Testing
Fortran 90
Finite-element Crash Simulation
Fortran 77
High Energy Physics Accelerator Design
Fortran 77
Meteorology: Pollutant
KT6213
http://www.spec.org/osg/cpu2000/CFP2000/
83
SPEC Benchmarks
Sample Results For CINT2000
http://www.spec.org/osg/cpu2000/results/res2000q3/cpu2000-20000718-00168.asc
Benchmarks
Base
Base
Base
Ref Time Run Time Ratio
164.gzip
1400
175.vpr
1400
176.gcc
1100
181.mcf
1800
186.crafty
1000
197.parser
1800
252.eon
1300
253.perlbmk
1800
254.gap
1100
255.vortex
1900
256.bzip2
1500
300.twolf
3000
SPECint_base2000
SPECint2000
KT6213
277
419
275
621
191
500
267
302
249
268
389
784
505*
334*
399*
290*
522*
360*
486*
596*
442*
710*
386*
382*
438
Peak
Peak
Peak
Ref Time Run Time Ratio
1400
1400
1100
1800
1000
1800
1300
1800
1100
1900
1500
3000
270
417
272
619
191
499
267
302
248
264
375
776
518*
336*
405*
291*
523*
361*
486*
596*
443*
719*
400*
387*
Intel OR840(1 GHz
Pentium III processor)
442
84
85
86
More Benchmarks
TPC: Transaction Processing Council
– Measure the ability of a system to handle transactions,
which consist of database accesses and updates.
– Many variants depending on transaction complexity
– TPC-A: simple bank teller transaction style
– TPC-C: complex database query
EDN Embedded Microprocessor Benchmark Consortium (EEMBC,
“embassy”)
– 34 kernels in 5 classes
– 16 automotive/industrial; 5 consumer; 3 networking; 4 office
automation; 6 telecommunications
87
How to Summarize Performance
Management would like to have one number.
Technical people want more:
1. They want to have evidence of reproducibility – there should be
enough information so that you or someone else can repeat the
experiment.
2. There should be consistency when doing the measurements
multiple times.
How would you report these results?
Computer A
Computer B
Computer C
Program P1 (secs)
1
10
20
Program P2 (secs)
1000
100
20
Total Time (secs)
1001
110
40
KT6213
88
Comparing and Summarizing Performance
 Comparing the performance by looking at individual
programs is not fair.
 Total execution time: a consistent summary measure.
 Arithmetic Mean – provides a simple average
1 n
Timei

n i 1
– Timei: execution time for program i in the workload
– Doesn’t account for weight: all programs are treated equal
89
Normalized Time Metrics
 Normalized execution time metrics
Measure the performance by normalizing it to a reference
machine: Execution time ratioi
Geometric Mean
n
n
 Execution Time Ratio i
i 1
Geometric mean is consistent no matter which machine is the
reference .
The arithmetic mean should not be used to average normalized
execution times.
However, geometric mean still doesn’t form accurate
predication models (doesn’t predict execution time)
90
Comparing performance of two computer using SPECRatio
The relationship between Geomatric mean and Performance Ratio
91
The ratio of the geometric means is equal to the geometric mean of the
performance ratios, which implies that the choice of the reference computer is
irrelevant.
92
Weighted Variants
 What is the proper mixture of programs for the
workload?
 Weight is a weighting factor to each program to indicate
the relative frequency of the program in the workload: %
of use
 Weighted Arithmetic Mean
1
Weight i
n
Weight  Time
i 1
i
i
– Weighti: frequency of program i in the workload
– May be better but beware the dominant program time
93
Example
A
B
C
W(1)
W(2)
W(3)
Program P1 (secs)
1
10
20
0.5
0.909
0.999
Program P2 (secs)
1000
100
20
0.5
0.091
0.001
Arithmetic Mean
500.5
55
20
Weighted Arithmetic Mean (1)
500.5
55
20
Weighted Arithmetic Mean (2)
91.82
18.18
20
Weighted Arithmetic Mean (3)
2
10.09
20
94
Example
A
B
C
W(1)
Program P1 (secs)
1
10
20
100/101
Program P2 (secs)
1000
100
20
1/101
Program P1 (secs)
1
10
20
Normalized to A
Program P2 (secs)
1
0.1
0.02
Geometric Mean
1
1
0.63
Arithmetic Mean
500.5
Weighted Arithmetic Mean (1)
10.89
55 20
10.89
20
Machines A and B have the same performance according to the
Geometric Mean measure, yet this would only be true for a
workload that P1 runs 100 times more than P2 according to the
Weighted Arithmetic Mean measure.
95
Locality of Reference

Other important fundamental observations come from the
properties of programs.

Principle of locality: Programs tend to reuse data and instructions
they have used recently.

There are two different types of locality:
 Temporal Locality (locality in time): If an item is referenced, it will tend to
be referenced again soon (loops, reuse, etc.)
 Spatial Locality (locality in space/location): If an item is referenced, items
whose addresses are close one another tend to be referenced soon (straight line
code, array access, etc.)

We can predict with reasonable accuracy what instructions and data a
program will use in the near future based on its accesses in the past.
KT6213
96
Quantitative Principles of Computer Design
Already known how to define, measure and summarize
performance, then we can explore some of the principles
and guidelines in design and analysis of computers.
Make the common case fast
In making a design trade-off, favor the frequent case over the infrequent
case.
Improving the frequent event, rather than the rare event, will obviously help
performance.
Frequent case is often simpler and can be done faster than the infrequent
case.
We have to decide what the frequent case is and how much performance can
be improved by making the case faster.
KT6213
97
Two equations to evaluate design alternatives
Amdahl’s Law



The performance gain that can be obtained by improving
some porting of a computer can be calculated using
Amdahl’s Law.
Amdahl’s Law states that the performance improvement to
be gained from using some fast mode of execution is limited
by the fraction of the time the faster mode can be used.
Amdahl’s Law defines the speedup that can be gained by
using a particular feature.
The CPU Performance Equation
 Essentially all computers are constructed using a clock running at
a constant rate.
 CPU time then can be expressed by the amount of clock cycles.
KT6213
98
Amdahl's Law
Speedup due to enhancement E:
Execution _ Time _ Without _ Enhancement
Speedup( E ) 
Execution _ Time _ With _ Enhancement
Suppose that enhancement E accelerates a fraction F of the task
by a factor S, and the remainder of the task is unaffected
This fraction enhanced
• Fractionenhanced: the fraction of the execution time in the original machine
that can be converted to take advantage of the enhancement
• Speedupenhanced: the improvement gained by the enhanced execution mode
KT6213
99
Amdahl's Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) +
Speedupoverall =
ExTimeold
ExTimenew
Fractionenhanced
Speedupenhanced
1
=
(1 - Fractionenhanced) +
Fractionenhanced
Speedupenhanced
This fraction enhanced
ExTimeold
KT6213
ExTimenew
100
Amdahl's Law
KT6213
101
Amdahl's Law
Example: Floating point (FP) instructions are improved to run
faster by a factor of 2; but only 10% of its time is used to
execute instructions FP. What’s the overall speedup gained by
this improvement?
Answer
ExTimenew = ExTimeold x (0.9 + 0.1/2) = 0.95 x ExTimeold
Speedupoverall =
1
0.95
= 1.053
 Amdahl’s Law can serve as a guide to how much an
enhancement will improve performance and how to distribute
resource to improve cost-performance.
 It is particularly useful for comparing performances both of the
overall system and the CPU of two design alternatives.
KT6213
102
Example of Amdahl’s Law

Floating point instructions improved to run 3X; but
only 5% of its time is to run instructions FP
ExTimenew = ExTimeold x (0.95 + .05/3) = 0.kkk x ExTimeold
Speedupoverall
KT6213
=
1
=
1.yyy
0.kkk
103
Example

Suppose that we want to enhance the processor
used for Web serving. The new processor is 10
times faster on computation in the Web serving
application than the original processor. Assuming
that the original processor is busy with
computation 40% of the time and is waiting for
I/O 60% of the time, what is the overall speedup
gained by incorporating the enhancement?
KT6213
104
Example

A common transformation required in graphics processors is square
root. Implementations of floating-point (FP) square root vary
significantly in performance, especially among processors designed for
graphics. Suppose FP square root (FPSQR) is responsible for 20% of
the execution time of a critical graphics benchmark. One proposal is
to enhance the FPSQR hardware and speed up this operation by a
factor of 10. The other alternative is just to try to make all FP
instructions in the graphics processor run faster by a factor of 1.6; FP
instructions are responsible for half of the execution time for the
application. The design team believes that they can make all FP
instructions run 1.6 times faster with the same effort as required for
the fast square root. Compare these two design alternatives.
KT6213
105
Amdahl's Law - cont
KT6213
106
Amdahl's Law - cont
KT6213
107
Amdahl's Law - cont
KT6213
108
Exercise

Eg: Our program takes 10s to run on computer A, which
has 400 MHz clock. We want it to run in 6s. The
designer says that the clock rate can be increased, but it
will cause the total number of cycles for the program to
increase to 1.2 times the previous value. What is the
minimum clock rate required to get the desired
speedup ?
KT6213
109
Solution

Using formula:
ExTime = INS/PROG X CPI /clock rate
initial
10 = INS/PROG X CPI/400
final
6 = INS/PROG X 1.2 CPI /CLKmin
==> 10/6 = CLKmin/(1.2 X 400)
==> CLKmin = 10/6 X (1.2 X 400)
= 800 MHz
KT6213
110

Exercise: Program runs in 100s. Multiplies = 80% of
program. Designer M can improve speedup of
multiply operations. Now, I am a user and I need to
make MY program 5 times faster. How much
speedup of multiply instructions should M achieve to
allow me to reach my overall speedup goal ?
KT6213
111
Summary, #1
• Designing to Last through Trends
Capacity
•
Speed
Logic
2x in 3 years
2x in 3 years
DRAM
4x in 3 years
2x in 10 years
Disk
4x in 3 years
2x in 10 years
6yrs to graduate => 16X CPU speed, DRAM/Disk size
• Time to run the task
– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns, …
– Throughput, bandwidth
• “X is n times faster than Y” means
ExTime(Y)
--------ExTime(X)
KT6213
=
Performance(X)
-------------Performance(Y)
112
Summary, #2

Amdahl’s Law:
Speedupoverall =

ExTimeold
ExTimenew
1
=
(1 - Fractionenhanced) + Fractionenhanced
CPI Law:
CPU time
Speedupenhanced
= Seconds
Program
= Instructions x
Program
Cycles
x Seconds
Instruction
Cycle
Execution time is the REAL measure of computer
performance!
 Good products created when have:

◦ Good benchmarks, good ways to summarize performance


Die Cost goes roughly with die area4
Can PC industry support engineering/research investment?
KT6213
113