EE-F011 Computer Architecture 計算機結構

Download Report

Transcript EE-F011 Computer Architecture 計算機結構

EEF011 Computer Architecture
計算機結構
Chapter 1
Fundamentals of Computer Design
吳俊興
高雄大學資訊工程學系
September 2004
Chapter 1. Fundamentals of
Computer Design
1.1 Introduction
1.2 The Changing Face of Computing and the Task of
the Computer Designer
1.3 Technology Trends
1.4 Cost, Price, and Their Trends
1.5 Measuring and Reporting Performance
1.6 Quantitative Principles of Computer Design
1.7 Putting It All Together
1.8 Another View
1.9 Fallacies and Pitfalls
1.10 Concluding Remarks
Computer Architecture –
1.1 Introduction
• Rapid rate of improvement in the roughly 55 years
– the technology used to build computers
– innovation in computer design
Beginning in about 1970,computer designers became largely
dependent upon IC technology
• Emergence of microprocessor: late 1970s
– C: virtual elimination of assembly language programming
(portability)
– UNIX: standardized, vendor-independent OS
Lowered the cost and risk of bringing out a new architecture
1.1 Introduction
(cont.)
• Development of RISC (Reduced Instruction
Set Computer): early 1980s
Focused the attention of designers on
– The exploitation of instruction-level parallelism
• Initially through pipelining, and
• later through multiple instruction issue
– The use of caches
• initially in simple forms and
• later using more sophisticated organizations and
optimizations
Figure 1.1: 20 years of
sustained growth in
performance at an annual
rate of over 50% (two
different versions of SPEC,
i.e., SPEC92 and SPEC95)
Effects
Significantly enhanced the capability available to users
Dominance of microprocessor-based computers across the entire
range of computer design
• Minicomputers and mainframes replaced
• Supercomputers built with collections of microprocessors
1.2 Changing Face of Computing
• 1960s: dominated by mainframes
– business data processing, large-scale scientific computing
• 1970s: the birth of minicomputers
– scientific laboratories, time-sharing multiple-user
environment
• 1980s: the rise of the desktop computer (personal
computer and workstation)
• 1990s: the emergence of the Internet and the World
Wide Web
– handheld computing devices
– emergence of high-performance digital consumer
electronics
Segmented Markets
•Desktop computing: to optimize price-performance
–the largest market in dollar terms
–Web-centric, interactive applications
•Servers
–Web-based services, replacing traditional mainframe
Availability, Scalability, Throughput
•Embedded Computers
–Features
•The presence of the computers is not immediately obvious
•The application can usually be carefully tuned for the processor and system
–Widest range of processing power and cost
•price is a key factor in the design of computers
•Real-time performance requirement
–Key characteristics
•the need to minimize memory and power
•The use of processor cores together with application-specific circuitry
Task of Computer Designer
• Tasks
– Determine what attributes are important for a new machine,
then
– design a machine to maximize performance while staying
within cost and power constraints
• Efforts
– instruction set design, functional organization, logic design,
and implementation (IC design, packaging, power, and
cooling)
Design a computer to meet functional requirements as well
as price, power, and performance goals
(see Figure 1.4)
Task of a Computer Designer
Implementation
Complexity
Evaluate Existing
Systems for
Bottlenecks
Benchmarks
Technology
Trends
Implement Next
Generation System
Simulate New
Designs and
Organizations
Workloads
Effect of trends
Designer often needs to design for the next technology because
during a single product cycle (typically 4-5 years), key technologies,
such as memory, change sufficiently that the designer must plan for
these changes.
1.3 Technology Trends
The designer must be especially aware of rapidly
occurring changes in implementation technology
1.Integrated circuit logic technology
– transistor density increases by about 35% per year
– die size are less predictable and slower
combined effect: growth rate in transistor count about 55% per year
2.Semiconductor DRAM
– density increases by between 40% and 60% per year
– cycle time decreased by about one-third in 10 years
– bandwidth per chip increases about twice
3.Magnetic disk technology
4.Network technology
Moore’s Law
Gordon Moore (Founder of Intel) observed in 1965 that the number
of transistors that could be crammed on a chip doubles every year.
http://www.intel.com/research/silicon/mooreslaw.htm
Technology Trends
Based on SPEED, the CPU has increased dramatically,
but memory and disk have increased only a little.
This has led to dramatic changed in architecture,
Operating Systems, and Programming practices.
Logic
Capacity
Speed (latency)
2x in 3 years
2x in 3 years
DRAM 4x in 3 years
2x in 10 years
Disk
2x in 10 years
4x in 3 years
1.4 Cost, Price, and Their Trends
• Price: what you sell a finished good
– 1999: more than half the PCs sold were
priced at less than $1000
• Cost: the amount spent to produce it,
including overhead
Cost and Its Trend
• Time
– Learning curve: manufacturing cost reduces over time
– Yield: the percentage of the manufactured devices that
survive the testing procedure
– As the technology matures over time, the yield improves and
hence, things get cheaper
• Volume
– Increasing volume decreases cost (time for learning curve),
– Increases purchasing and manufacturing efficiency
• Commodities
– Products are sold by multiple vendors in large volumes and
are essentially identical
– The low-end business such as DRAMs, disks, monitors
– Improved competition  reduced cost
Prices of six generations of DRAMs
Between the start of a project and the shipping of a produce, say, two years, the cost of
a new DRAM drops by a factor of between 5 and 10 in constant dollars.
Price of an Intel Pentium III at a Given Frequency
It decreases over time as yield enhancements decrease the cost of a good die and
competition forces price reductions.
Cost and Its Trend (Cont.)
• Time
– Learning curve: manufacturing cost reduces over time
– Yield: the percentage of the manufactured devices that
survive the testing procedure
– As the technology matures over time, the yield improves and
hence, things get cheaper
• Volume
– Increasing volume decreases cost (time for learning curve),
– Increases purchasing and manufacturing efficiency
• Commodities
– Products are sold by multiple vendors in large volumes and
are essentially identical
– The low-end business such as DRAMs, disks, monitors
– Improved competition  reduced cost
Wafer and Dies
Exponential cost decrease – technology basically the same:
A wafer is tested and chopped into dies that are packaged
Figure 1.8. This wafer contains 564 MIPS64 R20K processors in 0.18
Figure 1.7. Intel Pentium 4 microprocessor Die
Cost of an Integrated Circuit (IC)
Cost of IC =
(A greater portion of the cost that varies between machines)
Cost of wafter
Cost of die 
# Dies per wafer  Die yield
(sensitive to die size)
# of dies along the edge
Die yield  Wafer yield  (1 
Detect desity  Die area

) 
Today’s technology:   4, detect density 0.4 and 0.8 per cm2
Cost of an IC (cont.)
Cost Example in a $1000 PC in 2001
1.5 Measuring and Reporting
Performance
• Designing high performance computers is one of the major
goals of any computer architect.
• As a result, assessing the performance of computer
hardware is the at the heart of computer design and greatly
affects the demand and market value of the computer.
• However, measuring performance of a computer system is
not a straightforward task:
 Metrics – How do we describe in a numerical way the
performance of a computer?
 What tools do we use to find those metrics?
 How do we summarize the performance?
What is the computer user interested in?
• Reduce the time to run certain task
–Execution time (response time)
–The time between the start and the completion of
an event
• Increase the tasks per week, day, hour, sec, ns …
–Throughput
–The total amount of work done in a given time
Example
Do the following changes to a computer system increase
throughput, reduce response time, or both?
1) Replacing the processor in a computer with a faster version
2) Adding additional processors to a system that uses multiple
processors for separate tasks –for example handling an
airline reservation system
Answer
1) Both response time and throughput are improved
2) Only throughput increases
Execution Time and Performance Comparison
• In this subject, we will be primarily interested in execution time
as a measure of performance
• The relationship between performance and execution time on a
computer X (reciprocal) is defined as follows:
1
Performanc
eX 
ExecutiontimeX
• To compare design alternatives, we use the following equation:
ExecutionT imeY Performanc
eX

n
ExecutionT imeX Performanc
eY
• “X is n times faster than Y” or “the throughput of X is n times
higher than Y” means that the execution time is n times less on
X than Y
• To maximize performance of an application, we need to minimize
its execution time
Measure Performance – user CPU time
• Response time may include disk access, memory access,
input/output activities, CPU event and operating system overhead
- everything
• In order to get an accurate measure of performance, we use CPU
time instead of using response time
• CPU time is the time the CPU spends computing a program and
does not include time spent waiting for I/O or running other
programs
• CPU time can also be divided into user CPU time (program) and
system CPU time (OS)
• Key in UNIX command time, we have,
90.7u 12.9s 2.39 65% (user CPU, system CPU, total response,%)
• In our performance measures, we use user CPU time – because
of its independence on the OS and other factors
Programs for Measuring Performance
• Real applications: text processing software (Word), compliers (C),
and other applications like Photoshop – have inputs, outputs, and
options when the user wants to use them
 One major downside: Real applications often encounter portability problems
arising from dependences on OS or complier
• Modified (or scripted) applications: Modification is to enhance
portability or focus on one particular aspect of system performance.
Scripts are used to simulate the interaction behaviors
• Kernels: small, key pieces from real programs. Typically used to
evaluate individual features of the machine
• Toy benchmarks: typically between 10 and 100 lines of code and
produce a known result
• Synthetic benchmarks: artificially created code to match an average
execution profile
Benchmark Suites
• They are a collection of programs (workload) that try to
explore and capture all the strengths and weaknesses of a
computer system (real programs, kernels)
• A key advantage of such suites is that the weakness of any
one benchmark is lessened by the presence of the other
benchmarks
• Good vs. Bad benchmarks
– Improving product for real programs vs. improving product for
benchmarks to get more sales
– If benchmarks are inadequate, then sales wins!
SPEC Benchmarks
SPEC: Standard Performance Evaluation Cooperation
• Most successful attempts and widely adopted
• First generation 1989
– 10 programs yielding a single number (“SPECmarks”)
• Second generation 1992
– SPECInt92 (6 integer programs) and SPECfp92 (14 floating point
programs)
– Unlimited compiler flags
• Third generation 1995
– New set of programs: SPECint95 (8 integer programs) and SPECfp95
(10 floating point)
– Single flag setting for all programs: SPECint_base95,
SPECfp_base95
– “benchmarks useful for 3 years”
• SPEC CPU 2000
– CINT2000 (11 integer programs) and
CFP2000 (14 floating point programs)
SPEC Benchmarks
CINT2000 (Integer Component of SPEC CPU2000)
Program
164.gzip
175.vpr
176.gcc
181.mcf
186.crafty
197.parser
252.eon
253.perlbmk
254.gap
255.vortex
256.bzip2
300.twolf
Language
C
C
C
C
C
C
C++
C
C
C
C
C
What Is It
Compression
FPGA Circuit Placement and Routing
C Programming Language Compiler
Combinatorial Optimization
Game Playing: Chess
Word Processing
Computer Visualization
PERL Programming Language
Group Theory, Interpreter
Object-oriented Database
Compression
Place and Route Simulator
http://www.spec.org/osg/cpu2000/CINT2000/
SPEC Benchmarks
CFP2000 (Floating Point Component of SPEC CPU2000)
Program
168.wupwise
171.swim
172.mgrid
173.applu
177.mesa
178.galgel
179.art
183.equake
187.facerec
188.ammp
189.lucas
191.fma3d
200.sixtrack
301.apsi
Language
Fortran 77
Fortran 77
Fortran 77
Fortran 77
C
Fortran 90
C
C
Fortran 90
C
Fortran 90
Fortran 90
Fortran 77
Fortran 77
What Is It
Physics / Quantum Chromodynamics
Shallow Water Modeling
Multi-grid Solver: 3D Potential Field
Parabolic / Elliptic Differential Equations
3-D Graphics Library
Computational Fluid Dynamics
Image Recognition / Neural Networks
Seismic Wave Propagation Simulation
Image Processing: Face Recognition
Computational Chemistry
Number Theory / Primality Testing
Finite-element Crash Simulation
High Energy Physics Accelerator Design
Meteorology: Pollutant Distribution
http://www.spec.org/osg/cpu2000/CFP2000/
More Benchmarks
TPC: Transaction Processing Council
– Measure the ability of a system to handle transactions, which
consist of database accesses and updates.
– Many variants depending on transaction complexity
– TPC-A: simple bank teller transaction style
– TPC-C: complex database query
EDN Embedded Microprocessor Benchmark Consortium
(EEMBC, “embassy”)
– 34 kernels in 5 classes
– 16 automotive/industrial; 5 consumer; 3 networking; 4 office
automation; 6 telecommunications
How to Summarize Performance
Management would like to have one number
Technical people want more:
1.They want to have evidence of reproducibility – there should be
enough information so that you or someone else can repeat the
experiment
2.There should be consistency when doing the measurements
multiple times
How would you report these results?
Computer A
Computer B
Computer C
Program P1 (secs)
1
10
20
Program P2 (secs)
1000
100
20
Total Time (secs)
1001
110
40
Comparing and Summarizing Performance
 Comparing the performance by looking at individual
programs is not fair
 Total execution time: a consistent summary measure
 Arithmetic Mean – provides a simple average
1 n
Tim ei

n i 1
– Timei: execution time for program i in the workload
– Doesn’t account for weight: all programs are treated equal
Weighted Variants
 What is the proper mixture of programs for the workload?
 Weight is a weighting factor to each program to indicate
the relative frequency of the program in the workload: %
of use
 Weighted Arithmetic Mean
n
Weight  Tim e
i 1
i
i
– Weighti: frequency of program i in the workload
– May be better but beware the dominant program time
Example: Figure 1.16
A
B
C
W(1)
W(2)
W(3)
Program P1 (secs)
1
10
20
0.5
0.909
0.999
Program P2 (secs)
1000
100
20
0.5
0.091
0.001
Arithmetic Mean
500.5
55
20
Weighted Arithmetic Mean (1)
500.5
55
20
Weighted Arithmetic Mean (2)
91.82
18.18
20
Weighted Arithmetic Mean (3)
2
10.09
20
n
Weighted Arithmetic Mean =
Weight  Tim e
i 1
i
i
Normalized Time Metrics
 Normalized execution time metrics
Measure the performance by normalizing it to a reference
machine: Execution time ratioi
Geometric Mean
n
n
 ExecutionT imeRatioi
i 1
Geometric mean is consistent no matter which machine is the
reference
The arithmetic mean should not be used to average
normalized execution times
However, geometric mean still doesn’t form accurate
predication models (doesn’t predict execution time). It
encourages designers to improve the easiest benchmarks
rather the slowest benchmarks (due to no weighting)
Example
A
B
C
W(1)
Program P1 (secs)
1
10
20
100/101
Program P2 (secs)
1000
100
20
1/101
Program P1 (secs)
1
10
20
Normalized to A
Program P2 (secs)
1
0.1
0.02
Geometric Mean
1
1
0.63
Arithmetic Mean
500.5
55
20
Weighted Arithmetic Mean (1)
10.89
10.89
20
Machines A and B have the same performance according to the
Geometric Mean measure, yet this would only be true for a
workload that P1 runs 100 times more than P2 according to the
Weighted Arithmetic Mean measure
1.6 Quantitative Principles of
Computer Design
Already known how to define, measure and summarize
performance, then we can explore some of the principles and
guidelines in design and analysis of computers.
Make the common case fast
In making a design trade-off, favor the frequent case over the infrequent case
Improving the frequent event, rather than the rare event, will obviously help
performance
Frequent case is often simpler and can be done faster than the infrequent
case
We have to decide what the frequent case is and how much performance
can be improved by making the case faster
Two equations to evaluate design alternatives
Amdahl’s Law
 Amdahl’s Law states that the performance improvement to be
gained from using some fast mode of execution is limited by the
fraction of the time the faster mode can be used
 Amdahl’s Law defines the speedup that can be gained by using a
particular feature
 The performance gain that can be obtained by improving some
porting of a computer can be calculated using Amdahl’s Law
The CPU Performance Equation
 Essentially all computers are constructed using a clock running
at a constant rate
 CPU time then can be expressed by the amount of clock cycles
Amdahl's Law
Speedup due to enhancement E:
Execution_ Tim e_ Without _ Enhancem en
t
Speedup( E ) 
Execution_ Tim e_ With _ Enhancem en
t
Suppose that enhancement E accelerates a fraction F of the task by a
factor S, and the remainder of the task is unaffected
This fraction enhanced
• Fractionenhanced: the fraction of the execution time in the original machine that can
be converted to take advantage of the enhancement
• Speedupenhanced: the improvement gained by the enhanced execution mode
Amdahl's Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) +
Speedupoverall =
ExTimeold
ExTimenew
Fractionenhanced
Speedupenhanced
1
=
(1 - Fractionenhanced) +
Fractionenhanced
Speedupenhanced
This fraction enhanced
ExTimeold
ExTimenew
Amdahl's Law
Example: Floating point (FP) instructions are improved to run faster by
a factor of 2; but only 10% of actual instructions are FP. What’s the
overall speedup gained by this improvement?
Answer
ExTimenew = ExTimeold x (0.9 + 0.1/2) = 0.95 x ExTimeold
Speedupoverall
=
1
0.95
= 1.053
 Amdahl’s Law can serve as a guide to how much an enhancement
will improve performance and how to distribute resource to improve
cost-performance
 It is particularly useful for comparing performances both of the
overall system and the CPU of two design alternatives
More example: page 41 and page 42
CPU Performance
 All computers are constructed using a clock to operate its
circuits
• Typically measured by two basic metrics
• Clock rate – today in MHz and GHz
• Clock cycle time: clock cycle time = 1/clock rate
• E.g., 1 GHz rate corresponds to 1 ns cycle time
 Thus the CPU time for a program is given by:
CPU T ime CPU clock cyclesfor a program Clock cycle time
Or,
CPU clock cycles for a program
CPU Time 
Clock rate
CPU Performance
 More typically, we tend to count # instructions executed, known as
Instruction Count (IC)
 CPI: average number of clock cycles per instruction
CPU clock cycles for a program
CPI 
IC
 Hence, alternative method to get the CPU time
CPU time  IC  CPI  Clock cycle time 
Depends on IS of the
computer and its
compiler
IC  CPI
Clock rate
 CPU performance is equally dependent upon 3 characteristics: clock cycle time,
CPI, IC. They are independent and making one better often makes another
worse because the basic technologies involved in changing each characteristics
are interdependent.
 Clock cycle time: hardware technology and organization
 CPI: organization and instruction set architecture
 IC: instruction set architecture and complier technology
CPU Performance
CPU time  IC  CPI  Clock cycle time
Example: Suppose we have 2 implementations of the same instruction set
architecture. Computer A has a clock cycle time of 10 nsec and a CPI of 2.0
for some program, and computer B has a clock cycle time of 20 nsec and a
CPI of 1.2 for the same program.
What machine is faster for this program?
Answer
Assume the program require IN instructions to be executed:
CPU clock cycleA = IN x 2.0
CPU clock cycleB = IN x 1.2
CPU timeA = IN x 2.0 x 10 = 20 x IN nsec
CPU timeB = IN x 1.2 x 20 = 24 x IN nsec
So computer A is faster than computer B.
CPU Performance
 Often the overall performance is easy to deal with on a
per instruction set basis
# times instruction i is
executed
n
CPU clock cycles  ICi  CPIi
i 1
CPI for instruction i
 n

CPU time   ICi  CPIi   Clock cycle time
 i 1

 The overall CPI can be expressed as:
n
CPI 
 IC  CPI
i 1
i
i
Instruction count
n

i 1
ICi
 CPIi
Instruction count
Cycles Per Instruction
Example: Suppose we have a machine where we can count the
frequency with which instructions are executed. We also know
how many cycles it takes for each instruction type.
Base Machine (Reg / Reg)
Instruction
Freq
CPI
ALU
50%
1
Load
20%
2
Store
10%
2
Branch
20%
2
(Total CPI
How do we get total CPI?
How do we get %time?
(% Time)
(33%)
(27%)
(13%)
(27%)
1.5)
CPU Performance
Example: Suppose we have made the following measurements:
Frequency of FP operations (other than FPSQR) = 25%
Avg. CPI of FP operations = 4.0
Avg. CPI of other instructions = 1.33
Frequency of FPSQR = 2%
CPI of FPSQR = 20
Compare two designs:
1) decrease CPI of FPSQR to 2;
2) decrease the avg. CPI of all FP operations to 2.5.
Answer
First, find the original CPI:
n
CPIo  
i 1
ICi
 CPIi  25%  4  75% 1.33  2.0
Instruction count
The CPI with design 1:
CPI1  CPIo  2%  ImprovedCPI for FPSQR  2.0- 2% (20- 2)  1.64
The CPI with design 2:
CPI2  25% 2.5 75%1.33 1.625
So design 2 is better
Principle of Locality
 Other important fundamental observations come from the properties
of programs.
 Principle of locality: Programs tend to reuse data and instructions
they have used recently.
 There are two different types of locality: (Ref. Chapter 5)
 Temporal Locality (locality in time): If an item is referenced, it
will tend to be referenced again soon (loops, reuse, etc.)
 Spatial Locality (locality in space/location): If an item is
referenced, items whose addresses are close one another tend
to be referenced soon (straight line code, array access, etc.)
 We can predict with reasonable accuracy what instructions and
data a program will use in the near future based on its accesses in
the past.
 A program spends 90% of its execution time in only 10% of the code.
Some “Misleading” Performance Measures
 There are certain computer performance measures which are
famous with computer manufactures and sellers – but may be
misleading.
 MIPS (Million Instructions Per Second)
 MIPS depends on the instruction set to make it difficult to
compare MIPS of different instructions.
 MIPS varies between programs on the same computer – different
programs use different instruction mix.
 MIPS can vary inversely to performance –most importantly.
Some “Misleading” Performance Measures
 MFLOPS: Focus on one type of work
 MFLOPS (Million Floating-point Operations Per Second) depends on the
program. Must be FP intensive.
 MFLOPS depends on the computer as well.
 The floating-point operations vary in complexity (e.g., add & divide).
 Peak Performance: Performance that the manufacture guarantees
you won’t exceed
 Difference between peak performance and average performance is
huge.
 Peak performance is not useful in predicting observed performance.
Summary
1.1 Introduction
1.2 The Changing Face of Computing and the Task of
the Computer Designer
1.3 Technology Trends
1.4 Cost, Price, and Their Trends
1.5 Measuring and Reporting Performance
– Benchmarks
– (Weighted) Arithmetic Mean, Normalized Geometric Mean
1.6 Quantitative Principles of Computer Design
– Amdahl’s Law
– CPU Performance: CPU Time, CPI
– Locality