presentation source
Download
Report
Transcript presentation source
CMPUT429/CMPE382 Winter 2001
Topic2: Technology Trend and
Cost/Performance
(Adapted from David A. Patterson’s CS252
lecture slides at Berkeley)
1/17/01
CS252/Patterson
Lec 1.1
Technology Trends: Microprocessor
Capacity
100000000
“Graduation Window”
Alpha 21264: 15 million
Pentium Pro: 5.5 million
PowerPC 620: 6.9 million
Alpha 21164: 9.3 million
Sparc Ultra: 5.2 million
10000000
Moore’s Law
Pentium
i80486
Transistors
1000000
i80386
i80286
100000
CMOS improvements:
• Die size: 2X every 3 yrs
• Line width: halve / 7 yrs
i8086
10000
i8080
i4004
1000
1970
1975
1980
1985
1990
1995
2000
Year
1/17/01
CS252/Patterson
Lec 1.2
Memory Capacity
(Single Chip DRAM)
size
1000000000
100000000
Bits
10000000
1000000
100000
10000
1000
1970
1975
1980
1985
1990
1995
year
1980
1983
1986
1989
1992
1996
2000
2000
size(Mb)
cyc time
0.0625 250 ns
0.25
220 ns
1
190 ns
4
165 ns
16
145 ns
64
120 ns
256
100 ns
Year
1/17/01
CS252/Patterson
Lec 1.3
Technology Trends
(Summary)
1/17/01
Capacity
Speed (latency)
Logic
2x in 3 years
2x in 3 years
DRAM
4x in 3 years
2x in 10 years
Disk
4x in 3 years
2x in 10 years
CS252/Patterson
Lec 1.4
Processor Performance
Trends
1000
Supercomputers
100
Mainframes
10
Minicomputers
Microprocessors
1
0.1
1965
1970
1975
1980
1985
1990
1995
2000
Year
1/17/01
CS252/Patterson
Lec 1.5
Processor Performance
(1.35X before, 1.55X now)
1200
1000
DEC Alpha 21264/600
1.54X/yr
800
600
DEC Alpha 5/500
400
200
0
DEC Alpha 5/300
DEC
HP
SunMIPSMIPSIBM 9000/AXP/
RS/
DEC Alpha 4/266
500
-4/ M M/
750
6000
IBM POWER 100
260 2000 120
87 88 89 90 91 92 93 94 95 96 97
1/17/01
CS252/Patterson
Lec 1.6
Performance Trends
(Summary)
• Workstation performance (measured in Spec
Marks) improves roughly 50% per year
(2X every 18 months)
• Improvement in cost performance estimated
at 70% per year
1/17/01
CS252/Patterson
Lec 1.7
Computer Architecture Topics
Input/Output and Storage
Disks, WORM, Tape
Emerging Technologies
Interleaving
Bus protocols
DRAM
Memory
Hierarchy
Coherence,
Bandwidth,
Latency
L2 Cache
L1 Cache
VLSI
Instruction Set Architecture
Addressing,
Protection,
Exception Handling
Pipelining, Hazard Resolution,
Superscalar, Reordering,
Prediction, Speculation,
Vector, DSP
1/17/01
RAID
Pipelining and Instruction
Level Parallelism
CS252/Patterson
Lec 1.8
Computer Architecture Topics
P M
P M
S
°°°
P M
P M
Interconnection Network
Processor-Memory-Switch
Multiprocessors
Networks and Interconnections
1/17/01
Shared Memory,
Message Passing,
Data Parallelism
Network Interfaces
Topologies,
Routing,
Bandwidth,
Latency,
Reliability
CS252/Patterson
Lec 1.9
Course Focus
Technology
Parallelism
Programming
Languages
Applications
Computer Architecture:
• Instruction Set Design
• Organization
• Hardware
Operating
Systems
1/17/01
Measurement &
Evaluation
Interface Design
(ISA)
History
CS252/Patterson
Lec 1.10
Measurement Tools
• Benchmarks, Traces, Mixes
• Hardware: Cost, delay, area, power
estimation
• Simulation (many levels)
– ISA, RT, Gate, Circuit
• Queueing Theory
• Rules of Thumb
• Fundamental “Laws”/Principles
1/17/01
CS252/Patterson
Lec 1.11
Which is faster?
Plane
DC to
Paris
Speed
Passengers
Throughput
(pmph)
Boeing 747
6.5 hours
610 mph
470
286,700
BAD/Sud
Concodre
3 hours
1350 mph
132
178,200
• Time to run the task (ExTime)
– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns …
(Performance)
– Throughput, bandwidth
1/17/01
CS252/Patterson
Lec 1.12
Definitions
• Performance is in units of things per sec
– bigger is better
• If we are primarily concerned with response time
– performance(x) =
1
execution_time(x)
" X is n times faster than Y" means
Execution_time(Y)
Performance(X)
n
=
=
Performance(Y)
1/17/01
Execution_time(X)
CS252/Patterson
Lec 1.13
Cycles Per Instruction
IC = Instruction Count
CPI = Clock Per Instruction
CPU time Number of clock cycles Clock cycle time
Number of clock cycles
CPU time
Clock Frequency
Number of clock cycles
CPI
IC
CPU time IC CPI Clock cycle time
IC CPI
CPU time
Clock Rate
n
CPU time Cycle Time CPI j I j
1/17/01
j 1
CS252/Patterson
Lec 1.14
Cycles Per Instruction
We may separate the contribution of each type of
instruction to the execution time defining:
n
Number of clock cycles CPI j IC j
j 1
where IC j is the number of times that instructio n
j is executed, and CPI j is the average number of
clocks required to execute instructio n j
1/17/01
CS252/Patterson
Lec 1.15
Example: Calculating CPI
Base Machine
Op
ALU
Load
Store
Branch
(Reg /
Freq
50%
20%
10%
20%
Reg)
Cycles
1
2
2
2
Typical Mix of
instruction types
in program
1/17/01
CPI(i)
.5
.4
.2
.4
1.5
(% Time)
(33%)
(27%)
(13%)
(27%)
CS252/Patterson
Lec 1.16
Aspects of CPU Performance (CPU Law)
CPU time
= Seconds
= Instructions x
Program
CPI
Program
Compiler
X
(X)
Inst. Set.
X
X
Technology
x Seconds
Instruction
Inst Count
X
Organization
1/17/01
Program
Cycles
X
Cycle
Clock Rate
X
X
CS252/Patterson
Lec 1.17
Amdahl's Law
Speedup due to enhancement E:
Exec Time w/o E Performanc e w/ E
Speedup(E)
Exec Time w/ E Performanc e w/o E
Suppose that enhancement E accelerates a fraction
F of the task by a factor S, and the remainder of
the task is unaffected
1/17/01
CS252/Patterson
Lec 1.18
Amdahl’s Law
ExTime
new
Fraction enhanced
ExTime old 1 Fraction enhanced
Speedup enhanced
ExTime old
1
Speedup overall
Fraction enhanced
ExTime new 1 Fraction
enhanced
Speedup enhanced
1/17/01
CS252/Patterson
Lec 1.19
Amdahl’s Law
• Example: Floating point instructions improved to
run 2X; but only 10% of actual instructions are
FP
ExTime
new
0.1
ExTime old 1 0.1
ExTime
2
old
0.95
ExTime old
ExTime old
1
Speedup overall
1.053
ExTime new ExTime old 0.95 0.95
1/17/01
CS252/Patterson
Lec 1.20
Metrics of Performance
Application
Answers per month
Operations per second
Programming
Language
Compiler
ISA
(millions) of Instructions per second: MIPS
(millions) of (FP) operations per second: MFLOP/s
Datapath
Control
Function Units
Transistors Wires Pins
1/17/01
Megabytes per second
Cycles per second (clock rate)
CS252/Patterson
Lec 1.21
SPEC: System Performance Evaluation
Cooperative
• First Round 1989
– 10 programs yielding a single number (“SPECmarks”)
• Second Round 1992
– SPECInt92 (6 integer programs) and SPECfp92 (14 floating
point programs)
» Compiler Flags unlimited. March 93 of DEC 4000 Model
610:
spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=
memcpy(b,a,c)”
wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200
nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas
• Third Round 1995
1/17/01
– new set of programs: SPECint95 (8 integer programs) and
SPECfp95 (10 floating point)
– “benchmarks useful for 3 years”
– Single flag setting for all programs: SPECint_base95,
SPECfp_base95
CS252/Patterson
Lec 1.22
How to Summarize Performance
• Arithmetic mean (weighted arithmetic mean)
tracks execution time: •(Ti)/n or •(Wi*Ti)
• Harmonic mean (weighted harmonic mean) of
rates (e.g., MFLOPS) tracks execution time:
n/•(1/Ri) or n/•(Wi/Ri)
• Normalized execution time is handy for scaling
performance (e.g., X times faster than
SPARCstation 10)
• But do not take the arithmetic mean of
normalized execution time,
use the geometrici)^1/n)
1/17/01
CS252/Patterson
Lec 1.23
Performance Evaluation
• “For better or worse, benchmarks shape a field”
• Good products created when have:
– Good benchmarks
– Good ways to summarize performance
• Given sales is a function in part of performance
relative to competition, investment in improving
product as reported by performance summary
• If benchmarks/summary inadequate, then choose
between improving product for real programs vs.
improving product to get more sales;
Sales almost always wins!
• Execution time is the measure of computer
performance!
1/17/01
CS252/Patterson
Lec 1.24
Instruction Set Architecture (ISA)
software
instruction set
hardware
1/17/01
CS252/Patterson
Lec 1.25
Interface Design
A good interface:
• Lasts through many implementations (portability,
compatability)
• Is used in many differeny ways (generality)
• Provides convenient functionality to higher levels
• Permits an efficient implementation at lower levels
use
Interface
use
use
1/17/01
imp 1
time
imp 2
imp 3
CS252/Patterson
Lec 1.26
Summary, #1
• Designing to Last through Trends
Capacity
•
Speed
Logic
2x in 3 years
2x in 3 years
DRAM
4x in 3 years
2x in 10 years
Disk
4x in 3 years
2x in 10 years
6yrs to graduate => 16X CPU speed, DRAM/Disk size
• Time to run the task
– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns, …
– Throughput, bandwidth
• “X is n times faster than Y” means
ExTime(Y)
--------ExTime(X)
1/17/01
=
Performance(X)
-------------Performance(Y)
CS252/Patterson
Lec 1.27
Summary, #2
• Amdahl’s Law:
Speedupoverall =
• CPI Law:
CPU time
ExTimeold
ExTimenew
1
=
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
= Seconds
Program
= Instructions x
Program
Cycles
x Seconds
Instruction
Cycle
• Execution time is the REAL measure of computer
performance!
• Good products created when have:
– Good benchmarks, good ways to summarize performance
1/17/01
• Die Cost goes roughly with die area4
• Can PC industry support engineering/research
investment?
CS252/Patterson
Lec 1.28