Lecture 1: Course Introduction and Overview

Download Report

Transcript Lecture 1: Course Introduction and Overview

Lectures 1: Review of Technology Trends
and Cost/Performance
Prof. Jan M. Rabaey
Computer Science 252
Spring 2000
“Computer Architecture in Cory Hall”
JR.S00 1
CS 252 Course Focus
Understanding the design techniques, machine
structures, technology factors, evaluation
methods that will determine the form of
programmable processors in 21st Century
Technology
Applications
Languages
Computer Architecture:
• Instruction Set Design
• Organization
• Hardware
Operating
Systems
Programming
Measurement &
Evaluation
Interface Design
(ISA)
History
JR.S00 2
Related Courses
CS 152
Strong
Prerequisite
How to build it
Implementation details
CS 252
Why, Analysis,
Evaluation
CS 258
Parallel Architectures,
Languages, Systems
Basic knowledge of the
organization of a computer
is assumed!
EE 141
CS 250
Digital Integrated
Circuits
Integrated Circuit
Design
JR.S00 3
Topic Coverage
Textbook: Hennessy and Patterson, Computer
Architecture: A Quantitative Approach, 2nd Ed., 1996.
• 1.5 weeks Review: Fundamentals of Computer Architecture (Ch. 1),
Instruction Set Architecture (Ch. 2), Pipelining (Ch. 3)
• 1.5 week: Pipelining and Instructional Level Parallelism (Ch. 4)
• 2 weeks: Vector Processors and DSPs (Appendix B)
• 2 weeks: Configurable processors and computing
• 1 week: Memory Hierarchy (Chapter 5)
• 1 weeks: Input/Output and Storage (Chapter 6)
• 1 weeks: Networks and Interconnection Technology (Chapter 7)
• 1 weeks: Multiprocessors (Ch. 8)
• 2 weeks: Design space exploration and embedded processors
JR.S00 4
CS252: Administrative Information
Instructors: Prof. Jan M. Rabaey
Office: 231 Cory Hall, 666-3111, jan@eecs
Office Hours: M 1:30-3pm,Tu 12:30-2pm
Prof. Kurt Keutzer
Office: 566 Cory Hall, 642-9267, keutzer@eecs
Office Hours: W 10-11:30am
T. A:
TBD
Class:
TuTh 2- 3:30pm Hogan Room - Cory Hall
Text:
Computer Architecture: A Quantitative Approach,
Second Edition (1996) (second printing)
Web page: http://bwrc.eecs.berkeley.edu/Classes/CS252
Newsgroup: ucb.class.c252
JR.S00 5
Course Style
• Reduce the pressure of taking quizzes
– Only 2 Graded Quizzes: Thursday Mar. 3 and Th. Apr. 13
– Our goal: test knowledge vs. speed writing
– Take home!
• Major emphasis on research project
–
–
–
–
–
–
–
–
–
Transition from undergrad to grad student
Berkeley wants you to succeed, but you need to show initiative
pick topic
meet 3 times with faculty/TA to see progress
give oral presentation
give poster session
written report like conference paper
~ 3 weeks work full time for 2 people
Opportunity to do “research in the small” to help make transition
from good student to research colleague
JR.S00 6
Course Style
• Everything is on the course Web page:
bwrc.eecs.berkeley.edu/Classes/CS252
• Notes:
– Lecture notes will be available on the web-page at the latest at noon of
the lecture day
– Midterms and pointers to old exams can be found on the web-pages of
previous offerings (pointers on web-site).
• Schedule:
–
–
–
–
–
–
2 Graded Quizes: Thursday Mar. 3 and Thursday Apr. 13
Project Reviews/Checkpoints: Tu. Feb 15, Tu March 14, Tu Apr 11
Oral Presentations: Tu Th April 25/27
252 Poster Session: Tu May 2
252 Last lecture: Th May 4
Project Papers/URLs due: Tu May 9
JR.S00 7
Grading
• 5% Homeworks (work in pairs)
• 35% Examinations (2 Midterms)
• 60% Research Project (work in pairs)
JR.S00 8
1988 Computer Food Chain
Mainframe
Supercomputer
Minisupercomputer
Work- PC
Ministation
computer
Massively Parallel
Processors
JR.S00 9
Massively Parallel Processors
Minisupercomputer
Minicomputer
1997 Computer Food Chain
Mainframe
Server
Work- PC PDA
station
Supercomputer
JR.S00 10
Why Such Change?
• Performance
– Technology Advances
» CMOS VLSI dominates older technologies (TTL, ECL) in
cost AND performance and is progressing rapidly
– Computer architecture advances improves low-end
» RISC, superscalar, RAID, …
• Price: Lower costs due to …
– Simpler development
» CMOS VLSI: smaller systems, fewer components
– Higher volumes
» CMOS VLSI : same device cost 10,000 vs. 10,000,000 units
– Lower margins by class of computer, due to fewer services
• Function
– Rise of networking/local interconnection technology
JR.S00 11
Technology Trends: Microprocessor
Capacity
100000000
Alpha 21264: 15 million
Pentium Pro: 5.5 million
PowerPC 620: 6.9 million
Alpha 21164: 9.3 million
Sparc Ultra: 5.2 million
10000000
Moore’s Law
Pentium
i80486
Transistors
1000000
i80386
i80286
100000
CMOS improvements:
• Die size: 2X every 3 yrs
• Line width: halve / 7 yrs
i8086
10000
i8080
i4004
1000
1970
1975
1980
1985
1990
1995
2000
Year
ISSCC 2000: 25M+ transistor processors (Intel)
JR.S00 12
Memory Capacity
(Single Chip DRAM)
size
1000000000
100000000
Bits
10000000
1000000
100000
10000
1000
1970
1975
1980
1985
1990
1995
2000
year
1980
1983
1986
1989
1992
1996
2000
size(Mb)
cyc time
0.0625 250 ns
0.25
220 ns
1
190 ns
4
165 ns
16
145 ns
64
120 ns
256
100 ns
Year
JR.S00 13
Technology Trends
(Summary)
Capacity
Speed (latency)
Logic
2x in 3 years
2x in 3 years
DRAM
4x in 3 years
2x in 10 years
Disk
4x in 3 years
2x in 10 years
JR.S00 14
Processor frequency trend
100
10,000
Intel
DEC
Gate delays/clock
21264S
1,000
Mhz
21164A
21264
Pentium(R)
21064A
21164
II
21066
MPC750
604
604+
10
Pentium Pro
601, 603 (R)
Pentium(R)
100
Gate Delays/ Clock
Processor freq
scales by 2X per
generation
IBM Power PC
486
386
1
2005
2003
2001
1999
1997
1995
1993
1991
1989
1987
10
 Frequency doubles each generation
 Number of gates/clock reduce by 25%
JR.S00 15
Processor Performance
Trends
1000
Supercomputers
100
Mainframes
10
Minicomputers
Microprocessors
1
0.1
1965
1970
1975
1980
1985
1990
1995
2000
Year
JR.S00 16
Processor Performance
(1.35X before, 1.55X now)
1200
1000
DEC Alpha 21264/600
1.54X/yr
800
600
DEC Alpha 5/500
400
200
0
DEC Alpha 5/300
DEC
HP
IBM
AXP/
SunMIPSMIPS
9000/
DEC Alpha 4/266
-4/ M M/ RS/ 750 500
IBM POWER 100
260 2000 1206000
87 88 89 90 91 92 93 94 95 96 97
JR.S00 17
Performance Trends
(Summary)
• Workstation performance (measured in Spec
Marks) improves roughly 50% per year
(2X every 18 months)
• Improvement in cost performance estimated
at 70% per year
JR.S00 18
A glimpse into the future
Silicon in 2010
Density
(Gbits/cm2)
Die Area:
2.5x2.5 cm
DRAM
8.5
Voltage:
0.6 - 0.9 V
DRAM (Logic)
2.5
Technology: 0.07 m
15
times
denser
2.5
times
power
SRAM
(Cache)
0.3
than
todayrate
density
5 times
clock
Access Time
(ns)
10
10
1.5
Density Max. Ave. PowerClock Rate
(Mgates/cm2) (W/cm2)
(GHz)
Custom
25
54
3
Std. Cell
10
27
1.5
Gate Array
5
18
1
Single-Mask GA
2.5
12.5
0.7
FPGA
0.4
4.5
0.25
JR.S00 19
What is the next wave?
Source: Richard Newton
JR.S00 20
The Embedded Processor
What?
A programmable processor whose programming
interface is not accessible to the end-user of the
product.
The only user-interaction is through the actual
application.
Examples:
- Sharp PDA’s are encapsulated products with fixed
functionality
- 3COM Palm pilots were originally intended as embedded
systems. Opening up the programmers interface turned
them into more generic computer systems.
JR.S00 21
Some interesting numbers
• The Intel 4004 was intended for an embedded
application (a calculator)
• Of todays microprocessors
– 95% go into embedded applications
» SSH3/4 (Hitachi): best selling RISC microprocessor
– 50% of microprocessor revenue stems from embedded
systems
• Often focused on particular application area
–
–
–
–
–
Microcontrollers
DSPs
Media Processors
Graphics Processors
Network and Communication Processors
JR.S00 22
Some different evaluation metrics
• Components of Cost
Power
– Area of die / yield
– Code density (memory is
the major part of die size)
– Packaging
– Design effort
– Programming cost
– Time-to-market
– Reusability
Cost
Flexibility
Performance as a Functionality Constraint
(“Just-in-Time Computing”)
JR.S00 23
The Secret of Architecture Design:
Measurement and Evaluation
Architecture Design is an iterative process:
• Searching the space of possible designs
Design
• At all levels of computer systems
Analysis
Creativity
Cost /
Performance
Analysis
Good Ideas
Bad Ideas
Mediocre Ideas
JR.S00 24
Computer Architecture Topics
Input/Output and Storage
Disks, WORM, Tape
DRAM
Memory
Hierarchy
L2 Cache
L1 Cache
VLSI
Instruction Set Architecture
RAID
Emerging Technologies
Interleaving
Bus protocols
Coherence,
Bandwidth,
Latency
Addressing,
Protection,
Exception Handling
Pipelining, Hazard Resolution,
Pipelining and Instruction
Superscalar, Reordering,
Level Parallelism
Prediction, Speculation,
JR.S00 25
Vector, VLIW, DSP, Reconfiguration
Computer Architecture Topics
P M
P M
S
°°°
P M
P M
Interconnection Network
Processor-Memory-Switch
Multiprocessors
Networks and Interconnections
Shared Memory,
Message Passing,
Data Parallelism
Network Interfaces
Topologies,
Routing,
Bandwidth,
Latency,
Reliability
JR.S00 26
Computer Engineering Methodology
Implementation
Complexity
Implementation
Evaluate Existing
Systems for
Bottlenecks
Analysis
Benchmarks
Technology
Trends
Implement Next
Generation System
Simulate New
Designs and
Organizations
Workloads
Design
JR.S00 27
Measurement Tools
• Hardware: Cost, delay, area, power estimation
• Benchmarks, Traces, Mixes
• Simulation (many levels)
– ISA, RT, Gate, Circuit
• Queuing Theory
• Rules of Thumb
• Fundamental “Laws”/Principles
JR.S00 28
Review:
Performance, Cost, Power
JR.S00 29
Metric 1: Performance
In passenger-mile/hour
Plane
DC to Paris
Speed
Passengers
Throughput
Boeing 747
6.5 hours
610 mph
470
286,700
Concorde
3 hours
1350 mph
132
178,200
• Time to run the task
– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns …
– Throughput, bandwidth
JR.S00 30
The Performance Metric
"X is n times faster than Y" means
ExTime(Y)
--------ExTime(X)
=
Performance(X)
--------------Performance(Y)
• Speed of Concorde vs. Boeing 747
• Throughput of Boeing 747 vs. Concorde
JR.S00 31
Amdahl's Law
Speedup due to enhancement E:
ExTime w/o E
Speedup(E) = ------------ExTime w/ E
=
Performance w/ E
------------------Performance w/o E
Suppose that enhancement E accelerates a fraction F
of the task by a factor S, and the remainder of the
task is unaffected
JR.S00 32
Amdahl’s Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
Speedupoverall =
ExTimeold
ExTimenew
1
=
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
JR.S00 33
Amdahl’s Law
• Floating point instructions improved to run 2X;
but only 10% of actual instructions are FP
ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
Speedupoverall =
1
0.95
=
1.053
Law of diminishing return:
Focus on the common case!
JR.S00 34
Metrics of Performance
Application
Answers per month
Operations per second
Programming
Language
Compiler
ISA
(millions) of Instructions per second: MIPS
(millions) of (FP) operations per second: MFLOP/s
Datapath
Control
Function Units
Transistors Wires Pins
Megabytes per second
Cycles per second (clock rate)
JR.S00 35
Aspects of CPU Performance
CPU time
= Seconds
= Instructions x
Program
Program
CPI
Program
Compiler
X
(X)
Inst. Set.
X
X
Technology
x Seconds
Instruction
Inst Count
X
Organization
Cycles
X
Cycle
Clock Rate
X
X
JR.S00 36
Cycles Per Instruction
“Average Cycles per Instruction”
CPI = Cycles / Instruction Count
= (CPU Time * Clock Rate) / Instruction Count
n
CPU time = CycleTime *

i =1
CPI
i
* I
i
“Instruction Frequency”
n
CPI =

i =1
CPI i *
F
i
where F i =
I i
Instruction Count
Invest Resources where time is Spent!
JR.S00 37
Example: Calculating CPI
Base Machine (Reg / Reg)
Op
Freq CPIi CPIi*Fi
ALU
50%
1
.5
Load
20%
2
.4
Store
10%
2
.2
Branch
20%
2
.4
1.5
(% Time)
(33%)
(27%)
(13%)
(27%)
Typical Mix
JR.S00 38
Creating Benchmark Sets
•
•
•
•
Real programs
Kernels
Toy benchmarks
Synthetic benchmarks
– e.g. Whetstones and Dhrystones
JR.S00 39
SPEC: System Performance Evaluation
Cooperative
• First Round 1989
– 10 programs yielding a single number (“SPECmarks”)
• Second Round 1992
– SPECInt92 (6 integer programs) and SPECfp92 (14 floating point
programs)
» Compiler Flags unlimited. March 93 of DEC 4000 Model 610:
spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=
memcpy(b,a,c)”
wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200
nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas
• Third Round 1995
– new set of programs: SPECint95 (8 integer programs) and
SPECfp95 (10 floating point)
– “benchmarks useful for 3 years”
– Single flag setting for all programs: SPECint_base95,
SPECfp_base95
JR.S00 40
How to Summarize Performance
• Arithmetic mean (weighted arithmetic mean)
tracks execution time: (Ti)/n or (Wi*Ti)
• Harmonic mean (weighted harmonic mean) of
rates (e.g., MFLOPS) tracks execution time:
n/ (1/Ri) or n/(Wi/Ri)
• Normalized execution time is handy for scaling
performance (e.g., X times faster than
SPARCstation 10)
– Arithmetic mean impacted by choice of reference machine
• Use the geometric mean for comparison:
(Ti)^1/n
– Independent of chosen machine
– but not good metric for total execution time
JR.S00 41
SPEC First Round
• One program: 99% of time in single line of code
• New front-end compiler could improve dramatically
800
700
500
400
300
200
100
tomcatv
fpppp
matrix300
eqntott
li
nasa7
doduc
spice
epresso
0
gcc
SPEC Perf
600
Benchmark
IBM Powerstation 550 for 2 different compilers
JR.S00 42
Impact of Means
on SPECmark89 for IBM 550
(without and with special compiler option)
Ratio to VAX:
Program
gcc
espresso
spice
doduc
nasa7
li
eqntott
matrix300
fpppp
tomcatv
Mean
Time:
Before After Before After
30
29
49
51
35
34
65
67
47
47
510 510
46
49
41
38
78 144
258 140
34
34
183 183
40
40
28
28
78 730
58
6
90
87
34
35
33 138
20
19
54
72
124 108
Geometric
Ratio
1.33
Ratio
1.16
Weighted Time:
Before After
8.91
9.22
7.64
7.86
5.69
5.69
5.81
5.45
3.43
1.86
7.86
7.86
6.68
6.68
3.43
0.37
2.97
3.07
2.01
1.94
54.42 49.99
Arithmetic
Weighted
Arith.
JR.S00 43
Ratio
1.09
Performance Evaluation
• “For better or worse, benchmarks shape a field”
• Good products created when have:
– Good benchmarks
– Good ways to summarize performance
• Given sales is a function in part of performance
relative to competition, investment in improving
product as reported by performance summary
• If benchmarks/summary inadequate, then choose
between improving product for real programs vs.
improving product to get more sales;
Sales almost always wins!
• Execution time is the measure of computer
performance!
JR.S00 44
Integrated Circuits Costs
IC cost 
Die cost 
Die cost  Testing cost  Packaging cost
Final test yield
Wafer cost
Dies per Wafer  Die yield
 (Wafer_dia m/2)2
  Wafer_diam
Dies per wafer 

 Test_Die
Die_Area
2  Die_Area




 Defect_Den sity  Die_area  
Die Yield  Wafer_yiel d  1  
 


 



Die Cost goes roughly with die area4
JR.S00 45
Real World Examples
Chip
Metal Line Wafer Defect Area Dies/ Yield Die Cost
layers width cost
/cm2 mm2 wafer
386DX
2 0.90 $900
1.0
43 360 71%
$4
486DX2
3 0.80 $1200
1.0
81 181 54%
$12
PowerPC 601 4 0.80 $1700
1.3 121 115 28%
$53
HP PA 7100 3 0.80 $1300
1.0 196
66 27%
$73
DEC Alpha
3 0.70 $1500
1.2 234
53 19%
$149
SuperSPARC 3 0.70 $1700
1.6 256
48 13%
$272
Pentium
3 0.80 $1500
1.5 296
40 9%
$417
– From "Estimating IC Manufacturing Costs,” by Linley Gwennap,
Microprocessor Report, August 2, 1993, p. 15
JR.S00 46
Cost/Performance
What is Relationship of Cost to Price?
• Recurring Costs
– Component Costs
– Direct Costs (add 25% to 40%) recurring costs: labor, purchasing, scrap,
warranty
• Non-Recurring Costs or Gross Margin (add 82% to
186%)
(R&D, equipment maintenance, rental, marketing, sales, financing
cost, pretax profits, taxes
• Average Discount to get List Price (add 33% to 66%): volume
discounts and/or retailer markup
List Price
Avg. Selling Price
Average
Discount
Gross
Margin
Direct Cost
Component
Cost
25% to 40%
34% to 39%
6% to 8%
15% to 33%
JR.S00 47
Chip Prices (August 1993)
• Assume purchase 10,000 units
Chip
386DX
Area Mfg. Price Multi- Comment
mm2
cost
43
$9
486DX2
81
PowerPC 601 121
plier
$31
$35 $245
$77 $280
3.4 Intense Competition
7.0 No Competition
3.6
DEC Alpha
234 $202 $1231
6.1 Recoup R&D?
Pentium
296 $473 $965
2.0 Early in shipments
JR.S00 48
Summary: Price vs. Cost
100%
80%
Average Discount
60%
Gross Margin
40%
Direct Costs
20%
Component Costs
0%
Mini
5
4
W/S
PC
4.7
3.5
3.8
Average Discount
2.5
3
Gross Margin
1.8
2
Direct Costs
1.5
1
Component Costs
0
Mini
W/S
PC
JR.S00 49
Power/Energy
100
Pentium Pro
(R)
Pentium(R)
10
386
386

Pentium(R)
MMX
486
486
1
?
Source: Intel
Max Power (Watts)
Pentium II (R)



   
 Lead processor power increases every generation
 Compactions provide higher performance at lower power
JR.S00 50
Energy/Power
• Power dissipation: rate at which energy is
taken from the supply (power source) and
transformed into heat
P = E/t
• Energy dissipation for a given instruction
depends upon type of instruction (and state
of the processor)
n
P = (1/CPU Time) *

i =1
E
i
* I
i
JR.S00 51
Summary, #1
• Designing to Last through Trends
Capacity
Logic
•
2x in 3 years
Speed
2x in 3 years
SPEC RATING:
2x in 1.5 years
DRAM
4x in 3 years
2x in 10 years
Disk
4x in 3 years
2x in 10 years
6yrs to graduate => 16X CPU speed, DRAM/Disk size
• Time to run the task
–
Execution time, response time, latency
• Tasks per day, hour, week, sec, ns, …
–
Throughput, bandwidth
• “X is n times faster than Y” means
ExTime(Y)
--------ExTime(X)
=
Performance(X)
-------------Performance(Y)
JR.S00 52
Summary, #2
• Amdahl’s Law:
Speedupoverall =
ExTimeold
ExTimenew
1
=
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
• CPI Law:
CPU time
= Seconds
Program
= Instructions x
Program
Cycles
x Seconds
Instruction
Cycle
• Execution time is the REAL measure of computer
performance!
• Good products created when have:
– Good benchmarks, good ways to summarize performance
• Different set of metrics apply to embedded
systems
JR.S00 53