Introduction to computer architecture

Download Report

Transcript Introduction to computer architecture

Chapter 1
Introduction to Computer
Architecture
Computer Architecture
COE 501
Course Objective
The course objective is to gain the knowledge
required to design and analyze highperformance computer systems.
Technology
Parallelism
Programming
Languages
Applications
Computer Architecture:
• Instruction Set Design
• Machine Organization
• Hardware
Operating
Systems
Interface Design
History
Topic Coverage
Textbook: Hennessy and Patterson, Computer
Architecture: A Quantitative Approach, 3rd Ed.,
2004.
•
•
•
•
•
•
•
•
Fundamentals of Computer Architecture (Chapter 1)
Instruction Set Architecture (Chapter 2, Appendix D, Reading)
ILP and dynamic exploitation (Chapter 3)
Exploiting ILP with Software Approaches (Chapters 4)
Memory Hierarchy (Chapter 5)
Multiprocessors (Chapter 6)
Networks and Interconnection Technology (Overview, Chapter 7)
Clusters (Overview, Chapter 8)
Plus, several computer architecture research papers.
Computer Architecture Topics
Input/Output
and Storage
Disks and Tape
DRAM
Memory
Hierarchy
L2 Cache
L1 Cache
VLSI
Emerging Technologies,
Interleaving
Coherence,
Bandwidth,
Latency
Cache Design
Block size,
Associativity
Addressing modes, formats
Pipelining, Hazard Resolution,
Superscalar, Reordering, ILP,
Branch Prediction, Speculation
Instruction Set Architecture
Processor
Design
RAID
Computer Architecture Topics
Networks, Interconnections, and Multiprocessors
P M
P M
°°°
P M
Interconnection Network
P M
Shared Memory or
Message Passing
Network Interfaces
Topologies, Routing, Bandwidth, Latency, Reliability
Context for Designing New
Architectures
• Application Area
– Special Purpose / General Purpose
– Scientific (FP intensive) / Commercial (Integer Intensive)
• Level of Software Compatibility Required
– Object Code / Binary Compatible
– Assembly Language
– Programming Language
Context for Designing New
Architectures
• Operating System Requirements
–
–
–
–
Size of Address Space
Memory Management / Protection
Context Switch Capability
Interrupts and Traps
• Standards:
– IEEE 754 Floating Point
– I/O Bus
– Networks
• Technology Improvements
– Increased Processor Performance
– Larger Memory and I/O devices
– Software/Compiler Innovations
Technology Trends
• Functionality Enhancements
– Improvements in networking technology
– Increase in communication and multimedia support
– Support for larger programs
• Performance
– Technology Advances
» Decreases in feature size, increase in wafer size, lower
voltages
– Computer architecture advances improves low-end
» RISC, superscalar, pipelining, …
• Price: Lower costs due to …
–
–
–
–
Simpler architectures
Higher volumes
More reliable systems
Improved technology
Technology Trends:
Microprocessor Capacity
100000000
Alpha 21264: 15 million
Alpha 21164: 9.3 million
PowerPC 620: 6.9 million
Pentium Pro: 5.5 million
Sparc Ultra: 5.2 million
10000000
Transistors
Transistors
Pentium
i80486
1000000
i80386
i80286
100000
CMOS improvements:
• Die size: 2X every 3 yrs
• Line width: halve / 7 yrs
i8086
10000
i8080
i4004
1000
1970
1975
1980
1985
Year
1990
1995
2000
Memory Capacity
(Single Chip DRAM)
1000000000
100000000
Bits
10000000
1000000
100000
10000
1000
1970
1975
1980
1985
1990
1995
2000
year
1980
1983
1986
1989
1992
1996
2000
Year
Moore’s Law for Memory:
Memory capacity increases by 4x every 3 years
size
64 Kb
256 Kb
1 Mb
4 Mb
16 Mb
64 Mb
256 Mb
cyc time
250 ns
220 ns
190 ns
165 ns
145 ns
125 ns
100 ns
Technology Improvements
Capacity
Speed
Logic
2x in 3 years
2x in 3 years
DRAM
4x in 3 years
1.4x in 10 years
Disk
2x in 3 years
1.4x in 10 years
Speed increases of memory and I/O have not
kept pace with processor speed increases.
Why does DRAM capacity increase faster than
logic capacity?
Processor Performance
(1.35X before, 1.55X now)
1200
DEC Alpha 21264/600
1000
800
600
DEC Alpha 5/500
400
200
0
DEC Alpha 5/300
DEC
HP
SunMIPSMIPSIBM 9000/AXP/
RS/
DEC Alpha 4/266
500
-4/ M M/
750
6000
IBM POWER 100
260 2000 120
87 88 89 90 91 92 93 94 95 96 97
Processor frequency trend
100
10,000
Intel
DEC
Gate delays/clock
21264S
1,000
Mhz
21164A
21264
Pentium(R)
21064A
21164
II
21066
MPC750
604
604+
10
Pentium Pro
601, 603 (R)
Pentium(R)
100
Gate Delays/ Clock
Processor freq
scales by 2X per
generation
IBM Power PC
486
386
1
2005
2003
2001
1999
1997
1995
1993
1991
1989
1987
10
 Frequency doubles each generation
 Number of gates/clock reduce by 25%
Processor Performance
Trends
1000
Supercomputers
100
Mainframes
10
Minicomputers
Microprocessors
1
0.1
1965
1970
1975
1980
1985
Year
1990
1995
2000
1988 Computer Food Chain
Mainframe
Supercomputer
Minisupercomputer
Massively Parallel
Processors
Work- PC
Ministation
computer
Massively Parallel Processors
Minisupercomputer
Minicomputer
Current Computer Food Chain
Mainframe
Server
Supercomputer
Work- PC PDA
station
Processor Perspective
• Putting performance growth in perspective:
Pentium 3 (Coppermine)
Cray YMP
Type
Desktop
Supercomputer
Year
2000
1988
Clock
1130 MHz
167 MHz
MIPS
> 1000 MIPS
< 50 MIPS
Cost
$2,000
$1,000,000
Cache
256 KB
0.25 KB
Memory
512 MB
256 MB
Where Has This Performance
Improvement Come From?
• Technology
– More transistors per chip
– Faster logic
• Machine Organization/Implementation
– Deeper pipelines
– More instructions executed in parallel
• Instruction Set Architecture
– Reduced Instruction Set Computers (RISC)
– Multimedia extensions
– Explicit parallelism
• Compiler technology
– Finding more parallelism in code
– Greater levels of optimization
What is Ahead?
•
•
•
•
•
Greater instruction level parallelism
Bigger caches, and more levels of cache
Multiple processors per chip
Complete systems on a chip
High performance interconnect
Current Computers
• Technology
– Very large dynamic RAM: 10 GB and beyond
– Large fast static RAM: 500 MB, 1-2ns
– Very large disks: Approaching 300 GB
• Complete systems on a chip
– 50+ Million Transistors
• Parallelism
–
–
–
–
–
Superscalar, VLIW
Superpipeline
Explicitely Parallel
Multiprocessors
Distributed systems
Current Computers
• Low Power
– 60% of PCs portable
– Performance per watt is now of interest
• Parallel I/O
– Many applications I/O limited, not computation limited
– Processors speeds increase faster than memory and I/O
• Multimedia
– New interface technologies
– Video, speech, handwriting, virtual reality, …
• Embedded systems extremely important
– 90% of computers manufactured and 50% of processor
revenue is in the embedded market (e.g., microcontrollers,
DSPs, graphics processors, etc.)
Hardware Technology
1980
Memory chips 64 KB
1990
4 MB
Current
2 GB
Clock Rate
1-2 MHz
20-40 MHz
3-6 GHz
Hard disks
40 M
1G
300 GB
Floppies
.256 M
1.5 M
1-4 GB
Computing in the
21st century
• Continue quadrupling memory about every 3 years
• Single-chip multiprocessor systems
• High-speed communication networks
• These improvements will create the need for new
and innovative computer systems.
Measurement and Evaluation
Architecture is an iterative process:
• Search the possible design space
• Make selections
• Evaluate the selections made
Good measurement tools are required
to accurately evaluate the selection.
Cost /
Performance
Analysis
Good Ideas
Bad Ideas
Mediocre Ideas
Measurement Tools
• Benchmarks
• Simulation (many levels)
– ISA, RTL, Gate, Transistor
•
•
•
•
Cost, Delay, and Area Estimates
Queuing Theory
Rules of Thumb
Fundamental Laws
The Bottom Line:
Performance (and Cost)
Plane
DC to Paris
Speed
Passengers
Throughput
(pmph)
Boeing 747
6.5 hours
610 mph
470
286,700
BAD/Sud
Concorde
3 hours
1350 mph
132
178,200
• Time to run the task (Execution Time)
– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns … (Performance)
– Throughput, bandwidth
Performance and
Execution Time
Execution time and performance are reciprocals
ExTime(Y)
Performance(X)
--------- = --------------ExTime(X)
Performance(Y)
• Speed of Concorde vs. Boeing 747
• Throughput of Boeing 747 vs. Concorde
Performance Terminology
“X is n% faster than Y” means:
ExTime(Y)
--------- =
ExTime(X)
Performance(X)
-------------Performance(Y)
=
1
+
n
----100
n = 100(Performance(X) - Performance(Y))
Performance(Y)
n = 100(ExTime(Y) - ExTime(X))
ExTime(X)
Example: Y takes 15 seconds to complete a task,
X takes 10 seconds. What % faster is X?
Example
Example: Y takes 15 seconds to complete a task,
X takes 10 seconds. What % faster is X?
n = 100(ExTime(Y) - ExTime(X))
ExTime(X)
n = 100(15 - 10)
10
n = 50%
Amdahl's Law
Speedup due to enhancement E:
ExTime w/o E
Speedup(E) = ------------ExTime w/ E
=
Performance w/ E
------------------Performance w/o E
Suppose that enhancement E accelerates a fraction
Fractionenhanced of the task by a factor Speedupenhanced,
and the remainder of the task is unaffected.
What are the new execution time and the overall
speedup due to the enhancement?
Amdahl’s Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
Speedupoverall =
ExTimeold
ExTimenew
1
=
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
Example of Amdahl’s Law
• Floating point instructions improved to run 2X;
but only 10% of the time was spent on these
instructions.
ExTimenew =
Speedupoverall =
Example of Amdahl’s Law
• Floating point instructions improved to run 2X;
but only 10% of the time was spent on these
instructions.
ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
Speedupoverall = ExTimenew = 1
0.95
ExTimeold
= 1.053
• The new machine is 5.3% faster for this mix of
instructions.
Make The
Common Case Fast
• All instructions require an instruction fetch,
only a fraction require a data fetch/store.
– Optimize instruction access over data access
• Programs exhibit locality
- 90% of time in 10% of code
- Temporal Locality (items referenced recently)
- Spatial Locality (items referenced nearby)
• Access to small memories is faster
– Provide a storage hierarchy such that the most frequent
accesses are to the smallest (closest) memories.
Reg's
Cache
Memory
Disk / Tape
Hardware/Software Partitioning
• The simple case is usually the most frequent and
the easiest to optimize!
• Do simple, fast things in hardware and be sure the
rest can be handled correctly in software.
Would you handled these in hardware or software:
• Integer addition?
• Accessing data from disk?
• Floating point square root?
Performance Factors
CPU time
= Seconds
Program
= Instructions x
Cycles
Program
Instruction
x Seconds
Cycle
• The number of instructions/program is called
the instruction count (IC).
• The average number of cycles per instruction
is called the CPI.
• The number of seconds per cycle is the clock
period.
• The clock rate is the multiplicative inverse of
the clock period and is given in cycles per
second (or MHz).
• For example, if a processor has a clock
period of 5 ns, what is it’s clock rate?
Aspects of CPU Performance
CPU time
= Seconds
Program
= Instructions x
Cycles
Program
Instruction
Instr. Cnt
Program
Compiler
Instr. Set
Organization
Technology
CPI
x Seconds
Cycle
Clock Rate
Marketing Metrics
MIPS = Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6
• Not effective for machines with different instruction sets
• Not effective for programs with different instruction mixes
• Uncorrelated with performance
MFLOPs = FP Operations / Time * 10^6
• Machine dependent
• Often not where time is spent
•Peak - maximum able to achieve
Normalized MFLOPS:
•Native - average for a set of benchmarks
add,sub,compare,mult 1
•Relative - compared to another platform
divide, sqrt
4
exp, sin, . . .
8
Programs to Evaluate
Processor Performance
• (Toy) Benchmarks
– 10-100 line program
– e.g.: sieve, puzzle, quicksort
• Synthetic Benchmarks
– Attempt to match average frequencies of real workloads
– e.g., Whetstone, dhrystone
• Kernels
– Time critical excerpts
• Real Benchmarks
Benchmarks
• Benchmark mistakes
–
–
–
–
–
–
–
Only average behavior represented in test workload
Loading level controlled inappropriately
Caching effects ignored
Buffer sizes not appropriate
Ignoring monitoring overhead
Not ensuring same initial conditions
Collecting too much data but doing too little analysis
• Benchmark tricks
– Compiler wired to optimize the workload
– Very small benchmarks used
– Benchmarks manually translated to optimize performance
SPEC: System Performance
Evaluation Cooperative
• First Round SPEC CPU89
– 10 programs yielding a single number
• Second Round SPEC CPU92
– SPEC CINT92 (6 integer programs) and SPEC CFP92 (14 floating
point programs)
– Compiler flags can be set differently for different programs
• Third Round SPEC CPU95
– new set of programs: SPEC CINT95 (8 integer programs) and
SPEC CFP95 (10 floating point)
– Single flag setting for all programs
• Fourth Round SPEC CPU2000
– new set of programs: SPEC CINT2000 (12 integer programs) and
SPEC CFP2000 (14 floating point)
– Single flag setting for all programs
– Programs in C, C++, Fortran 77, and Fortran 90
SPEC 2000
• 12 integer programs:
– 2 Compression
– 2 Circuit Placement and Routing
– C Programming Language
Compiler
– Combinatorial Optimization
– Chess, Word Processing
– Computer Visualization
– PERL Programming Language
– Group Theory Interpreter
– Object-oriented Database.
– Written in C (11) and C++ (1)
• 14 floating point programs:
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Quantum Physics
Shallow Water Modeling
Multi-grid Solver
3D Potential Field
Parabolic / Elliptic PDEs
3-D Graphics Library
Computational Fluid Dynamics
Image Recognition
Seismic Wave Simulation
Image Processing
Computational Chemistry
Number Theory / Primality Testing
Finite-element Crash Simulation
High Energy Nuclear Physics
Pollutant Distribution
Written in Fortran (10) and C (4)
Other SPEC Benchmarks
• JVM98:
– Measures performance of Java Virtual Machines
• SFS97:
– Measures performance of network file server (NFS)
protocols
• Web99:
– Measures performance of World Wide Web applications
• HPC96:
– Measures performance of large, industrial applications
• APC, MEDIA, OPC
– Meausre performance of graphics applications
• For more information about the SPEC
benchmarks see: http://www.spec.org.
Conclusions on Performance
• A fundamental rule in computer architecture
is to make the common case fast.
• The most accurate measure of performance is
the execution time of representative real
programs (benchmarks).
• Execution time is dependent on the number
of instructions per program, the number of
cycles per instruction, and the clock rate.
• When designing computer systems, both cost
and performance need to be taken into
account.