CS775: Computer Architecture - Computer Science and Engineering

Download Report

Transcript CS775: Computer Architecture - Computer Science and Engineering

CIS775: Computer Architecture
Chapter 1: Fundamentals of
Computer Design
1
Course Objectives
• To evaluate the issues involved in choosing and
designing instruction set.
• To learn concepts behind advanced pipelining
techniques.
• To understand the “hitting the memory wall”
problem and the current state-of-art in memory
system design.
• To understand the qualitative and quantitative
tradeoffs in the design of modern computer
systems
2
What is Computer Architecture?
• Functional operation of the individual HW
units within a computer system, and the
flow of information and control among
them.
Technology
Parallelism
Computer
Hardware Organization Architecture:
Measurement &
Evaluation
Programming
Language
Interface
Interface Design
(ISA)
Applications
OS
3
Computer Architecture Topics
Input/Output and Storage
Disks, WORM, Tape
Emerging Technologies
Interleaving Memories
DRAM
Memory
Hierarchy
Coherence,
Bandwidth,
Latency
L2 Cache
L1 Cache
VLSI
Instruction Set Architecture
RAID
Addressing,
Protection,
Exception Handling
Pipelining, Hazard Resolution,
Superscalar, Reordering,
Prediction, Speculation,
Vector, DSP
Pipelining and Instruction
Level Parallelism
4
Computer Architecture Topics
P M
P M
S
°°°
P M
P M
Interconnection Network
Processor-Memory-Switch
Multiprocessors
Networks and Interconnections
Shared Memory,
Message Passing,
Data Parallelism
Network Interfaces
Topologies,
Routing,
Bandwidth,
Latency,
Reliability
5
Measurement and Evaluation
Architecture is an iterative process:
• Searching the space of possible designs
• At all levels of computer systems
Design
Analysis
Creativity
Cost /
Performance
Analysis
Good Ideas
Bad Ideas
Mediocre Ideas
6
Issues for a Computer Designer
• Functional Requirements Analysis (Target)
– Scientific Computing – HiPerf floating pt.
– Business – transactional support/decimal arith.
– General Purpose –balanced performance for a range of
tasks
• Level of software compatibility
– PL level
• Flexible, Need new compiler, portability an issue
– Binary level (x86 architecture)
• Little flexibility, Portability requirements minimal
• OS requirements
– Address space issues, memory management, protection
• Conformance to Standards
– Languages, OS, Networks, I/O, IEEE floating pt.
7
Computer Systems: Technology
Trends
• 1988
–
–
–
–
–
–
Supercomputers
Massively Parallel Processors
Mini-supercomputers
Minicomputers
Workstations
PC’s
• 2002
– Powerful PC’s and
SMP Workstations
– Network of SMP
Workstations
– Mainframes
– Supercomputers
– Embedded Computers
8
Why Such Change in 10 years?
• Performance
– Technology Advances
• CMOS (complementary metal oxide semiconductor) VLSI
dominates older technologies like TTL (transistor transistor
logic) in cost AND performance
– Computer architecture advances improves low-end
• RISC, pipelining, superscalar, RAID, …
• Price: Lower costs due to …
– Simpler development
• CMOS VLSI: smaller systems, fewer components
– Higher volumes
– Lower margins by class of computer, due to fewer
services
• Function :Rise of networking/local
interconnection technology
9
Growth in Microprocessor
Performance
10
Six Generations of DRAMs
11
Updated Technology Trends
(Summary)
Capacity
Speed (latency)
Logic
4x in 4 years
2x in 3 years
DRAM
4x in 3 years
2x in 10 years
Disk
4x in 2 years
2x in 10 years
Network (bandwidth) 10x in 5 years
• Updates during your study period??
BS (4 yrs)
MS (2 yrs)
PhD (5 yrs)
12
Cost of Microprocessors
12
13
Integrated Circuits Costs
IC cost = Die cost + Testing cost + Packaging cost
Final test yield
Die cost =
Wafer cost
Dies per Wafer * Die yield
Dies per wafer = š * ( Wafer_diam / 2)2 – š * Wafer_diam – Test dies
Die Area
¦ 2 * Die Area
{
Die Yield = Wafer yield * 1 +
Defects_per_unit_area * Die_Area

Die Cost goes roughly with die area4

}
14 1
DAP.S98
Performance Trends
(Summary)
• Workstation performance (measured in Spec
Marks) improves roughly 50% per year
(2X every 18 months)
• Improvement in cost performance estimated
at 70% per year
15
Computer Engineering
Methodology
Implementation
Complexity
Evaluate Existing
Systems for
Bottlenecks
Benchmarks
Technology
Trends
Implement Next
Generation System
Simulate New
Designs and
Organizations
Workloads
16
How to Quantify Performance?
Plane
DC to Paris
Speed
Passengers
Throughput
(pmph)
Boeing 747
6.5 hours
610 mph
470
286,700
BAD/Sud
Concodre
3 hours
1350 mph
132
178,200
• Time to run the task (ExTime)
– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns … (Performance)
– Throughput, bandwidth
17
The Bottom Line:
Performance and Cost or Cost
and Performance?
"X is n times faster than Y" means
ExTime(Y)
--------ExTime(X)
=
Performance(X)
--------------Performance(Y)
• Speed of Concorde vs. Boeing 747
• Throughput of Boeing 747 vs. Concorde
• Cost is also an important parameter in the
equation which is why concordes are being put
to pasture!
18
Measurement Tools
• Benchmarks, Traces, Mixes
• Hardware: Cost, delay, area, power estimation
• Simulation (many levels)
– ISA, RT, Gate, Circuit
•
•
•
•
Queuing Theory
Rules of Thumb
Fundamental “Laws”/Principles
Understanding the limitations of any
measurement tool is crucial.
19
Metrics of Performance
Application
Answers per month
Operations per second
Programming
Language
Compiler
ISA
(millions) of Instructions per second: MIPS
(millions) of (FP) operations per second: MFLOP/s
Datapath
Control
Function Units
Transistors Wires Pins
Megabytes per second
Cycles per second (clock rate)
20
Cases of Benchmark Engineering
• The motivation is to tune the system to the benchmark
to achieve peak performance.
• At the architecture level
– Specialized instructions
• At the compiler level (compiler flags)
– Blocking in Spec89  factor of 9 speedup
– Incorrect compiler optimizations/reordering.
– Would work fine on benchmark but not on other programs
• I/O level
– Spec92 spreadsheet program (sp)
– Companies noticed that the produced output was always out
put to a file (so they stored the results in a memory buffer) and
then expunged at the end (which was not measured). 21
– One company eliminated the I/O all together.
After putting in a blazing performance on the benchmark test,
Sun issued a glowing press release claiming that it had
outperformed Windows NT systems on the test.
Pendragon president Ivan Phillips cried foul, saying the results
weren't representative of real-world Java performance and that
Sun had gone so far as to duplicate the test's code within Sun's
Just-In-Time compiler. That's cheating, says Phillips, who claims
that benchmark tests and real-world applications aren't
the same thing.
Did Sun issue a denial or a mea culpa? Initially, Sun neither
denied optimizing for the benchmark test nor apologized for
it. "If the test results are not representative of real-world Java
applications, then that's a problem with the benchmark,"
Sun's Brian Croll said.
After taking a beating in the press, though, Sun retreated and
issued an apology for the optimization.[Excerpted from PC Online221997]
Issues with Benchmark
Engineering
• Motivated by the bottom dollar, good
performance on classic suites  more
customers, better sales.
• Benchmark Engineering  Limits the
longevity of benchmark suites
• Technology and Applications Limits the
longevity of benchmark suites.
23
SPEC: System Performance
Evaluation Cooperative
• First Round 1989
– 10 programs yielding a single number (“SPECmarks”)
• Second Round 1992
– SPECInt92 (6 integer programs) and SPECfp92 (14
floating point programs)
• Compiler Flags unlimited. March 93
• new set of programs: SPECint95 (8 integer programs) and
SPECfp95 (10 floating point)
– “benchmarks useful for 3 years”
– Single flag setting for all programs: SPECint_base95,
SPECfp_base95
– SPEC CPU2000 (11 integer benchmarks – CINT2000,
and 14 floating-point benchmarks – CFP2000
24
SPEC 2000 (CINT 2000)Results
25
SPEC 2000 (CFP 2000)Results
26
Reporting Performance Results
• Reproducability
•  Apply them on publicly available
benchmarks. Pecking/Picking order
–
–
–
–
Real Programs
Real Kernels
Toy Benchmarks
Synthetic Benchmarks
27
How to Summarize Performance
• Arithmetic mean (weighted arithmetic mean)
tracks execution time: sum(Ti)/n or sum(Wi*Ti)
• Harmonic mean (weighted harmonic mean) of
rates (e.g., MFLOPS) tracks execution time:
n/sum(1/Ri) or 1/sum(Wi/Ri)
• Normalized execution time is handy for scaling
performance (e.g., X times faster than
SPARCstation 10)
• But do not take the arithmetic mean of
normalized execution time,
use the geometric mean = (Product(Ri)^1/n)
28
Performance Evaluation
• “For better or worse, benchmarks shape a field”
• Good products created when have:
– Good benchmarks
– Good ways to summarize performance
• Given sales is a function in part of performance
relative to competition, investment in improving
product as reported by performance summary
• If benchmarks/summary inadequate, then choose
between improving product for real programs vs.
improving product to get more sales;
Sales almost always wins!
• Execution time is the measure of computer
performance!
29
Simulations
• When are simulations useful?
• What are its limitations, I.e. what real world
phenomenon does it not account for?
• The larger the simulation trace, the less
tractable the post-processing analysis.
30
Queueing Theory
• What are the distributions of arrival rates
and values for other parameters?
• Are they realistic?
• What happens when the parameters or
distributions are changed?
31
Quantitative Principles of Computer
Design
• Make the Common Case Fast
– Amdahl’s Law
• CPU Performance Equation
– Clock cycle time
– CPI
– Instruction Count
• Principles of Locality
• Take advantage of Parallelism
32
Amdahl's Law
Speedup due to enhancement E:
ExTime w/o E
Speedup(E) = ------------ExTime w/ E
=
Performance w/ E
----------------Performance w/o
Suppose that enhancement E accelerates a fraction F
of the task by a factor S, and the remainder of the
task is unaffected
33
Amdahl’s Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
Speedupoverall =
ExTimeold
ExTimenew
1
=
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
34
Amdahl’s Law
• Floating point instructions improved to run 2X;
but only 10% of actual instructions are FP
ExTimenew =
Speedupoverall =
35
CPU Performance Equation
CPU time
= Seconds
= Instructions x
Program
Program
Program
Inst Count CPI
X
X
(X)
Inst. Set.
X
X
Technology
x Seconds
Instruction
Compiler
Organization
Cycles
X
Cycle
Clock Rate
X
X
36
Cycles Per Instruction
“Average Cycles per Instruction”
CPI = (CPU Time * Clock Rate) / Instruction Count
= Cycles / Instruction Count
n
CPU time = CycleTime *
•CPI
i =1
* I
i
i
“Instruction Frequency”
n
CPI =
•CPI
i =1
*i F
i
where F
=i
I
i
Instruction Count
Invest Resources where time is Spent!
37
Example: Calculating CPI
Base Machine (Reg / Reg)
Op
Freq Cycles CPI(i)
ALU
50%
1
.5
Load
20%
2
.4
Store
10%
2
.2
Branch
20%
2
.4
1.5
(% Time)
(33%)
(27%)
(13%)
(27%)
Typical Mix
38
Chapter Summary, #1
• Designing to Last through Trends
Capacity
•
Speed
Logic
2x in 3 years
2x in 3 years
DRAM
4x in 3 years
2x in 10 years
Disk
4x in 3 years
2x in 10 years
6yrs to graduate => 16X CPU speed, DRAM/Disk size
• Time to run the task
– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns, …
– Throughput, bandwidth
• “X is n times faster than Y” means
ExTime(Y)
--------ExTime(X)
=
Performance(X)
-------------Performance(Y)
39
Chapter Summary, #2
• Amdahl’s Law:
Speedupoverall =
ExTimeold
ExTimenew
1
=
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
• CPI Law:
CPU time
= Seconds
Program
= Instructions x
Program
Cycles
x Seconds
Instruction
Cycle
• Execution time is the REAL measure of computer
performance!
• Good products created when have:
– Good benchmarks, good ways to summarize
performance
• Die Cost goes roughly with die
area4
40
Food for thought
• Two companies reports results on two benchmarks
one on a Fortran benchmark suite and the other on
a C++ benchmark suite.
• Company A’s product outperforms Company B’s
on the Fortran suite, the reverse holds true for the
C++ suite. Assume the performance differences
are similar in both cases.
• Do you have enough information to compare the
two products. What information will you need?
41
Food for Thought II
• In the CISC vs. RISC debate a key argument of the
RISC movement was that because of its simplicity,
RISC would always remain ahead.
• If there were enough transistors to implement a CISC
on chip, then those same transistors could implement
a pipelined RISC
• If there was enough to allow for a pipelined CISC
there would be enough to have an on-chip cache for
RISC. And so on.
• After 20 years of this debate what do you think?
• Hint: Think of commercial PC’s, Moore’s law and
some of the data in the first chapter of the book (and
on these slides)
42
Amdahl’s Law (answer)
• Floating point instructions improved to run 2X;
but only 10% of actual instructions are FP
ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
Speedupoverall =
1
0.95
=
1.053
43