Chapter 1: Fundamentals of Quantitative Design and
Download
Report
Transcript Chapter 1: Fundamentals of Quantitative Design and
دیباچه
انسهانهای دست دوم بررس ی نمی کنند ،تکرار می کنند؛ انجام نمی دهند ،ادای
انجام دادن را در می آورند؛ خلق نمی کنند ،نمایش می دهند؛ به توان خود و
دیگران کاری ندارند ،بلکه به رفاقت ها و روابط می اندیشند.
اگر آنهایی که عمل می کنند ،می اندیشند و کار و تولید می کنند نبودند ،چه بر سر
دنیا می آمد؟
آین راند (سرچشمه)
1
Computer Architecture
A Quantitative Approach, Fifth Edition
Chapter 1
Fundamentals of Quantitative
Design and Analysis
2
Performance improvements:
Improvements in semiconductor technology
HLL (High Level Language) compilers, UNIX based OSes
RISC architectures
Together have enabled:
Feature size, clock speed
Improvements in computer architectures
Introduction
Computer Technology
Lightweight computers
Productivity-based programming languages: C#, Java, Python
SaaS, Virtualization, Cloud
Applications evolution:
Speech, sound, images, video,
“augmented/extended reality”, “big data”
3
Move to multi-processor
Introduction
Single Processor Performance
RISC
4
Cannot continue to leverage Instruction-Level parallelism
(ILP)
Single processor performance improvement ended in 2003
New models for performance:
Introduction
Current Trends in Architecture
Data-level parallelism (DLP)
Thread-level parallelism (TLP)
Request-level parallelism (RLP)
These require explicit restructuring of the application
5
Embedded Computers (19 billion in 2010)
Personal Mobile Device (PMD)
Emphasis on price-performance (0.35 billion)
Servers
smart phones, tablet computers (1.8 billion sold 2010)
Emphasis on energy efficiency and real-time
Desktop Computers
Emphasis: price
Classes of Computers
Classes of Computers
Emphasis on availability (very costly downtime!), scalability,
throughput (20 million)
Clusters / Warehouse Scale Computers
Used for “Software as a Service (SaaS)”, etc.
Emphasis on availability ($6M/hour-downtime at Amazon.com!)
and price-performance (power=80% of Total Cost!)
Sub-class: Supercomputers, emphasis: floating-point
performance, fast internal networks, big data analytics
6
Classes of Computers
Parallelism
Classes of parallelism in applications:
Data-Level Parallelism (DLP) : Operating on Same Data in
Parallel
Task-Level Parallelism (TLP): Doing Parallel Tasks
Classes of architectural parallelism:
Instruction-Level Parallelism (ILP): Instruction pipelining
Vector architectures/Graphic Processor Units (GPUs):
Applying single Instruction to a collection of data
Thread-Level Parallelism: DLP or TLP by tightly coupled
hardware that allows Multithreading
Request-Level Parallelism: Doing largely decoupled tasks
in parallel
7
Single instruction stream, single data stream (SISD)
Single instruction stream, multiple data streams (SIMD)
Vector architectures
Multimedia extensions
Graphics processor units
Multiple instruction streams, single data stream (MISD)
Uniprocessor system with ILP
Classes of Computers
Flynn’s Taxonomy
No commercial implementation
Multiple instruction streams, multiple data streams
(MIMD)
Tightly-coupled MIMD: thread-level parallelism
Loosely-coupled MIMD: request-level parallelism
8
“Old” view of computer architecture:
Instruction Set Architecture (ISA) design
i.e. decisions regarding:
registers, memory addressing, addressing modes,
instruction operands, available operations, control flow
instructions, instruction encoding
Defining Computer Architecture
Defining Computer Architecture
“Real” computer architecture:
Specific requirements of the target machine
Design to maximize performance within constraints:
cost, power, and availability
Includes ISA, microarchitecture (Memory System
and its interconnection with CPU), hardware
9
Integrated circuit technology
DRAM capacity: 25-40%/year (slowing)
Flash capacity: 50-60%/year
15-20X cheaper/bit than DRAM
Standard for PMDs
On chip Cache
Multi-Core
Magnetic disk technology: 40%/year
Transistor density: 35%/year
Die size: 10-20%/year
Integration overall: 40-55%/year
MOS Technology
Trends in Technology
Trends in Technology
15-25X cheaper/bit then Flash
300-500X cheaper/bit than DRAM
Networking technology: Discussed in another cource
10
Bandwidth or throughput
Total work done in a given time
10,000-25,000X improvement for processors
300-1200X improvement for memory and disks
Trends in Technology
Bandwidth and Latency
Latency or response time
Time between start and completion of an event
30-80X improvement for processors
6-8X improvement for memory and disks
11
Trends in Technology
Bandwidth and Latency
Log-log plot of bandwidth and latency milestones
12
Feature size
Minimum size of transistor or wire
10 microns in 1971 to 0.032 microns in 2011
Transistor performance scales linearly
Trends in Technology
Transistors and Wires
Wire delay does not improve with feature size and
transistor switching delay!
Integration density scales quadratically
Linear performance and quadratic density growth
present two challenges:
1. Power
2. Signal Propagation delay
Copyright © 2012, Elsevier Inc. All rights reserved.
13
Problem: Get power in, get power out
Three power concerns:
1.
2.
Maximum Power to maintain supply voltage
Thermal Design Power (TDP)
Characterizes sustained power consumption
Used as target for power supply and cooling system
Lower than peak power, higher than average power consumption
Power control via: voltage or temperature dependent clock rate +
Thermal overload trip
Trends in Power and Energy
Power and Energy
3. Energy
Efficiency : energy consumption per task
Example: a CPU with 20% more power + 30% less time per
task = 1.2*0.7= 0.84 = better energy efficiency
Copyright © 2012, Elsevier Inc. All rights reserved.
14
Dynamic energy
Transistor switch from 0 -> 1 or 1 -> 0
½ x Capacitive load x Voltage2
Trends in Power and Energy
Dynamic Energy and Power
Dynamic power
½ x Capacitive load x Voltage2 x Frequency switched
Reducing clock rate reduces power, not energy
Reducing voltage lowers both: going from 5V to under
1V in 20 years
15
Trends in Power and Energy
Example
Copyright © 2012, Elsevier Inc. All rights reserved.
16
Intel 80386 consumed
~4W
3.3 GHz Intel Core i7
consumes 130 W
Heat must be
dissipated from 1.5 x
1.5 cm chip
This is the limit of
what can be cooled
by air
Trends in Power and Energy
Power
17
Do nothing well: turning off the clock of idle units or cores
Dynamic Voltage-Frequency Scaling
Low power state for DRAM, disks : imposes wake up delay
Overclocking, turning off some cores and running others faster.
Typically 10% above nominal clock.
Copyright © 2012, Elsevier Inc. All rights reserved.
Trends in Power and Energy
Increasing energy efficeincy
18
Static power consumption
Currentstatic (leakage current) x Voltage
Scales with number of transistors & on chip cache (SRAM)
To reduce:
Trends in Power and Energy
Static Power
power gating of idle sub modules
Race-to-halt: operate at maximum speed to prolong idle periods.
The new primary evaluation for design innovation
Tasks per joule
Performance per watt
(Instead of performance per mm²)
19
Trends in Cost
Trends in Cost
Cost relative issues
Yield: percent of manufactured devices that pass
the tests
Volume: increasing the volume decreases cost
Becoming commodity: increases competition and
lowers the cost
20
Chip manufacturing process
8–12 inches in diameter
and 12–24 inches long
0.1 inches thick
1 layer of transistors
with
2-8 levels of metal
conductor, separated
by layers of insulators
21
Integrated circuit
Trends in Cost
Integrated Circuit Cost
Bose-Einstein formula:
Wafer yield= 100%
Defects per unit area = 0.016-0.057 defects per square cm for 40 nm (2010)
N = process-complexity factor = 11.5-15.5 (40 nm, 2010)
The manufacturing process dictates the wafer cost, wafer yield and defects per unit area
The architect’s design affects the die area, which in turn affects the defects and cost per die
22
Copyright © 2012, Elsevier Inc. All rights reserved.
23
Cost of Die
Processed wafer cost = $5500
Cost of 1 cm² die = $13
Cost of 2.25 cm² die = $51
The cost increases relative to square of
the area increase
Additional costs: testing, packaging, test
after packaging, and multilayer fabrication
masks
24
Systems alternate between two states of service:
Dependability
Dependability
Service accomplishment, where service is delivered as required.
2.
Service interruption, where the delivered service is corrupted.
“failure(F)=transition from 1 to 2” and “repair(R)=transition from 2 to 1”
1.
Module reliability:
Failures In Time (FIT) : no of failures per 1 billion hours
Mean time to failure (MTTF) = 10^9/FIT
Mean time to repair (MTTR): Time required for service restoration.
Mean time between failures (MTBF) = MTTF + MTTR
Module availability: MTTF / MTBF
25
Total system failure rate = ∑ failure rate of each part
Dependability
Dependability
26
Redundancy improves dependability
27
Typical performance metrics:
Speed of X relative to Y
Execution timeY / Execution timeX
Execution time
Response time = execution time : desktop
Throughput = total amount of work done in a given time: warehouse
Measuring Performance
Measuring Performance
Wall clock time: includes all system overheads
CPU time: only computation time
Benchmarks
Kernels (e.g. matrix multiply): small, key pieces of real applications
Toy programs (e.g. sorting): less than 100 line
Benchmark suites: Standard Performance Evaluation Corporation,
www.spec.org & Transaction Processing Council, www.tpc.org
28
SPEC desktop benchmark programs
29
Summerizing Performance Results
SPECRatio
30
AMD Opteron Vs. Intel Itanium2
31
Principle of Locality
Principles
Amdahl’s Law
Programs spend 90% of execution time in only 10% of code
Focus on the Common Case
Amdahl’s Law : performance improvement obtained
from optimizing the common case.
32
33
Using Amdahl’s law to compare design alternatives
Copyright © 2012, Elsevier Inc. All rights reserved.
34
The effect of 4150x improvement in power
supply reliability on overall system reliability.
Amdahl’s law requires to know the fraction of time or other resources consumed
by new version
35
Principles
CPU time Calculation
The Processor Performance Equation
36
Principles
Average CPI
Different instruction types having different CPIs
ICi= the number of times instruction i is executed in a program
Average
37
38
Comparing Performance, Price, and Power
Comparing performance/price of three small servers with SPECpower:
(ssj-ops = server side java operations per second)
ssj-ops/W
3034
2357
2696
ssj-ops/W/1000$
324
254
213
39
Instruction Set Architecture (ISA)
• Serves as an interface between software and
hardware.
• Provides a mechanism by which the software
tells the hardware what should be done.
High level language code : C, C++, Java, Fortran,
compiler
Assembly language code: architecture specific statements
assembler
Machine language code: architecture specific bit patterns
software
instruction set
hardware
CSCE430/830
ISA
Instruction Set Design Issues
• Instruction set design issues include:
– Where are operands stored?
» registers, memory, stack, accumulator
– How many explicit operands are there?
» 0, 1, 2, or 3
– How is the operand location specified?
» register, immediate, indirect, . . .
– What type & size of operands are supported?
» byte, int, float, double, string, vector. . .
– What operations are supported?
» add, sub, mul, move, compare . . .
CSCE430/830
ISA
Classifying ISAs
Accumulator (before 1960, e.g. 68HC11):
1-address
add A
acc acc + mem[A]
Stack (1960s to 1970s):
0-address
add
tos tos + next
Memory-Memory (1970s to 1980s):
2-address
3-address
add A, B
add A, B, C
mem[A] mem[A] + mem[B]
mem[A] mem[B] + mem[C]
Register-Memory (1970s to present, e.g. 80x86):
2-address
add R1, A
load R1, A
R1 R1 + mem[A]
R1 mem[A]
Register-Register (Load/Store, RISC) (1960s to present, e.g.
MIPS):
3-address
CSCE430/830
add R1, R2, R3
load R1, R2
store R1, R2
R1 R2 + R3
R1 mem[R2]
mem[R1] R2
ISA
Code Sequence C = A + B
for Four Instruction Sets
Stack
Accumulator
Push A
Push B
Add
Pop C
Load A
Add B
Store C
memory
CSCE430/830
acc = acc + mem[C]
Register
(register-memory)
Load R1, A
Add R1, B
Store C, R1
memory
R1 = R1 + mem[C]
Register (loadstore)
Load R1,A
Load R2, B
Add R3, R1, R2
Store C, R3
R3 = R1 + R2
ISA
Types of Addressing Modes (VAX)
Addressing Mode
1. Register direct
2. Immediate
3. Displacement
4. Register indirect
5. Indexed
6. Direct
7. Memory Indirect
8. Autoincrement
Action
R4 <- R4 + R3
R4 <- R4 + 3
R4 <- R4 + M[100 + R1]
R4 <- R4 + M[R1]
R4 <- R4 + M[R1 + R2]
R4 <- R4 + M[1000]
R4 <- R4 + M[M[R3]]
R4 <- R4 + M[R2]
R2 <- R2 + d
9. Autodecrement
Add R4, (R2)R4 <- R4 + M[R2]
R2 <- R2 - d
10. Scaled
Add R4, 100(R2)[R3] R4 <- R4 +
M[100 + R2 + R3*d]
• Studies by [Clark and Emer] indicate that modes 1-4 account for
93% of all operands on the VAX.
CSCE430/830
Example
Add R4, R3
Add R4, #3
Add R4, 100(R1)
Add R4, (R1)
Add R4, (R1 + R2)
Add R4, (1000)
Add R4, @(R3)
Add R4, (R2)+
ISA