2017 Ch1 Fundemantal 文件 - 天津大学研究生e

Transcript 2017 Ch1 Fundemantal 文件 - 天津大学研究生e

现代计算机体系结构
课件、作业、讨论网址：http://glearning.tju.edu.cn/
主讲教师：张钢教授
天津大学计算机学院
通信邮箱：[email protected]
2017年
主要参考书（一）
• Computer Architecture
– A Quantitative Approach
• （英文版第5版）
• John L. Hennessy
• David A. Patterson
– 机械工业出版社
– 电子书网址：
http://www.doc88.com/p-112663203506.html
现代计算机体系结构
2
主要参考书（二）
• 计算机体系结构
– 量化研究方法
• （第5版）
• John L. Hennessy
David A. Patterson
• 贾洪峰译
– 人民邮电出版社
现代计算机体系结构
3
主要参考书（三）
• Computer Architecture
– A Quantitative Approach
• （英文版第4版）
• John L. Hennessy
• David A. Patterson
– 机械工业出版社
现代计算机体系结构
4
主要参考书（四）
• 计算机系统结构
– 一种定量的方法
– （第四版）
• John L. Hennessy
David A. Patterson著
• 白跃彬译
– 电子工业出版社
现代计算机体系结构
5
Stanford主页上对Hennessy的介绍
现代计算机体系结构
6
Stanford主页上对Hennessy的介绍
现代计算机体系结构
7
主要参考书（五）
• 可扩展并行计算
– 技术、结构与编程
• Scalable Parallel Computing
– Technology,
Architecture,
Programming
• 黄铠徐志伟著
• 陆鑫达等译
• 机械工业出版社
现代计算机体系结构
8
主要参考书（六）
• 计算机系统结构（第二版）
– 郑纬民等
• 清华大学出版社
现代计算机体系结构
9
课程时间安排
• 课程安排：2017年2月20日开始
• 上课时间：1-8周，每周一晚6:30-9:45
• 上课地点：第44楼B区203教室
现代计算机体系结构
10
The Main Contents课程主要内容
• Chapter 1. Fundamentals of Quantitative Design and
Analysis
• Chapter 2. Memory Hierarchy Design
• Chapter 3. Instruction-Level Parallelism and Its
Exploitation
• Chapter 4. Data-Level Parallelism in Vector, SIMD, and
GPU Architectures
• Chapter 5. Thread-Level Parallelism
• Chapter 6. Warehouse-Scale Computers to Exploit
Request-Level and Data-Level Parallelism
• Appendix A. Pipelining: Basic and Intermediate Concepts
现代计算机体系结构
11
先修课要求
• 本科课程：
–
–
–
–
计算机组成原理
计算机系统结构
操作系统
计算机网络
现代计算机体系结构
12
考试与成绩
• 出勤(包括Quizs和回答问题)： 20%
• 作业(网上提交)：
20%
• 期末考试(闭卷)：
60%
• 提交作业要求：
– 写清姓名和作业号，张某某作业几
– 作业以附件形式提交，附件不要使用WPS格式
• 提交时间要求：
– 周六早8点之前提交
现代计算机体系结构
13
The Main Contents课程主要内容
• Chapter 1. Fundamentals of Quantitative Design and
Analysis
• Chapter 2. Memory Hierarchy Design
• Chapter 3. Instruction-Level Parallelism and Its
Exploitation
• Chapter 4. Data-Level Parallelism in Vector, SIMD, and
GPU Architectures
• Chapter 5. Thread-Level Parallelism
• Chapter 6. Warehouse-Scale Computers to Exploit
Request-Level and Data-Level Parallelism
• Appendix A. Pipelining: Basic and Intermediate Concepts
现代计算机体系结构
14
Computer Technology
• Performance improvements:
– Improvements in semiconductor technology
• Feature size, clock speed
– Improvements in computer architectures
• Enabled by High Level Language (HLL) compilers,
UNIX
• Lead to RISC architectures
– Together have enabled:
• Lightweight computers
• Productivity-based managed/interpreted programming
languages
现代计算机体系结构
15
Uniprocessor Performance
10000
Performance (vs. VAX-11/780)
From Hennessy and Patterson, Computer
Architecture: A Quantitative Approach, 4th
edition, October, 2006
20%/year
1000
52%/year
100
10
25%/year
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
• VAX
: 25%/year 1978 to 1986
• RISC + x86: 52%/year 1986 to 2002
• RISC + x86: 20%/year 2002 to present
现代计算机体系结构
16
Single Processor Performance
Move to multi-processor
RISC
现代计算机体系结构
17
Original Food Chain
Big Fishes Eating Little Fishes
现代计算机体系结构
18
1986 Computer Food Chain
Mainframe
Supercomputer
Minisupercomputer
Work- PC
Ministation
computer
Massively Parallel
Processors
现代计算机体系结构
19
Massively Parallel Processors
Minisupercomputer
Minicomputer
2002 Computer Food Chain
Mainframe
Server
Supercomputer
Work- PC
station
Now who is eating whom?
现代计算机体系结构
20
Why Such Change in 16 years?
• Performance
– Technology Advances
• CMOS VLSI dominates older technologies (TTL,
ECL) in cost AND performance
– Computer architecture advances improves lowend
• RISC, superscalar, RAID, …
现代计算机体系结构
21
作业1：
列举近20年来在计算机系统结构方面出现的各项
新技术
现代计算机体系结构
22
Why Such Change in 16 years?
• Price: Lower costs due to …
– Simpler development
• CMOS VLSI: smaller systems, fewer components
– Higher volumes
• CMOS VLSI : same dev. cost 10,000 vs.
10,000,000 units
– Lower margins by class of computer, due to
fewer services
现代计算机体系结构
23
Why Such Change in 16 years?
• Function
– Rise of networking/local interconnection
technology
现代计算机体系结构
24
Moore’s Law
Exponential Growth – doubling of transistors every couple of years
现代计算机体系结构
25
Growth in CPU Transistor Count
现代计算机体系结构
26
现代计算机体系结构
27
Moore’s Law Graph
In 1965, Gordon Moore
prediction, popularly known as
Moore's Law, states that the
number of transistors on a chip
will double about every two
现代计算机体系结构
years.
28
Moore’s Law Graph
• 芯片尺寸大些好？小些好？
• 图中灰色圆形为晶圆
• 图中黄点为杂质
现代计算机体系结构
29
Moore’s Law Graph
• 试想如果一个晶圆只出一个芯片会怎样？
现代计算机体系结构
30
Moore’s Law Graph
• 适当的芯片数总成本最少
现代计算机体系结构
31
Do you want to be a millionaire?
• You double your investment everyday
– Starting investment - one cent.
• How long it takes to become a millionaire?
a) 20 days
b) 27 days
c) 37 days
d)365 days
e)Lifetime ++
现代计算机体系结构
32
Do you want to be a millionaire?
• You double your investment everyday
– Starting investment - one cent.
• How long it takes to become a millionaire
a) 20 days
b) 27 days
c) 37 days
One million cents
Millionaire
Billionaire
• Doubling transistors every 18 months
– This growth rate is hard to imagine
现代计算机体系结构
33
现代计算机体系结构
34
现代计算机体系结构
35
现代计算机体系结构
36
Uniprocessor Performance
10000
Performance (vs. VAX-11/780)
From Hennessy and Patterson, Computer
Architecture: A Quantitative Approach, 4th
edition, October, 2006
20%/year
1000
52%/year
100
10
25%/year
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
• VAX
: 25%/year 1978 to 1986
• RISC + x86: 52%/year 1986 to 2002
• RISC + x86: 20%/year 2002 to present
现代计算机体系结构
37
Why does the improvement have
dropped?
• The End of the Uniprocessor Era
• Single biggest change in the history of
computing systems
现代计算机体系结构
38
Current Trends in Architecture
• Cannot continue to leverage Instruction-Level
parallelism (ILP)
– Single processor performance improvement ended
in 2003
• New models for performance:
– Data-level parallelism (DLP)
– Thread-level parallelism (TLP)
– Request-level parallelism (RLP)
• These require explicit restructuring of the
application
现代计算机体系结构
39
Trends in Technology
• Integrated circuit technology
– Transistor density: 35%/year
– Die size芯片面积: 10-20%/year
– Integration overall: 40-55%/year
• DRAM capacity: 25-40%/year (slowing)
• Flash capacity: 50-60%/year
– 15-20X cheaper/bit than DRAM
• Magnetic disk technology: 40%/year
– 15-25X cheaper/bit then Flash
– 300-500X cheaper/bit than DRAM
现代计算机体系结构
40
Memory Capacity
(Single Chip DRAM)
size
1000000000
100000000
Bits
10000000
1000000
100000
10000
1000
1970
1975
1980
1985
1990
1995
year
1980
1983
1986
1989
1992
1996
2000
2000
size(Mb) cyc time
0.0625 250 ns
0.25
220 ns
1
190 ns
4
165 ns
16
145 ns
64
120 ns
256
100 ns
Year
现代计算机体系结构
41
Bandwidth and Latency
• Bandwidth or throughput
– Total work done in a given time
– 10,000-25,000X improvement for processors
– 300-1200X improvement for memory and disks
• Latency or response time
– Time between start and completion of an event
– 30-80X improvement for processors
– 6-8X improvement for memory and disks
现代计算机体系结构
42
Bandwidth and Latency
Log-log plot of bandwidth and latency milestones
现代计算机体系结构
43
Transistors and Wires
• Feature size
– Minimum size of transistor or wire in x or y
dimension
– 10 microns in 1971 to .032 microns in 2011
– Transistor performance scales linearly
• Wire delay does not improve with feature size!
– Integration density scales quadratically
现代计算机体系结构
44
Power and Energy
• Thermal Design Power (TDP) 热量设计功耗
– Characterizes sustained power consumption持续功耗
– Used as target for power supply and cooling system
– Lower than peak power, higher than average power
consumption
• Clock rate can be reduced dynamically to limit power
consumption
• Energy per task is often a better measurement
现代计算机体系结构
45
Power and Energy
Intel公司对Core i7处理器给出的是最大TDP (Max TDP)，并不
是 TDP
现代计算机体系结构
46
Dynamic Energy and Power
• Dynamic energy
– Transistor switch from 0 -> 1 or 1 -> 0
– ½ x Capacitive load x Voltage2
• Dynamic power
– ½ x Capacitive load x Voltage2 x Frequency switched
• Reducing clock rate reduces power, not energy
现代计算机体系结构
47
Power
• Intel 80386
consumed ~ 2 W
• 3.3 GHz Intel Core
i7 (1st G) consumes
130 W
• Heat must be
dissipated from
1.5 cm x 1.5 cm
chip
• This is the limit of
what can be cooled
by air
现代计算机体系结构
48
Static Power
• Static power consumption
– Currentstatic x Voltage
– Scales with number of transistors
– To reduce: power gating – turning off the power
supply to idle circuits to reduce leakage.
现代计算机体系结构
49
Energy Saving
• Do nothing well 以逸待劳
• Turn off the clock of inactive modules
• E.g. floating-point unit, cores
• Dynamic Voltage-Frequency Scaling (DVFS)
动态电压—频率调整
• Design for typical case 典型情况设计
• Overclocking 超频
现代计算机体系结构
50
Energy Saving
• Dynamic Voltage-Frequency Scaling (DVFS)
动态电压—频率调整
• 当CPU处于仅有
3%的使用率时，
CPU也非要处于
全速运行的状态
吗？
现代计算机体系结构
51
Energy Saving
• Why is DVS, not is DVFS?
– “Figure 5.11 shows the potential power savings of CPU
dynamic voltage scaling (DVS) for that same server by
plotting the power usage across a varying compute load for
three frequency-voltage steps.”
现代计算机体系结构
52
Energy Saving
• Design for typical case 典型情况设计
• Memory and storage offer low power modes
• “Emergency slowdown”
• Overclocking 超频
• Intel从2008年开始在芯片中提供Turbo模式。
• 在Turbo模式下，允许在少数几个核（核心）上
以高于标称时钟频率的更高频率短时运行。
• 例如，3.3GHz Core i7是多核微处理器，不同型
号的Core i7有2-8个核（核心）不等，Core i7可
以在很短的时间内让部分核（核心）以3.6GHz
的频率运行
现代计算机体系结构
53
Energy Saving
• The primary evaluation now is
tasks per joule
or
performance per watt
• Not is
performance per mm2 of silicon
现代计算机体系结构
54
思考题
• 有一个现象：相同的程序、在相同的计算机上运
行，室温的变化会影响程序的执行速度。
• 为什么室温会影响程序执行的速度？或者说为什
么室温会影响计算机系统的性能？
现代计算机体系结构
55
Trends in Cost
• Cost driven down by learning curve 学习曲线
• DRAM: price closely tracks cost
• Microprocessors: price depends on volume(产量)
– 10% less for each doubling of volume
现代计算机体系结构
56
Dependability
• Module reliability
–
–
–
–
Mean time to failure (MTTF)
Mean time to repair (MTTR)
Mean time between failures (MTBF) = MTTF + MTTR
Availability = MTTF / MTBF
现代计算机体系结构
57
Conventional Wisdom in Computer
Architecture
• Old Conventional Wisdom:
Power is free, Transistors expensive
• New Conventional Wisdom:
“Power wall” Power expensive, Transistors free
(Can put more on chip than can afford to turn on)
现代计算机体系结构
58
Conventional Wisdom in Computer
Architecture
• Old CW:
Sufficient increasing Instruction-Level
Parallelism via compilers, innovation (Out-oforder, speculation, VLIW, …)
• New CW:
“ILP wall” law of diminishing returns on more
HW for ILP
现代计算机体系结构
59
Conventional Wisdom in Computer
Architecture
• Old CW:
Multiplies are slow, Memory access is fast
• New CW:
“Memory wall” Memory slow, multiplies fast
(200 clock cycles to DRAM memory, 4 clocks for
multiply)
现代计算机体系结构
60
Conventional Wisdom in Computer
Architecture
• Old CW:
Uniprocessor performance 2X / 1.5 yrs
• New CW:
Power Wall + ILP Wall + Memory Wall = Brick
Wall
– Uniprocessor performance now 2X / 5(?) yrs
 Sea change in chip design: multiple “cores”
(2X processors per chip / ~ 2 years)
• More, simpler processors are more power efficient
现代计算机体系结构
61
计算机体系结构课程的内容
• 1950s to 1960s: Computer Architecture
Course: Computer Arithmetic
• 1970s to mid 1980s: Computer Architecture
Course: Instruction Set Design, especially
ISA appropriate for compilers
• 1990s: Computer Architecture Course:
Design of CPU, memory system, I/O system,
Multiprocessors, Networks
• 2010s: Computer Architecture Course: Self
adapting systems? Self organizing structures?
DNA Systems/Quantum Computing?
现代计算机体系结构
62
计算机体系结构的研究内容
• 进一步提高单个微处理器的性能。（光
速极限问题）
• 基于微处理器的多处理器体系结构。
• 全面提高计算机的系统性能：可用性，
可维护性，可缩放性。
• 新型器件的处理器：如光计算机；新原
理的计算机（生物，分子，又提出了
DNA计算机)。
现代计算机体系结构
63
What is Computer Architecture?
Application
Gap too large to
bridge in one step
(but there are exceptions,
e.g. magnetic compass)
Physics
In its broadest definition, computer architecture is the
design of the abstraction layers that allow us to implement
information processing applications efficiently using
64
available manufacturing现代计算机体系结构
technologies.
Abstraction Layers in Modern
Systems
Application
Algorithm
Parallel
computing,
security, …
Programming Language
Original
domain of
the computer
architect
(‘50s-’80s)
Operating System/Virtual Machine
Instruction Set Architecture (ISA)
Microarchitecture
Gates/Register-Transfer Level (RTL)
Circuits
Devices
Domain of
recent
computer
architecture
(‘90s)
Reliability,
power, …
Physics
现代计算机体系结构
Reinvigoration of
computer architecture,
65
mid-2000s onward.
Computer Engineering
Methodology
Implementation
Complexity
Evaluate Existing
Systems for
Bottlenecks
Technology
Trends
Implement Next
Generation System
Benchmarks
Simulate New
Designs and
Organizations
Workloads
现代计算机体系结构
66
Types of Computers
• Computers come in many shapes and sizes
– Supercomputers
– Mainframes
– Minicomputers
– Microcomputers, Also known as a PC
– Palm computers, Also known as PDAs
– Embedded computers
现代计算机体系结构
67
Supercomputers
• Designed for
ultra-high
performance
tasks
•weather
analysis
• large
• expensive
• massively
parallelprocessing
现代计算机体系结构
68
Mainframes
• Require high
performance
• Generate and
process large
numbers of
transactions
• IBM S/390
– 126 MIPS in a
single-processor
configuration.
现代计算机体系结构
69
Minicomputers
• Designed for real-time dedicated
applications or as high-performance,
multiple user applications
– Digital Alpha
– IBM RS/6000
– Sun Ultra
现代计算机体系结构
70
Microcomputers
• The most prevalent form
• Sitting on a standard desktop or even laptop
• The first PC was built by IBM
现代计算机体系结构
71
Apple
现代计算机体系结构
72
Palm computers
• These computers are about
the size of a human hand
– word processing
– spreadsheet calculations
– handwriting recognition
– game playing
– faxing
现代计算机体系结构
73
Types of Computers Now
•
•
•
•
Personal Mobile Device (PMD)
Desktop Computing
Servers
Clusters/Warehouse-Scale Computers (WSC)
– Many desktop computers or servers are connected
by local area networks to act as a single larger
computer
– The largest of the clusters
• Embedded Computers
– What are embedded computers?
现代计算机体系结构
74
Types of Computers Now
现代计算机体系结构
75
Classes of Parallelism and
Parallel Architectures
• In applications
– Data-Level Parallelism (DLP)
– Task-Level Parallelism (TLP)
• Hardware support
– Instruction-Level Parallelism
– Vector Architectures and Graphic Processor
Unit (GPUs)
– Thread-Level Parallelism
– Request-Level Parallelism
现代计算机体系结构
76
Flynn Categories
• Single instruction stream, single data stream
(SISD)
• Single instruction stream, multiple data
stream (SIMD)
• Multiple instruction stream, single data
stream (MISD
• Multiple instruction stream, multiple data
stream (MIMD)
现代计算机体系结构
77
Flynn Categories
现代计算机体系结构
78
Flynn Categories
• Some further divide the MIMD category into
SPMD(Single Program, Multiple Data) and
MPMD(Multiple Program, Multiple Data)
• SPMD
– Multiple autonomous processors simultaneously
executing the same program on different data
• MPMD
– Multiple autonomous processors simultaneously
operating at least 2 independent programs
现代计算机体系结构
79
Flynn’s Web Page Copy from
Stanford University
现代计算机体系结构
80
Intel 4004
现代计算机体系结构
81
Intel 8008
现代计算机体系结构
82
Intel 80286
现代计算机体系结构
83
Intel 80386
现代计算机体系结构
84
Intel 80486
现代计算机体系结构
85
Intel Pentium
现代计算机体系结构
86
Intel Pentium Pro
现代计算机体系结构
87
Intel Pentium II
现代计算机体系结构
88
Pentium Evolution (1)
• 8080
• first general purpose microprocessor
• 8 bit data path
• Used in first personal computer – Altair
• 8086
•
•
•
•
much more powerful
16 bit
instruction cache, prefetch few instructions
8088 (8 bit external bus) used in first IBM PC
• 80286
• 16 Mbyte memory addressable
• up from 1Mb
• 80386
• 32 bit
• Support for multitasking
现代计算机体系结构
89
Pentium Evolution (2)
• 80486
• sophisticated powerful cache and instruction
pipelining
• built in maths co-processor
• Pentium
– Superscalar （超标量）
– Multiple instructions executed in parallel
• Pentium Pro
–
–
–
–
–
Increased superscalar organization
Aggressive register renaming
branch prediction
data flow analysis
speculative execution （推测执行）
现代计算机体系结构
90
Pentium Evolution (3)
• Pentium II
– MMX technology
– graphics, video & audio processing
• Pentium III
– Additional floating point instructions for 3D
graphics
• Pentium 4
– Note Arabic rather than Roman numerals
– Further floating point and multimedia
enhancements
现代计算机体系结构
91
Sea Change in Chip Design
• Intel 4004 (1971): 4-bit processor,
2312 transistors, 0.4 MHz,
10 micron PMOS, 11 mm2 chip
• RISC II (1983): 32-bit, 5 stage
pipeline, 40,760 transistors, 3 MHz,
3 micron NMOS, 60 mm2 chip
• 125 mm2 chip, 0.065 micron CMOS
= 2312 RISC II+FPU+Icache+Dcache
– RISC II shrinks to ~ 0.02 mm2 at 65
nm
– Caches via DRAM or 1 transistor
SRAM?
• Processor is the现代计算机体系结构
new transistor?
92
Problems with Sea Change
• Algorithms, Programming Languages,
Compilers, Operating Systems, Architectures,
Libraries, … not ready to supply Thread-Level
Parallelism or Data-Level Parallelism for 1000
CPUs / chip,
现代计算机体系结构
93
Problems with Sea Change
• Architectures not ready for 1000 CPUs / chip
– Unlike Instruction-Level Parallelism, cannot be
solved by computer architects and compiler
writers alone, but also cannot be solved without
participation of architects
现代计算机体系结构
94
Problems with Sea Change
• This edition of our course and 4th Edition of
textbook “Computer Architecture: A
Quantitative Approach” explores shift from
Instruction-Level Parallelism to Thread-Level
Parallelism / Data-Level Parallelism
现代计算机体系结构
95
Measurement and Evaluation
Architecture is an iterative process:
• Searching the space of possible designs
• At all levels of computer systems
Design
Analysis
现代计算机体系结构
96
Measurement and Evaluation
Creativity
Cost /
Performance
Analysis
Good Ideas
Bad Ideas
Mediocre Ideas
注意：英文中常用的Cost/Performance与中
文中常用的性能/价格正好相反！
现代计算机体系结构
97
现代计算机体系结构
98
现代计算机体系结构
99
性能和成本
• “X is n times faster than Y” mean
•
=n
现代计算机体系结构
100
Amdahl’s Law
• Speedup=(Performance for entire task
using the enhancement)/ (Performance
for entire task without using the
enhancement)
• Speedup=(Execution time for entire task
without using the enhancement)/
(Execution time for entire task using the
enhancement)
现代计算机体系结构
101
Amdahl’s Law
Depends on Two Factors
• Fraction enhanced
– The fraction of the computation time in the
original machine that can be converted to take
advantage of the enhancement
– （可改进部分占用的时间）/（改进前整个
任务的执行时间）< 1
– 例：改进前整个任务60秒，可改进部分为20
秒，则Fraction enhanced=20/60
现代计算机体系结构
102
Amdahl’s Law
Depends on Two Factors
• Speedup enhanced
– The improvement gained by the enhanced
execution mode
– （改进前改进部分的执行时间）/（改进后
改进部分的执行时间）> 1
– 例：改进前改进部分5秒，改进后改进部分2
秒，则Speedup enhanced=5/2
现代计算机体系结构
103
由Amdahl’s Law得出的结论
（一）

Fractionenhanced 
ExTimenew  ExTimeold  1  Fractionenhanced  

Speedup

enhanced 
[（可改进部分占用的时间）/（改进前整个任务的执
行时间）] / [（改进前改进部分的执行时间）/（改
进后改进部分的执行时间）]
= （改进后改进部分的执行时间）/（改进前整个任务
的执行时间）
现代计算机体系结构
104
由Amdahl’s Law得出的结论
（二）
Speedupoverall 
ExTimeold

ExTimenew
1
1  Fractionenhanced  
Fractionenhanced
Speedupenhanced
由结论（一）得：
Speedup overall
= 1 / [(1-Fraction enhanced) + (Fraction
/ Speedup enhanced)]
现代计算机体系结构
enhanced
105
Amdahl’s Law结论的例子(1)
• Floating point instructions improved to run
2X; but only 10% of actual instructions are
FP
ExTimenew = ExTimeold x (0.9 + 0.1/2) = 0.95 x ExTimeold
Speedupoverall =
1
0.95
现代计算机体系结构
=
1.053
106
Amdahl’s Law结论的例子(2)
现代计算机体系结构
107
Amdahl’s Law结论的例子(3)
现代计算机体系结构
108
CPU Time
CPU Time
=CPU clock cycles for a program / Clock
rate
or
CPU Time
=CPU clock cycles for a program  Clock
cycle time
现代计算机体系结构
109
Cycles Per Instruction
(Throughput)
“Average Cycles per Instruction”
CPI = (CPU Time * Clock Rate) / Instruction Count
= Cycles / Instruction Count
CPU Time = Instruction Count * CPI * Clock cycle Time
= Instruction Count * CPI / Clock Rate
现代计算机体系结构
110
Cycles Per Instruction
(Throughput)
CPU clock cycles 
n
 CPI
j
j 1
 Ij
n
CPU time  Cycle Time   CPI j  I j
j 1
现代计算机体系结构
111
Cycles Per Instruction
(Throughput)
“Instruction Frequency”
n
CPI   CPI j  Fj
j 1
where Fj 
Ij
Instruction Count
Invest Resources where time is Spent!
现代计算机体系结构
112
Example:
现代计算机体系结构
113
现代计算机体系结构
114
Aspects of CPU Performance
CPU time
= Seconds
= Instructions x
Program
Program
Program
x Seconds
Instruction
Inst Count CPI
X
Compiler
X
(X)
Inst. Set.
X
X
Organization
Cycles
X
Technology
Cycle
Clock Rate
X
X
现代计算机体系结构
115
Example: Calculating CPI
Base Machine (Reg / Reg)
Op
Freq Cycles CPI(i)
ALU
50% 1
.5
Load
20% 2
.4
Store
10% 2
.2
Branch
20% 2
.4
Typical Mix
1.5
现代计算机体系结构
(% Time)
(33%)
(27%)
(13%)
(27%)
116
性能标准
MIPS ( Million Instruction Per Second )
=指令条数 /（执行时间106）
缺陷：
• 依赖于指令集
• 在同一台机器上，因程序不同而不同
• 可能与性能相反
现代计算机体系结构
117
性能标准
MFLOPS ( Million Floating Point Oprations
Per Second )
=程序中的浮点操作次数 /（执行时间106）
优点：可以比较不同的机器
缺陷：
• 不能体现整体性能
• 依赖浮点操作类型
现代计算机体系结构
118
性能标准
• 基准测试程序
–
–
–
–
实际应用程序
核心测试程序
小型基准测试程序
综合基准测试程序
衡量性能的唯一固定而且可靠的标准是真
正执行程序的时间。
现代计算机体系结构
119
Benchmark Suites
• Desktop
– SPEC CPU2006: 12 integer, 17 floating-point
– SPECviewperf, SPECapc: graphics benchmarks
• Server
– SPEC CPU2006: running multiple copies, SPECrate
– SPECSFS: for NFS performance
– SPECWeb: Web server benchmark
– TPC-x: measure transaction-processing, queries, and
decision making database applications
• Embedded Processor
– New area
– EEMBC: EDN Embedded Microprocessor Benchmark
Consortium
120
现代计算机体系结构
性能比较
• 两个程序在三台计算机上的执行时间
A机
B机
C机
程序 1
1秒
10 秒
20 秒
程序 2
1000 秒 100 秒
20 秒
总时间 1001 秒 110 秒
40 秒
• 总执行时间：一致的衡量标准
现代计算机体系结构
121
性能比较
• 平均执行时间
– 各执行时间的算术平均值
Am 
1
n
n
T
i
i 1
• 其中Ti是第i个程序的执行时间
现代计算机体系结构
122
性能比较
• 调和均值执行速率
Hm  1Am 
n
n

i 1
1
Ri
• 其中Ri=1/Ti ，Ti是第i个程序的执行时间
现代计算机体系结构
123
性能比较
• 加权执行时间
– 加权算术平均值
Am 
n
W
i 1
i
 Ti
• 其中Wi是第i个程序在任务中所占的比重，
Ti是该程序的执行时间。
现代计算机体系结构
124
性能比较
• 几何平均 Geometric Mean
n
n
 Execution Time Ratio i
i 1
– Execution time ratio is normalized to a base machine
– Is used to figure out SPECrate
现代计算机体系结构
125
作业2
• 阅读关于Power Wall 、 ILP Wall、
Memory Wall方面的英文文献
• 要求：
– 每人至少阅读一篇英文文献；
– 写一篇类似大摘要的读书报告（中英文均
可），注明文献出处；
– 提交所阅读的文献+读书报告（文件名：作
业2+姓名）
现代计算机体系结构
126
作业3
• 第五版
– Case Studies 1.4
• 完整的题目见下页
现代计算机体系结构
127
现代计算机体系结构
128
现代计算机体系结构
129

2017 Ch1 Fundemantal 文件 - 天津大学研究生e

Transcript 2017 Ch1 Fundemantal 文件 - 天津大学研究生e

Directory