高等计算机系统机构 - 北京大学微处理器

Download Report

Transcript 高等计算机系统机构 - 北京大学微处理器

高等计算机系统结构
引
论
(第一讲)
程 旭
2011年2月21日
北京大学微处理器研究开发中心
1
[email protected]
教材与教师
主要教材:
Computer Architecture: A Quantitative Approach,
4th Edition (Oct, 2006) ,Patterson and Hennessy
主讲教师:
程旭
北京大学微处理器研究开发中心
刘先华
北京大学微处理器研究开发中心
助教:蔡明宸 潘星星 刘晓兵
授课时间地点: 每周一 下午 15:10—18:00 二教102
http://mprc.pku.edu.cn
北京大学微处理器研究开发中心
2
[email protected]
“高等计算机系统结构”的教学目标
学习和把握将决定二十一世纪计算机具体形态的
设计技术、机器结构、工艺要素、评价方法等
技术工艺
并行性
编程语言
应用
计算机系统结构
•指令系统设计
•组成
•硬件
测度 和 评测
操作系统
北京大学微处理器研究开发中心
3
软硬件界面设计
(ISA)
历史
[email protected]
本课程在教学安排中的地位
计算机基础
数据结构
应用基础
C语言编程
计算机组织与结构
基本逻辑单元
处理器基础知识
数字逻辑
操作系统
编译技术
存储管理
调度
并发
代码生成
优化
高等计算机体系结构
• 计算机应用需要什么?
• 操作系统需要那些功能支持?
• 优化编译可以利用和实现哪
些功能?
• 我们能够建造什么样的机器?
• 今后的计算机将会怎样?
°计算机系统结构研究人员
必须具有宽厚的专业知识!
如何实现!
具体细节
---知其然!
北京大学微处理器研究开发中心
1.
分析+评测—知其所以然!
2.
并行计算机系统结构
4
[email protected]
Charles Babbage 1791-1871
Lucasian Professor of Mathematics,
Cambridge University, 1827-1839
北京大学微处理器研究开发中心
5
[email protected]
Charles Babbage
°Difference Engine
1823
°Analytic Engine
1833
• The forerunner of modern digital computer!
Application
– Mathematical Tables – Astronomy
– Nautical Tables – Navy
Background
– Any continuous function can be approximated by a
polynomial --Weierstrass
Technology
– mechanical - gears, Jacquard’s loom, simple calculators
北京大学微处理器研究开发中心
6
[email protected]
Difference Engine
A machine to compute mathematical tables
Weierstrass:
• Any continuous function can be approximated by a polynomial
• Any polynomial can be computed from difference tables
An example
f(n)
= n2 + n + 41
d1(n) = f(n) - f(n-1) = 2n
d2(n) = d1(n) - d1(n-1) = 2
f(n)
= f(n-1) + d1(n) = f(n-1) + (d1(n-1) + 2)
all you need is an adder!
n
0
1
2
3
4
2
2
2
2
4
6
8
43
47
53
61
d2(n)
d1(n)
f(n)
41
北京大学微处理器研究开发中心
7
[email protected]
Babbage’s
Difference
Engine 1
1832
北京大学微处理器研究开发中心
8
[email protected]
Analytic Engine
1833: Babbage’s paper was published
• conceived during a hiatus in the development of
the difference engine
Inspiration: Jacquard Looms
• looms were controlled by punched cards
- The set of cards with fixed punched holes
dictated the pattern of weave  program
- The same set of cards could be used with
different colored threads  numbers
1871: Babbage dies
• The machine remains unrealized.
It is not clear if the analytic engine could be built
even today using only mechanical technology
北京大学微处理器研究开发中心
9
[email protected]
Babbage’s
Difference Engine 2
and
Analytical Engine
北京大学微处理器研究开发中心
10
[email protected]
1834 Babbage Analytical Engine
The Mill
The Store
Operation
Cards
Variable
Cards
Printer
Punch
Program
北京大学微处理器研究开发中心
11
[email protected]
Babbage Analytical Engine
• The Store: Memory unit consisting of counter
wheels
• The Mill: The arithmetic unit capable of 4 operations
used a pair of register and produced results stored
in another register in the store
• Operation Cards: Specified one of Four operations
• Variable Cards: Specified the memory location to be
used
• Output: Printer or punch
北京大学微处理器研究开发中心
12
[email protected]
Analytic Engine
The first conception of a general-purpose computer
1. The store in which all variables to be operated upon, as
well as all those quantities which have arisen from the
results of the operations are placed.
2. The mill into which the quantities about to be operated
upon are always brought.
The program
Operation
variable1
variable2
variable3
An operation in the mill required feeding two punched cards
and producing a new punched card for the store.
An operation to alter the sequence was also provided!
北京大学微处理器研究开发中心
13
[email protected]
The first programmer
Ada Byron aka “Lady Lovelace” 1815-52
Ada’s tutor was Babbage himself!
北京大学微处理器研究开发中心
14
[email protected]
1937, Alan Turing
While not using the practical
technology of the era, Alan
Turing developed the idea of a
"Universal Machine" capable
of executing any
describable algorithm, and
forming the basis for the
concept of "computability".
Perhaps more importantly
Turing's ideas differed from
those of others who were
solving arithmetic problems
by introducing the concept of
"symbol processing".
北京大学微处理器研究开发中心
15
[email protected]
第一台通用电子计算机--ENIAC
Electronic Numerical Integrator and Calculator
1946年2月14日
J. Presper Eckert&
John Mauchly
Moore School
University of Pennsylvania
Size: 80 feet long
8.5 feet high
18,000 vacuum tubes
5000 additions/sec.
The world’s first general-purpose electronic computer
conditional Jump and be programmable, distinguished it from earlier ones
Used for computing artillery firing tables
北京大学微处理器研究开发中心
16
[email protected]
Accumulator
°28 vacuum tubes
WW-2 Effort
ENIAC’S Application:
Ballistic calculations
angle = f (location, tail wind, cross wind,
air density, temperature, weight of shell,
propellant charge, ... )
北京大学微处理器研究开发中心
17
[email protected]
ENIAC was NOT a “stored program” device
° For each problem, someone analyzed the
arithmetic processing needed and prepared
wiring diagrams for the computors to use
when wiring the machine
° Process was time consuming and error prone
° Cleaning personnel often knocked cables
out of their place and just put them back
somewhere
北京大学微处理器研究开发中心
18
[email protected]
北京大学微处理器研究开发中心
19
[email protected]
Wiring the machine
北京大学微处理器研究开发中心
20
[email protected]
北京大学微处理器研究开发中心
21
[email protected]
Electronic Discrete Variable Automatic Computer (EDVAC)
° ENIAC’s programming system was external
• Sequences of instructions were executed independently of
the results of the calculation
• Human intervention required to take instructions “out of
order”
° Eckert, Mauchly, John von Neumann and others
designed EDVAC (1944) to solve this problem
• Solution was the stored program computer
 “program can be manipulated as data ”
° First Draft of a report on EDVAC was published in
1945, but just had von Neumann’s signature!
• In 1973 the court of Minneapolis attributed the honor of
inventing the computer to John Atanasoff
北京大学微处理器研究开发中心
22
[email protected]
The von Neumann Machine
1946
Stored Program Computer
IAS(Institute for Advanced Study)
Computer
Arithmetic
Logic
Unit
Main
Memory
Program
I/O
Equipment
Control
Unit
北京大学微处理器研究开发中心
23
[email protected]
存储程序的思想 即构成计算机程序
的指令可同数据一样事先存放到存储器中,
然后由计算机自己一条条取出执行。
这种思想很自然地引出了转移指令和
可对指令的地址部分进行修改的概念,从
而使一段程序的指令可以自动地被有意义
地多次执行。
北京大学微处理器研究开发中心
24
[email protected]
第一台全面的、可操作的、存储程序计算机--EDSAC
The world’s first full-scale,operational,stored-program computer
Maurice Wilkes,Cambridge University
EDSAC: Electronic Delay Storage Automatic Calculator
1949年,EDSAC开始运行
其基于累加器的结构和其
指令系统设计
对以后一段时期的机器设
计有着重要影响
北京大学微处理器研究开发中心
25
[email protected]
Bell Labs
°1940: Ohl develops the PN Junction
°1945: Shockley's laboratory established
°1947: Bardeen and Brattain create point
contact transistor (U.S. Patent 2,524,035)
Diagram from patent application
北京大学微处理器研究开发中心
26
[email protected]
Bell Labs
°1951: Shockley develops a junction transistor
manufacturable in quantity (U.S. Patent 2,623,105)
Diagram from patent application
北京大学微处理器研究开发中心
27
[email protected]
The Integrated Circuit
°1959: Jack Kilby, working at TI, dreams up
the idea of a monolithic “integrated circuit”
• Components connected by hand-soldered wires
and isolated by “shaping”, PN-diodes used as
resistors (U.S. Patent 3,138,743)
Diagram from patent application
北京大学微处理器研究开发中心
28
[email protected]
Integrated Circuits
°1961: TI and Fairchild introduce the first logic ICs
($50 in quantity)
°1962: RCA develops the first MOS transistor
Fairchild bipolar RTL Flip-Flop
北京大学微处理器研究开发中心
RCA 16-transistor MOSFET IC
29
[email protected]
The Microprocessor
°1971: Intel introduces the 4004
• General purpose programmable computer instead of custom chip
for Japanese calculator company
北京大学微处理器研究开发中心
30
[email protected]
微处理器性能
Pentium® 4 Processor
3GHz
Intel® Celeron® Processor
1.3 GHz
Pentium® Processor
233 MHz
4004
108 kilohertz
0.06 MIPS
8080
2 MHz
0.64 MIPS
Intel486™ DX CPU
50 MHz
41 MIPS
Intel386™ SX CPU
33 MHz
2.9 MIPS
8088
8 MHz
0.75 MIPS
北京大学微处理器研究开发中心
31
[email protected]
Sea Change in Chip Design
°Intel 4004 (1971): 4-bit processor,
2312 transistors, 0.4 MHz,
10 micron PMOS, 11 mm2 chip
°RISC II (1983): 32-bit, 5 stage
pipeline, 40,760 transistors, 3 MHz,
3 micron NMOS, 60 mm2 chip
°125 mm2 chip, 0.065 micron CMOS
= 2312 RISC II+FPU+Icache+Dcache
• RISC II shrinks to ~ 0.02 mm2 at 65 nm
• Processor is the new transistor?
北京大学微处理器研究开发中心
32
[email protected]
Multicore
IBM Power4, 2001
Sun T-1 (Niagara), 2005
AMD True quad core die 2007
° Small number of cores, shared memory
° Some systems have multithreaded cores
° Trend to simplicity in cores (e.g. no branch prediction)
° Multiple threads share resources (L2 cache, maybe FP units)
° Deployment in embedded market as well as other sectors
北京大学微处理器研究开发中心
33
[email protected]
Cell from IBM and Sony
北京大学微处理器研究开发中心
34
[email protected]
Intel 80核芯片(2007)
•
•
•
•
•
•
13.75mm * 22 mm
北京大学微处理器研究开发中心
80个处理核心
1 Teraflop
100亿次运算/瓦特
主频3.1GHz
面积 300mm²,
各CPU内核与内存1对1地
连接,分别拥有256MBps
的内存带宽
• 32MB的片上静态RAM 。
• 单芯片整体的内存带宽达
到了1TB/s
35
[email protected]
IBM POWER7(2010)
北京大学微处理器研究开发中心
36
[email protected]
CPU技术发展简史
• Charles Babbage’s Engines(1823)
• Turing Machine(1937)
• ENIAC(1946) EDSAC(1949)
• CPU
• Microprocessor
• General Purpose Microprocessor VS. special CPU for
HPC
• Multicore, Manycore or …
北京大学微处理器研究开发中心
37
[email protected]
因特网访问方式的改进
信息家电
手持 Hand-helds
分离PC
(email)
连接PC
(WWW)
无线、手机
Cellphones &phone access
机顶盒 网络计算机
Set-tops & NCs
1985
北京大学微处理器研究开发中心
游戏机 Game Consoles
1995
2005
Sources: Network Computer Inc. & IDC
38
[email protected]
因特网:网络的网络
公共和私用广域网
局域网和家庭网
系统级网络
人体网络
汽车、飞机等网络
内容
平台
……
内容
平台
电脑空间与人和其他
物理世界的数字接口
北京大学微处理器研究开发中心
39
[email protected]
电脑空间:螺旋上升
计算
基于内容的服务!
数字化
通信
程序, 内容 & 消息
北京大学微处理器研究开发中心
40
[email protected]
后PC时代(PC+时代)
驱动后PC时代的两大技术:
1) 移动消费类设备
– 例如:新一代PDA、
新一代移动通信设备、
可穿戴计算机

2) 支持上述设备的基础设施:
– 例如:新一代Big Fat Web Servers、
Database Servers
北京大学微处理器研究开发中心
41
[email protected]
新的浪潮—微处理器将无处不在
Source: Richard Newton
北京大学微处理器研究开发中心
42
[email protected]
嵌入式微处理器
What?
A programmable processor whose programming
interface is not accessible to the end-user of the
product.
The only user-interaction is through the actual
application.
Examples:
- Sharp PDA’s are encapsulated products with fixed
functionality
- 3COM Palm pilots were originally intended as embedded
systems. Opening up the programmers interface turned
them into more generic computer systems.
北京大学微处理器研究开发中心
43
[email protected]
Some interesting numbers
°The Intel 4004 was intended for an embedded application
(a calculator)
°Of todays microprocessors
• 95%
• 50%
go into embedded applications
SSH3/4 (Hitachi): best selling RISC microprocessor(1997)
ARM: best selling embedded microprocessor(2001-)
of microprocessor revenue stems from embedded systems
°Often focused on particular application area
• Microcontrollers
• DSPs
• Media Processors
• Graphics Processors
• Network and Communication Processors
44
北京大学微处理器研究开发中心
[email protected]
不同的评价标准
°Components of Cost
Power
Cost
Flexibility
• Area of die / yield
• Code density (memory
is the major part of
die size)
• Packaging
• Design effort
• Programming cost
• Time-to-market
• Reusability
Performance as a Functionality Constraint
(“Just-in-Time Computing”)
北京大学微处理器研究开发中心
45
[email protected]
VLSI工艺发展加快(Gate Length)
北京大学微处理器研究开发中心
46
[email protected]
芯片制作流程
北京大学微处理器研究开发中心
47
[email protected]
集成电路的成本
Wafer _ cos t
Die _ cos t 
Dies _ per _ wafer  Die _ yield
Dies _ per _ wafer 
  (Wafer _ diameter/ 2) 2  Wafer _ diameter
Die _ area

2  Die _ area
 Defects _ per _ unit _ area  Die _ area 
Die _ yield  Wafer _ yield  1 





若 =3, 晶模成本 大致以 晶模大小的 四次方 增长
北京大学微处理器研究开发中心
48
[email protected]
其他成本
Die cost + Testing cost + Packaging cost
IC cost =
Final test yield
封装成本:
取决于管脚数量和散热要求
Chip
Die
Package
cost pins type cost
386DX
486DX2
PowerPC 601
HP PA 7100
DEC Alpha
SuperSPARC
Pentium
$4
$12
$53
$73
$149
$272
$417
北京大学微处理器研究开发中心
132
168
304
504
431
293
273
QFP
PGA
QFP
PGA
PGA
PGA
PGA
$1
$11
$3
$35
$30
$20
$19
49
Test & Total
Assembly
$4
$12
$21
$16
$23
$34
$37
$9
$35
$77
$124
$202
$326
$473
[email protected]
Cost/Performance
What is Relationship of Cost to Price?
°
Component Costs
°
Direct Costs (add 25% to 40%) recurring costs: labor, purchasing,
scrap, warranty
°
Gross Margin (add 82% to 186%) nonrecurring costs:
R&D, marketing, sales, equipment maintenance, rental, financing cost,
pretax profits, taxes
°
Average Discount to get List Price (add 33% to 66%): volume
discounts and/or retailer markup
List Price
Avg. Selling Price
北京大学微处理器研究开发中心
Average
Discount
Gross
Margin
Direct Cost
Component
Cost
50
25% to 40%
34% to 39%
6% to 8%
15% to 33%
[email protected]
iPad: Apple’s profit comes from margins in hardware
$499
$110 + Apple margin
Margin:
40%
$90
Average industry margin
(approx. 30 %)
$70
Cost of sales
(approx. 30 %)
$230
Cost of materials and
manufacturing1
Source: iSuppli
北京大学微处理器研究开发中心
51
[email protected]
功耗密度进一步恶化
Sun’s
Surface
1000
Nuclear Reactor
Rocket
Nozzle
Watts/cm
2
100
Hot plate
Pentium III processor
Pentium II processor
10
Pentium Pro
Pentium processor
i386
processor
i486
1
1.5m
1m
0.7m
0.5m 0.35m 0.25m 0.18m 0.13m
0.1m 0.07m
Surpassed hot-plate power density in 0.5m
Not too long to reach nuclear reactor
[email protected]
52
北京大学微处理器研究开发中心
优化能耗
°高性能通用微处理器 (例如, Pentiums)
• 10-100 Watts, 100-1000MIPS = 0.01 Mips/mW
°节能通用微处理器 (例如, StrongARM)
• 0.5 Watts, 160 MIPS = 0.3 Mips/mW
°节能专用处理器(例如, MPEG2)
• 100 Mops/mW
北京大学微处理器研究开发中心
53
[email protected]
开关能耗
北京大学微处理器研究开发中心
54
[email protected]
Multiprocessors and Multicomputer Clusters
PVP (Cray T90)
SMP (Intel SHV, Sun, SGI, IBM)
COMA (KSR-1)
UMA
Multiprocessors
(Single address space
with shared-memory)
MIMD
CC- NUMA
(Stanford Dash, SGI Origin,
Sequent NUMA, HP Exemplar)
NCC-NUMA (Cray T3E, etc.)
DSM or SC-NUMA
NUMA
NORMA
Multicomputers
(Multiple address spaces with
no remote memory access)
(TreadMarks, Wind Tunnel, Shrimp)
Cluster (IBM SP2, TruCluster,
Solaris MC, Tandem Hymalaya,
Wolfpack, NOW, PearlCluster)
MPP (Intel Option Red, IBM Blue
Pacific, SGI/Cray Blue Mountain)
北京大学微处理器研究开发中心
55
[email protected]
Nine Computer Price Tiers(2000)
1$:
10$:
100$:
1,000$:
10,000$:
100,000$:
1,000,000$:
10,000,000$:
100,000,000$:
embeddables e.g. greeting card
wrist watch & wallet computers
pocket/ palm computers
portable computers
personal computers (desktop)
departmental computers (closet)
site computers (glass house)
regional computers (glass castle)
national centers
Super server: costs more than $100,000
“Mainframe”: costs more than $1 million
an array of processors, disks, tapes, comm ports
北京大学微处理器研究开发中心
56
[email protected]
What is Computer Architecture?
Application
Gap too large to bridge
in one step
(but there are exceptions, e.g.
magnetic compass)
Physics
In its broadest definition, computer architecture is the design of
the abstraction layers that allow us to implement information
processing applications efficiently using available manufacturing
technologies.
北京大学微处理器研究开发中心
57
[email protected]
Abstraction Layers in Modern Systems
Application
Algorithm
Parallel
computing,
security, …
Programming Language
Original
domain of the
computer
architect
(‘50s-’80s)
Operating System/Virtual Machine
Instruction Set Architecture (ISA)
Microarchitecture
Gates/Register-Transfer Level (RTL)
Circuits
Domain of
recent
computer
architecture
(‘90s)
Reliability,
power, …
Devices
Physics
Reinvigoration of
computer architecture,
mid-2000s onward.
北京大学微处理器研究开发中心
58
[email protected]
The End of the Uniprocessor Era
Single biggest change in the history of
computing systems
北京大学微处理器研究开发中心
59
[email protected]
Conventional Wisdom in Computer Architecture
° Old Conventional Wisdom: Power is free, Transistors expensive
° New Conventional Wisdom: “Power wall” Power expensive, Transistors free
(Can put more on chip than can afford to turn on)
° Old CW: Sufficient increasing Instruction-Level Parallelism via compilers,
innovation (Out-of-order, speculation, VLIW, …)
° New CW: “ILP wall” law of diminishing returns on more HW for ILP
° Old CW: Multiplies are slow, Memory access is fast
° New CW: “Memory wall” Memory slow, multiplies fast
(200 clock cycles to DRAM memory, 4 clocks for multiply)
° Old CW: Uniprocessor performance 2X / 1.5 yrs
° New CW: Power Wall + ILP Wall + Memory Wall = Brick Wall
• Uniprocessor performance now 2X / 5(?) yrs
 Sea change in chip design: multiple “cores”
(2X processors per chip / ~ 2 years)
- More, simpler processors are more power efficient
北京大学微处理器研究开发中心
60
[email protected]
Uniprocessor Performance
10000
Performance (vs. VAX-11/780)
From Hennessy and Patterson, Computer
Architecture: A Quantitative Approach, 4th
edition, October, 2006
??%/year
1000
52%/year
100
10
25%/year
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
• VAX
: 25%/year 1978 to 1986
• RISC + x86: 52%/year 1986 to 2002
• RISC + x86: ??%/year 2002 to present
[email protected]
61
北京大学微处理器研究开发中心
Problems with Sea Change
° Algorithms, Programming Languages, Compilers,
Operating Systems, Architectures, Libraries, … not ready
to supply Thread-Level Parallelism or Data-Level
Parallelism for 1000 CPUs / chip,
° Architectures not ready for 1000 CPUs / chip
• Unlike Instruction-Level Parallelism, cannot be solved by
computer architects and compiler writers alone, but also
cannot be solved without participation of architects
° 4th Edition of textbook “Computer Architecture: A
Quantitative Approach” explores shift from InstructionLevel Parallelism to Thread-Level Parallelism / Data-Level
Parallelism
北京大学微处理器研究开发中心
62
[email protected]
Instruction Set Architecture: Critical Interface
software
instruction set
hardware
°Properties of a good abstraction
• Lasts through many generations (portability)
• Used in many different ways (generality)
• Provides convenient functionality to higher levels
• Permits an efficient implementation at lower levels
北京大学微处理器研究开发中心
63
[email protected]
Instruction Set Architecture
“... the attributes of a [computing] system as seen by the
programmer, i.e. the conceptual structure and functional
behavior, as distinct from the organization of the data
flows and controls the logic design, and the physical
implementation.”
– Amdahl, Blaauw, and Brooks,
1964
SOFTWARE
-- Organization of Programmable
Storage
-- Data Types & Data Structures:
Encodings & Representations
-- Instruction Formats
-- Instruction (or Operation Code) Set
-- Modes of Addressing and Accessing Data Items and Instructions
-- Exceptional Conditions
北京大学微处理器研究开发中心
64
[email protected]
Example: MIPS
r0
r1
°
°
°
r31
PC
lo
hi
0
Programmable storage
Data types ?
2^32 x bytes
Format ?
31 x 32-bit GPRs (R0=0)
Addressing Modes?
32 x 32-bit FP regs (paired DP)
HI, LO, PC
Arithmetic logical
Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU,
AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUI
SLL, SRL, SRA, SLLV, SRLV, SRAV
Memory Access
LB, LBU, LH, LHU, LW, LWL,LWR
SB, SH, SW, SWL, SWR
Control
32-bit instructions on word boundary
J, JAL, JR, JALR
BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL
北京大学微处理器研究开发中心
65
[email protected]
Computer Architecture is an Integrated Approach
°What really matters is the functioning of the
complete system
• hardware, runtime system, compiler, operating
system, and application
• In networking, this is called the “End to End
argument”
°Computer architecture is not just about
transistors, individual instructions, or particular
implementations
• E.g., Original RISC projects replaced complex
instructions with a compiler + simple instructions
北京大学微处理器研究开发中心
66
[email protected]
Computer Architecture is Design and Analysis
确定系统结构是一个循环反复的过程
设计
-- 在计算机系统的所有级别
-- 搜索可能存在设计的空间
分析
创造力
成本/性能
分析
好想法
坏想法
北京大学微处理器研究开发中心
普通想法
67
[email protected]
计算机系统结构的论题
输入/输出 与 存贮
磁盘、WORM、磁带
不断涌现的新技术、
交叉寻址(Interleaving)、
总线协议
DRAM
存储
层次
RAID
一致性、
带宽、
延迟(Latency)
二级Cache
一级Cache
寻址方式、
保护、
意外处理
VLSI
指令系统体系结构
流水技术、冒险处理、
超标量、重排序(Reordering)、
流水技术 与 指令级并行
预测(Prediction)、推测(Speculation)、
向量、DSP
北京大学微处理器研究开发中心
68
[email protected]
计算机系统结构的论题(续)
P
M
P
S
M
等等
P
M
P
网络接口
互联网络
处理器-存储器-开关
拓扑结构、
路由、
带宽、
延迟、
可靠性
多处理器
网络 与 互联
北京大学微处理器研究开发中心
M
共享存储、
消息传递、
数据并行
69
[email protected]
课程主题
每位计算机科学家和工程人员都应该了解计算机的内部机理
°为什么?
• 一些人将设计、制造计算机
• 每个人都将使用计算机
- 了解得越多,使用得越有效
°以用户为中心的计算机系统
• 本课程将使你能够更好地启迪用户
• 考虑整个计算机系统:处理器、存储器、I/O、存储系统、网络
°有益收获
• 如何使得程序运行的更快
• 应用程序需要哪种类型的硬件支持
• 技术、结构将如何变化
北京大学微处理器研究开发中心
70
[email protected]
What Computer Architecture brings to Table
°
°
Other fields often borrow ideas from architecture
Quantitative Principles of Design
°
Careful, quantitative comparisons
°
°
1.
2.
3.
4.
5.
Take Advantage of Parallelism
Principle of Locality
Focus on the Common Case
Amdahl’s Law
The Processor Performance Equation
•
•
•
•
Define, quantity, and summarize relative performance
Define and quantity relative cost
Define and quantity dependability
Define and quantity power
Culture of anticipating and exploiting advances in
technology
Culture of well-defined interfaces that are carefully
implemented and thoroughly checked
北京大学微处理器研究开发中心
71
[email protected]
1) Taking Advantage of Parallelism
° Increasing throughput of server computer via multiple processors
or multiple disks
° Detailed HW design
• Carry lookahead adders uses parallelism to speed up computing sums from
linear to logarithmic in number of bits per operand
• Multiple memory banks searched in parallel in set-associative caches
° Pipelining: overlap instruction execution to reduce the total time to
complete an instruction sequence.
• Not every instruction depends on immediate predecessor  executing
instructions completely/partially in parallel possible
• Classic 5-stage pipeline:
1) Instruction Fetch (Ifetch),
2) Register Read (Reg),
3) Execute (ALU),
4) Data Memory Access (Dmem),
5) Register Write (Reg)
北京大学微处理器研究开发中心
72
[email protected]
Pipelined Instruction Execution
Time (clock cycles)
Ifetch
O
r
d
e
r
北京大学微处理器研究开发中心
DMem
Reg
DMem
Reg
DMem
Reg
ALU
Reg
ALU
Ifetch
Cycle 6 Cycle 7
ALU
I
n
s
t
r.
Cycle 3 Cycle 4 Cycle 5
ALU
Cycle 1 Cycle 2
Ifetch
Ifetch
73
Reg
Reg
Reg
DMem
Reg
[email protected]
Limits to pipelining
°Hazards prevent next instruction from executing
during its designated clock cycle
• Structural hazards: attempt to use the same hardware to do
two different things at once
• Data hazards: Instruction depends on result of prior
instruction still in the pipeline
• Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow
(branches and jumps).
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
DMem
Ifetch
Reg
ALU
Ifetch
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
O
r
d
e
r
北京大学微处理器研究开发中心
74
Reg
Reg
Reg
DMem
Reg
[email protected]
2) The Principle of Locality
°The Principle of Locality:
• Program access a relatively small portion of the address space
at any instant of time.
°Two Different Types of Locality:
• Temporal Locality (Locality in Time): If an item is
referenced, it will tend to be referenced again soon (e.g.,
loops, reuse)
• Spatial Locality (Locality in Space): If an item is
referenced, items whose addresses are close by tend to be
referenced soon
(e.g., straight-line code, array access)
°Last 30 years, HW relied on locality for memory perf.
P
北京大学微处理器研究开发中心
$
MEM
75
[email protected]
Levels of the Memory Hierarchy
Capacity
Access Time
Cost
CPU Registers
100s Bytes
300 – 500 ps (0.3-0.5 ns)
L1 and L2 Cache
10s-100s K Bytes
~1 ns - ~10 ns
$1000s/ GByte
Staging
Xfer Unit
Upper Level
Registers
Instr. Operands
L1 Cache
prog./compiler
1-8 bytes
faster
cache cntl
32-64 bytes
Blocks
L2 Cache
Blocks
Main Memory
G Bytes
80ns- 200ns
~ $100/ GByte
Disk
10s T Bytes, 10 ms
(10,000,000 ns)
~ $1 / GByte
Tape
infinite
sec-min
~$1 / GByte
北京大学微处理器研究开发中心
cache cntl
64-128 bytes
Memory
Pages
OS
4K-8K bytes
Files
user/operator
Mbytes
Disk
Tape
Larger
Lower Level
76
[email protected]
3) Focus on the Common Case
°Common sense guides computer design
• Since we’re engineering, common sense is valuable
°In making a design trade-off, favor the frequent case over the
infrequent case
• E.g., Instruction fetch and decode unit used more frequently than
multiplier, so optimize it 1st
• E.g., If database server has 50 disks / processor, storage
dependability dominates system dependability, so optimize it 1st
°Frequent case is often simpler and can be done faster than the
infrequent case
• E.g., overflow is rare when adding 2 numbers, so improve
performance by optimizing more common case of no overflow
• May slow down overflow, but overall performance improved by
optimizing for the normal case
°What is frequent case and how much performance improved by
making case faster => Amdahl’s Law
北京大学微处理器研究开发中心
77
[email protected]
4) Amdahl’s Law

Fractionenhanced 
ExTimenew  ExTimeold  1  Fractionenhanced  

Speedup

enhanced 
Speedupoverall 
ExTimeold

ExTimenew
1
1 
Fractionenhanced  
Fractionenhanced
Speedupenhanced
Best you could ever hope to do:
Speedupmaximum
北京大学微处理器研究开发中心
1

1 - Fractionenhanced 
78
[email protected]
Amdahl’s Law example
°New CPU 10X faster
°I/O bound server, so 60% time waiting for I/O
Speedup overall 
1
Fract ionenhanced
1  Fract ionenhanced  
Speedup enhanced
1
1


 1.56
0.4 0.64
1  0.4 
10
°Apparently, it’s human nature to be attracted by
10X faster, vs. keeping in perspective it’s just
1.6X faster
北京大学微处理器研究开发中心
79
[email protected]
5) Processor performance equation CPI
“Iron Law of Performance”
CPU time
= Seconds
inst count
= Instructions x
Program
Program
指令总数
CPI
X
编译器
X
(X)
X
X
组成
X
工艺技术
北京大学微处理器研究开发中心
x Seconds
Instruction
程序
指令系统体系结构
Cycles
Cycle time
Cycle
时钟频率
X
X
80
[email protected]
Metrics used to Compare Designs
°Cost
• Die cost and system cost
°Execution Time
• average and worst-case
• Latency vs. Throughput
°Energy and Power
• Also peak power and peak switching current
°Reliability
• Resiliency to electrical noise, part failure
• Robustness to bad software, operator error
°Maintainability
• System administration costs
°Compatibility
• Software costs dominate
北京大学微处理器研究开发中心
81
[email protected]
Cost of Processor
° Design cost (Non-recurring Engineering Costs, NRE)
• dominated by engineer-years
• also mask costs (exceeding $1M per spin)
° Cost of die
• die area
• die yield (maturity of manufacturing process, redundancy
features)
• cost/size of wafers
• die cost ~= f(die area^4) with no redundancy
° Cost of packaging
• number of pins (signal + power/ground pins)
• power dissipation
° Cost of testing
• built-in test features?
• logical complexity of design
• choice of circuits (minimum clock rates, leakage currents,
I/O drivers)
Architect affects all of these
北京大学微处理器研究开发中心
82
[email protected]
System-Level Cost Impacts
°Power supply and cooling
°Support chipset
°Off-chip SRAM/DRAM/ROM
°Off-chip peripherals
北京大学微处理器研究开发中心
83
[email protected]
What is Performance?
°Latency (or response time or execution time)
• time to complete one task
°Bandwidth (or throughput)
• tasks completed per unit time
北京大学微处理器研究开发中心
84
[email protected]
Definition: Performance
°Performance is in units of things per sec
• bigger is better
°If we are primarily concerned with response
time
1
Performace( X ) 
ExTime( X )
" X is n times faster than Y "
means
ExTim e(Y ) Perform ance( X )
n

ExTim e( X ) Perform ance(Y )
北京大学微处理器研究开发中心
85
[email protected]
Types of Benchmark
°Toy Programs
• Small, easy to port. Output often known before program is
run, e.g., Nqueens, Bubblesort, Towers of Hanoi
°Kernels
• Common subroutines in real programs, e.g., matrix
multiply, FFT, sorting, Livermore Loops, Linpack
°Synthetic Benchmarks
• Designed to have same mix of operations as real
workloads, e.g., Dhrystone, Whetstone
°Simplified Applications
• Extract main computational skeleton of real application
to simplify porting, e.g., NAS parallel benchmarks, TPC
°Real Applications
• Things people actually use their computers for, e.g., car
crash simulations, relational databases, Photoshop, Quake
北京大学微处理器研究开发中心
86
[email protected]
Performance: What to measure
°Usually rely on benchmarks vs. real workloads
°To increase predictability, collections of benchmark
applications-- benchmark suites -- are popular
°SPECCPU: popular desktop benchmark suite
• CPU only, split between integer and floating point programs
• SPECint2000 has 12 integer, SPECfp2000 has 14 integer pgms
• SPECCPU2006 to be announced Spring 2006
• SPECSFS (NFS file server) and SPECWeb (WebServer) added as
server benchmarks
°Transaction Processing Council measures server
performance and cost-performance for databases
• TPC-C Complex query for Online Transaction Processing
• TPC-H models ad hoc decision support
• TPC-W a transactional web benchmark
• TPC-App application server and web services benchmark
[email protected]
87
北京大学微处理器研究开发中心
Summarizing Performance
System
Rate (Task 1)
Rate (Task 2)
A
10
20
B
20
10
Which system is faster?
北京大学微处理器研究开发中心
88
[email protected]
… depends who’s selling
System
Rate (Task 1)
Rate (Task 2)
Average
A
10
20
15
B
20
10
15
Average throughput
System
Rate (Task 1)
Rate (Task 2)
Average
A
0.50
2.00
1.25
B
1.00
1.00
1.00
Throughput relative to B
System
Rate (Task 1)
Rate (Task 2)
Average
A
1.00
1.00
1.00
B
2.00
0.50
1.25
Throughput relative to A
北京大学微处理器研究开发中心
89
[email protected]
Summarizing Performance over Set of Benchmark Program
Arithmetic mean of execution times ti (in seconds)
1/n
 i ti
Harmonic mean of execution rates ri (MIPS/MFLOPS)
n/ [i (1/ri)]
° Both equivalent to workload where each program is
run the same number of times
° Can add weighting factors to model other workload
distributions
北京大学微处理器研究开发中心
90
[email protected]
Normalized Execution Time and Geometric Mean
°Measure speedup up relative to reference machine
ratio = tRef/tA
°Average time ratios using geometric mean
n(
I ratioi )
°Insensitive to machine chosen as reference
°Insensitive to run time of individual benchmarks
°Used by SPEC89, SPEC92, SPEC95, …, SPEC2006
北京大学微处理器研究开发中心
91
[email protected]
SPEC:
System Performance Evaluation
Cooperative
° 第一版 1989
• 10个程序(6Fp+4Int)产生单一数值(SPECmarks)
° 第二版 1992
• SPECInt92 (6Int) 和 SPECfp92 (14Fp)
- 不限制编译器的开关. DEC 4000 Model 610在93年3月:
spice: unix.c:/def=(sysv,has_bcopy,攂copy(a,b,c)=memcpy(b,a,c)
wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200
nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas
° 第三版 1995
• 一组新的程序: SPECint95 (8Int) 和 SPECfp95 (10Fp)
• 揵有效期 三年 ?
• 对所有程序使用同一开关设置: SPECint_base95, SPECfp_base95
北京大学微处理器研究开发中心
92
[email protected]
第一版SPEC
° 1989年,第一版; 10 个程序, 用单一数值来总结性能
(6Fp+4Int), 相对于VAX 11/780
° 其中有一个程序: 99%的时间耗费在该程序的单一一行代码上
° 新型前端编译器可以非常显著地改进它的性能
800
700
SPEC Perf
600
500
400
300
200
100
tomcatv
fpppp
matrix300
eqntott
li
nasa7
doduc
spice
epresso
gcc
0
B ench mark
北京大学微处理器研究开发中心
93
[email protected]
SPEC95
Benchmark
go
m88ksim
gcc
compress
li
ijpeg
perl
vortex
tomcatv
swim
su2cor
hydro2d
mgrid
applu
trub3d
apsi
fpppp
wave5
Description
Artificial intelligence; plays the game of Go
Motorola 88k chip simulator; runs test program
The Gnu C compiler generating SPARC code
Compresses and decompresses file in memory
Lisp interpreter
Graphic compression and decompression
Manipulates strings and prime numbers in the special-purpose programming language Perl
A database program
A mesh generation program
Shallow water model with 513 x 513 grid
quantum physics; Monte Carlo simulation
Astrophysics; Hydrodynamic Naiver Stokes equations
Multigrid solver in 3-D potential field
Parabolic/elliptic partial differential equations
Simulates isotropic, homogeneous turbulence in a cube
Solves problems regarding temperature, wind velocity, and distribution of pollutant
Quantum chemistry
Plasma physics; electromagnetic particle simulation
北京大学微处理器研究开发中心
94
[email protected]
Benchmark
gzip
vpr
gcc
mcf
crafty
parser
eon
perlmbk
gap
vortex
bzip2
twolf
wupwise
swim
mgrid
apply
mesa
galgel
art
equake
facerec
ammp
lucas
fma3d
sixtrack
apsi
Type
Integer
Integer
Integer
Integer
Integer
Integer
Integer
Integer
Integer
Integer
Integer
Integer
FP
FP
FP
FP
FP
FP
FP
FP
FP
FP
FP
FP
FP
FP
Source
C
C
C
C
C
C
C++
C
C
C
C
C
F77
F77
F77
F77
C
F90
C
C
C
C
F90
F90
F77
F77
SPEC2Kcpu
Description
Compression using the Lempel-Ziv algorithm
FPGA circuit placement and routing
Consists of the GNU C compiler generating optimized machine code
Combinatorial optimization of public transit scheduling
Chess-playing program
Syntactic English language parser
Graphics visualization using probabilistic ray tracing
Perl (an interpreted string-processing language) with four input scripts
A group theory application package
An object-oriented database system
A block-sorting compression algorithm
Timberwolf: a simulated annealing algorithm for VLSI place and route
Lattice gauge theory model of quantum chromodynamics
Solves shallow water equations using finite difference equations
Multigrid solver over three-dimensional field
Parabolic and elliptic partial differential equation solver
Three-dimensional graphics library
Computational fluid dynamics
Image recognition of a thermal image using neural networks
Simulation of seismic wave propagation
Face recognition using wavelets and graph matching
Molecular dynamics simulation of a protein in water
Performs primality testing for Mersenne primes
Finite element modeling of crash simulation
High-energy physics accelerator design simulation
A meteorological simulation of pollution distribution
北京大学微处理器研究开发中心
95
[email protected]
EEMBC
Benchmark type
Number of
kernels
Automotive/industria
l
16
6 microbenchmarks (arithmetic operations,
pointer chasing, memory performance, matrix
arithmetic, table lookup, bit manipulation), 5
automobile control benchmarks, and 5 filter or
FFT benchmarks
Consumer
5
5 multimedia benchmarks (JPEG
compress/decompress, filtering, and RGB
Conversions)
Networking
3
Shortest-path calculation, IP routing, and
packet flow operations
Office automation
4
Graphics and text benchmarks (Bézier curve
calculation, dithering, image rotation, text
processing)
Telecommunications
6
Filtering and DSP benchmarks (autocorrelation,
FFT, decoder, encoder)
北京大学微处理器研究开发中心
Example benchmarks
96
[email protected]
如何总结性能
° 算术平均值 (或者加权算术平均值) 追踪 执行时间: SUM(Ti)/n
或者 SUM(Wi*Ti)
° 比率(例如MFLOPS) 的 调和平均值 (或者加权调和平均值)
追踪 执行时间:
n/SUM(1/Ri) 或者 n/SUM(Wi/Ri)
° 为了按比例伸缩性能,规格化执行时间 是 非常便捷的!
例如, 参照机器的时间  被评测机器的时间
° 注意,不可使用规格化的执行时间的算术平均值,而应该使用几
何平均值!
° 几何平均值平等对待所有的改进情况:
A 程序 的执行
B 程序 的执行
从
2 秒
与
从 同等重要!
2000 秒
97
北京大学微处理器研究开发中心
减少到
减少到
1 秒
1000 秒
[email protected]
为什么对规格化数值要进行几何平均?
Time on A Time on B Normalized to A
Normalized to B
A
B
A
B
Program 1
1
10
1
10
0.1
1
Program 2
1000
100
1
0.1
10
1
Arithmetic mean
500.5
55
1
5.05
5.05
1
Geometric mean
31.6
31.6
1
1
1
1
北京大学微处理器研究开发中心
98
[email protected]
性能评测
° or better or worse, benchmarks shape a field
° Good products created when have:
• Good benchmarks
• Good ways to summarize performance
° Given sales is a function in part of performance relative to
competition, investment in improving product as reported
by performance summary
° If benchmarks/summary inadequate, then choose between
improving product for real programs vs. improving
product to get more sales;
Sales almost always wins!
° Execution time is the measure of computer performance!
北京大学微处理器研究开发中心
99
[email protected]
How to Mislead with Performance Reports
° Select pieces of workload that work well on your design, ignore others
° Use unrealistic data set sizes for application (too big or too small)
° Report throughput numbers for a latency benchmark
° Report latency numbers for a throughput benchmark
° Report performance on a kernel and claim it represents an entire application
° Use 16-bit fixed-point arithmetic (because it’s fastest on your system) even
though application requires 64-bit floating-point arithmetic
° Use a less efficient algorithm on the competing machine
° Report speedup for an inefficient algorithm (bubblesort)
° Compare hand-optimized assembly code with unoptimized C code
° Compare your design using next year’s technology against competitor’s year
old design (1% performance improvement per week)
° Ignore the relative cost of the systems being compared
° Report averages and not individual results
° Report speedup over unspecified base system, not absolute times
° Report efficiency not absolute times
° Report MFLOPS not absolute times (use inefficient algorithm)
[ David Bailey “Twelve ways to fool the masses when giving performance
results for parallel supercomputers” ]
北京大学微处理器研究开发中心
100
[email protected]
Power and Energy
°Energy to complete operation (Joules)
• Corresponds approximately to battery life
• (Battery energy capacity actually depends on rate of
discharge)
°Peak power dissipation (Watts = Joules/second)
• Affects packaging (power and ground pins, thermal
design)
°di/dt, peak change in supply current
(Amps/second)
• Affects power supply noise (power and ground pins,
decoupling capacitors)
北京大学微处理器研究开发中心
101
[email protected]
Peak Power versus Lower Energy
Peak A
Peak B
Power
Integrate power curve
to get energy
Time
°System A has higher peak power, but lower total energy
°System B has lower peak power, but higher total energy
北京大学微处理器研究开发中心
102
[email protected]
A "Typical" RISC ISA
°32-bit fixed format instruction (3 formats)
°32 32-bit GPR (R0 contains zero, DP take pair)
°3-address, reg-reg arithmetic instruction
°Single address mode for load/store:
base + displacement
• no indirection
°Simple branch conditions
°Delayed branch
see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC,
CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3
北京大学微处理器研究开发中心
103
[email protected]
Example: MIPS ( MIPS)
Register-Register
31
26 25
Op
21 20
Rs1
11 10
16 15
Rs2
6 5
Rd
0
Opx
Register-Immediate
31
26 25
Op
21 20
Rs1
16 15
Rd
immediate
0
Branch
31
26 25
Op
Rs1
21 20
16 15
Rs2/Opx
immediate
0
Jump / Call
31
26 25
Op
北京大学微处理器研究开发中心
0
target
104
[email protected]
Datapath vs Control
Datapath
Controller
signals
Control Points
°Datapath: Storage, FU, interconnect sufficient to perform the desired
functions
• Inputs are Control Points
• Outputs are signals
°Controller: State machine to orchestrate operation on the data path
• Based on desired function and signals
北京大学微处理器研究开发中心
105
[email protected]
Approaching an ISA
°Instruction Set Architecture
• Defines set of operations, instruction format, hardware
supported data types, named storage, addressing modes,
sequencing
°Meaning of each instruction is described by RTL on architected
registers and memory
°Given technology constraints assemble adequate datapath
•
•
•
•
Architected storage mapped to actual storage
Function units to do all the required operations
Possible additional storage (eg. MAR, MBR, …)
Interconnect to move information among regs and FUs
°Map each instruction to sequence of RTLs
°Collate sequences into symbolic controller state transition
diagram (STD)
°Lower symbolic STD to control points
°Implement controller
[email protected]
106
北京大学微处理器研究开发中心
5 Steps of MIPS Datapath
Figure A.2, Page A-8
Instruction
Fetch
Instr. Decode
Reg. Fetch
Execute
Addr. Calc
Zero?
RS1
L
M
D
MUX
Data
Memory
ALU
Imm
MUX MUX
RD
Reg File
RS2
Inst
Memory
Address
IR <= mem[PC];
Next SEQ PC
Adder
4
Write
Back
MUX
Next PC
Memory
Access
Sign
Extend
PC <= PC + 4
WB Data
Reg[IRrd] <= Reg[IRrs] opIRop Reg[IRrt]
北京大学微处理器研究开发中心
107
[email protected]
5 Steps of MIPS Datapath
Figure A.3, Page A-9
Execute
Addr. Calc
Instr. Decode
Reg. Fetch
Next SEQ PC
Next SEQ PC
Adder
4
Zero?
RS1
RD
RD
RD
MUX
Sign
Extend
MEM/WB
Data
Memory
EX/MEM
ALU
A <= Reg[IRrs];
Imm
MUX MUX
PC <= PC + 4
ID/EX
IR <= mem[PC];
Reg File
IF/ID
Memory
Address
RS2
Write
Back
MUX
Next PC
Memory
Access
WB Data
Instruction
Fetch
B <= Reg[IRrt]
rslt <= A opIRop B
WB <= rslt
Reg[IR ] <= WB
rd
北京大学微处理器研究开发中心
108
[email protected]
Inst. Set Processor Controller
IR <= mem[PC];
PC <= PC + 4
JSR
A <= Reg[IRrs];
JR
br
if bop(A,b)
Ifetch
opFetch-DCD
ST
B <= Reg[IRrt]
jmp
PC <= IRjaddr
LD
RI
RR
r <= A opIRop B
r <= A opIRop IRim
r <= A + IRim
WB <= r
WB <= Mem[r]
PC <= PC+IRim
WB <= r
Reg[IRrd] <= WB
北京大学微处理器研究开发中心
109
Reg[IRrd] <= WB
Reg[IRrd] <= WB
[email protected]
5 Steps of MIPS Datapath
Figure A.3, Page A-9
Execute
Addr. Calc
Instr. Decode
Reg. Fetch
Next SEQ PC
Next SEQ PC
Adder
4
Zero?
RS1
RD
RD
RD
MUX
Sign
Extend
MEM/WB
Data
Memory
EX/MEM
ALU
MUX MUX
ID/EX
Imm
Reg File
IF/ID
Memory
Address
RS2
Write
Back
MUX
Next PC
Memory
Access
WB Data
Instruction
Fetch
• Data stationary control
– local decode for each instruction phase / pipeline stage
[email protected]
北京大学微处理器研究开发中心
110
Visualizing Pipelining
Figure A.2, Page A-8
Time (clock cycles)
Ifetch
O
r
d
e
r
北京大学微处理器研究开发中心
DMem
Reg
DMem
Reg
DMem
Reg
ALU
Reg
ALU
Ifetch
Cycle 6 Cycle 7
ALU
I
n
s
t
r.
Cycle 3 Cycle 4 Cycle 5
ALU
Cycle 1 Cycle 2
Ifetch
Ifetch
111
Reg
Reg
Reg
DMem
Reg
[email protected]
Pipelining is not quite that easy!
°Limits to pipelining: Hazards prevent next
instruction from executing during its designated
clock cycle
• Structural hazards: HW cannot support this
combination of instructions (single person to fold
and put clothes away)
• Data hazards: Instruction depends on result of prior
instruction still in the pipeline (missing sock)
• Control hazards: Caused by delay between the fetching
of instructions and decisions about changes in
control flow (branches and jumps).
北京大学微处理器研究开发中心
112
[email protected]
One Memory Port/Structural Hazards
Figure A.4, Page A-14
Time (clock cycles)
Instr 3
DMem
Reg
DMem
Reg
DMem
Reg
ALU
Instr 2
Reg
ALU
Ifetch
DMem
ALU
O
r
d
e
r
Reg
Ifetch
Ifetch
Instr 4
北京大学微处理器研究开发中心
Cycle 6 Cycle 7
ALU
I Load Ifetch
n
s
Instr 1
t
r.
Cycle 3 Cycle 4 Cycle 5
ALU
Cycle 1 Cycle 2
Reg
Ifetch
113
Reg
Reg
Reg
DMem
Reg
[email protected]
One Memory Port/Structural Hazards
(Similar to Figure A.5, Page A-15)
Time (clock cycles)
Stall
Reg
DMem
Reg
ALU
Ifetch
DMem
Ifetch
Bubble
Cycle 6 Cycle 7
Reg
Bubble
Instr 3
How do you “bubble” the pipe?
114
北京大学微处理器研究开发中心
Ifetch
Reg
DMem
Bubble
Reg
Reg
Bubble
ALU
O
r
d
e
r
Instr 2
Reg
ALU
I Load Ifetch
n
s
Instr 1
t
r.
Cycle 3 Cycle 4 Cycle 5
ALU
Cycle 1 Cycle 2
Bubble
DMem
Reg
[email protected]
Speed Up Equation for Pipelining
CPIpipelined  Ideal CPI  Average Stall cycles per Inst
Cycle Time unpipeline d
Ideal CPI  Pipeline depth
Speedup 

Ideal CPI  Pipeline stall CPI
Cycle Time pipelined
For simple RISC pipeline, CPI = 1:
Cycle Time unpipeline d
Pipeline depth
Speedup 

1  Pipeline stall CPI
Cycle Time pipelined
北京大学微处理器研究开发中心
115
[email protected]
Example: Dual-port vs. Single-port
°Machine A: Dual ported memory (“Harvard Architecture”)
°Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate
°Ideal CPI = 1 for both
°Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
°Machine A is 1.33 times faster
北京大学微处理器研究开发中心
116
[email protected]
Data Hazard on R1
Figure A.6, Page A-17
Time (clock cycles)
and r6,r1,r7
or
Ifetch
Reg
DMem
Ifetch
Reg
Ifetch
r8,r1,r9
xor r10,r1,r11
北京大学微处理器研究开发中心
Reg
DMem
117
Reg
DMem
Reg
DMem
Ifetch
Reg
ALU
sub r4,r1,r3
Reg
ALU
Ifetch
WB
ALU
O
r
d
e
r
add r1,r2,r3
MEM
ALU
I
n
s
t
r.
ID/RF EX
ALU
IF
Reg
Reg
DMem
Reg
[email protected]
Three Generic Data Hazards
°Read After Write (RAW)
InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3
J: sub r4,r1,r3
°Caused by a “Dependence” (in compiler
nomenclature). This hazard results from an actual
need for communication.
北京大学微处理器研究开发中心
118
[email protected]
Three Generic Data Hazards
°Write After Read (WAR)
InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
°Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”.
°Can’t happen in MIPS 5 stage pipeline because:
• All instructions take 5 stages, and
• Reads are always in stage 2, and
• Writes are always in stage 5
北京大学微处理器研究开发中心
119
[email protected]
Three Generic Data Hazards
°Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
°Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.
°Can’t happen in MIPS 5 stage pipeline because:
• All instructions take 5 stages, and
• Writes are always in stage 5
°Will see WAR and WAW in more complicated pipes
北京大学微处理器研究开发中心
120
[email protected]
Forwarding to Avoid Data Hazard
Figure A.7, Page A-19
or
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
and r6,r1,r7
Ifetch
DMem
ALU
sub r4,r1,r3
Reg
ALU
O
r
d
e
r
add r1,r2,r3 Ifetch
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
r8,r1,r9
xor r10,r1,r11
北京大学微处理器研究开发中心
121
Reg
Reg
Reg
Reg
DMem
Reg
[email protected]
HW Change for Forwarding
Figure A.23, Page A-37
NextPC
mux
What circuit detects and resolves this hazard?
122
北京大学微处理器研究开发中心
mux
Immediate
MEM/WR
EX/MEM
ALU
mux
ID/EX
Registers
Data
Memory
[email protected]
Forwarding to Avoid LW-SW Data Hazard
Figure A.8, Page A-20
or
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
sw r4,12(r1)
Ifetch
DMem
ALU
lw r4, 0(r1)
Reg
ALU
O
r
d
e
r
add r1,r2,r3 Ifetch
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
r8,r6,r9
xor r10,r9,r11
北京大学微处理器研究开发中心
123
Reg
Reg
Reg
Reg
DMem
Reg
[email protected]
Data Hazard Even with Forwarding
Figure A.9, Page A-21
and r6,r1,r7
sub r4,r1,r6
or
Reg
DMem
Ifetch
Reg
DMem
Reg
Ifetch
Ifetch
r8,r1,r9
北京大学微处理器研究开发中心
124
Reg
Reg
Reg
DMem
ALU
O
r
d
e
r
Ifetch
ALU
lw r1, 0(r2)
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
Reg
DMem
Reg
[email protected]
Data Hazard Even with Forwarding
(Similar to Figure A.10, Page A-21)
Reg
DMem
Ifetch
Reg
Bubble
Ifetch
Bubble
Reg
Bubble
Ifetch
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
How is this detected?
北京大学微处理器研究开发中心
125
Reg
DMem
Reg
Reg
DMem
ALU
Ifetch
ALU
O
r
d
e
r
lw r1, 0(r2)
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
Reg
DMem
[email protected]
Software Scheduling to Avoid
Load Hazards
Try producing fast code for
a = b + c;
d = e – f;
Fast code:
LW
Rb,b
assuming a, b, c, d ,e, and f in memory.
Slow code:
LW
LW
ADD
SW
LW
LW
SUB
SW
LW
LW
ADD
LW
SW
SUB
SW
Rb,b
Rc,c
Ra,Rb,Rc
a,Ra
Re,e
Rf,f
Rd,Re,Rf
d,Rd
Rc,c
Re,e
Ra,Rb,Rc
Rf,f
a,Ra
Rd,Re,Rf
d,Rd
Compiler optimizes for performance. Hardware checks for safety.
北京大学微处理器研究开发中心
126
[email protected]
r6,r1,r7
Reg
DMem
Ifetch
Reg
Ifetch
22: add r8,r1,r9
36: xor r10,r1,r11
Reg
DMem
Reg
DMem
Ifetch
Reg
ALU
18: or
Ifetch
Reg
ALU
14: and r2,r3,r5
DMem
Reg
ALU
Ifetch
ALU
10: beq r1,r3,36
ALU
Control Hazard on Branches
Three Stage Stall
Reg
Reg
DMem
Reg
What do you do with the 3 instructions in between?
How do you do it?
Where is the “commit”?
北京大学微处理器研究开发中心
127
[email protected]
Branch Stall Impact
°If CPI = 1, 30% branch,
Stall 3 cycles => new CPI = 1.9!
°Two part solution:
• Determine branch taken or not sooner, AND
• Compute taken branch address earlier
°MIPS branch tests if register = 0 or  0
°MIPS Solution:
• Move Zero test to ID/RF stage
• Adder to calculate new PC in ID/RF stage
• 1 clock cycle penalty for branch versus 3
北京大学微处理器研究开发中心
128
[email protected]
Pipelined MIPS Datapath
Figure A.24, page A-38
Instruction
Fetch
Memory
Access
Write
Back
Adder
Adder
MUX
Next
SEQ PC
Next PC
Zero?
RS1
RD
RD
RD
MUX
Sign
Extend
MEM/WB
Data
Memory
EX/MEM
ALU
MUX
ID/EX
Imm
Reg File
IF/ID
Memory
Address
RS2
WB Data
4
Execute
Addr. Calc
Instr. Decode
Reg. Fetch
• Interplay of instruction set design and cycle time.
北京大学微处理器研究开发中心
129
[email protected]
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
• Execute successor instructions in sequence
• “Squash” instructions in pipeline if branch actually taken
• Advantage of late pipeline state update
• 47% MIPS branches not taken on average
• PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
• 53% MIPS branches taken on average
• But haven’t calculated branch target address in MIPS
- MIPS still incurs 1 cycle branch penalty
北京大学微处理器研究开发中心
130
[email protected]
Four Branch Hazard Alternatives
#4: Delayed Branch
• Define branch to take place AFTER a following instruction
branch instruction
sequential successor1
sequential successor2
........
sequential successorn
branch target if taken
Branch delay of length n
• 1 slot delay allows proper decision and branch target
address in 5 stage pipeline
• MIPS uses this
北京大学微处理器研究开发中心
131
[email protected]
Scheduling Branch Delay Slots (Fig A.14)
A. From before branch
add $1,$2,$3
if $2=0 then
delay slot
becomes
B. From branch target
sub $4,$5,$6
add $1,$2,$3
if $1=0 then
delay slot
becomes
delay slot
sub $4,$5,$6
becomes
add $1,$2,$3
if $1=0 then
if $2=0 then
add $1,$2,$3
C. From fall through
add $1,$2,$3
if $1=0 then
add $1,$2,$3
if $1=0 then
sub $4,$5,$6
sub $4,$5,$6
°A is the best choice, fills delay slot & reduces instruction count (IC)
°In B, the sub instruction may need to be copied, increasing IC
°In B and C, must be okay to execute sub when branch fails
北京大学微处理器研究开发中心
132
[email protected]
Delayed Branch
° Compiler effectiveness for single branch delay
slot:
• Fills
• About
delay
• About
about 60% of branch delay slots
80% of instructions executed in branch
slots useful in computation
50% (60% x 80%) of slots usefully filled
° Delayed Branch downside: As processor go to
deeper pipelines and multiple issue, the branch
delay grows and need more than one delay slot
• Delayed branching has lost popularity compared to
more expensive but more flexible dynamic
approaches
• Growth in available transistors has made dynamic
approaches relatively cheaper
北京大学微处理器研究开发中心
133
[email protected]
Evaluating Branch Alternatives
Pipeline speedup =
Pipeline depth
1 +Branch frequency Branch penalty
Assume 4% unconditional branch, 6% conditional branch- untaken, 10%
conditional branch-taken
Scheduling
Branch
CPI
speedup v.
speedup v.
scheme
penalty
Stall pipeline
Predict taken
Predict not taken
Delayed branch
3
1
1
0.5
北京大学微处理器研究开发中心
unpipelined
1.60
1.20
1.14
1.10
3.1
4.2
4.4
4.5
134
stall
1.0
1.33
1.40
1.45
[email protected]
Problems with Pipelining
°Exception: An unusual event happens to an instruction
during its execution
• Examples: divide by zero, undefined opcode
°Interrupt: Hardware signal to switch the processor to a
new instruction stream
• Example: a sound card interrupts when it needs more audio
output samples (an audio “click” happens if it is left
waiting)
°Problem: It must appear that the exception or interrupt
must appear between 2 instructions (Ii and Ii+1)
• The effect of all instructions up to and including Ii is
totalling complete
• No effect of any instruction after I i can take place
°The interrupt (exception) handler either aborts program or
restarts at instruction Ii+1
北京大学微处理器研究开发中心
135
[email protected]
Precise Exceptions In-Order Pipelines
Key observation: architected state only change
in memory and register write stages.
北京大学微处理器研究开发中心
136
[email protected]
Summary: Metrics and Pipelining
°Machines compared over many metrics
• Cost, performance, power, reliability, compatibility, …
°Difficult to compare widely differing machines on
benchmark suite
°Control VIA State Machines and Microprogramming
°Just overlap tasks; easy if tasks are independent
°Speed Up  Pipeline Depth; if ideal CPI is 1, then:
Cycle Time unpipeline d
Pipeline depth
Speedup 

1  Pipeline stall CPI
Cycle Time pipelined
°Hazards limit performance on computers:
• Structural: need more HW resources
• Data (RAW,WAR,WAW): need forwarding, compiler scheduling
• Control: delayed branch, prediction
°Exceptions, Interrupts add complexity
°Next time: Read Appendix C!
北京大学微处理器研究开发中心
137
[email protected]
Since 1980, CPU has outpaced DRAM ...
Four-issue 2GHz superscalar accessing 100ns DRAM could execute
800 instructions during time for one memory access!
Performance
(1/latency)
CPU
CPU
60% per yr
2X in 1.5 yrs
Gap grew 50% per
year
DRAM
9% per yr
DRAM
2X in 10 yrs
Year
北京大学微处理器研究开发中心
138
[email protected]
Addressing the Processor-Memory Performance GAP
°Goal: Illusion of large, fast, cheap memory. Let
programs address a memory space that scales to
the disk size, at a speed that is usually as fast as
register access
°Solution: Put smaller, faster “cache” memories
between CPU and DRAM. Create a “memory
hierarchy”.
北京大学微处理器研究开发中心
139
[email protected]
Levels of the Memory Hierarchy
Capacity
Access Time
Cost
CPU Registers
100s Bytes
<10s ns
Staging
Xfer Unit
Today’s
Focus
Cache
K Bytes
10-100 ns
1-0.1 cents/bit
Upper Level
faster
Registers
Instr. Operands
prog./compiler
1-8 bytes
Cache
Main Memory
M Bytes
200ns- 500ns
$.0001-.00001 cents /bit
Blocks
Memory
Disk
G Bytes, 10 ms
(10,000,000 ns)
-5 -6
10 - 10 cents/bit
Tape
infinite
sec-min
10 -8
北京大学微处理器研究开发中心
cache cntl
8-128 bytes
Pages
OS
512-4K bytes
Files
user/operator
Mbytes
Disk
Tape
Larger
Lower Level
140
[email protected]
Common Predictable Patterns
Two predictable properties of memory references:
°Temporal Locality: If a location is referenced, it is
likely to be referenced again in the near future
(e.g., loops, reuse).
°Spatial Locality: If a location is referenced it is
likely that locations near it will be referenced in
the near future (e.g., straightline code, array
access).
北京大学微处理器研究开发中心
141
[email protected]
Caches
Caches exploit both types of predictability:
• Exploit temporal locality by
remembering the contents of recently
accessed locations.
• Exploit spatial locality by fetching
blocks of data around recently
accessed locations.
北京大学微处理器研究开发中心
142
[email protected]
Cache Algorithm (Read)
Look at Processor Address, search cache
tags to find match. Then either
MISS - Not in cache
HIT - Found in Cache
Return copy
of data from
cache
Read block of data
from Main Memory
Hit Rate = fraction of accesses found in cache
Miss Rate = 1 – Hit rate
Wait …
Return data to
processor
and update cache
Hit Time = RAM access time +
time to determine HIT/MISS
Miss Time = time to replace block in cache +
time to deliver block to processor
北京大学微处理器研究开发中心
143
[email protected]
Inside a Cache
Address
Processor
Address
CACHE
Data
copy of main
memory
location 100
Address
Tag
Data
Main
Memory
copy of main
memory
location 101
100
Data Data
Byte Byte
304
Data
Byte
Line
6848
416
Data Block
北京大学微处理器研究开发中心
144
[email protected]
4 Questions for Memory Hierarchy
°Q1: Where can a block be placed in the cache?
(Block placement)
°Q2: How is a block found if it is in the cache?
(Block identification)
°Q3: Which block should be replaced on a miss?
(Block replacement)
°Q4: What happens on a write?
(Write strategy)
北京大学微处理器研究开发中心
145
[email protected]
Q1: Where can a block be placed?
Block Number
0123456789
1111111111022222222220 33
123456789
123456789 01
Memory
Set Number
0
0
1
2
3
01234567
Cache
Block 12
can be placed
Fully
Associative
anywhere
北京大学微处理器研究开发中心
(2-way) Set
Associative
anywhere in
set 0
(12 mod 4)
146
Direct
Mapped
only into
block 4
(12 mod 8)
[email protected]
Q2: How is a block found?
°Index selects which set to look in
°Tag on each block
• No need to check index or block offset
°Increasing associativity shrinks index, expands
tag. Fully Associative caches have no index field.
Memory Address
Block Address
Tag
北京大学微处理器研究开发中心
Index
147
Block
Offset
[email protected]
Direct-Mapped Cache
Tag
Index
t
V
Tag
k
Block
Offset
Data Block
b
2k
lines
t
=
HIT
北京大学微处理器研究开发中心
Data Word or Byte
148
[email protected]
2-Way Set-Associative Cache
Tag
Index
t
k
Data Block
V Tag
Block
Offset
V Tag
b
Data Block
t
=
北京大学微处理器研究开发中心
=
149
Data
Word
or Byte
HIT
[email protected]
Fully Associative Cache
V Tag
Data Block
t
Tag
=
t
=
Block
Offset
HIT
b
Data
Word
or Byte
=
北京大学微处理器研究开发中心
150
[email protected]
What causes a MISS?
°Three Major Categories of Cache Misses:
•Compulsory Misses: first access to a block
•Capacity Misses: cache cannot contain all
blocks needed to execute the program
•Conflict Misses: block replaced by another
block and then later retrieved - (affects
set assoc. or direct mapped caches)
Nightmare Scenario: ping pong effect!
北京大学微处理器研究开发中心
151
[email protected]
Block Size and Spatial Locality
Block is unit of transfer between the cache and memory
Word0
Tag
Split CPU
address
Word1
Word2
Word3
block address
4 word block,
b=2
offsetb
b bits
32-b bits
2b = block size a.k.a line size (in bytes)
Larger block size has distinct hardware advantages
• less tag overhead
• exploit fast burst transfers from DRAM
• exploit fast burst transfers over wide busses
What are the disadvantages of increasing block size?
Fewer blocks => more conflicts. Can waste bandwidth.
北京大学微处理器研究开发中心
152
[email protected]
Q3: Which block should be replaced on a miss?
°Easy for Direct Mapped
°Set Associative or Fully Associative:
• Random
• Least Recently Used (LRU)
- LRU cache state must be updated on every
access
- true implementation only feasible for small
sets (2-way)
- pseudo-LRU binary tree often used for 4-8 way
• First In, First Out (FIFO) a.k.a. Round-Robin
- used in highly associative caches
°Replacement policy has a second order effect since
replacement only happens on misses
北京大学微处理器研究开发中心
153
[email protected]
Q4: What happens on a write?
°Cache hit:
• write through: write both cache & memory
- generally higher traffic but simplifies cache coherence
• write back:
write cache only
(memory is written only when the entry is evicted)
- a dirty bit per block can further reduce the traffic
°Cache miss:
• no write allocate: only write to main memory
• write allocate (aka fetch on write): fetch into cache
°Common combinations:
• write through and no write allocate
• write back with write allocate
北京大学微处理器研究开发中心
154
[email protected]
5 Basic Cache Optimizations
° Reducing Miss Rate
1. Larger Block size (compulsory misses)
2. Larger Cache size (capacity misses)
3. Higher Associativity (conflict misses)
° Reducing Miss Penalty
4. Multilevel Caches
° Reducing hit time
5. Giving Reads Priority over Writes
• E.g., Read complete before earlier writes in write
buffer
北京大学微处理器研究开发中心
155
[email protected]