ppt

Transcript ppt

Computer Organization & Design
计算机组成与设计
Weidong Wang (王维东)
[email protected]
College of Information Science & Electronic Engineering
信息与通信工程研究所
Zhejiang University
Course Outline
•
•
•
•
•
•
Instructor
Prerequisites
Topics of Study
Course Expectations
Grading
Logic Review
2
Course Information
• Instructor: Weidong WANG
– Email: [email protected]
– Tel(O): 0571-87953170;
– Office Hours: TBD, Yuquan Campus, Xindian (High-Tech)
Building 306, /email whenever
• TA:
– mobile，Email:
» Lu Huang黄露, 13516719473/6719473;
[email protected]
» Hanqi Shen沈翰祺, 15067115046; [email protected]
» Office Hours: Wednesday & Saturday 14:00-16:30 PM.
» Xindian (High-Tech) Building 308.（也可以短信邮件联系）
3
Prerequisites预修课程
• Logic Design
• C Programming
• Compilers, OS, Circuits/VLSI background is a plus,
not needed
• How to learn this Course?
– Not only listening, thinking and waiting ….
– But Exercise, Simulation, Practice!
4
Other Course Info
• Course Text:
– David A. Patterson and John L. Hennessy, Computer
Organization & Design: The Hardware/Software
Interface, 4th edition, Morgan Kaufmann press
– CD includes manuals, appendices, in-depth sections,
– “Green card” summarizes MIPS ISA
• Website:
– http://mypage.zju.edu.cn/wdwd/
– ftp://10.13.71.58/数字系统设计2/2015/（暂时有故障）
– Check frequently
• Classroom
– 教七-202 周三Wednesday -三四五节
5
Grading (Tentative)
Final grades will be computed approximately as follows:
•Homework Sets + Quizzes + Class - 30%
– Class Room Check
– 6 Homework Sets
– 2 Programming Assignments
•
Use spim simulator on Linux & Windows machines
•Project - 10%
– 2 projects (1 or 2 members team)
» Simulation or Assembling
» cache controlller design
•Finial Exam - 60%
6
Major Topics
• Hardware-software interface
– Machine language and assembly language programming
• Compiler optimizations and performance
• Processor design
– Pipelined processor design
• Memory hierarchy design
– Caches
• I/O devices and systems
• Virtual memory & operating systems support
• Multiprocessors and Multithreading
7
Why take this class?
• Learn how modern computing hardware works
• Understand where computing hardware is going
in the future
– And learn how to contribute to this future…
• How does this impact system software and
applications?
– Essential to understand OS/compilers/PL
– For everyone else, it can help you write better code!
• How are future technologies going to impact
computing systems?
8
Topics of Study
• Focus on what modern computer architects worry
about ( both academia and industry)
• Get through the basics of modern processor design
• Look at technology trends: multithreading, CMP,
power‐, reliability‐aware design
• Recent research ideas, and the future of computing
hardware
9
Lecture 1
Introduction to Programmable Digital Systems
Computer Organization & Design
-Fall 2015
Zhejiang University
10
Current State of the world
--------thinking via electronics
• Electronic systems dominate主宰 almost everything
– And of these – most systems use processors and memory
• Why?
– Break this question into three questions
•
Why electronics?
•
Why use digital integrated circuits (ICs) to build electronics?
•
Why use processors in ICs?
• Why use electronics
– Electrons are easy to move / control
– Easier than the current alternatives
• Result is that we move information / not real physical stuff
– Think phone, email, fax, TV, WWW, etc.
11
Programmable Components
aka Processors
• An old approach to “solve” complexity problem
– Build a generic device and customize with memory
•
Through a process called programming
– (Re)use device in a large number of systems
– Best way to do this is with a general purpose processor
• Processor complexity grows with technology
– But software model stays roughly the same
•
C, C++, Java, … run on Pentium 2, 3, 4, M, Core, Core 2, …
•
True for sequential programs
– This is getting much tougher to do
•
Recent hardware developments require software model changes
•
Multi core
– Aka == also known as
12
The Complexity复杂性 Problem
•
Complexity is the limiting factor in modern chip design
– Two problems
1.How do you make use of all that IC resources?
– Uberappliance
•
Cellphone, PDA, iPod, mobile TV, video camera
– Too many applications to cast all into hardware logic
– Takes too long to finish the design
2.How do you make sure it works?
– Verification problem
– How do you fix bugs?
•
Only way to survive complexity:
– Hide complexity in “general-purpose” components
– Reuse components
13
What is Computer Architecture?
14
Challenges in the 21st Century
15
Modeling + Design
• First Component (Modeling/Measurement):
– Come up with a way to:
•
Diagnose where power is going in your system
•
Quantify potential savings
• Second Component (Design)
– Try out lots of ideas
– Or characterize tradeoffs of ideas…
• This class will focus on both of these at many
levels of the computing hierarchy
16
What is Computer Architecture?
Application
Gap too large to
bridge in one step
(but there are exceptions,
e.g. magnetic compass)
Physics
In its broadest definition, computer architecture is the
design of the abstraction layers that allow us to implement
information processing applications efficiently using
available manufacturing technologies.
17
Abstraction Layers in Modern Systems
Application
Algorithm
Programming Language
Operating System/Virtual Machines
Instruction Set Architecture (ISA)
Microarchitecture
Gates/Register-Transfer Level (RTL)
Circuits
Devices
Physics
18
Architecture continually changing
Applications
suggest how
to improve
technology,
provide
revenue to
fund
development
Applications
Technology
Improved
technologies
make new
applications
possible
Cost of software development
makes compatibility a major
force in market
19
Computing Devices Now
Sensor Nets
Cameras
Media
Players
Games
Set-top
boxes
Laptops
Servers
Routers
Robots
Smart
phones
Automobiles
Supercomputers
20
iPhone 6s
•
•
1、北京时间2015年9月10日发布了iPhone 6s系列。除了原有的金色，银色，深空灰并推出玫瑰金（粉色），屏幕
采用高强度的Ion-X玻璃，处理器采用A9处理器，CPU性能比A8提升70%，图形性能提升90%，后置摄像头
1200万像素，前置摄像头 500万像素。摄像头对焦更加准确，CMOS 为了降噪采用“深槽隔离”技术，支持4K视
频摄录。数据连接方面，支持23个频段的LTE网络，和2倍速度的WIFI连接。2015年9月25日发售。
2、主要技术
–
–
–
–
–
–
–
3D Touch
1200 万像素照片、4K 视频、Live Photos，还有生动如初的记忆。
A9 芯片，超前的智能手机芯片。特别打造的 64 位 A9 芯片，体验到比以往提速 70% 的中央处理器性能，以及提速 90% 之多的图形处理器性能。
iPhone 6s 4.7英寸具备 3D Touch 技术的Retina HD 显示屏； iPhone 6s Plus 5.5 英寸具备 3D Touch 技术的Retina HD 显示屏
Touch ID先进的安全保护。Touch ID 采用先进的指纹识别传感器，比以往更快、更好用，让你轻松安全地解锁你的手机。
4G LTE 和无线网络，双双提速。
iOS 9 是先进、智能、安全的移动操作系统。众多强大的全新内置 app、Siri 中的先进功能，以及遍布系统各处的提升，让 iOS 9 比以往都更智能，也
更加重要。与 Apple 硬件的深度整合，使一切运作都格外默契而顺畅。
21
Apple
• 乔布斯（Jobs）和沃兹（Woz，Wozniak）
• 于1976年创立苹果公司
• 最初的LOGO还有外框，华兹华斯（William Wordsworth）
的短诗
• “Newton……/A Mind Forever/A Mind Forever/Seas of
Thought/……Alone”，
• “牛顿……/一个灵魂/永远航行在陌生的思想的海洋中
/……孤独地”。
22
?
Major
Technology
Generations
CMOS
Bipolar
nMOS
Vacuum
Tubes
pMOS
Relays
Electromechanical
[from Kurzweil]
23
Uniprocessor Performance
10000
Performance (vs. VAX-11/780)
From Hennessy and Patterson, Computer
Architecture: A Quantitative Approach, 4th
edition, October, 2006
??%/year
1000
52%/year
100
10
25%/year
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
• VAX
: 25%/year 1978 to 1986
• RISC + x86: 52%/year 1986 to 2002
• RISC + x86: ??%/year 2002 to present
24
The End of the Uniprocessor Era
Single biggest change in the history
of computing systems
25
Course Focus
Understanding the design techniques, machine
structures, technology factors, evaluation methods that
will determine the form of computers in 21st Century
Technology
Applications
Parallelism
Programming
Languages
Computer Architecture:
• Organization
• Hardware/Software Boundary
Operating
Systems
Measurement &
Evaluation
Interface Design
(ISA)
Compilers
History
26
Computer Architecture:
A Little History
Throughout the course we’ll use a historical narrative to
help understand why certain ideas arose
Why worry about old ideas?
• Helps to illustrate the design process, and explains
why certain decisions were taken
• Because future technologies might be as constrained
as older ones
• Those who ignore history are doomed to repeat it
– Every mistake made in mainframe design was also made in
minicomputers, then microcomputers, where next?
27
计算机产生的基础
• 1642年，年仅19岁的法国科学家Blaise Pascal(16231662)制造出的第一台能工作的计算机器。整台机器是
纯机械设备，使用手柄驱动，用齿轮传动，能完成加法
和减法。程序设计语言Pascal就是以他的名字命名的。
• 1945年, 美国数学家冯.诺依曼博士发表《电子计算工具
逻辑设计》论文，提出二进制表达方式和存储程序控制
计算机构想。
28
Difference Engine
1823
– Charles Babbage’s paper is published
1834
– The paper is read by Scheutz & his
son in Sweden
1842
– Babbage gives up the idea of building
it; he is onto Analytic Engine!
1855
– Scheutz displays his machine at the
Paris World Fare
– Can compute any 6th degree
polynomial
– Speed: 33 to 44 32-digit numbers per
minute!
Now the machine is at the Smithsonian（美国博物馆）
29
Linear Equation Solver
John Atanasoff, Iowa State University
1930’s:
–
–
–
–
Atanasoff built the Linear Equation Solver.
It had 300 tubes!
Special-purpose binary digital calculator
Dynamic RAM (stored values on refreshed
capacitors)
Application:
– Linear and Integral differential equations
Background:
– Vannevar Bush’s Differential Analyzer
--- an analog computer
Technology:
– Tubes and Electromechanical relays
Atanasoff decided that the correct mode of
computation was using electronic binary digits.
30
Harvard Mark I
• Built in 1944 in IBM Endicott laboratories
– Howard Aiken – Professor of Physics at Harvard
– Essentially mechanical but had some electromagnetically controlled relays and gears
– Weighed 5 tons and had 750,000 components
– A synchronizing clock that beat every 0.015
seconds (66Hz)
Performance:
0.3 seconds for addition
6 seconds for multiplication
1 minute for a sine calculation
Decimal arithmetic
No Conditional Branch!
Broke down once a week!
31
ENIAC - The first electronic computer (1946)
1946年, 美国宾西法尼亚大学研制成功电子数字
计算机 ENIAC。重28吨，耗电150kW，占地
170平米，用电子管18800个，每秒5000次加法。
32
Computing Devices Then…
EDSAC, University of Cambridge, UK, 1949
33
And then there was IBM 701
IBM 701 -- 30 machines were sold in 1953-54
used CRTs as main memory, 72 tubes of 32x32b each
IBM 650 -- a cheaper, drum based machine,
more than 120 were sold in 1954
and there were orders for 750 more!
Users stopped building their own machines.
Why was IBM late getting into computer technology?
IBM was making too much money!
Even without computers, IBM revenues
were doubling every 4 to 5 years in 40’s
and 50’s.
34
Intel 4004
Micro-Processor
@1971
1000 transistors
1 MHz operation
35
IBM 360: 47 years later…
The zSeries z11 Microprocessor
• 5.2 GHz in IBM 45nm PD-SOI CMOS technology
• 1.4 billion transistors in 512 mm2
• 64-bit virtual addressing
– original S/360 was 24-bit, and S/370 was 31-bit extension
•
•
•
•
Quad-core design
Three-issue out-of-order superscalar pipeline
Out-of-order memory accesses
Redundant datapaths
– every instruction performed in two parallel datapaths and
results compared
•
•
•
•
[ IBM, HotChips, 2010]
64KB L1 I-cache, 128KB L1 D-cache on-chip
1.5MB private L2 unified cache per core, on-chip
On-Chip 24MB eDRAM L3 cache
Scales to 96-core multiprocessor with 768MB of
shared L4 eDRAM
36
And in conclusion …
• Computer Architecture >> ISAs and RTL
• Computer architecture is shaped by technology and
applications
– History provides lessons for the future
• Computer Science at the crossroads from sequential
to parallel computing
– Salvation requires innovation in many fields, including computer
architecture
37
Burrough’s B5000 Stack Architecture:
An ALGOL Machine, Robert Barton, 1960
•
Machine implementation can be completely hidden if the
programmer is provided only a high-level language interface.
•
Stack machine organization because stacks are convenient for:
1. expression evaluation;
2. subroutine calls, recursion, nested interrupts;
3. accessing variables in block-structured languages.
•
B6700, a later model, had many more innovative features
– tagged data
– virtual memory
– multiple processors and memories
以计算机为中心的数字电子系统
• 硬件是计算机的物资基础，没有硬件计算
机将不复存在；软件是发挥计算机功能，
没有软件计算机无法投入使用。
39
A Stack Machine
A Stack machine has a stack as
a part of the processor state
Processor
stack
Main
Store
:
a
typical operations:
push, pop, +, *, ...
Instructions like + implicitly
specify the top 2 elements of
the stack as operands.
push b

b
a
push c

c
b
a
pop

b
a
Evaluation of Expressions
(a + b * c) / (a + d * c - e)
/
+
a
-
b
e
+
*
c
a
*
d
c
Reverse Polish
abc*+adc*+e-/
push
ab c
push
push
multiply
*
c
bb
*c
a
Evaluation Stack
Evaluation of Expressions
(a + b * c) / (a + d * c - e)
/
+
a
-
b
e
+
*
c
a
*
d
c
Reverse Polish
abc*+adc*+e-/
add
+
b*c
a+a
b*c
Evaluation Stack
Hardware organization of the stack
• Stack is part of the processor state
stack must be bounded and small
number of Registers,
not the size of main memory
• Conceptually stack is unbounded

a part of the stack is included in the
processor state; the rest is kept in the
main memory
Stack Operations and
Implicit Memory References
• Suppose the top 2 elements of the stack are kept
in registers and the rest is kept in the memory.
Each push operation 
pop operation 
1 memory reference
1 memory reference
No Good!
• Better performance can be got if the top N
elements are kept in registers and memory
references are made only when register stack
overflows or underflows.
Issue - when to Load/Unload registers ?
Stack versus GPR Organization
Amdahl, Blaauw and Brooks, 1964
1. The performance advantage of push down stack organization
is derived from the presence of fast registers and not the way
they are used.
2.“Surfacing” of data in stack which are “profitable” is
approximately 50% because of constants and common
subexpressions.
3. Advantage of instruction density because of implicit addresses
is equaled if short addresses to specify registers are allowed.
4. Management of finite depth stack causes complexity.
5. Recursive subroutine advantage can be realized only with the
help of an independent stack for addressing.
6. Fitting variable-length fields into fixed-width word is awkward.
Stack Machines (Mostly) Died by 1980
1. Stack programs are not smaller if short (Register)
addresses are permitted.
2. Modern compilers can manage fast register space better
than the stack discipline.
GPR’s and caches are better than stack and displays
Early language-directed architectures often did not
take into account the role of compilers!
B5000, B6700, HP 3000, ICL 2900, Symbolics 3600
Some would claim that an echo of this mistake is
visible in the SPARC architecture register windows more later…
Stacks post-1980
• Inmos Transputers (1985-2000)
–
–
–
–
Designed to support many parallel processes in Occam language
Fixed-height stack design simplified implementation
Stack trashed on context swap (fast context switches)
Inmos T800 was world’s fastest microprocessor in late 80’s
• Forth machines
– Direct support for Forth execution in small embedded real-time
environments
– Several manufacturers (Rockwell, Patriot Scientific)
• Java Virtual Machine
– Designed for software emulation, not direct hardware execution
– Sun PicoJava implementation + others
• Intel x87 floating-point unit
– Severely broken stack model for FP arithmetic
– Deprecated in Pentium-4, replaced with SSE2 FP registers
Microprogramming
• A brief look at microprogrammed machines
– To show how to build very small processors with complex ISAs
– To help you understand where CISC machines came from
– Because it is still used in the most common machines (x86,
PowerPC, IBM360)
– As a gentle introduction into machine structures
– To help understand how technology drove the move to RISC
ISA to Microarchitecture Mapping
• ISA often designed with particular microarchitectural style
in mind, e.g.,
– CISC  microcoded
– RISC  hardwired, pipelined
– VLIW  fixed-latency in-order pipelines
– JVM  software interpretation
• But can be implemented with any microarchitectural style
– Core 2 Duo: hardwired pipelined CISC (x86)
machine (with some microcode support)
– This lecture: a microcoded RISC (MIPS) machine
– Intel could implement a dynamically scheduled outof-order VLIW (IA-64) processor
– ARM Jazelle: A hardware JVM processor
– Simics: Software-interpreted SPARC RISC machine
Microarchitecture: Implementation of an ISA
status
lines
Controller
control
points
Data
path
Structure: How components are connected.
Static
Behavior: How data moves between components
Dynamic
Microcontrol Unit Maurice Wilkes, 1954
Embed the control logic state table in a memory array
op
code
conditional
flip-flop
Next state
 address
Matrix A
Matrix B
Decoder
Control lines to
ALU, MUXs, Registers
Microcoded Microarchitecture
busy?
zero?
opcode
holds fixed
microcode instructions
controller
(ROM)
Datapath
Data
holds user program
written in macrocode
instructions (e.g.,
MIPS, x86, etc.)
Addr
Memory
(RAM)
enMem
MemWrt
The MIPS32 ISA
• Processor State
32 32-bit GPRs, R0 always contains a 0
16 double-precision/32 single-precision FPRs
FP status register, used for FP compares & exceptions
PC, the program counter
See H&P
some other special registers
Appendix B for
full description
• Data types
8-bit byte, 16-bit half word
32-bit word for integers
32-bit word for single precision floating point
64-bit word for double precision floating point
• Load/Store style instruction set
data addressing modes- immediate & indexed
branch addressing modes- PC relative & register indirect
Byte addressable memory- big-endian mode
All instructions are 32 bits
54
55
56
57
58
59
冯.诺依曼Von Neumann体系结构
Von Neumann提出的现代计算机模型:
•计算机由运算器、控制器、存储器、输入设备和输出设备
五部分组成。
•采用存储程序的方式，程序和数据放在同一存储器中，
由指令组成的程序可以修改。
•数据以二进制码表示
•指令由操作码和地址码组成。
•指令在存储器中按执行顺序存放，由指令计数器指明要
执行的指令所在的单元地址，一般按顺序替增。
•机器以运算器为中心，数据传送都经过运算器。
60
典型冯.诺依曼计算机结构
存储器
输入
运算器
输出
控制器
数据线路
控制信号
图1
计算机的基本结构
61
Clocked Logic Review
62
Review: Edge-Triggered D Flip Flops
D
Q
Value of D is sampled on positive
clock edge.
Q outputs sampled value for rest of
cycle.
CLK
D
Q
63
Review: Edge-Triggering in Verilog
D
CLK
Q
Value of D is sampled on positive
clock edge.
Q outputs sampled value for rest of
cycle.
module ff(D, Q, CLK);
input D, CLK;
output Q;
Module code
has two bugs.
always @ (CLK)
Q <= D;
endmodule
Where?
64
Review: Edge-Triggered D Flip Flops
D
Q
Value of D is sampled on positive
clock edge.
Q outputs sampled value for rest
of cycle.
module ff(D, Q, CLK);
CLK
input D, CLK;
output Q;
reg Q;
Correct ?
always @ (posedge CLK)
Q <= D;
endmodule
65
Logic styles used in lectures
All state elements in a design are edgetriggered, on the positive edge of a single
global clock.
or
All state elements in a design are edgetriggered, on the negative edge of a single
global clock.
66
State Machine Review
67
Specification: Traffic Light
Controller
CLK
R
(red)
Change
Rst
If Change == 1 on
positive CLK
edge
traffic light
changes
Y
(yellow)
G
(green)
RYG
If Rst == 1 on
positive CLK
edge
RYG=100
100
68
Rst == 1
State Diagram: Traffic Light
Controller
RYG
100
Change == 1
Change == 1
RYG
001
Change == 1
RYG
010
69
Timing Diagram: Traffic Light Controller
Rst == 1
Change == 1
RYG
100
Change == 1
RYG
001
Change == 1
RYG
010
CLK
Change
RYG
100
001
010
100
70
State Assignment: Traffic Light Controller
Rst == 1
Change == 1
RYG
100
Change == 1
RYG
001
Change == 1
RYG
010
“One-Hot Encoding”
D
Q
R
D
Q
G
D
Q
Y
71
Next State Logic: Traffic Light Controller
Rst == 1
Change == 1
RYG
100
Change == 1
RYG
001
Change == 1
RYG
010
Rst
Change
Next State Combinational Logic
D
Q
R
D
Q
G
D
Q
Y
72
State Verilog: Traffic Light Controller
D
Q
R
D
Q
G
D
Q
Y
wire
next_R, next_Y, next_G;
output R, Y, G;
???
73
Verilog: Edge-Triggered D Flip Flops
D
Q
Value of D is sampled on positive
clock edge.
Q outputs sampled value for rest of
cycle.
module ff(D, Q, CLK);
CLK
input D, CLK;
output Q;
reg Q;
always @ (posedge CLK)
Q <= D;
endmodule
74
State Elements: Traffic Light Controller
D
Q
R
D
Q
G
D
Q
Y
wire
next_R, next_Y, next_G;
output R, Y, G;
ff ff_R(R, next_R, CLK);
ff ff_Y(Y, next_Y, CLK);
ff ff_G(G, next_G, CLK);
75
Next State Logic: Traffic Light Controller
Rst
Change
Next State Combinational Logic
next_R
wire
R
next_G
G
next_Y
Y
next_R, next_Y, next_G;
assign next_R = rst ? 1’b1 : (change ? Y : R);
assign next_Y = rst ? 1’b0 : (change ? G : Y);
assign next_G = rst ? 1’b0 : (change ? R : G);
76
Verilog: Complete Traffic Light Controller
wire
next_R, next_Y, next_G;
output R, Y, G;
assign next_R = rst ? 1’b1 : (change ? Y : R);
assign next_Y = rst ? 1’b0 : (change ? G : Y);
assign next_G = rst ? 1’b0 : (change ? R : G);
ff ff_R(R, next_R, CLK);
ff ff_Y(Y, next_Y, CLK);
ff ff_G(G, next_G, CLK);
77
Logic Diagram: Traffic Light Controller
Rst == 1
Change == 1
RYG
100
Change == 1
RYG
001
Change == 1
RYG
010
Rst
Change
Next State Combinational Logic
D
Q
R
D
Q
G
D
Q
Y
78
In conclusion -- Design Descriptions
Schematics: visually
coherent logic structure.
Timing diagrams: logic in motion.
Verilog: Precise semantics
and structure.
79
计算机的六大类
• 巨型机——世界几家公司生产，最快1.4万亿次，9千个CPU组成
Cray-1,Cray-2,Cray-3,国产银河I, 银河II, 银河III，
目前我国神威号速度达3480亿次/秒。
• 小巨型机——功能同巨型机相近，价格相对便宜，发展十分迅速
美国Convex公司的C系列机为其代表产品。
• 大型机——大中型企事业单位作为计算中心的主机使用，统一调度
主机资源，代表产品有IBM360，370，4300等
• 小型机——它可以满足部门性的需求，供小型企事业单位使用，
典型产品有IBM-AS/400，DEC-VAX系列，国产太级
• 工作站——用于特殊的专业领域，例如图象处理和辅助设计等。
典型产品有HP-APOLLO，SUN工作站等。

微型机——个人或家庭使用，PC机/个人计算机，价格低廉
80
IBM System／360
81
世界上的超级计算机
1.IBM: Seaborg
6 080个 CPU
最大平均速度 7.304 TF (1012)
82
超级计算机
2.IBM: ASCI White
8 192个 CPU
最大平均速度 7.304 TF (1012)
83
超级计算机
3.Linux NetworX : MCR Linux Cluster
2 304个 CPU
最大平均速度 7.634 TF (1012)
84
超级计算机
4. HP : ASCI Q
4 096个 CPU
最大平均速度 13.88 TF (1012)
85
超级计算机
5.NEC: Earth Simulator
5 120个 CPU
最大平均速度 35.86 TF (1012)
86
天河二号
•
•
由国防科大研制的天河二号超级计算机系统，以峰值计算速度每秒5.49亿亿次、持续计算速度每秒3.39亿亿
次双精度浮点运算的优异性能位居榜首，成为全球最快超级计算机。2010年11月，天河一号曾以每秒4.7千万
亿次的峰值速度，首次将五星红旗插上超级计算领域的世界之巅。
2013年11月18日，国际TOP500组织公布了最新全球超级计算机500强排行榜榜单，中国国防科学技术大学研制
的“天河二号”以比第二名—美国的“泰坦”快近一倍的速度再度登上榜首。美国专家预测，在一年时间内
，“天河二号”还会是全球最快的超级计算机。
在2014年6月23日公布的全球超级计算机500强榜单中，中国“天河二号”以比第二名美国“泰坦”快近一倍
的速度连续第三次获得冠军。
计算能力方面，使用14336个节点总计50GB内存进行LINPACK测试，理论性能为49.19Pflops,而实际测试性能
为30.65Pflops,效率为62.3%.这个效率并不算高，还有很大优化提升潜力。
当然也可能是被Xeon phi仅支持PCI Express 2.0带宽不足限制。
天河2的性能部件（处理器、内存、互联）整体功耗为17.6MW,而整体的运算能力为30.65PFlops,这样计算每
瓦的性能为1.935Gflops,这个性能/功耗比可以拍在超算TOP500的前五，其整体性能/功耗比十分出色。
系统的整体功耗为17.6 MW,并且这个功耗还不包括水冷这样的散热系统，如果考虑上整体功耗将高达24MW。
•
“天河二号”由280人历时两年多研制完成，耗资约1亿美元。
•
•
•
•
•
超级计算机内部
第一章学习掌握关注点
• Performance
• Benchmark
• CPI
89
HomeWork
• Readings:
– D. Brooks, P. Bose, S. Schuster, H. Jacobson, P. Kudva, A.
Buyuktosunoglu, J.D. Wellman, V. Zyuban, M. Gupta, and P.
Cook, “Power-Aware Microarchitecture: Design and Modeling
Challenges for Next-Generation Microprocessors,” IEEE Micro,
Nov/Dec, 2000.
– T. Mudge, “Power: A First-Class Architectural Design
Constraint,” IEEE Computer, 2001.
– Read Chapter 1, and Chapter 2.1-2.4, then Appendix B.
90
Acknowledgements
• These slides contain material from courses:
– UCBerkeley CS152.
– Stanford EE108B
91

ppt

Transcript ppt

Directory