computer organization and design
Download
Report
Transcript computer organization and design
Computer
Organization AND Design
The Hardware/Software Interface
Chapter 1
Computer Abstractions
and Technology
Xin LI (李新)
Shandong University
Contents of Chapter 1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
Introduction
Below your program
Under the covers
Performance
Power wall
The Sea Change
Real Stuff: Manufacturing Chips
2
1.1
Introduction
Computers have led to a third revolution for
civilization
The following applications used to be “computer
science fiction”
Automatic teller machines
Computers in automobiles
Laptop computers
Human genome project
World Wide Web
?
Tomorrow’s science fiction computer
applications
Cashless society
Digital cash from 2004
failed
Automated intelligent highways
ITS from 2003
failed
Genuinely ubiquitous computing
Embedded system from 1999 ?
Mobile phone will kilo-core ?
GPU: 1600 cores
Cloud computing
The influence of hardware on software
In the past
Memory size was very small
Programmers must minimize memory space to
make programs fast
Nowadays
The hierarchical nature of memories
The parallel nature of processors
Programmers must understand computer
organization more
Computer major
Theory/Software
Hardware/System
Organization, architecture…
Application
Algorithm, Language principle…
Database, Web, Embedded systems, graphics, …
SCI categories
HARDWARE & ARCHITECTURE
ARTIFICIAL INTELLIGENCE
CYBERNETICS
INFORMATION SYSTEMS
INTERDISCIPLINARY APPLICATIONS
SOFTWARE ENGINEERING
THEORY & METHODS
Hardware PK Software
Who develop qk
Round 1: hardware win
Round 2: Software win
Round 3: hardware win
Round 4: ?
machine code/ASM
C/C++/java
multicore/manycore
Why we need learn hardware?
CS PK EE
What difference between professionally trained person
and other major
Programming skill?
Tools
What is the threshold when non-computer major
students work in IT
Classes of Computer applications
Personal Computer
E.g. Desktop, laptop
Server
High performance
E.g. Mainframes, minicomputers,
supercomputers, data center
Application
WWW, search engine, weather broadcast
Embedded Computers
a computer system with a dedicated function within a larger
mechanical or electrical system
E.g. Cell phone, microprocessors in cars/ television
Embedded computers in a car
Growth of Sales of Embedded Computers
1.2 Below your programs
A simplified view of hardware and software as
hierarchical layers
re
Sy
ions soft w
licat
are
p
Ap
ms softwa
e
t
s
Hardware
1.2 Below your programs
Systems software
aimed at programmers
E.g. Operation Systems, Database, Compiler
Applications software
aimed at users
E.g. Word, IE, QQ, WeChat
Computer Language and Software System
Computer language
Computers only understands electrical signals
Binary numbers express machine instructions
e.g. 1000110010100000 means to add two numbers
Easiest signals: on and off
Very tedious to write
Assembly language
Symbolic notations
e.g. add a, b, c #a=b+c
The assembler translates them into machine
instruction
Programmers have to think like the machine
The Instruction Set Architecture (ISA)
software
instruction set architecture
hardware
The interface description separating the
software and hardware
High-level programming language
Notations more closer to the natural language
The compiler translates them into assembly language
statements
Advantages over assembly language
Programmers can think in a more natural language
Improved programming productivity
Programs can be independent of hardware
Subroutine library ---- reusing programs
Which one faster?
Asm、C、C++、Java
Lower, faster
1.3 Under the covers
Mouse
鼠标1968
The mechanical version
Moving the mouse rolls the large ball
inside
The ball makes contact with an xwheel and a y-wheel
Decide the distance and direction the
mouse moves according to the rotation
of wheels
The photoelectric version
Better orientation and better
precision
鼠标之父——道格·恩格尔巴特(1925-2013)
Display
CRT (raster cathode ray tube) display
Scan an image one line at a time, 30 to 75
times / s
Pixels and the bit map, 512×340 to
1560×1280
The more bits per pixel, the more colors to be
displayed
LCD
(liquid crystal display)
Thin and low-power
The LCD pixel is not the source of light
Rod-shaped molecules in a liquid that form
a twisting helix that bends light entering the
display
Hardware support for graphics ---- raster
refresh buffer (frame buffer) to store bit map
Goal of bit map ---- to faithfully represent what
is on the screen
Frame buffer
Y0
Y1
0
0
1
Raster scan CRT display
1
0
1
1
X 0 X1
1
Y0
Y1
X0 X1
The System Unit
What are common components inside
the system unit?
Processor
Memory module
Expansion cards
• Sound card
• Modem card
• Video card
• Network
interface
card
Ports and
Connectors
Motherboard and the hardware on it
Motherboard
Thin, green, plastic, covered with dozens of small
rectangles which contain integrated circuits (chips)
Three pieces: the piece connecting to the I/O devices,
memory, and processor
Memory
Place to keep running prgrams and data needed
Each memory board contains 8 integrated circuits
DRAM and cache
Processor
Add numbers, tests numbers, signals I/O devices to
activate, and so on
CPU (central processor unit)
Program platform of motherboard
UEFI,ASM/C/C++
What is the motherboard?
Close-up of PC motherboard
Audio/
MIDI
Four
ISA
card
slots
Four
PCI
card
slots
Parallel/
serial
What diff
Slot 0
…
Slot3
Processor
Speed
0>1>2>3
Prime>slave
Four
SIMM
slots
Two IDE
connectors
24
CPU
内存条
CPU散热风扇
电源
主机箱
25
软驱
显卡
硬盘
光驱
声卡
26
The five classic components of a computer
Abstractions
Lower-level details are hidden to higher levels
Instruction set architecture ---- the interface between
hardware and lowest-level software
Many implementations of varying cost and performance
can run identical software
A safe place for data ---- secondary memory
Main memory is volatile
Secondary memory is nonvolatile
Below the Program
High-level language program (in C)
swap (int v[], int k)
. . .
Assembly
swap:
language program (for MIPS)
sll
add
lw
lw
sw
sw
jr
$2, $5, 2
$2, $4, $2
$15, 0($2)
$16, 4($2)
$16, 0($2)
$15, 4($2)
$31
C compiler
Machine (object) code (for MIPS)
000000
000000
100011
100011
101011
101011
000000
00000
00100
00010
00010
00010
00010
11111
00101
00010
01111
10000
10000
01111
00000
0001000010000000
0001000000100000
0000000000000000
0000000000000100
0000000000000000
0000000000000100
0000000000001000
assembler
Input Device Inputs Object Code
000000
000000
100011
100011
101011
101011
000000
Devices
Processor
Network
Control
Datapath
Memory
Input
Output
00000
00100
00010
00010
00010
00010
11111
00101
00010
01111
10000
10000
01111
00000
0001000010000000
0001000000100000
0000000000000000
0000000000000100
0000000000000000
0000000000000100
0000000000001000
Object Code Stored in Memory
Memory
Processor
Control
Datapath
000000
000000
100011
100011
101011
101011
000000
00000
00100
00010
00010
00010
00010
11111
00101
00010
01111
10000
10000
01111
00000
0001000010000000
0001000000100000
0000000000000000
0000000000000100
0000000000000000
0000000000000100
0000000000001000
Devices
Network
Input
Output
Processor Fetches an Instruction
Processor fetches an instruction from memory
Memory
Processor
Control
Datapath
000000
000000
100011
100011
101011
101011
000000
00000
00100
00010
00010
00010
00010
11111
00101
00010
01111
10000
10000
01111
00000
0001000010000000
0001000000100000
0000000000000000
0000000000000100
0000000000000000
0000000000000100
0000000000001000
Devices
Network
Input
Output
Control Decodes the Instruction
Control decodes the instruction to determine what to execute
Devices
Network
Processor
Control
000000 00100 00010 0001000000100000
Memory
Input
Datapath
Output
Datapath Executes the Instruction
Datapath executes the instruction as directed by
control
Devices
Network
Processor
Control
000000 00100 00010 0001000000100000
Memory
Input
Datapath
contents Reg #4 ADD contents Reg #2
results put in Reg #2
Output
Integrated Circuits
Relative performance / unit cost of
technologies used in computers
Year
Technology used in Relative performance /
computers
unit cost
1962, SSI(Small-Scale
Integration)
1951
Vacuum
tube 12 transistors 1
1966, MSI(Medium-Scale
Integration),100-1k transistors
1965
Transistor
35
1967-1973年,
LSI(Large-Scale
Integration),1k~100k
1975
Integrated Circuit
900 transistors
2, 150k transistors
1977,VLSI(Very
Large-Scale
Integration),30m
1995
Very large-scale
2,400,000
integrated Integration)
Circuit
1993, ULSI (Ultra Large-Scale
16M FLASH and 256M DRAM
which integrate 10M transistors
1994, GSI(Giga Scale Integration) 1G DRAM which integrate 100M
transistors
2007: 2T flops 80core CPU
Growth of capacity per DRAM chip
over time
100,000
64M
16M
Kbit capacity
10,000
4M
1M
1000
256K
100
64K
16K
10
1976
1978
1980
1982
1984
1986
1988
Year of introduction
1990
1992
1994
1996
History of Computer Development
1946 ENIAC (Electronic Numerical Integrator and Calculator)
History of Computer Development
The first electronic computers
ENIAC (Electronic Numerical Integrator and Calculator)
J. Presper Eckert and John Mauchly
Publicly known in 1946
30 tons, 80 feet long, 8.5 feet high, several feet wide
18,000 vacuum tubes
EDVAC (Electronic Discrete Variable Automatic Computer)
John von Neumann’s memo about stored-program
computer
von Neumann Computer
EDSAC (Electronic Delay Storage Automatic Calculator)
Operational in 1949
First full-scale, operational, stored-program
computer in the world
John Atanasoff’s small-scale electronic computer in
the early 1940s
A special-purpose machine by Konrad Zuse in
Germany
Colossus built in 1943
Harvard architecture
Whirlwind project
Commercial Developments
Eckert-Mauchly Computer Corporation
Formed in 1947
$1 million for each of the 48 computers
IBM computers
First one, the IBM 701, shipped in 1952
Investing $5 billion for System/360 in 1964
Digital Equipment Corporation (DEC)
The first commercial minicomputer PDP-8 in 1965
Low-cost design, under $20,000
CDC 6600
The first supercomputer, built in 1963
Cray Research, Inc.
Cray-1 in 1976
The fastest, the most expensive, the best cost/performance
for scientific programs
Personal computer
Apple II
In 1977
Low cost, high volume, high reliability
IBM Personal Computer
Annouced in 1981
Best-selling computer of any kind
Microprocessors of Intel and operating systems of
Microsoft became popular
Computer Generations
First generation
1950-1959, vacuum tubes, commercial electronic
computer
Second generation
1960-1968, transistors, cheaper computers
Third generation
1969-1977, integrated circuit, minicomputer
Fourth generation
1978-1997, LSI and VLSI, PCs and workstations
Fifth generation
1998-?, micromation and hugeness
selfstudy course
new departure from
computer hardware
Multicore
From 2006
IBM/SUN/AMD/Intel
number of core
Software
2, 4, 8, 16, 48
adequate provision?
Embedded system
embed PC into electronic product
pervasive computing/Ubiquitous computing
I/O
Device innovation
WII, PS3, Xbox 360
Communication
3G/4G/WIMAX
Computer performance tools
CPU
Memory
DISK
Task manager
Service
System info
1.4 Performance
Performance metrics:
Response time, wall-clock time, or elapsed
time
Execution time
The time between the start and the completion of an event
The time CPU spends computing , not include time
spent waiting
Throughput
the total amount of work done in a given time.
45
“X is faster than Y”
the execution time on Y is longer than that on X.
“X is n times faster than Y”
“the throughput of X is 1.3 times higher than Y”
the number of tasks completed per unit time on machine
X is 1.3 times the number completed on Y.
46
Example
If computer A runs a program in 10 seconds and
computer B runs the same program in 15 seconds,
how much faster is A than B?
The performance ratio is 15/10=1.5
A is therefore 1.5 times faster than B
47
Measuring Performance
wall-clock time, response time, or elapsed
time,
Including disk accesses, memory accesses, input/output
activities, operating system overhead —everything.
CPU time= user CPU time+ system CPU time.
User CPU time: the CPU time spent in a program itself
System CPU time: the CPU time spent in the operating
system
48
clock
Clock cycle (also tick): the time for one clock period.
Clock rate: the count of clocks in one second.
250ps (PicoSeconds)
4GHz (GigaHertz)
倒数关系?
Clock cycle time and clock rate are inverses.
The inverse/reciprocal of clock cycle is clock rate.
49
Example
One program runs in 10 seconds on computer A,
which has a 2 GHz clock. Computer B requires
1.2 times as many clock cycles as computer A for
this program. To run this program in 6 seconds,
what clock rate should the computer B supply?
CPU clock cyclesA=10×2×109=2×1010
Clock rateB=1.2×2×1010/6=4GHz
To run the program in 6 seconds, B must have twice
the clock rate of A.
50
Instruction Performance
Clock cycles per instruction (CPI): average number
of clock cycles per instruction for a program
Different instructions may take different amounts of time
depending on what they do.
CPI provides one way of comparing two different
implementations of the same instruction set
architecture.
Instruction set architecture(ISA)
The number of instructions executed for a program will
be the same, if the program run in two different
implementations of the same instruction set architecture.
51
CPU Performance Equation
CPU time=Instruction count ×CPI×Clock cycle time
CPU time=Instruction count ×CPI/Clock rate
52
Example
Suppose we have two implementations of the same
instruction set architecture. Computer A has a clock cycle
time for 250ps and a CPI of 2.0 for some program, and
computer B has a clock cycle time of 500ps and a CPI of
1.2 for the same program. Which computer is faster for
this program and by how much?
I: the number of instructions for the program
CPU timeA=I×2×250(ps)=500×I (ps)
CPU timeB=I×1.2×500(ps)=600×I (ps)
Computer A is 1.2 times as fast as computer B for
this program.
53
Choosing Programs to Evaluate Performance
five levels of programs :
Real applications
Modified (or scripted) applications
Kernels
Toy benchmarks
Synthetic benchmarks
54
Desktop Benchmarks
CPU-intensive benchmarks
SPEC89
SPEC92
SPEC95
SPEC2000
SPEC2006
graphics-intensive benchmarks
SPEC2000
SPECviewperf
is used for benchmarking systems supporting the OpenGL graphics library
SPECapc
consists of applications that make extensive use of graphics.
55
Server Benchmarks
SPECrate--processing rate of a multiprocessor
(SPECSFS)--file server benchmark
(SPECWeb)--Web server benchmark
Transaction-processing (TP) benchmarks
TPC benchmark—Transaction Processing Council
TPC-A, 1985
TPC-C, 1992,
TPC-H TPC-RTPC-W
56
Embedded Benchmarks
EDN Embedded Microprocessor Benchmark Consortium
(or EEMBC, pronounced “embassy”).
57
Quantitative Principles
Make the Common Case Fast
Perhaps it is the most important and pervasive
principle of computer design.
A fundamental law, called Amdahl’s Law, can be
used to quantify this principle.
58
Amdahl’s Law
states that the performance improvement to be
gained from using some faster mode of execution is
limited by the fraction of the time the faster mode
can be used.
59
The fraction of the computation time in the
original machine that can be converted to
take advantage of the enhancement
The improvement gained by the enhanced
execution mode; that is, how much faster
the task would run if the enhanced mode
were used for the entire program
60
61
Example1.2
Suppose that we are considering an enhancement to the
processor of a server system used for Web serving. The
new CPU is 10 times faster on computation in the Web
serving application than the original processor.
Assuming that the original CPU is busy with
computation 40% of the time and is waiting for I/O 60%
of the time, what is the overall speedup gained by
incorporating the enhancement?
62
Answer
63
Example1.3
A common transformation required in graphics engines is square
root. Implementations of floating-point (FP) square root vary
significantly in performance, especially among processors
designed for graphics. Suppose FP square root (FPSQR) is
responsible for 20% of the execution time of a critical graphics
benchmark.One proposal is to enhance the FPSQR hardware and
speed up this operation by a factor of 10. The other alternative is
just to try to make all FP instructions in the graphics processor
run faster by a factor of 1.6; FP instructions are responsible for a
total of 50% of the execution time for the application. The
design team believes that they can make all FP instructions run
1.6 times faster with the same effort as required for the fast
square root. Compare these two design alternatives.
64
answer
65
The CPU Performance Equation
66
67
CPU performance is dependent upon three
characteristics:
clock cycle (or rate)
clock cycles per instruction
and instruction count.
It is difficult to change one parameter in complete
isolation from others because the basic
technologies involved in changing each
characteristic are interdependent:
68
cycle time—Hardware technology and
organization
CPI—Organization and instruction set architecture
Instruction count—Instruction set architecture
and compiler technology
Clock
69
70
Example1.4: Suppose we have made the
following measurements:
Frequency of FP operations (other than FPSQR) = 25%
Average CPI of FP operations = 4.0
Average CPI of other instructions = 1.33
Frequency of FPSQR= 2%
CPI of FPSQR = 20
Assume that the two design alternatives are to decrease the
CPI of FPSQR to 2 or to decrease the average CPI of all
FP operations to 2.5. Compare these two design
alternatives using the CPU performance equation.
71
Answer
Since the CPI of the overall FP enhancement is slightly lower, its
performance will be marginally better.
72
This is the same speedup we obtained using Amdahl’s
Law:
73
Principle of Locality
Programs tend to reuse data and instructions
they have used recently.
a program spends 90% of its execution time in
only 10% of the code.
Temporal locality
states that recently accessed items are likely to be
accessed in the near future.
Spatial locality
says that items whose addresses are near one
another tend to be referenced close together in time.
74
Power Consumption Trends
Power=Dynamic power+ Leakage power
•Dyn power∝activity capacitance×voltage2 ×frequency
•Capacitance per transistor and voltage are decreasing,
but number of transistors and frequency are increasing at a faster rate
• Leakage power is also rising and will soon match dynamic
power
Power consumption is already around 100W in some highperformance processors today
75
Power wall
Power = K (Capacitive Load)·(Voltage)2·(Frequency Switched)
1.7 Real Stuff: Manufacturing AMD Chips
AMD Barcelona
65nm
463 million transistors
each core has a 128KB
L1 cache and a 512KB
L2 cache, with all four
cores sharing a 2MB L3
cache
Wafers and Dies
78
The semiconductor silicon and the chip
manufacturing process
Manufacturing Process
• Silicon wafers undergo many processing steps so that
different parts of the wafer behave as insulators(绝缘体),
conductors, and transistors (switches)
• Multiple metal layers on the silicon enable connections
between transistors
• The wafer is chopped into many dies – the size of the die
determines yield and cost
80
Processor Technology Trends
• Shrinking of transistor sizes: 250nm (1997)
130nm (2002) 70nm (2008) 35nm (2014)
• Transistor density increases by 35% per year and die size
increases by 10-20% per year… functionality improvements!
• Transistor speed improves linearly with size (complex
equation involving voltages, resistances, capacitances)
• Wire delays do not scale down at the same rate as
transistor delays
81
Assignments
P56
1.1
1.3.1-1.3.3
1.3.1-1.4.3
82
END