Rosetta Demostrator Project MASC, Adelaide University and

Download Report

Transcript Rosetta Demostrator Project MASC, Adelaide University and

COMPUTER ORGANIZATION AND DESIGN
5th
The Hardware/Software Interface
Chapter 1
Computer Abstractions
and Technology
Edition

Progress in computer technology


Underpinned by Moore’s Law
Makes novel applications feasible






§1.1
Introduction
The Computer Revolution
Computers in automobiles
Cell phones
Human genome project
World Wide Web
Search Engines
Computers are pervasive
Chapter 1 — Computer Abstractions and Technology — 2
Classes of Computers

Personal computers



General purpose, variety of software
Subject to cost/performance tradeoff
Server computers



Network based
High capacity, performance, reliability
Range from small servers to building sized
Chapter 1 — Computer Abstractions and Technology — 3
Classes of Computers

Supercomputers



High-end scientific and engineering
calculations
Highest capability but represent a small
fraction of the overall computer market
Embedded computers


Hidden as components of systems
Stringent power/performance/cost constraints
Chapter 1 — Computer Abstractions and Technology — 4
The PostPC Era
Chapter 1 — Computer Abstractions and Technology — 5
The PostPC Era

Personal Mobile Device (PMD)





Battery operated
Connects to the Internet
Hundreds of dollars
Smart phones, tablets, electronic glasses
Cloud computing




Warehouse Scale Computers (WSC)
Software as a Service (SaaS)
Portion of software run on a PMD and a
portion run in the Cloud
Amazon and Google
Chapter 1 — Computer Abstractions and Technology — 6
What is CSCI-365?
Application (ex: browser)
Compiler
Software
Hardware
Assembler
Operating
System
(Mac OSX)
Processor Memory I/O system
CSCI-263
Instruction Set
Architecture
Datapath & Control
Digital Design
Circuit Design
transistors
Coordination of many
levels (layers) of abstraction
What You Will Learn

How programs are translated into the
machine language



The hardware/software interface
What determines program performance



And how the hardware executes them
And how it can be improved
How hardware designers improve
performance
What is parallel processing
Chapter 1 — Computer Abstractions and Technology — 8
Understanding Performance

Algorithm


Programming language, compiler, architecture


Determine number of machine instructions executed
per operation
Processor and memory system


Determines number of operations executed
Determine how fast instructions are executed
I/O system (including OS)

Determines how fast I/O operations are executed
Chapter 1 — Computer Abstractions and Technology — 9

Design for Moore’s Law

Use abstraction to simplify design

Make the common case fast

Performance via parallelism

Performance via pipelining

Performance via prediction

Hierarchy of memories

Dependability via redundancy
§1.2 Eight Great Ideas in Computer
Architecture
Eight Great Ideas
Chapter 1 — Computer Abstractions and Technology — 10

Application software


Written in high-level language
System software


Compiler: translates HLL code to
machine code
Operating System: service code




§1.3 Below Your Program
Below Your Program
Handling input/output
Managing memory and storage
Scheduling tasks & sharing resources
Hardware

Processor, memory, I/O controllers
Chapter 1 — Computer Abstractions and Technology — 11
Levels of Program Code

High-level language



Assembly language


Level of abstraction closer
to problem domain
Provides for productivity
and portability
Textual representation of
instructions
Hardware representation


Binary digits (bits)
Encoded instructions and
data
Chapter 1 — Computer Abstractions and Technology — 12
The BIG Picture

Same components for
all kinds of computer


Desktop, server,
embedded
§1.4 Under the Covers
Components of a Computer
Input/output includes

User-interface devices


Storage devices


Display, keyboard, mouse
Hard disk, CD/DVD, flash
Network adapters

For communicating with
other computers
Chapter 1 — Computer Abstractions and Technology — 13
Opening the Box
Capacitive multitouch LCD screen
3.8 V, 25 Watt-hour battery
Computer board
Chapter 1 — Computer Abstractions and Technology — 16
Inside the Processor (CPU)



Datapath: performs operations on data
Control: sequences datapath, memory, ...
Cache memory

Small fast SRAM memory for immediate
access to data
Chapter 1 — Computer Abstractions and Technology — 17
Inside the Processor

Apple A5
Chapter 1 — Computer Abstractions and Technology — 18
Abstractions
The BIG Picture

Abstraction helps us deal with complexity


Instruction set architecture (ISA)


The hardware/software interface
Application binary interface


Hide lower-level detail
The ISA plus system software interface
Implementation

The details underlying and interface
Chapter 1 — Computer Abstractions and Technology — 19
A Safe Place for Data

Volatile main memory


Loses instructions and data when power off
Non-volatile secondary memory



Magnetic disk
Flash memory
Optical disk (CDROM, DVD)
Chapter 1 — Computer Abstractions and Technology — 20
Networks




Communication, resource sharing,
nonlocal access
Local area network (LAN): Ethernet
Wide area network (WAN): the Internet
Wireless network: WiFi, Bluetooth
Chapter 1 — Computer Abstractions and Technology — 21

Electronics
technology
continues to evolve


Increased capacity
and performance
Reduced cost
DRAM capacity
Year
Technology
Relative performance/cost
1951
Vacuum tube
1965
Transistor
1975
Integrated circuit (IC)
1995
Very large scale IC (VLSI)
2013
Ultra large scale IC
1
35
900
2,400,000
§1.5 Technologies for Building Processors and Memory
Technology Trends
250,000,000,000
Chapter 1 — Computer Abstractions and Technology — 22
# of transistors on an IC
Microprocessor Complexity
Gordon Moore
Intel Cofounder
2X Transistors / Chip
Every 1.5 years
Called
“Moore’s Law”
Year
Memory Capacity (Single-Chip DRAM)
siz e
1000000000
B its
Bits
100000000
10000000
1000000
100000
10000
1000
1970
1975
1980
1985
Year
Year
• Now 1.4X/yr, or 2X every 2 years
• 8000X since 1980!
1990
1995
2000
year
1980
1983
1986
1989
1992
1996
1998
2000
2002
2004
2006
size (Mbit)
0.0625
0.25
1
4
16
64
128
256
512
1024
(1Gbit)
2048
(2Gbit)
Computer Technology – Dramatic
Change!

Memory


Processor


DRAM capacity: 2x / 2 years (since ‘96);
64x size improvement in last decade
Speed 2x / 1.5 years (since ‘85); [slowing!]
100X performance in last decade
Disk

Capacity: 2x / 1 year (since ‘97)
250X size in last decade
Performance Metrics




Purchasing perspective
 given a collection of machines, which has the
 best performance ?
 least cost ?
 best cost/performance?
Design perspective
 faced with design options, which has the
 best performance improvement ?
 least cost ?
 best cost/performance?
Both require
 basis for comparison
 metric for evaluation
Our goal is to understand what factors in the architecture contribute
to overall system performance and the relative importance (and cost)
of these factors

Which airplane has the best performance?
Boeing 777
Boeing 777
Boeing 747
Boeing 747
BAC/Sud
Concorde
BAC/Sud
Concorde
Douglas
DC-8-50
Douglas DC8-50
0
100
200
300
400
0
500
Boeing 777
Boeing 777
Boeing 747
Boeing 747
BAC/Sud
Concorde
BAC/Sud
Concorde
Douglas
DC-8-50
Douglas DC8-50
500
1000
Cruising Speed (mph)
4000
6000
8000 10000
Cruising Range (miles)
Passenger Capacity
0
2000
§1.6 Performance
Defining Performance
1500
0
100000 200000 300000 400000
Passengers x mph
Chapter 1 — Computer Abstractions and Technology — 27
Response Time and Throughput

Response time


How long it takes to do a task
Throughput

Total work done per unit time


How are response time and throughput affected
by



e.g., tasks/transactions/… per hour
Replacing the processor with a faster version?
Adding more processors?
We’ll focus on response time for now…
Chapter 1 — Computer Abstractions and Technology — 28
Relative Performance



Define Performance = 1/Execution Time
“X is n time faster than Y”
Performanc
e X Performanc
 Execution
time
Y
Execution
eY
time
X
n
Example: time taken to run a program



10s on A, 15s on B
Execution TimeB / Execution TimeA
= 15s / 10s = 1.5
So A is 1.5 times faster than B
Chapter 1 — Computer Abstractions and Technology — 29
Measuring Execution Time

Elapsed time

Total response time, including all aspects



Processing, I/O, OS overhead, idle time
Determines system performance
CPU time

Time spent processing a given job



Discounts I/O time, other jobs’ shares
Comprises user CPU time and system CPU
time
Different programs are affected differently by
CPU and system performance
Chapter 1 — Computer Abstractions and Technology — 30
CPU Clocking

Operation of digital hardware governed by a
constant-rate clock
Clock period
Clock (cycles)
Data transfer
and computation
Update state

Clock period: duration of a clock cycle


e.g., 250ps = 0.25ns = 250×10–12s
Clock frequency (rate): cycles per second

e.g., 4.0GHz = 4000MHz = 4.0×109Hz
Chapter 1 — Computer Abstractions and Technology — 31
Review: Machine Clock Rate

Clock rate (clock cycles per second in
MHz or GHz) is inverse of clock cycle time
(clock period)
CC = 1 / CR
one clock period
10 nsec clock cycle => 100 MHz clock rate
5 nsec clock cycle => 200 MHz clock rate
2 nsec clock cycle => 500 MHz clock rate
1 nsec (10-9) clock cycle => 1 GHz (109) clock rate
500 psec clock cycle => 2 GHz clock rate
250 psec clock cycle => 4 GHz clock rate
200 psec clock cycle => 5 GHz clock rate
CPU Time
CPU Time  CPU Clock Cycles  Clock Cycle Time

CPU Clock Cycles
Clock Rate

Performance improved by



Reducing number of clock cycles
Increasing clock rate
Hardware designer must often trade off clock
rate against cycle count
Chapter 1 — Computer Abstractions and Technology — 33
CPU Time Example


Computer A: 2GHz clock, 10s CPU time
Designing Computer B



Aim for 6s CPU time
Can do faster clock, but causes 1.2 × clock cycles
How fast must Computer B clock be?
Clock Rate
Clock Cycles
B
A

Clock Cycles
CPU Time
 CPU Time
A
B

1.2  Clock Cycles
6s
B
 Clock Rate
 10s  2GHz  20  10
Clock Rate
B

1.2  20  10
6s
A
9

A
9
24  10
9
 4GHz
6s
Chapter 1 — Computer Abstractions and Technology — 34
Instruction Count and CPI
Clock Cycles  Instructio n Count  Cycles per Instructio n
CPU Time  Instructio n Count  CPI  Clock Cycle Time

Instructio n Count  CPI
Clock Rate

Instruction Count for a program


Determined by program, ISA and compiler
Average cycles per instruction


Determined by CPU hardware
If different instructions have different CPI

Average CPI affected by instruction mix
Chapter 1 — Computer Abstractions and Technology — 35
CPI Example




Computer A: Cycle Time = 250ps, CPI = 2.0
Computer B: Cycle Time = 500ps, CPI = 1.2
Same ISA
Which is faster, and by how much?
CPU Time
A
 Instructio n Count  CPI
A
 I  2.0  250ps  I  500ps
CPU Time
B
 Instructio n Count  CPI
B
 Cycle Time
A
A is faster…
 Cycle Time
B
 I  1.2  500ps  I  600ps
CPU Time
B  I  600ps
CPU Time
I  500ps
A
 1.2
…by this much
Chapter 1 — Computer Abstractions and Technology — 36
CPI in More Detail

If different instruction classes take different
numbers of cycles
n
Clock Cycles 
 (CPI
i
 Instructio
n Count i )
i 1

CPI 
Weighted average CPI
Clock Cycles
Instructio
n Count
n


i1
Instructio n Count i 

 CPI i 

Instructio n Count 

Relative frequency
Chapter 1 — Computer Abstractions and Technology — 37
CPI Example


Alternative compiled code sequences using
instructions in classes A, B, C
Class
A
B
C
CPI for class
1
2
3
IC in sequence 1
2
1
2
IC in sequence 2
4
1
1
Sequence 1: IC = 5


Clock Cycles
= 2×1 + 1×2 + 2×3
= 10
Avg. CPI = 10/5 = 2.0

Sequence 2: IC = 6


Clock Cycles
= 4×1 + 1×2 + 1×3
=9
Avg. CPI = 9/6 = 1.5
Chapter 1 — Computer Abstractions and Technology — 38
A Simple Example
Op
Freq
CPIi
Freq x CPIi
ALU
50%
1
.5
.5
.5
.25
Load
20%
5
1.0
.4
1.0
1.0
Store
10%
3
.3
.3
.3
.3
Branch
20%
2
.4
.4
.2
.4
2.2
1.6
2.0
1.95
=

How much faster would the machine be if a better data cache
reduced the average load time to 2 cycles?
CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster

How does this compare with using branch prediction to shave
a cycle off the branch time?
CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster

What if two ALU instructions could be executed at once?
CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster
Determinates of CPU Performance
CPU time
= Instruction_count x CPI x clock_cycle
Instruction_
count
Algorithm
Programming
language
Compiler
ISA
Core
organization
Technology
CPI
clock_cycle
Determinates of CPU Performance
CPU time
= Instruction_count x CPI x clock_cycle
Algorithm
Programming
language
Compiler
ISA
Core
organization
Technology
Instruction_
count
CPI
clock_cycle
X
X
X
X
X
X
X
X
X
X
X
X
Performance Summary
The BIG Picture
CPU Time 
Instructio
ns
Program


Clock cycles
Instructio
n

Seconds
Clock cycle
Performance depends on




Algorithm: affects IC, possibly CPI
Programming language: affects IC, CPI
Compiler: affects IC, CPI
Instruction set architecture: affects IC, CPI, Tc
Chapter 1 — Computer Abstractions and Technology — 43
§1.7 The Power Wall
Power Trends

In CMOS IC technology
Power  Capacitive
×40
load  Voltage
2
 Frequency
5V → 1V
×1000
Chapter 1 — Computer Abstractions and Technology — 44
Reducing Power

Suppose a new CPU has


85% of capacitive load of old CPU
15% voltage and 15% frequency reduction
Pnew
Pold

2
 Fold  0.85
2
C old  V old  Fold
 0.85
4
 0.52
The power wall




C old  0.85  (V old  0.85)
We can’t reduce voltage further
We can’t remove more heat
How else can we improve performance?
Chapter 1 — Computer Abstractions and Technology — 45
§1.8 The Sea Change: The Switch to Multiprocessors
Uniprocessor Performance
Constrained by power, instruction-level parallelism,
memory latency
Chapter 1 — Computer Abstractions and Technology — 46
Multiprocessors

Multicore microprocessors


More than one processor per chip
Requires explicitly parallel programming

Compare with instruction level parallelism



Hardware executes multiple instructions at once
Hidden from the programmer
Hard to do



Programming for performance
Load balancing
Optimizing communication and synchronization
Chapter 1 — Computer Abstractions and Technology — 47
SPEC CPU Benchmark

Programs used to measure performance


Standard Performance Evaluation Corp (SPEC)


Supposedly typical of actual workload
Develops benchmarks for CPU, I/O, Web, …
SPEC CPU2006

Elapsed time to execute a selection of programs



Negligible I/O, so focuses on CPU performance
Normalize relative to reference machine
Summarize as geometric mean of performance ratios

CINT2006 (integer) and CFP2006 (floating-point)
n
n
 Execution
time ratio
i
i1
Chapter 1 — Computer Abstractions and Technology — 48
Semiconductor Technology


Silicon: semiconductor
Add materials to transform properties:



Conductors
Insulators
Switch
Chapter 1 — Computer Abstractions and Technology — 49
Manufacturing ICs

Yield: proportion of working dies per wafer
Chapter 1 — Computer Abstractions and Technology — 50
CINT2006 for Intel Core i7 920
Chapter 1 — Computer Abstractions and Technology — 51
Intel Core i7 Wafer


300mm wafer, 280 chips, 32nm technology
Each chip is 20.7 x 10.5 mm
Chapter 1 — Computer Abstractions and Technology — 52
SPEC Power Benchmark

Power consumption of server at different
workload levels


Performance: ssj_ops/sec
Power: Watts (Joules/sec)
 10
Overall ssj_ops per Watt    ssj_ops
 i 0

i 

 10

  power i 
 i 0

Chapter 1 — Computer Abstractions and Technology — 53
Integrated Circuit Cost
Cost per die 
Cost per wafer
Dies per wafer  Yield
Dies per wafer  Wafer area Die area
Yield 

1
(1  (Defects per area  Die area/2))
2
Nonlinear relation to area and defect rate



Wafer cost and area are fixed
Defect rate determined by manufacturing process
Die area determined by architecture and circuit design
Chapter 1 — Computer Abstractions and Technology — 54
SPECpower_ssj2008 for Xeon X5650
Chapter 1 — Computer Abstractions and Technology — 55

Improving an aspect of a computer and
expecting a proportional improvement in
overall performance
Timprov ed 

Taf f ected
improvemen
t factor
 Tunaf f ected
§1.10 Fallacies and Pitfalls
Pitfall: Amdahl’s Law
Example: multiply accounts for 80s/100s

How much improvement in multiply performance to
get 5× overall?
20 
80
 20

Can’t be done!
n

Corollary: make the common case fast
Chapter 1 — Computer Abstractions and Technology — 56
Fallacy: Low Power at Idle

Look back at i7 power benchmark




Google data center



At 100% load: 258W
At 50% load: 170W (66%)
At 10% load: 121W (47%)
Mostly operates at 10% – 50% load
At 100% load less than 1% of the time
Consider designing processors to make
power proportional to load
Chapter 1 — Computer Abstractions and Technology — 57
Pitfall: MIPS as a Performance Metric

MIPS: Millions of Instructions Per Second

Doesn’t account for


Differences in ISAs between computers
Differences in complexity between instructions
MIPS 
Instructio
Execution

n count
time  10
Instructio
Instructio
n count
n count  CPI
Clock rate

6

 10
6
Clock rate
CPI  10
6
CPI varies between programs on a given CPU
Chapter 1 — Computer Abstractions and Technology — 58

Cost/performance is improving


Hierarchical layers of abstraction



In both hardware and software
Instruction set architecture


Due to underlying technology development
§1.9 Concluding Remarks
Concluding Remarks
The hardware/software interface
Execution time: the best performance
measure
Power is a limiting factor

Use parallelism to improve performance
Chapter 1 — Computer Abstractions and Technology — 59