Computer Organization CS224

Download Report

Transcript Computer Organization CS224

Computer Organization
Fall 2012
“Welcome to my course”
Moxtar Mohammadi
With thanks to M.J. Irwin, D. Patterson, and J. Hennessy for some lecture slide contents
Course Contents
Overview of computer technologies, instruction set architecture
(ISA), ISA design considerations, RISC vs. CISC, assembly and
machine language, translation and program start-up. Computer
arithmetic, arithmetic logic unit, floating-point numbers and their
arithmetic implementations. Processor design, data path and control
implementation, pipelining, hazards, pipelined processor design,
hazard detection and forwarding, branch prediction and exception
handling. Memory hierarchy, principles, structure, and performance
of caches, virtual memory, segmentation and paging. I/O devices,
I/O performance, interfacing I/O. Intro to multiprocessors, multicores,
and cluster computing.
Policies
Everything is on the Web site:
Numerical average will be calculated from:
• 4-5 homeworks 10%
• X pop quizzes 10%
• 2 projects
30%
• Midterm
20%
• Final exam 30%
TO PASS, you must:
• have exam average >= 35% (weighted average)
• have overall course performance that is passing
Introduction
• This course is all about how computers work
• But what do we mean by a computer?
– Different types: desktop, servers, embedded devices
– Different uses: automobiles, graphics, finance, genomics…
– Different manufacturers: Intel, Apple, IBM, Microsoft, Sun…
– Different underlying technologies and different costs
• Best way to learn:
– Focus on a specific instance and learn how it works
– While learning general principles and historical perspectives
Why learn this stuff?
• You want to call yourself a “computer engineer”
• You want to build software people use (need performance)
• You need to make a purchasing decision or offer “expert” advice
• Both Hardware and Software affect performance:
– Algorithm determines number of source-level statements
– Language/Compiler/Architecture determine number of machine
instructions (Chapter 2 and 3)
– Processor/Memory determine how fast instructions are executed
(Chapter 4 and 5)
– I/O and Number_of_Cores determine overall system performance
(Chapter 6 and 7)
Organization of a Computer
• Five classic components of a computer – input, output, memory,
datapath, and control
 datapath
+ control
=
processor
What is a computer?
• Components:
– input (mouse, keyboard, camera, microphone...)
– output (display, printer, speakers....)
– memory (caches, DRAM, SRAM, hard disk drives, Flash....)
– network (both input and output)
• Our primary focus: the processor (datapath and control)
– implemented using billions of transistors
– Impossible to understand by looking at each transistor
– We need...abstraction!
An abstraction omits unneeded detail,
helps us cope with complexity.
How do computers work?
• Each of the following abstracts everything below it:
–
–
–
–
–
–
–
–
–
–
–
Applications software
Systems software
Assembly Language
Machine Language
Architectural Approaches: Caches, Virtual Memory, Pipelining
Sequential logic, finite state machines
Combinational logic, arithmetic circuits
Boolean logic, 1s and 0s
Transistors used to build logic gates (e.g. CMOS)
Semiconductors/Silicon used to build transistors
Properties of atoms, electrons, and quantum dynamics
• Notice how abstraction hides the detail of lower levels, yet gives a
useful view for a given purpose
Computer Architecture
Application
Operating
System
Compiler
Firmware
Instruction Set Architecture
Instr. Set Proc.
I/O system
Logic Design
Implementation
Circuit Design
Layout
Computer
Architecture
The Instruction Set: a Critical Interface
software
instruction set architecture
hardware
Instruction Set Architecture
• A very important abstraction
– interface between hardware and low-level software
– standardizes instructions, machine language bit patterns, etc.
– advantage: different implementations of the same architecture
– disadvantage: sometimes prevents using new innovations
• Common instruction set architectures:
– IA-64, IA-32, PowerPC, MIPS, SPARC, ARM, and others
– All are multi-sourced, with different implementations for the same
ISA
Instruction Set Architecture (ISA)
•
ISA, or simply architecture: the abstract interface between hardware and the
lowest level of software that encompasses all the information necessary to
write a machine language program, including instructions, registers,
memory access, IO, …
•
ISA Includes
– Organization of storage
– Data types
– Encoding and representing instructions
– Instruction Set (i.e. opcodes)
– Modes of addressing data items/instructions
– Program visible exception handling
•
ISA together with OS interface specifies the requirements for binary
compatibility across implementations (ABI: application binary interface)
Case Study: MIPS ISA
• Instruction Categories
– Load/Store
– Computational
– Jump and Branch
– Floating Point
– Memory Management
– Special
R0 - R31
PC
HI
LO
3 Instruction Formats, 32 bits wide
OP
rs
rt
OP
rs
rt
OP
rd
sa
immediate
jump target
funct
Computer Organization
Logic Designer's View
ISA Level
• Capabilities & Performance Characteristics
of Principal Functional Units (e.g.,
Registers, ALU, Shifters, Logic Units, ...)
• Ways in which these components are
interconnected
• Information flows between components
• Logic and means by which such information
flow is controlled.
• Choreography of FUs to realize the ISA
• Register Transfer Level (RTL) Description
FUs & Interconnect
Function Units in a Computer
Classes of Computers
• Desktop computers: Designed to deliver good performance to a
single user at low cost usually executing 3rd party software, usually
incorporating a graphics display, a keyboard, and a mouse
• Servers: Used to run larger programs for multiple, simultaneous
users typically accessed only via a network and that places a
greater emphasis on dependability and (often) security
• Supercomputers: A high performance, high cost class of servers
with hundreds to thousands of processors, terabytes of memory
and petabytes of storage that are used for high-end scientific and
engineering applications
• Embedded computers (processors): A computer inside another
device, used for running one predetermined application
Digital Cell Phone--Front Side (Nokia 8260)
Growth in Embedded Processor Sales
(embedded growth >> desktop growth !!!)
 Where
else are embedded processors found?
Embedded Processor Characteristics
The largest class of computers spanning the widest range of
applications and performance
•
•
•
•
Often have minimum performance requirements.
Often have stringent limitations on cost.
Often have stringent limitations on power consumption.
Often have low tolerance for failure.
In all these ways, embedded processors are very different than
supercomputers, servers, or desktops/laptops
Below the Program
Applications software
Systems software
Hardware
•
System software
– Operating system – supervising program that interfaces the user’s
program with the hardware (e.g., Linux, MacOS, Windows)
• Handles basic input and output operations
• Allocates storage and memory
• Provides for protected sharing among multiple applications
– Compiler – translate programs written in a high-level language (e.g., C,
Java) into instructions that the hardware can execute
Below the Program
• High-level language program (in C)
swap (int v[], int k)
(int temp;
temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;
)
one-to-many
C compiler
• Assembly language program (for MIPS)
swap:
sll $2, $5, 2
add $2, $4, $2
lw $15, 0($2)
lw $16, 4($2)
sw $16, 0($2)
sw $15, 4($2)
jr $31
• Machine (object, binary) code (for MIPS)
000000 00000 00101 0001000010000000
000000 00100 00010 0001000000100000
. . .
one-to-one
assembler
Advantages of HLLs
• Higher-level languages (HLLs)

Allow the programmer to think in a more natural language and
tailored for the intended use (Fortran for scientific computation,
Cobol for business programming, Lisp for symbol manipulation, Java
for web programming, …)

Improve programmer productivity – more understandable code that
is easier to debug and validate

Improve program maintainability

Allow programs to be machine independent of the computer on
which they are developed (compilers and assemblers can translate
high-level language programs to the binary instructions of any
machine)

Emergence of optimizing compilers that produce very efficient
assembly code optimized for the target machine
• As a result, very little programming is done today at the
assembler level.
Compiler Basics
• High-level languages
– Programmers do not think in 0 and 1s
• Languages can also be specific to target applications, such
as Cobol (business) or Fortran (scientific)
– Applications are more concise  fewer bugs
– Programs can be independent of system on which they are
developed
• Compilers convert source code to object code
• Libraries simplify common tasks
Levels of Representation
temp = v[k];
High Level Language
Program
v[k] = v[k+1];
v[k+1] = temp;
Compiler
lw
lw
sw
sw
Assembly Language
Program
$15,
$16,
$16,
$15,
0($2)
4($2)
0($2)
4($2)
Assembler
Machine Language
Program
0000
1010
1100
0101
1001
1111
0110
1000
1100
0101
1010
0000
0110
1000
1111
1001
1010
0000
0101
1100
1111
1001
1000
0110
0101
1100
0000
1010
1000
0110
1001
1111
Machine Interpretation
Control Signal
Specification
°
°
ALUOP[0:3] <= InstReg[9:11] & MASK
[i.e.high/low on control lines]
Execution Cycle
Instruction
Fetch
Instruction
Decode
Operand
Obtain instruction from program storage
Determine required actions and instruction size
Locate and obtain operand data
Fetch
Execute
Result
Store
Next
Instruction
Compute result value or status
Deposit results in storage for later use
Determine successor instruction
AMD’s Barcelona Multicore Chip
512KB L2
512KB L2
Core 1
out-oforder cores
on one chip
Core 2
512KB L2
Core 3
 1.9
GHz
clock rate
 65nm
technology
Northbridge
512KB L2
2MB shared L3 Cache
 Four
 Three
Core 4
levels
of caches
(L1, L2, L3)
on chip
 Integrated
Northbridge
http://www.techwarelabs.com/reviews/processors/barcelona/
Magnetic Storage
Source: Quantum Corp
Disk capacity increasing 60%/year for common form factor
Communication
• The Information Age
• The Internet’s changes to communication are unlike any past
medium (printing press, radio, television)
– Estimated 400
million users in 2005
Residential Internet Subscribers
– An astounding 1.2
billion wireless users!!
300
240
Millions
250
200
150
145
160
180
270
205
100
50
0
2000 2001 2002 2003 2004 2005
Source: Ovum
Moore’s Law
 In
1965, Intel’s Gordon Moore
predicted that the number of
transistors that can be
integrated on single chip would
double about every two years
Dual Core
Itanium with
1.7B transistors
feature size
&
die size
Courtesy, Intel ®
Moore’s Law for CPUs and DRAMs
From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.
Technology Scaling Road Map
Year
2004
2006
2008
2010
2012
Feature size (nm)
90
65
45
32
22
Intg. Capacity (BT)
2
4
6
16
32
• Fun facts about 45nm transistors
– 30 million can fit on the head of a pin
– You could fit more than 2,000 across the width of a human
hair
– If car prices had fallen at the same rate as the price of a
single transistor has since 1968, a new car today would cost
about 1 cent
Semiconductors
• 50 year old industry
– Still has continuous improvements
– New generation every 2-3 years
• 30% reduction in dimension  50% in area
• 30% reduction in delay  50% speed increase
• Current generation: Reduce cost and increases performance
– Processors are fabricated on ingots cut into wafers which
are then etched to create transistors
– Wafers are then diced to form chips, some of which have
defects
– Yield is the measurement of the good chips
• Next generation: Larger with more functions
– Each generation is an incremental improvement
Semiconductor Manufacturing Process
for Silicon ICs
Main driver: device scaling ...
From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.
But What Happened to Clock Rates?
 Clock
rates hit a
“power wall”
Hitting the Power Wall
“For the P6, success criteria included performance above a certain
level and failure criteria included power dissipation above some
threshold.”
Bob Colwell, Pentium Chronicles
Processor performance growth flattens!
The Latest Revolution: Multicores
The power challenge has forced a change in the design of
microprocessors
--Since 2002 the rate of improvement in the response time of programs
on desktop computers has slowed from a factor of 1.5 per year to less
than a factor of 1.2 per year
--In 2011 all desktop and server companies are shipping
microprocessors with multiple processors – cores – per chip
Product
Cores per chip
Clock rate
Power

AMD
Barcelona
Intel
Nehalem
IBM Power 6 Sun Niagara
2
4
4
2
8
2.5 GHz
~2.5 GHz?
4.7 GHz
1.4 GHz
120 W
~100 W?
~100 W?
94 W
The plan is to double the number of cores per chip per
generation (about every two years)
Workloads and Benchmarks
• Benchmarks – a set of programs that form a “workload” specifically
chosen to measure performance. With standard inputs, the
benchmarks are run and execution time is measured.
• SPEC (System Performance Evaluation Cooperative) creates
standard sets of benchmarks starting with SPEC89. The latest is
SPEC CPU2006 which consists of 12 integer benchmarks
(CINT2006) and 17 floating-point benchmarks (CFP2006).
www.spec.org
• There are also benchmark collections for power workloads
(SPECpower_ssj2008), for email workloads (SPECmail2008), for
multimedia workloads (mediabench), …
2002 SPEC Benchmarks
Integer benchmarks
FP benchmarks
gzip
compression
wupwise Quantum chromodynamics
vpr
FPGA place & route
swim
Shallow water model
gcc
GNU C compiler
mgrid
Multigrid solver in 3D fields
mcf
Combinatorial optimization applu
Parabolic/elliptic pde
crafty
Chess program
mesa
3D graphics library
parser
Word processing program
galgel
Computational fluid dynamics
eon
Computer visualization
art
Image recognition (NN)
perlbmk
perl application
equake
Seismic wave propagation
simulation
gap
Group theory interpreter
facerec
Facial image recognition
vortex
Object oriented database
ammp
Computational chemistry
bzip2
compression
lucas
Primality testing
twolf
Circuit place & route
fma3d
Crash simulation fem
sixtrack
Nuclear physics accel
apsi
Pollutant distribution
SPEC CINT2006 on Barcelona (2.5 GHz)
Name
ICx109
CPI
ExTime
RefTime
SPEC
ratio
perl
2,1118
0.75
637
9,770
15.3
bzip2
2,389
0.85
817
9,650
11.8
gcc
1,050
1.72
724
8,050
11.1
mcf
336
10.00
1,345
9,120
6.8
go
1,658
1.09
721
10,490
14.6
hmmer
2,783
0.80
890
9,330
10.5
sjeng
2,176
0.96
837
12,100
14.5
libquantum
1,623
1.61
1,047
20,720
19.8
h264avc
3,102
0.80
993
22,130
22.3
omnetpp
587
2.94
690
6,250
9.1
astar
1,082
1.79
773
7,020
9.1
xalancbmk
1,058
2.70
1,143
6,900
6.0
Geometric Mean
11.7
Comparing & Summarizing Performance
 How
do we summarize the performance for a benchmark
set with a single number?

First the execution times are normalized giving the “SPEC ratio”
of reference time to measured execution time (bigger is faster)

The SPEC ratios are then “averaged” using the geometric mean
(GM)
n
GM =
n

SPEC ratioi
i=1
• Reference time is the execution time on a reference computer
• Guiding principle in reporting performance measurements is
reproducibility – list everything another experimenter would need to
duplicate the experiment (version of the operating system, compiler
settings, input set used, specific computer configuration (clock rate,
cache sizes and speed, memory size and speed, etc)).
Other Performance Metrics
Power consumption – especially used in the embedded market where battery
life is important. For power-limited applications, the most important metric is
energy efficiency
Course Content
Computer Architecture and Engineering
Instruction Set Design
Computer Organization
Interfaces
Hardware Components
Compiler/System View
Logic Designer’s View
“Building Architect”
“Construction Engineer”
So what's in it for me?
• In-depth understanding of the inner-workings of modern computers,
their evolution, and trade-offs present at the hardware/software
boundary.
– Insight into fast/slow operations that are easy/hard to implement
in hardware
• Experience with the design process in the context of a large complex
(hardware) design.
– Functional Spec --> Control & Datapath --> Simulation -->
Physical implementation
– Modern CAD tools
• Designer's "Conceptual" toolbox
Conceptual tool box
•
•
•
•
•
•
•
•
•
•
Evaluation Techniques
Levels of Translation (e.g. Compilation, Assembly)
Hierarchy (e.g. registers, cache, memory, disk)
Pipelining and Parallelism
Static / Dynamic Scheduling
Indirection and Address Translation
Timing, Clocking, and Latching
CAD Programs, Hardware Description Languages, Simulation
Physical Building Blocks (e.g. CLA)
Understanding Technology Trends