Computer Architecture
Download
Report
Transcript Computer Architecture
Computer Architecture
(“MAMAS”, 234267)
Spring 2014
Lecturer: Yoav Etsion
Reception: Mon 15:00, Fishbach 306-8
TAs: Nadav Amit, Gil Einziger, Franck Sala
Presentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, Adi Yoaz and Dan Tsafrir
1
Computer Architecture 2014 – Introduction
Computer System Structure
2
Computer Architecture 2014 – Introduction
COMPUTER SYSTEM COMPONENTS
Archaic
CPU - Memory BUS
BUS
ADAPTER
CACHE
I/O
CPU
MAIN
MEMORY
BUS
I/O CONTROLLERS
Disk
rchitecture 1 U Weiser
3
printer scanner
keyboard mouse...
3
Computer Architecture 2014 – Introduction
COMPUTER SYSTEM COMPONENTS
Yesterday
CPU
cache
MAIN
MEMORY
North Bridge
South Bridge
Network +WLAN
Disk
printer scanner
keyboard mouse...
I/O CONTROLLERS
4
rchitecture 1 U Weiser
4
Computer Architecture 2014 – Introduction
COMPUTER SYSTEM COMPONENTS
now
CPU
MC+cache+G
MAIN
MEMORY
South Bridge
Printer, scanner
Keyboard, mouse...
Network +WLAN
Disk/SSD
rchitecture 1 U Weiser
5
5
Computer Architecture 2014 – Introduction
Classical Motherboard Diagram
Cache
More to the “north”
= closer to the CPU
= faster
CPU
CPU BUS
North Bridge
External
Graphics
Card
DDR2 or DDR3
Channel 1
PCI express 2.0
IOMMU
On-board Memory
Graphics controller
Serial Port
Parallel Port
IO Controller
6
DDR2 or DDR3
Channel 2
PCI express ×1
South Bridge
Floppy
Drive
Mem BUS
keybrd
USB
controller
mouse
SATA
controller
DVD
Drive
Hard
Disk
PCI
Sound
Card
speakers
Lan
Adap
LAN
Computer Architecture 2014 – Introduction
Course Focus
Start from CPU (=processor)
Move on to Memory Hierarchy
7
Caching
Main memory
Virtual Memory
Move on to PC Architecture
Instruction set, performance
Pipeline, hazards
Branch prediction
Out-of-order execution
System & chipset, DRAM, I/O, Disk, peripherals
End with some Advanced Topics
Computer Architecture 2014 – Introduction
The Processor
8
Computer Architecture 2014 – Introduction
Architecture vs. Microarchitecture
Architecture:
= The processor features as seen by its user
= Interface
Microarchitecture:
= Manner by which the processor is implements
the Architecture
= Implementation details
9
Caches size and structure, number of execution units, …
Note: different processors with different u-archs
can support the same arch
Instruction set, number of registers, addressing modes,…
Example: ARM V8, ARM V9
We will address both
Computer Architecture 2014 – Introduction
Why Should We Care?
Abstractions enhance productivity, so:
Same goes for arch
10
If we know the arch (=interface),
Why should we care about the u-arch (=internals)?
Just details for a programmer of a high-level language
Computer Architecture 2014 – Introduction
Recent Processor Trends
Source: http://www.scidacreview.org/0904/html/multicore.html
11
Computer Architecture 2014 – Introduction
Well-Known Moore’s Law
Graph taken from: http://www.intel.com/technology/mooreslaw/index.htm
12
Computer Architecture 2014 – Introduction
13
Computer Architecture 2014 – Introduction
The Story in a Nutshell
Transistors
(1000s)
clock speed
(MHz)
power (W)
Instructions/cycle
(ILP)
14
Computer Architecture 2014 – Introduction
Took the Industry by Surprise
15
Computer Architecture 2014 – Introduction
Dire Implications: Performance
16
Computer Architecture 2014 – Introduction
Dire Implications: Sales
17
Computer Architecture 2014 – Introduction
Dire Implications: Sales
18
Computer Architecture 2014 – Introduction
Dire Implications: Programmers
19
Computer Architecture 2014 – Introduction
Supercomputing: “Top 500 list”
20
Computer Architecture 2014 – Introduction
Dire Implications: Supercomputing
21
Computer Architecture 2014 – Introduction
Processor Performance
22
Computer Architecture 2014 – Introduction
Metrics: IC, CPI, IPC
CPUs work according to a clock signal
Instruction Count (IC)
Clock cycle: measured in nanoseconds (10-9 of a second)
Clock frequency = 1/|clock cycle|: in GHz (109 cycles/sec)
Total number of instructions executed in the program
Cycles Per Instruction (CPI)
Average #cycles per Instruction (in a given program)
CPI =
23
#cycles required to execute the program
IC
IPC (= 1/CPI) : Instructions per cycles.
Can be > 1; see the “story in a nutshell slide”
Computer Architecture 2014 – Introduction
Minimizing Execution Time
CPU Time - time required to execute a program
CPU Time = IC CPI clock cycle
Our goal:
minimize CPU Time (any of above components)
Minimize clock cycle: increase GHz (processor design)
Minimize CPI:
u-arch (e.g.: more execution units)
Minimize IC:
arch (e.g. SSE instruction)
SSE = streaming SIMD extension (Intel)
24
Computer Architecture 2014 – Introduction
Alternative Way to Calculate CPI
ICi = #times instruction of type-i is executed in program
n
IC = #instruction executed in program =
IC ICi
i 1
Fi = relative frequency of type-i instruction = ICi/IC
CPIi = #cycles to execute type-i instruction
e.g.: CPIadd = 1, CPImul = 3
n
#cycles required to execute the program:
# cyc CPI i ICi
i 1
CPI:
n
# cyc
CPI
IC
25
CPI IC
i 1
i
IC
i
n
ICi n
CPIi
CPIi Fi
IC i 1
i 1
Computer Architecture 2014 – Introduction
Performance Evaluation: How?
Performance depends on
26
Application
Input
Mathematical analysis
Computer Architecture 2014 – Introduction
Benchmarks
Use benchmarks & measure how long it takes
Use real applications (=> no absolute answers)
Preferably standardized benchmarks (+input), e.g.,
SPEC INT: integer apps
• Compression, C complier, Perl, text-processing, …
Sometimes you see FLOPS (“pick” or “sustained”)
27
SPEC FP: floating point apps (mostly scientific)
TPC benchmarks: measure transaction throughput (DB)
SPEC JBB: models wholesale company (Java server, DB)
Supercomputers (top500 list), against LINPACK
Computer Architecture 2014 – Introduction
Evaluating Performance
Use a performance simulator to evaluate the
performance of a new feature / algorithm
Models the uarch to a great detail
Run 100’s of representative applications
Produce the performance s-curve
Sort the applications according to the IPC increase
Baseline (0%) is the processor without the new feature
3%
Bad S-curve
2%
6%
Positive
outliers
Good S-curve
Positive
outliers
4%
1%
0%
2%
-1%
-2%
Negative
outliers
-3%
0%
Small negative
outliers
-2%
-4%
28
Computer Architecture 2014 – Introduction
Amdahl’s Law
Suppose we accelerate the computation such that
P = portion of computation we make faster
S = speedup experienced by the portion we improved
For example
If an improvement can speedup 40% of the computation
=> P = 0.4
29
If the improvement makes the portion run twice as fast
=> S = 2
Then overall speedup
=
1
(1 P) P
S
Computer Architecture 2014 – Introduction
Amdahl’s Law - Example
FP operations improved to run 2x faster
S = 2, but…
P = only affects 10% of the program
Speedup:
1
1
1.053
0.1
0.95
(1 P) P
(1
0.1)
S
2
Conclusion
30
1
Better to make common case fast…
Computer Architecture 2014 – Introduction
Amdahl’s Law – Parallelism
When parallelizing a program
P = proportion of program that can be made parallel
1 - P = inherently serial
N = number of processing elements (say, cores)
Speedup:
1
(1 P) P
N
Serial component imposes a hard limit
1
1
lim
N
(1 P)
(1 P) P
N
31
Computer Architecture 2014 – Introduction
Instruction Set Design
software
The ISA is what the user
& compiler see
instruction set
hardware
32
The HW implements the
ISA
Computer Architecture 2014 – Introduction
Considerations in ISA Design
Instruction size
Long instructions take more time to fetch from memory
Longer instructions require a larger memory
• Important for small (embedded) devices, e.g., cell phones
Number of instructions (IC)
33
Reduce IC => reduce runtime (at a given CPI & frequency)
Virtues of instructions simplicity
Simpler HW allows for: higher frequency & lower power
Optimization can be applied better to simpler code
Cheaper HW
Computer Architecture 2014 – Introduction
Basing Design Decisions on Workload
Immediate argument’s size in bits (histogram)
30%
Int. Avg.
FP Avg.
20%
10%
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0%
Immediate data bits
34
1% of data values > 16-bits
Having 16 bits is likely good enough
Computer Architecture 2014 – Introduction
CISC Processors
CISC - Complex Instruction Set Computer
Example: x86
The idea: a high level machine language
• Once people programmed in assembly, CISC supposedly easier
Characteristic
Many instruction types, with a many addressing modes
Some of the instructions are complex
• Execute complex tasks
• Require many cycles
ALU operations directly on memory (e.g., arr[j] = arr[i]+n)
• Registers not used (and, accordingly, only a few registers exist)
Variable length instructions
• common instructions get short codes save code length
35
Computer Architecture 2014 – Introduction
But it Turns Out…
Rank
instruction
% of total executed
1
load
22%
2
conditional branch
20%
3
compare
16%
4
store
12%
5
add
8%
6
and
6%
7
sub
5%
8
move register-register
4%
9
call
1%
10
return
1%
Total
96%
Simple instructions dominate instruction frequency
36
Computer Architecture 2014 – Introduction
CISC Drawbacks
Complex instructions and complex addressing modes
complicates the processor
slows down the simple, common instructions
contradicts Make The Common Case Fast
Compilers don’t use complex instructions / indexing methods
Variable length instructions are real pain in the neck
Difficult to decode few instructions in parallel
• As long as instruction is not decoded, its length is unknown
It is unknown where the instruction ends
It is unknown where the next instruction starts
An instruction may be longer than a cache line
• Or even longer longer than a page (in theory)
37
Computer Architecture 2014 – Introduction
RISC Processors
RISC - Reduced Instruction Set Computer
The idea: simple instructions enable fast hardware
Characteristic
A small instruction set, with only a few instructions formats
Simple instructions
• execute simple tasks
• Most of them require a single cycle (with pipeline)
A few indexing methods
Load/Store machine: ALU operations on registers only
• Memory is accessed using Load and Store instructions only
• Many orthogonal registers
• Three address machine:
Add dst, src1, src2
38
Fixed length instructions
Examples: MIPSTM, SparcTM, AlphaTM, PowerTM
Computer Architecture 2014 – Introduction
RISC Processors (Cont.)
Simple arch => simple u-arch
Compiler can be smarter
Better pipeline usage
Better register allocation
Existing RISC processor are not “pure” RISC
39
Room for larger on die caches
Smaller => faster
Easier to design & validate (=> cheaper to manufacture)
Shorten time-to-market
More general-purpose registers (=> less memory refs)
Various complex operations added along the way
Computer Architecture 2014 – Introduction
Compilers and ISA
Ease of compilation
Orthogonality:
• no special registers
• few special cases
• all operand modes available with any data type or instruction
type
Regularity:
• no overloading for the meanings of instruction fields
streamlined
• resource needs easily determined
Register assignment is critical too
40
Easier if lots of registers
Computer Architecture 2014 – Introduction
Still, CISC Is Dominant
x86 (CISC) dominates the processor market
Legacy
A vast amount of existing software
Intel, AMD, Microsoft benefit
But put lot of money to compensate for disadvantage
CISC internally arch emulates RISC
41
Not necessarily because it is CISC…
Starting at Pentium II and K6, x86 processors translate
CISC instructions into RISC-like operations internally
Inside core is a RISC machine
Computer Architecture 2014 – Introduction
Software Specific Extensions
Extend arch to accelerate exec of specific apps
Example: SSETM – Streaming SIMD Extensions
128-bit packed (vector) / scalar single precision FP (4×32)
Introduced on Pentium® III on ’99
8 new 128 bit registers (XMM0 – XMM7)
Accelerates graphics, video, scientific calculations, …
Packed:
Scalar:
128-bits
x3
x2
x1
128-bits
x0
x3
x2
+
y3
y2
x0
+
y1
y0
x3+y3 x2+y2 x1+y1 x0+y0
42
x1
y3
y2
y1
y0
y3
y2
y1
x0+y0
Computer Architecture 2014 – Introduction
BACKUP
43
Computer Architecture 2014 – Introduction
Compatibility
Backward compatibility (HW responsibility)
When buying new hardware, it can run existing software:
• i5 can run SW written for Core2 Duo, Pentium4, PentiumM,
Pentium III, Pentium II, Pentium, 486, 386, 268
BTW:
Forward compatibility (SW responsibility)
Architecture-independent SW
44
For example: MS Word 2003 can open MS Word 2010 doc
Commonly supports one or two generations behind
Run SW on top of VM that does JIT (just in time compiler):
JVM for Java and CLR for .NET
Interpreted languages: Perl, Python
Computer Architecture 2014 – Introduction