Computer Architecture

Download Report

Transcript Computer Architecture

Computer Architecture
(“MAMAS”, 234267)
Spring 2014
Lecturer: Yoav Etsion
Reception: Mon 15:00, Fishbach 306-8
TAs: Nadav Amit, Gil Einziger, Franck Sala
Presentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, Adi Yoaz and Dan Tsafrir
1
Computer Architecture 2014 – Introduction
Computer System Structure
2
Computer Architecture 2014 – Introduction
COMPUTER SYSTEM COMPONENTS
Archaic
CPU - Memory BUS
BUS
ADAPTER
CACHE
I/O
CPU
MAIN
MEMORY
BUS
I/O CONTROLLERS
Disk
rchitecture 1 U Weiser
3
printer scanner
keyboard mouse...
3
Computer Architecture 2014 – Introduction
COMPUTER SYSTEM COMPONENTS
Yesterday
CPU
cache
MAIN
MEMORY
North Bridge
South Bridge
Network +WLAN
Disk
printer scanner
keyboard mouse...
I/O CONTROLLERS
4
rchitecture 1 U Weiser
4
Computer Architecture 2014 – Introduction
COMPUTER SYSTEM COMPONENTS
now
CPU
MC+cache+G
MAIN
MEMORY
South Bridge
Printer, scanner
Keyboard, mouse...
Network +WLAN
Disk/SSD
rchitecture 1 U Weiser
5
5
Computer Architecture 2014 – Introduction
Classical Motherboard Diagram
Cache
More to the “north”
= closer to the CPU
= faster
CPU
CPU BUS
North Bridge
External
Graphics
Card
DDR2 or DDR3
Channel 1
PCI express 2.0
IOMMU
On-board Memory
Graphics controller
Serial Port
Parallel Port
IO Controller
6
DDR2 or DDR3
Channel 2
PCI express ×1
South Bridge
Floppy
Drive
Mem BUS
keybrd
USB
controller
mouse
SATA
controller
DVD
Drive
Hard
Disk
PCI
Sound
Card
speakers
Lan
Adap
LAN
Computer Architecture 2014 – Introduction
Course Focus

Start from CPU (=processor)





Move on to Memory Hierarchy




7
Caching
Main memory
Virtual Memory
Move on to PC Architecture


Instruction set, performance
Pipeline, hazards
Branch prediction
Out-of-order execution
System & chipset, DRAM, I/O, Disk, peripherals
End with some Advanced Topics
Computer Architecture 2014 – Introduction
The Processor
8
Computer Architecture 2014 – Introduction
Architecture vs. Microarchitecture

Architecture:
= The processor features as seen by its user
= Interface


Microarchitecture:
= Manner by which the processor is implements
the Architecture
= Implementation details


9
Caches size and structure, number of execution units, …
Note: different processors with different u-archs
can support the same arch


Instruction set, number of registers, addressing modes,…
Example: ARM V8, ARM V9
We will address both
Computer Architecture 2014 – Introduction
Why Should We Care?

Abstractions enhance productivity, so:



Same goes for arch

10
If we know the arch (=interface),
Why should we care about the u-arch (=internals)?
Just details for a programmer of a high-level language
Computer Architecture 2014 – Introduction
Recent Processor Trends
Source: http://www.scidacreview.org/0904/html/multicore.html
11
Computer Architecture 2014 – Introduction
Well-Known Moore’s Law
Graph taken from: http://www.intel.com/technology/mooreslaw/index.htm
12
Computer Architecture 2014 – Introduction
13
Computer Architecture 2014 – Introduction
The Story in a Nutshell
Transistors
(1000s)
clock speed
(MHz)
power (W)
Instructions/cycle
(ILP)
14
Computer Architecture 2014 – Introduction
Took the Industry by Surprise
15
Computer Architecture 2014 – Introduction
Dire Implications: Performance
16
Computer Architecture 2014 – Introduction
Dire Implications: Sales
17
Computer Architecture 2014 – Introduction
Dire Implications: Sales
18
Computer Architecture 2014 – Introduction
Dire Implications: Programmers
19
Computer Architecture 2014 – Introduction
Supercomputing: “Top 500 list”
20
Computer Architecture 2014 – Introduction
Dire Implications: Supercomputing
21
Computer Architecture 2014 – Introduction
Processor Performance
22
Computer Architecture 2014 – Introduction
Metrics: IC, CPI, IPC

CPUs work according to a clock signal



Instruction Count (IC)


Clock cycle: measured in nanoseconds (10-9 of a second)
Clock frequency = 1/|clock cycle|: in GHz (109 cycles/sec)
Total number of instructions executed in the program
Cycles Per Instruction (CPI)

Average #cycles per Instruction (in a given program)
CPI =

23
#cycles required to execute the program
IC
IPC (= 1/CPI) : Instructions per cycles.
Can be > 1; see the “story in a nutshell slide”
Computer Architecture 2014 – Introduction
Minimizing Execution Time

CPU Time - time required to execute a program
CPU Time = IC  CPI  clock cycle

Our goal:
minimize CPU Time (any of above components)

Minimize clock cycle: increase GHz (processor design)

Minimize CPI:
u-arch (e.g.: more execution units)

Minimize IC:
arch (e.g. SSE instruction)
SSE = streaming SIMD extension (Intel)
24
Computer Architecture 2014 – Introduction
Alternative Way to Calculate CPI


ICi = #times instruction of type-i is executed in program
n
IC = #instruction executed in program =
IC   ICi
i 1


Fi = relative frequency of type-i instruction = ICi/IC
CPIi = #cycles to execute type-i instruction


e.g.: CPIadd = 1, CPImul = 3
n
#cycles required to execute the program:
# cyc   CPI i  ICi
i 1

CPI:
n
# cyc
CPI 

IC
25
 CPI  IC
i 1
i
IC
i
n
ICi n
  CPIi 
  CPIi  Fi
IC i 1
i 1
Computer Architecture 2014 – Introduction
Performance Evaluation: How?

Performance depends on



26
Application
Input
Mathematical analysis
Computer Architecture 2014 – Introduction
Benchmarks

Use benchmarks & measure how long it takes


Use real applications (=> no absolute answers)
Preferably standardized benchmarks (+input), e.g.,

SPEC INT: integer apps
• Compression, C complier, Perl, text-processing, …




Sometimes you see FLOPS (“pick” or “sustained”)

27
SPEC FP: floating point apps (mostly scientific)
TPC benchmarks: measure transaction throughput (DB)
SPEC JBB: models wholesale company (Java server, DB)
Supercomputers (top500 list), against LINPACK
Computer Architecture 2014 – Introduction
Evaluating Performance

Use a performance simulator to evaluate the
performance of a new feature / algorithm



Models the uarch to a great detail
Run 100’s of representative applications
Produce the performance s-curve


Sort the applications according to the IPC increase
Baseline (0%) is the processor without the new feature
3%
Bad S-curve
2%
6%
Positive
outliers
Good S-curve
Positive
outliers
4%
1%
0%
2%
-1%
-2%
Negative
outliers
-3%
0%
Small negative
outliers
-2%
-4%
28
Computer Architecture 2014 – Introduction
Amdahl’s Law


Suppose we accelerate the computation such that
 P = portion of computation we make faster
 S = speedup experienced by the portion we improved
For example

If an improvement can speedup 40% of the computation
=> P = 0.4


29
If the improvement makes the portion run twice as fast
=> S = 2
Then overall speedup
=
1
(1  P)  P
S
Computer Architecture 2014 – Introduction
Amdahl’s Law - Example

FP operations improved to run 2x faster
 S = 2, but…
 P = only affects 10% of the program

Speedup:
1
1


 1.053
0.1
0.95
(1  P)  P
(1

0.1)

S
2

Conclusion

30
1
Better to make common case fast…
Computer Architecture 2014 – Introduction
Amdahl’s Law – Parallelism

When parallelizing a program

P = proportion of program that can be made parallel
 1 - P = inherently serial


N = number of processing elements (say, cores)
Speedup:
1
(1  P)  P

N
Serial component imposes a hard limit


1
1


lim

N  
(1  P)
(1  P)  P 
N

31
Computer Architecture 2014 – Introduction
Instruction Set Design
software
The ISA is what the user
& compiler see
instruction set
hardware
32
The HW implements the
ISA
Computer Architecture 2014 – Introduction
Considerations in ISA Design

Instruction size

Long instructions take more time to fetch from memory

Longer instructions require a larger memory
• Important for small (embedded) devices, e.g., cell phones

Number of instructions (IC)


33
Reduce IC => reduce runtime (at a given CPI & frequency)
Virtues of instructions simplicity

Simpler HW allows for: higher frequency & lower power

Optimization can be applied better to simpler code

Cheaper HW
Computer Architecture 2014 – Introduction
Basing Design Decisions on Workload
Immediate argument’s size in bits (histogram)
30%
Int. Avg.
FP Avg.
20%
10%
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0%
Immediate data bits


34
1% of data values > 16-bits
Having 16 bits is likely good enough
Computer Architecture 2014 – Introduction
CISC Processors

CISC - Complex Instruction Set Computer


Example: x86
The idea: a high level machine language
• Once people programmed in assembly, CISC supposedly easier

Characteristic


Many instruction types, with a many addressing modes
Some of the instructions are complex
• Execute complex tasks
• Require many cycles

ALU operations directly on memory (e.g., arr[j] = arr[i]+n)
• Registers not used (and, accordingly, only a few registers exist)

Variable length instructions
• common instructions get short codes  save code length
35
Computer Architecture 2014 – Introduction
But it Turns Out…
Rank
instruction
% of total executed
1
load
22%
2
conditional branch
20%
3
compare
16%
4
store
12%
5
add
8%
6
and
6%
7
sub
5%
8
move register-register
4%
9
call
1%
10
return
1%
Total
96%
Simple instructions dominate instruction frequency
36
Computer Architecture 2014 – Introduction
CISC Drawbacks

Complex instructions and complex addressing modes
 complicates the processor
 slows down the simple, common instructions
 contradicts Make The Common Case Fast

Compilers don’t use complex instructions / indexing methods

Variable length instructions are real pain in the neck


Difficult to decode few instructions in parallel
• As long as instruction is not decoded, its length is unknown
 It is unknown where the instruction ends
 It is unknown where the next instruction starts
An instruction may be longer than a cache line
• Or even longer longer than a page (in theory)
37
Computer Architecture 2014 – Introduction
RISC Processors

RISC - Reduced Instruction Set Computer


The idea: simple instructions enable fast hardware
Characteristic


A small instruction set, with only a few instructions formats
Simple instructions
• execute simple tasks
• Most of them require a single cycle (with pipeline)


A few indexing methods
Load/Store machine: ALU operations on registers only
• Memory is accessed using Load and Store instructions only
• Many orthogonal registers
• Three address machine:
Add dst, src1, src2


38
Fixed length instructions
Examples: MIPSTM, SparcTM, AlphaTM, PowerTM
Computer Architecture 2014 – Introduction
RISC Processors (Cont.)

Simple arch => simple u-arch






Compiler can be smarter



Better pipeline usage
Better register allocation
Existing RISC processor are not “pure” RISC

39
Room for larger on die caches
Smaller => faster
Easier to design & validate (=> cheaper to manufacture)
Shorten time-to-market
More general-purpose registers (=> less memory refs)
Various complex operations added along the way
Computer Architecture 2014 – Introduction
Compilers and ISA

Ease of compilation

Orthogonality:
• no special registers
• few special cases
• all operand modes available with any data type or instruction
type

Regularity:
• no overloading for the meanings of instruction fields

streamlined
• resource needs easily determined

Register assignment is critical too

40
Easier if lots of registers
Computer Architecture 2014 – Introduction
Still, CISC Is Dominant

x86 (CISC) dominates the processor market


Legacy




A vast amount of existing software
Intel, AMD, Microsoft benefit
But put lot of money to compensate for disadvantage
CISC internally arch emulates RISC


41
Not necessarily because it is CISC…
Starting at Pentium II and K6, x86 processors translate
CISC instructions into RISC-like operations internally
Inside core is a RISC machine
Computer Architecture 2014 – Introduction
Software Specific Extensions

Extend arch to accelerate exec of specific apps

Example: SSETM – Streaming SIMD Extensions





128-bit packed (vector) / scalar single precision FP (4×32)
Introduced on Pentium® III on ’99
8 new 128 bit registers (XMM0 – XMM7)
Accelerates graphics, video, scientific calculations, …
Packed:
Scalar:
128-bits
x3
x2
x1
128-bits
x0
x3
x2
+
y3
y2
x0
+
y1
y0
x3+y3 x2+y2 x1+y1 x0+y0
42
x1
y3
y2
y1
y0
y3
y2
y1
x0+y0
Computer Architecture 2014 – Introduction
BACKUP
43
Computer Architecture 2014 – Introduction
Compatibility

Backward compatibility (HW responsibility)

When buying new hardware, it can run existing software:
• i5 can run SW written for Core2 Duo, Pentium4, PentiumM,
Pentium III, Pentium II, Pentium, 486, 386, 268
BTW:

Forward compatibility (SW responsibility)



Architecture-independent SW


44
For example: MS Word 2003 can open MS Word 2010 doc
Commonly supports one or two generations behind
Run SW on top of VM that does JIT (just in time compiler):
JVM for Java and CLR for .NET
Interpreted languages: Perl, Python
Computer Architecture 2014 – Introduction