1-ca-intro - Webcourse (tm)

Download Report

Transcript 1-ca-intro - Webcourse (tm)

Computer Architecture
(“MAMAS”, 234267)
Spring 2015
Lecturer: Yoav Etsion
Reception: Mon 15:00, Fishbach 306-8
TAs: Tomer Gurevich, Franck Sala,
Andrey Zhitnikov
Presentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, Adi Yoaz and Dan Tsafrir
1
Computer Architecture 2015 – Introduction
General Info

Grade



25% Exercise (mandatory) ‫תקף‬
75% Final exam
Textbook
 “Computer Architecture:
A Quantitative Approach” (4th Edition)
by: Patterson & Hennessy

Other course information


2
Course web site:
http://webcourse.cs.technion.ac.il/234267/Spring2013
Lectures will be upload to the web a day before the class
Computer Architecture 2015 – Introduction
Assignments

Five mandatory assignments during the semester


NO CHEATING!




Copying any part of the assignments from another student or using a
reference from a previous year
Letting someone else copy from you
Posting code/solutions to a shared forum

We can track the code back to the posters!

3
Suspected parties will be sent to a Technion trial
Typical outcome: course disqualified, which means your
graduation will be postponed by at least one semester
Possible examples of cheating (a few of many):


Only programming assignments
Be honest and work by yourselves!
Computer Architecture 2015 – Introduction
The Computer System
4
Computer Architecture 2015 – Introduction
Classical System Diagram
Cache
More to the “north”
= closer to the CPU
= faster
CPU
CPU BUS
North Bridge
External
Graphics
Card
DDR2 or DDR3
Channel 1
PCI express 2.0
IOMMU
On-board Memory
Graphics controller
Serial Port
Parallel Port
IO Controller
5
DDR2 or DDR3
Channel 2
PCI express ×1
South Bridge
Floppy
Drive
Mem BUS
keybrd
USB
controller
mouse
SATA
controller
DVD
Drive
Hard
Disk
PCI
Sound
Card
speakers
Lan
Adap
LAN
Computer Architecture 2015 – Introduction
Intel Nehalem Core i3 i5 i7
For high-end i-Series chips,
Northbridge functionality
moved onto processor
(=> made faster)
45 to 32 nm
6
Computer Architecture 2015 – Introduction
Intel Sandy Bridge Core i3 i5 i7
32 to 22 nm
7
The trend
continues
Computer Architecture 2015 – Introduction
Computer and System Architecture


The physical design of the computer system is
commonly known as Computer Architecture
The definition of computer architecture:
Prof. Christos Kozyrakis, Stanford
Computer theorists invent algorithms that solve
important problems and analyze their asymptotic
behavior (e.g. O(NlogN) or O(N3)).
Computer architects set the constant factors for
these algorithms…
8
Computer Architecture 2015 – Introduction
Computer and System Architecture


Axiom:
The ultimate goal is performance (ops/sec)
Constraints:



Practical goal:
Highest performance in light of constraints
Today, the most common goal is Performance/Watt

9
Power, Number of transistors, Bandwidth, …
e.g. ops/Joule
Computer Architecture 2015 – Introduction
Course Focus

Start from CPU (=processor)





Move on to Memory Hierarchy



Caching and Main memory
Virtual Memory
Finish with latency tolerance and parallelism

10
Instruction set, performance
Pipeline, hazards
Branch prediction
Out-of-order execution
Hardware multithreading
Computer Architecture 2015 – Introduction
The Processor
11
Computer Architecture 2015 – Introduction
Architecture vs. Microarchitecture

Architecture:
= The processor features as seen by its user
= Interface


Microarchitecture:
= Manner by which the processor is implemented
= Implementation details


Caches size and structure, number of execution units, …
Note: different processors with different u-archs
can support the same arch


12
Instruction set, number of registers, addressing modes,…
Intel Pentium-IV vs. Intel Core2 Duo
ARMv9 implemented by Qualcomm, TI, Samsung
Computer Architecture 2015 – Introduction
Why Should We Care?

Abstractions enhance productivity, so:



Same goes for arch


Just details for a programmer of a high-level language
Abstractions only work so long as what’s below
works

13
If we know the arch (=interface),
Why should we care about the u-arch (=internals)?
The taxi story: http://vimeo.com/11478146 (4:50-6:00)
Computer Architecture 2015 – Introduction
Recent Processor Trends
Source: http://www.scidacreview.org/0904/html/multicore.html
14
Computer Architecture 2015 – Introduction
Well-Known Moore’s Law
Graph taken from: http://www.intel.com/technology/mooreslaw/index.htm
15
Computer Architecture 2015 – Introduction
16
Computer Architecture 2015 – Introduction
The Story in a Nutshell
Transistors
(1000s)
clock speed
(MHz)
power (W)
Instructions/cycle
(ILP)
17
Computer Architecture 2015 – Introduction
Took the Industry by Surprise
18
Computer Architecture 2015 – Introduction
Dire Implications: Performance
19
Computer Architecture 2015 – Introduction
Dire Implications: Programmers
22
Computer Architecture 2015 – Introduction
Supercomputing: “Top 500 list”
23
Computer Architecture 2015 – Introduction
Dire Implications: Supercomputing
24
Computer Architecture 2015 – Introduction
Processor Performance
25
Computer Architecture 2015 – Introduction
Metrics: IC, CPI, IPC

CPUs work according to a clock signal



Instruction Count (IC)


Clock cycle: measured in nanoseconds (10-9 of a second)
Clock frequency = 1/|clock cycle|: in GHz (109 cycles/sec)
Total number of instructions executed in the program
Cycles Per Instruction (CPI)

Average #cycles per Instruction (in a given program)
CPI =

26
#cycles required to execute the program
IC
IPC (= 1/CPI) : Instructions per cycles.
Can be > 1; see the “story in a nutshell slide”
Computer Architecture 2015 – Introduction
Minimizing Execution Time

CPU Time - time required to execute a program
CPU Time = IC  CPI  clock cycle

Our goal:
minimize CPU Time (any of above components)

Minimize clock cycle: increase GHz (processor design)

Minimize CPI:
u-arch (e.g.: more execution units)

Minimize IC:
arch + u-arch (e.g.: SSETM)
SSE = streaming SIMD extension (Intel)
27
Computer Architecture 2015 – Introduction
Alternative Way to Calculate CPI


ICi = #times instruction of type-i is executed in program
n
IC = #instruction executed in program =
IC   ICi
i 1


Fi = relative frequency of type-i instruction = ICi/IC
CPIi = #cycles to execute type-i instruction


e.g.: CPIadd = 1, CPImul = 3
n
#cycles required to execute the program:
# cyc   CPI i  ICi
i 1

CPI:
n
# cyc
CPI 

IC
28
 CPI  IC
i 1
i
IC
i
n
ICi n
  CPIi 
  CPIi  Fi
IC i 1
i 1
Computer Architecture 2015 – Introduction
Performance Evaluation: How?

No simple answer

Performance depends on



Mathematical analysis



Typically impossible
Systems is too complex to model accurately
So how do we evaluate systems?

29
Application
Input
Empirical analysis
Computer Architecture 2015 – Introduction
Benchmarks

Use benchmarks & measure how long it takes


Use real applications (=> no absolute answers)
Preferably standardized benchmarks (+input), e.g.,

SPEC INT: integer apps
• Compression, C complier, Perl, text-processing, …




Sometimes you see FLOPS (“peak” or “sustained”)

30
SPEC FP: floating point apps (mostly scientific)
TPC benchmarks: measure transaction throughput (DB)
SPEC JBB: models wholesale company (Java server, DB)
Supercomputers (top500 list), against LINPACK
Computer Architecture 2015 – Introduction
Evaluating Performance

Use a performance simulator to evaluate the
performance of a new feature / algorithm



Models the uarch to a great detail
Run 100’s of representative applications
Produce the performance s-curve


Sort the applications according to the IPC increase
Baseline (0%) is the processor without the new feature
3%
Bad S-curve
2%
6%
Positive
outliers
Good S-curve
Positive
outliers
4%
1%
0%
2%
-1%
-2%
Negative
outliers
-3%
-4%
0%
Small negative
outliers
-2%
31
Computer Architecture 2015 – Introduction
Amdahl’s Law


Suppose we accelerate the computation such that
 P = portion of computation we make faster
 S = speedup experienced by the portion we improved
For example

If an improvement can speedup 40% of the computation
=> P = 0.4


32
If the improvement makes the portion run twice as fast
=> S = 2
Then overall speedup
=
1
(1  P)  P
S
Computer Architecture 2015 – Introduction
Amdahl’s Law - Example

FP operations improved to run 2x faster
 S = 2, but…
 P = only affects 10% of the program

Speedup:
1
1


 1.053
0.1
0.95
(1  P)  P
(1

0.1)

S
2

Conclusion

33
1
Better to make common case fast…
Computer Architecture 2015 – Introduction
Amdahl’s Law – Parallelism

When parallelizing a program

P = proportion of program that can be made parallel
 1 - P = inherently serial


N = number of processing elements (say, cores)
Speedup:
1
(1  P)  P

N
Serial component imposes a hard limit


1
1


lim

N  
(1  P)
(1  P)  P 
N

34
Computer Architecture 2015 – Introduction
Instruction Set Design
software
The ISA is what the user
& compiler see
instruction set
hardware
35
The HW implements the
ISA
Computer Architecture 2015 – Introduction
Considerations in ISA Design

Instruction size

Long instructions take more time to fetch from memory

Longer instructions require a larger memory
• Important for small (embedded) devices

Number of instructions (IC)


36
Reduce IC => reduce runtime (at a given CPI & frequency)
Virtues of instructions simplicity

Simpler HW allows for: higher frequency & lower power

Optimization can be applied better to simpler code

Cheaper HW
Computer Architecture 2015 – Introduction
Basing Design Decisions on Workload
Immediate argument’s size in bits (histogram)
30%
Int. Avg.
FP Avg.
20%
10%
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0%
Immediate data bits


37
1% of data values > 16-bits
Having 16 bits is likely good enough
Computer Architecture 2015 – Introduction
CISC Processors

CISC - Complex Instruction Set Computer


Example: x86
The idea: a high level machine language
• When people programmed in assembly, CISC supposedly easier

Characteristic


Many instruction types, with a many addressing modes
Some of the instructions are complex
• Execute complex tasks
• Require many cycles

ALU operations directly on memory (e.g., arr[j] = arr[i]+n)
• Registers not used (and, accordingly, only a few registers exist)

Variable length instructions
• common instructions get short codes  save code length
38
Computer Architecture 2015 – Introduction
But it Turns Out…
Rank
instruction
% of total executed
1
load
22%
2
conditional branch
20%
3
compare
16%
4
store
12%
5
add
8%
6
and
6%
7
sub
5%
8
move register-register
4%
9
call
1%
10
return
1%
Total
96%
Simple instructions dominate instruction frequency
39
Computer Architecture 2015 – Introduction
CISC Drawbacks

Complex instructions and complex addressing modes
 complicates the processor
 slows down the simple, common instructions
 contradicts Make The Common Case Fast

Compilers don’t use complex instructions / indexing methods

Variable length instructions are real pain in the neck


Difficult to decode few instructions in parallel
• As long as instruction is not decoded, its length is unknown
 It is unknown where the instruction ends
 It is unknown where the next instruction starts
An instruction may be longer than a cache line
• Or even longer longer than a page (in theory)
40
Computer Architecture 2015 – Introduction
RISC Processors

RISC - Reduced Instruction Set Computer


The idea: simple instructions enable fast hardware
Characteristic



A small instruction set, with only a few instructions formats
A few indexing methods
Load/Store machine: operate only on registers
• Memory is accessed using Load and Store instructions only
• Many orthogonal registers
• Three address machine:
Add dst, src1, src2


41
Fixed length instructions
Examples:
ARMTM MIPSTM, SparcTM, AlphaTM, PowerTM
Computer Architecture 2015 – Introduction
RISC Processors (Cont.)

Simple arch => simple u-arch





Compiler can be smarter



Better pipeline usage
Better register allocation
Existing RISC processor are not “pure” RISC

42
Smaller => faster
Easier to design & validate (=> cheaper to manufacture)
Shorten time-to-market
More general-purpose registers (=> less memory refs)
Various complex operations added along the way
Computer Architecture 2015 – Introduction
Compilers and ISA

Ease of compilation

Orthogonality:
• no special registers
• few special cases
• all operand modes available with any data type or instruction
type

Regularity:
• no overloading for the meanings of instruction fields

streamlined
• resource needs easily determined

Register assignment is critical too

43
Easier if lots of registers
Computer Architecture 2015 – Introduction
CISC or RISC?

ARM (RISC) dominates the personal market


x86 (CISC) dominates the PC and server market



Programs are more compact (fewer instructions)
Dynamic optimizations
Pro RISC:

44
CISC instructions are broken down to RISC micro-ops
Pro CISC:


Not necessarily because it is CISC…
Intel processors have a RISC arch


Not necessarily because it is RISC…
Simpler and faster uarch
Computer Architecture 2015 – Introduction
ISA Extensions

Extend arch to accelerate exec of specific apps

Example: SSETM – Streaming SIMD Extensions




128-bit packed (vector) / scalar single precision FP (4×32)
Introduced on Pentium® III on ’99
8 new 128 bit registers (XMM0 – XMM7)
Accelerates graphics, video, scientific calculations, …
128-bits
x3
x2
x1
x0
y0
+
45
y3
y2
y1
x3+y3
x2+y2
x1+y1
x0+y0
Computer Architecture 2015 – Introduction
BACKUP
46
Computer Architecture 2015 – Introduction
Compatibility

Backward compatibility (HW responsibility)

When buying new hardware, it can run existing software:
• i5 can run SW written for Core2 Duo, Pentium4, PentiumM,
Pentium III, Pentium II, Pentium, 486, 386, 268
BTW:

Forward compatibility (SW responsibility)



Architecture-independent SW


47
For example: MS Word 2003 can open MS Word 2010 doc
Commonly supports one or two generations behind
Run SW on top of VM that does JIT (just in time compiler):
JVM for Java and CLR for .NET
Interpreted languages: Perl, Python
Computer Architecture 2015 – Introduction