ca-2012-03-12-intro

Transcript ca-2012-03-12-intro

Computer Architecture
(“MAMAS”, 234267)
Spring 2012
Lecturer: Dan Tsafrir
Reception: Mon 18:30, Taub 611
12/3/2012
Presentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, and Adi Yoaz
1
Computer Architecture 2012 – Introduction (lec1)
General Info

Grade



20% Exercise (mandatory) ‫תקף‬
80% Final exam
Textbook
 “Computer Architecture:
A Quantitative Approach” (4th Edition)
by: Patterson & Hennessy

Other course information


2
Course web site:
http://webcourse.cs.technion.ac.il/234267/Spring2012
Lectures will be upload to the web a day before the class
Computer Architecture 2012 – Introduction (lec1)
Computer System Structure
3
Computer Architecture 2012 – Introduction (lec1)
Classical Motherboard Diagram
Cache
More to the “north”
= closer to the CPU
= faster
CPU
CPU BUS
North Bridge
External
Graphics
Card
DDR2 or DDR3
Channel 1
PCI express 2.0
IOMMU
On-board Memory
Graphics controller
Serial Port
Parallel Port
IO Controller
4
DDR2 or DDR3
Channel 2
PCI express ×1
South Bridge
Floppy
Drive
Mem BUS
keybrd
USB
controller
mouse
SATA
controller
DVD
Drive
Hard
Disk
PCI
Sound
Card
speakers
Lan
Adap
LAN
Computer Architecture 2012 – Introduction (lec1)
Intel Core 2
Northbridge = MCH =
mem controller hub
Notice bandwidths
Southbridge = ICH =
I/O controller hub
65 to 45 nm
5
Computer Architecture 2012 – Introduction (lec1)
Intel Nehalem Core i3 i5 i7
For high-end i-Series chips,
Northbridge functionality
moved onto processor
(=> made faster)
45 to 32 nm
6
Computer Architecture 2012 – Introduction (lec1)
Intel Sandy Bridge Core i3 i5 i7
32 to 22 nm
7
The trend
continues
Computer Architecture 2012 – Introduction (lec1)
8
Computer Architecture 2012 – Introduction (lec1)
Course Focus

Start from CPU (=processor)





Move on to Memory Hierarchy




9
Caching
Main memory
Virtual Memory
Move on to PC Architecture


Instruction set, performance
Pipeline, hazards
Branch prediction
Out-of-order execution
Motherboard & chipset, DRAM, I/O, Disk, peripherals
End with some Advanced Topics
Computer Architecture 2012 – Introduction (lec1)
The Processor
10
Computer Architecture 2012 – Introduction (lec1)
Architecture vs. Microarchitecture

Architecture:
= The processor features as seen by its user
= Interface


Microarchitecture:
= Manner by which the processor is implemented
= Implementation details


11
Caches size and structure, number of execution units, …
Note: different processors with different u-archs
can support the same arch


Instruction set, number of registers, addressing modes,…
Example: Intel Pentium-IV vs. Intel Core2 Duo
We will address both
Computer Architecture 2012 – Introduction (lec1)
Why Should We Care?

Abstractions enhance productivity, so:



Same goes for arch


Just details for a programmer of a high-level language
Abstractions only work so long as what’s below
works

12
If we know the arch (=interface),
Why should we care about the u-arch (=internals)?
The taxi story: http://vimeo.com/11478146 (4:50-6:00)
Computer Architecture 2012 – Introduction (lec1)
Recent Processor Trends
Source: http://www.scidacreview.org/0904/html/multicore.html
13
Computer Architecture 2012 – Introduction (lec1)
Well-Known Moore’s Law
Graph taken from: http://www.intel.com/technology/mooreslaw/index.htm
14
Computer Architecture 2012 – Introduction (lec1)
15
Computer Architecture 2012 – Introduction (lec1)
The Story in a Nutshell
Transistors
(1000s)
clock speed
(MHz)
power (W)
Instructions/cycle
(ILP)
16
Computer Architecture 2012 – Introduction (lec1)
Took the Industry by Surprise
17
Computer Architecture 2012 – Introduction (lec1)
Dire Implications: Performance
18
Computer Architecture 2012 – Introduction (lec1)
Dire Implications: Sales
19
Computer Architecture 2012 – Introduction (lec1)
Dire Implications: Sales
20
Computer Architecture 2012 – Introduction (lec1)
Dire Implications: Programmers
21
Computer Architecture 2012 – Introduction (lec1)
Supercomputing: “Top 500 list”
22
Computer Architecture 2012 – Introduction (lec1)
Dire Implications: Supercomputing
23
Computer Architecture 2012 – Introduction (lec1)
Processor Performance
24
Computer Architecture 2012 – Introduction (lec1)
Metrics: IC, CPI, IPC

CPUs work according to a clock signal



Instruction Count (IC)


Clock cycle: measured in nanoseconds (10-9 of a second)
Clock frequency = 1/|clock cycle|: in GHz (109 cycles/sec)
Total number of instructions executed in the program
Cycles Per Instruction (CPI)

Average #cycles per Instruction (in a given program)
CPI =

25
#cycles required to execute the program
IC
IPC (= 1/CPI) : Instructions per cycles.
Can be > 1; see the “story in a nutshell slide”
Computer Architecture 2012 – Introduction (lec1)
Minimizing Execution Time

CPU Time - time required to execute a program
CPU Time = IC  CPI  clock cycle

Our goal:
minimize CPU Time (any of above components)

Minimize clock cycle: increase GHz (processor design)

Minimize CPI:
u-arch (e.g.: more execution units)

Minimize IC:
arch + u-arch (e.g.: SSETM)
SSE = streaming SIMD extension (Intel)
26
Computer Architecture 2012 – Introduction (lec1)
Alternative Way to Calculate CPI


ICi = #times instruction of type-i is executed in program
n
IC = #instruction executed in program =
IC   ICi
i 1


Fi = relative frequency of type-i instruction = ICi/IC
CPIi = #cycles to execute type-i instruction


e.g.: CPIadd = 1, CPImul = 3
n
#cycles required to execute the program:
# cyc   CPI i  ICi
i 1

CPI:
n
# cyc
CPI 

IC
27
 CPI  IC
i 1
i
IC
i
n
ICi n
  CPIi 
  CPIi  Fi
IC i 1
i 1
Computer Architecture 2012 – Introduction (lec1)
Performance Evaluation: How?

No simple answer

Performance depends on



Mathematical analysis


28
Application
Input
Typically impossible
What to do?
Computer Architecture 2012 – Introduction (lec1)
Benchmarks

Use benchmarks & measure how long it takes


Use real applications (=> no absolute answers)
Preferably standardized benchmarks (+input), e.g.,

SPEC INT: integer apps
• Compression, C complier, Perl, text-processing, …




Sometimes you see FLOPS (“pick” or “sustained”)

29
SPEC FP: floating point apps (mostly scientific)
TPC benchmarks: measure transaction throughput (DB)
SPEC JBB: models wholesale company (Java server, DB)
Supercomputers (top500 list), against LINPACK
Computer Architecture 2012 – Introduction (lec1)
Evaluating Performance

Use a performance simulator to evaluate the
performance of a new feature / algorithm



Models the uarch to a great detail
Run 100’s of representative applications
Produce the performance s-curve


Sort the applications according to the IPC increase
Baseline (0%) is the processor without the new feature
3%
Bad S-curve
2%
6%
Positive
outliers
Good S-curve
Positive
outliers
4%
1%
0%
2%
-1%
-2%
Negative
outliers
-3%
-4%
0%
Small negative
outliers
-2%
30
Computer Architecture 2012 – Introduction (lec1)
Amdahl’s Law


Suppose we accelerate the computation such that
 P = proportion of computation we make faster
 S = speedup experienced by the proportion we improved
For example

If an improvement can speedup 40% of the computation
=> P = 0.4


31
If the improvement makes the portion run twice as fast
=> S = 2
Then overall speedup
=
1
(1  P)  P
S
Computer Architecture 2012 – Introduction (lec1)
Amdahl’s Law - Example

FP operations improved to run 2x faster
 S = 2, but…
 P = only affects 10% of the program

Speedup:
1
1


 1.053
0.1
0.95
(1  P)  P
(1

0.1)

S
2

Conclusion

32
1
Better to make common case fast…
Computer Architecture 2012 – Introduction (lec1)
Amdahl’s Law – Parallelism

When parallelizing a program

P = proportion of program that can be made parallel
 1 - P = inherently serial


N = number of processing elements (say, cores)
Speedup:
1
(1  P)  P

N
Serial component imposes a hard limit


1
1


lim

N  
(1  P)
(1  P)  P 
N

33
Computer Architecture 2012 – Introduction (lec1)
Instruction Set Design
software
The ISA is what the user
& compiler see
instruction set
hardware
34
The HW implements the
ISA
Computer Architecture 2012 – Introduction (lec1)
Considerations in ISA Design

Instruction size

Long instructions take more time to fetch from memory

Longer instructions require a larger memory
• Important for small (embedded) devices, e.g., cell phones

Number of instructions (IC)


35
Reduce IC => reduce runtime (at a given CPI & frequency)
Virtues of instructions simplicity

Simpler HW allows for: higher frequency & lower power

Optimization can be applied better to simpler code

Cheaper HW
Computer Architecture 2012 – Introduction (lec1)
Basing Design Decisions on Workload
Immediate argument’s size in bits (histogram)
30%
Int. Avg.
FP Avg.
20%
10%
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0%
Immediate data bits


36
1% of data values > 16-bits
Having 16 bits is likely good enough
Computer Architecture 2012 – Introduction (lec1)
CISC Processors

CISC - Complex Instruction Set Computer


Example: x86
The idea: a high level machine language
• Once people programmed in assembly, CISC supposedly easier

Characteristic


Many instruction types, with a many addressing modes
Some of the instructions are complex
• Execute complex tasks
• Require many cycles

ALU operations directly on memory (e.g., arr[j] = arr[i]+n)
• Registers not used (and, accordingly, only a few registers exist)

Variable length instructions
• common instructions get short codes  save code length
37
Computer Architecture 2012 – Introduction (lec1)
But it Turns Out…
Rank
instruction
% of total executed
1
load
22%
2
conditional branch
20%
3
compare
16%
4
store
12%
5
add
8%
6
and
6%
7
sub
5%
8
move register-register
4%
9
call
1%
10
return
1%
Total
96%
Simple instructions dominate instruction frequency
38
Computer Architecture 2012 – Introduction (lec1)
CISC Drawbacks

Complex instructions and complex addressing modes
 complicates the processor
 slows down the simple, common instructions
 contradicts Make The Common Case Fast

Compilers don’t use complex instructions / indexing methods

Variable length instructions are real pain in the neck



39
Difficult to decode few instructions in parallel
• As long as instruction is not decoded, its length is unknown
 It is unknown where the instruction ends
 It is unknown where the next instruction starts
An instruction may be over more than a single cache line
An instruction may be over more than a single page
Computer Architecture 2012 – Introduction (lec1)
RISC Processors

RISC - Reduced Instruction Set Computer


The idea: simple instructions enable fast hardware
Characteristic


A small instruction set, with only a few instructions formats
Simple instructions
• execute simple tasks
• Most of them require a single cycle (with pipeline)


A few indexing methods
ALU operations on registers only
• Memory is accessed using Load and Store instructions only
• Many orthogonal registers
• Three address machine:
Add dst, src1, src2


40
Fixed length instructions
Examples: MIPSTM, SparcTM, AlphaTM, PowerTM
Computer Architecture 2012 – Introduction (lec1)
RISC Processors (Cont.)

Simple arch => simple u-arch






Compiler can be smarter



Better pipeline usage
Better register allocation
Existing RISC processor are not “pure” RISC

41
Room for larger on die caches
Smaller => faster
Easier to design & validate (=> cheaper to manufacture)
Shorten time-to-market
More general-purpose registers (=> less memory refs)
e.g., support division which takes many cycles
Computer Architecture 2012 – Introduction (lec1)
Compilers and ISA

Ease of compilation

Orthogonality:
• no special registers
• few special cases
• all operand modes available with any data type or instruction
type

Regularity:
• no overloading for the meanings of instruction fields

streamlined
• resource needs easily determined

Register assignment is critical too

42
Easier if lots of registers
Computer Architecture 2012 – Introduction (lec1)
Still, CISC Is Dominant

x86 (CISC) dominates the processor market

Legacy




CISC internally arch emulates RISC


43
A vast amount of existing software
Intel, AMD, Microsoft benefit
But put lot of money to compensate for disadvantage
Starting at Pentium II and K6, x86 processors translate
CISC instructions into RISC-like operations internally
Inside core looks much like that of a RISC processor
Computer Architecture 2012 – Introduction (lec1)
Software Specific Extensions

Extend arch to accelerate exec of specific apps

Example: SSETM – Streaming SIMD Extensions





128-bit packed (vector) / scalar single precision FP (4×32)
Introduced on Pentium® III on ’99
8 new 128 bit registers (XMM0 – XMM7)
Accelerates graphics, video, scientific calculations, …
Packed:
Scalar:
128-bits
x3
x2
x1
128-bits
x0
x3
x2
+
y3
y2
x0
+
y1
y0
x3+y3 x2+y2 x1+y1 x0+y0
44
x1
y3
y2
y1
y0
y3
y2
y1
x0+y0
Computer Architecture 2012 – Introduction (lec1)
BACKUP
45
Computer Architecture 2012 – Introduction (lec1)
Compatibility

Backward compatibility (HW responsibility)

When buying new hardware, it can run existing software:
• i5 can run SW written for Core2 Duo, Pentium4, PentiumM,
Pentium III, Pentium II, Pentium, 486, 386, 268
BTW:

Forward compatibility (SW responsibility)



Architecture-independent SW


46
For example: MS Word 2003 can open MS Word 2010 doc
Commonly supports one or two generations behind
Run SW on top of VM that does JIT (just in time compiler):
JVM for Java and CLR for .NET
Interpreted languages: Perl, Python
Computer Architecture 2012 – Introduction (lec1)

ca-2012-03-12-intro

Transcript ca-2012-03-12-intro

Directory