2 - Webcourse

Download Report

Transcript 2 - Webcourse

MAMAS – Computer Architecture
234267
Lecturer: Dr. Lihu Rappoport
Some of the slides were taken from Avi Mendelson, Randi Katz, Patterson, Gabriel Loh
1
Computer Architecture 2010 – Introduction
General Course Information

Grade




Textbooks


Computer Architecture a Quantitative Approach:
Hennessy & Patterson
Other course information


2
20% Exercise (mandatory) ‫תקף‬
80% Final exam
No midterm exam
Course web site:
http://webcourse.cs.technion.ac.il/234267
Foils will be on the web several days before the class
Computer Architecture 2010 – Introduction
Lecturer details



3
Name: Lihu Rappoport
‫ליהוא רפופורט‬
Phone: 04-865-1554
Email: [email protected]
Computer Architecture 2010 – Introduction
Class Focus

CPU





Memory Hierarchy





Cache
Main memory
Virtual Memory
Advanced Topics
PC Architecture

4
Introduction: performance, instruction set (RISC vs. CISC)
Pipeline, hazards
Branch prediction
Out-of-order execution
Motherboard & chipset, DRAM, I/O, Disk, peripherals
Computer Architecture 2010 – Introduction
Computer System Structure
External
Graphics
Card
PCI express ×16
North Bridge
Cache
CPU BUS
CPU
On-board
Graphics
DDRII
Memory
controller
Channel 1
Mem BUS
DDRII
Channel 2
PCI express ×1
South Bridge
5
Serial Port
Parallel Port
IO Controller
Floppy
Drive
keybrd
USB
IDE
SATA
controller controller controller
mouse
Old DVD/
HD drive
Hard
Disk
PCI
Sound
Card
speakers
Lan
Adap
LAN
Computer Architecture 2010 – Introduction
Architecture & Microarchitecture

Architecture
The processor features seen by the “user”


Micro-architecture
The way of implementation of a processor



6
Instruction set, addressing modes, data width, …
Caches size and structure, number of execution units, …
Timing is considered uArch (though it is user visible)
Processors with different uArch can support the
same Architecture
Computer Architecture 2010 – Introduction
Compatibility

Backward compatibility

New hardware can run existing software
• Core2 Duo can run SW written for Pentium4, PentiumM,
Pentium III, Pentium II, Pentium, 486, 386, 268

Forward compatibility




Architecture independent SW


7
New software can run on existing hardware
Example: new software written with SSE2TM runs on older
processor which does not support SSE2TM
Commonly supports one or two generations behind
JIT – just in time compiler: Java and .NET
Binary translation
Computer Architecture 2010 – Introduction
Performance
8
Computer Architecture 2010 – Introduction
Technology Trends and Performance
1000
1000000
Logic
DRAM
Speed
100
2× in 3 years
10
100000
CPU speed and
Memory speed
grow apart
1.1× in 3 years


1000
Logic
DRAM
4× in 3 years
100
10
2× in 3 years
19
80
19
83
19
86
19
89
19
92
19
95
19
98
20
01
20
04
20
07
1
19
80
19
83
19
86
19
89
19
92
19
95
19
98
20
01
20
04
20
07
1
10000
Capacity
Computing capacity: 4× per 3 years
 If we could keep all the transistors busy all the time
 Actual: 3.3× per 3 years
Moore’s Law: Performance is doubled every ~18 months
 Trend is slowing: process scaling declines, power is up
9
Computer Architecture 2010 – Introduction
Moore’s Law
Graph taken from: http://www.intel.com/technology/mooreslaw/index.htm
10
Computer Architecture 2010 – Introduction
CPI – Cycles Per Instruction

CPUs work according to a clock signal



Instruction Count (IC)


Clock cycle is measured in nsec (10-9 of a second)
Clock frequency (= 1/clock cycle) measured in GHz (109cyc/sec)
Total number of instructions executed in the program
CPI – Cycles Per Instruction

Average #cycles per Instruction (in a given program)
CPI =

11
#cycles required to execute the program
IC
IPC (= 1/CPI) : Instructions per cycles
Computer Architecture 2010 – Introduction
CPU Time

CPU Time - time required to execute a program
CPU Time = IC  CPI  clock cycle

12
Our goal: minimize CPU Time

Minimize clock cycle: more GHz (process, circuit, uArch)

Minimize CPI:
uArch (e.g.: more execution units)

Minimize IC:
architecture (e.g.: SSETM)
Computer Architecture 2010 – Introduction
Amdahl’s Law
Suppose enhancement E accelerates a fraction F of the task by a
factor S, and the remainder of the task is unaffected, then:
ExTimenew = ExTimeold × (1 – Fraction enhanced) +
ExTimeold
Speedupoverall =
ExTimenew
=
Fractionenhanced
Speedupenhanced
1
(1 - Fractionenhanced) +
13
Fractionenhanced
Speedupenhanced
Computer Architecture 2010 – Introduction
Amdahl’s Law: Example
• Floating point instructions improved to run at 2×,
but only 10% of executed instructions are FP
ExTimenew = ExTimeold × (0.9 + 0.1 / 2) = 0.95 × ExTimeold
Speedupoverall =
1
= 1.053
0.95
Corollary:
Make The Common Case Fast
14
Computer Architecture 2010 – Introduction
Calculating the CPI of a Program


ICi: #times instruction of type i is executed in the program
n
IC: #instruction executed in the program: IC  IC
i

i 1


Fi: relative frequency of instruction of type i : Fi = ICi/IC
CPIi – #cycles to execute instruction of type i


e.g.: CPIadd = 1, CPImul = 3
#cycles required to execute the program:
n
# cyc   CPIi  ICi  CPI * IC
i 1

CPI:
# cyc
CPI 

IC
15
n
 CPI  IC
i 1
i
IC
i
n
ICi n
  CPIi 
  CPIi  Fi
IC i 1
i 1
Computer Architecture 2010 – Introduction
Evaluating Performance

Use a performance simulator to evaluate the
performance of a new feature / algorithm



Models the uarch to a great detail
Run 100’s of representative applications
Produce the performance s-curve


Sort the applications according to the IPC increase
Baseline (0) is the processor without the new feature
3%
Bad S-curve
2%
6%
Positive
outliers
Good S-curve
Positive
outliers
4%
1%
0%
2%
-1%
-2%
Negative
outliers
-3%
0%
Small negative
outliers
-2%
-4%
16
Computer Architecture 2010 – Introduction
Comparing Performance

Peak Performance



MIPS, MFLOPS
Often not useful: unachievable / unsustainable in practice
Benchmarks



Real applications, or representative parts of real apps
Targeted at the specific system usages
SPEC INT – integer applications
• Data compression, C complier, Perl interpreter, database
system, chess-playing, Text-processing, …

SPEC FP – floating point applications
• Mostly important scientific applications

TPC Benchmarks
• Measure transaction-processing throughput
17
Computer Architecture 2010 – Introduction
Instruction Set Design
software
The ISA is what the user /
compiler see
instruction set
hardware
18
The HW implements the
ISA
Computer Architecture 2010 – Introduction
ISA Considerations

Code size

Long instructions take more time to fetch

Longer instructions require a larger memory
• Important in small devices, e.g., cell phones

Number of instructions (IC)

Reducing IC reduce execution time
• At a given CPI and frequency

Code “simplicity”

Simple HW implementation
• Higher frequency and lower power

19
Code optimization can better be applied to “simple code”
Computer Architecture 2010 – Introduction
Architectural Consideration Example
Immediate data size
30%
Int. Avg.
FP Avg.
20%
10%
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0%
Immediate data bits


20
1% of data values > 16-bits
12 – 16 bits of needed
Computer Architecture 2010 – Introduction
CISC Processors

CISC - Complex Instruction Set Computer



The idea: a high level machine language
Example: x86
Characteristic


Many instruction types, with a many addressing modes
Some of the instructions are complex
• Execute complex tasks
• Require many cycles

ALU operations directly on memory
• Only a few registers, in many cases not orthogonal

Variable length instructions
• common instructions get short codes  save code length
21
Computer Architecture 2010 – Introduction
Top 10 x86 Instructions
Rank
instruction
% of total executed
1
load
22%
2
conditional branch
20%
3
compare
16%
4
store
12%
5
add
8%
6
and
6%
7
sub
5%
8
move register-register
4%
9
call
1%
10
return
1%
Total
96%
Simple instructions dominate instruction frequency
22
Computer Architecture 2010 – Introduction
CISC Drawbacks

Complex instructions and complex addressing modes
 complicates the processor
 slows down the simple, common instructions
 contradicts Make The Common Case Fast

Compilers don’t use complex instructions / indexing methods

Variable length instructions are real pain in the neck



23
Difficult to decode few instructions in parallel
• As long as instruction is not decoded, its length is unknown
 It is unknown where the instruction ends
 It is unknown where the next instruction starts
An instruction may be over more than a single cache line
An instruction may be over more than a single page
Computer Architecture 2010 – Introduction
RISC Processors

RISC - Reduced Instruction Set Computer


The idea: simple instructions enable fast hardware
Characteristic


A small instruction set, with only a few instructions formats
Simple instructions
• execute simple tasks
• Most of them require a single cycle (with pipeline)


A few indexing methods
ALU operations on registers only
• Memory is accessed using Load and Store instructions only
• Many orthogonal registers
• Three address machine:
Add dst, src1, src2


24
Fixed length instructions
Examples: MIPSTM, SparcTM, AlphaTM, PowerTM
Computer Architecture 2010 – Introduction
RISC Processors (Cont.)

Simple architecture  Simple micro-architecture





Using a smart compiler



Better pipeline usage
Better register allocation
Existing RISC processor are not “pure” RISC

25
Simple, small and fast control logic
Simpler to design and validate
Room for large on die caches
Shorten time-to-market
e.g., support division which takes many cycles
Computer Architecture 2010 – Introduction
Compilers and ISA

Ease of compilation

Orthogonality:
• no special registers
• few special cases
• all operand modes available with any data type or instruction
type

Regularity:
• no overloading for the meanings of instruction fields

streamlined
• resource needs easily determined

Register Assignment is critical too

26
Easier if lots of registers
Computer Architecture 2010 – Introduction
CISC Is Dominant

The x86 architecture, which is a CISC
architecture, dominates the processor market


A vast amount of existing software
Intel, AMD, Microsoft and others benefit from this
• Intel and AMD put a lot of money to make high performance
x86 processors, despite the architectural disadvantage
• Current x86 processor give the best cost/performance


CISC processors use arch ideas from the RISC world
Starting at Pentium II and K6, x86 processors translate
CISC instructions into RISC-like operations internally
• the inside core looks much like that of a RISC processor
27
Computer Architecture 2010 – Introduction
Software Specific Extensions

Extend arch to accelerate exec of specific apps

Example: SSETM – Streaming SIMD Extensions





128-bit packed (vector) / scalar single precision FP (4×32)
Introduced on Pentium® III on ’99
8 new 128 bit registers (XMM0 – XMM7)
Accelerates graphics, video, scientific calculations, …
Packed:
Scalar:
128-bits
x3
x2
x1
128-bits
x0
x3
x2
+
y3
y2
x0
+
y1
y0
x3+y3 x2+y2 x1+y1 x0+y0
28
x1
y3
y2
y1
y0
y3
y2
y1
x0+y0
Computer Architecture 2010 – Introduction