2 - Webcourse
Download
Report
Transcript 2 - Webcourse
MAMAS – Computer Architecture
234267
Lecturer: Dr. Lihu Rappoport
Some of the slides were taken from Avi Mendelson, Randi Katz, Patterson, Gabriel Loh
1
Computer Architecture 2010 – Introduction
General Course Information
Grade
Textbooks
Computer Architecture a Quantitative Approach:
Hennessy & Patterson
Other course information
2
20% Exercise (mandatory) תקף
80% Final exam
No midterm exam
Course web site:
http://webcourse.cs.technion.ac.il/234267
Foils will be on the web several days before the class
Computer Architecture 2010 – Introduction
Lecturer details
3
Name: Lihu Rappoport
ליהוא רפופורט
Phone: 04-865-1554
Email: [email protected]
Computer Architecture 2010 – Introduction
Class Focus
CPU
Memory Hierarchy
Cache
Main memory
Virtual Memory
Advanced Topics
PC Architecture
4
Introduction: performance, instruction set (RISC vs. CISC)
Pipeline, hazards
Branch prediction
Out-of-order execution
Motherboard & chipset, DRAM, I/O, Disk, peripherals
Computer Architecture 2010 – Introduction
Computer System Structure
External
Graphics
Card
PCI express ×16
North Bridge
Cache
CPU BUS
CPU
On-board
Graphics
DDRII
Memory
controller
Channel 1
Mem BUS
DDRII
Channel 2
PCI express ×1
South Bridge
5
Serial Port
Parallel Port
IO Controller
Floppy
Drive
keybrd
USB
IDE
SATA
controller controller controller
mouse
Old DVD/
HD drive
Hard
Disk
PCI
Sound
Card
speakers
Lan
Adap
LAN
Computer Architecture 2010 – Introduction
Architecture & Microarchitecture
Architecture
The processor features seen by the “user”
Micro-architecture
The way of implementation of a processor
6
Instruction set, addressing modes, data width, …
Caches size and structure, number of execution units, …
Timing is considered uArch (though it is user visible)
Processors with different uArch can support the
same Architecture
Computer Architecture 2010 – Introduction
Compatibility
Backward compatibility
New hardware can run existing software
• Core2 Duo can run SW written for Pentium4, PentiumM,
Pentium III, Pentium II, Pentium, 486, 386, 268
Forward compatibility
Architecture independent SW
7
New software can run on existing hardware
Example: new software written with SSE2TM runs on older
processor which does not support SSE2TM
Commonly supports one or two generations behind
JIT – just in time compiler: Java and .NET
Binary translation
Computer Architecture 2010 – Introduction
Performance
8
Computer Architecture 2010 – Introduction
Technology Trends and Performance
1000
1000000
Logic
DRAM
Speed
100
2× in 3 years
10
100000
CPU speed and
Memory speed
grow apart
1.1× in 3 years
1000
Logic
DRAM
4× in 3 years
100
10
2× in 3 years
19
80
19
83
19
86
19
89
19
92
19
95
19
98
20
01
20
04
20
07
1
19
80
19
83
19
86
19
89
19
92
19
95
19
98
20
01
20
04
20
07
1
10000
Capacity
Computing capacity: 4× per 3 years
If we could keep all the transistors busy all the time
Actual: 3.3× per 3 years
Moore’s Law: Performance is doubled every ~18 months
Trend is slowing: process scaling declines, power is up
9
Computer Architecture 2010 – Introduction
Moore’s Law
Graph taken from: http://www.intel.com/technology/mooreslaw/index.htm
10
Computer Architecture 2010 – Introduction
CPI – Cycles Per Instruction
CPUs work according to a clock signal
Instruction Count (IC)
Clock cycle is measured in nsec (10-9 of a second)
Clock frequency (= 1/clock cycle) measured in GHz (109cyc/sec)
Total number of instructions executed in the program
CPI – Cycles Per Instruction
Average #cycles per Instruction (in a given program)
CPI =
11
#cycles required to execute the program
IC
IPC (= 1/CPI) : Instructions per cycles
Computer Architecture 2010 – Introduction
CPU Time
CPU Time - time required to execute a program
CPU Time = IC CPI clock cycle
12
Our goal: minimize CPU Time
Minimize clock cycle: more GHz (process, circuit, uArch)
Minimize CPI:
uArch (e.g.: more execution units)
Minimize IC:
architecture (e.g.: SSETM)
Computer Architecture 2010 – Introduction
Amdahl’s Law
Suppose enhancement E accelerates a fraction F of the task by a
factor S, and the remainder of the task is unaffected, then:
ExTimenew = ExTimeold × (1 – Fraction enhanced) +
ExTimeold
Speedupoverall =
ExTimenew
=
Fractionenhanced
Speedupenhanced
1
(1 - Fractionenhanced) +
13
Fractionenhanced
Speedupenhanced
Computer Architecture 2010 – Introduction
Amdahl’s Law: Example
• Floating point instructions improved to run at 2×,
but only 10% of executed instructions are FP
ExTimenew = ExTimeold × (0.9 + 0.1 / 2) = 0.95 × ExTimeold
Speedupoverall =
1
= 1.053
0.95
Corollary:
Make The Common Case Fast
14
Computer Architecture 2010 – Introduction
Calculating the CPI of a Program
ICi: #times instruction of type i is executed in the program
n
IC: #instruction executed in the program: IC IC
i
i 1
Fi: relative frequency of instruction of type i : Fi = ICi/IC
CPIi – #cycles to execute instruction of type i
e.g.: CPIadd = 1, CPImul = 3
#cycles required to execute the program:
n
# cyc CPIi ICi CPI * IC
i 1
CPI:
# cyc
CPI
IC
15
n
CPI IC
i 1
i
IC
i
n
ICi n
CPIi
CPIi Fi
IC i 1
i 1
Computer Architecture 2010 – Introduction
Evaluating Performance
Use a performance simulator to evaluate the
performance of a new feature / algorithm
Models the uarch to a great detail
Run 100’s of representative applications
Produce the performance s-curve
Sort the applications according to the IPC increase
Baseline (0) is the processor without the new feature
3%
Bad S-curve
2%
6%
Positive
outliers
Good S-curve
Positive
outliers
4%
1%
0%
2%
-1%
-2%
Negative
outliers
-3%
0%
Small negative
outliers
-2%
-4%
16
Computer Architecture 2010 – Introduction
Comparing Performance
Peak Performance
MIPS, MFLOPS
Often not useful: unachievable / unsustainable in practice
Benchmarks
Real applications, or representative parts of real apps
Targeted at the specific system usages
SPEC INT – integer applications
• Data compression, C complier, Perl interpreter, database
system, chess-playing, Text-processing, …
SPEC FP – floating point applications
• Mostly important scientific applications
TPC Benchmarks
• Measure transaction-processing throughput
17
Computer Architecture 2010 – Introduction
Instruction Set Design
software
The ISA is what the user /
compiler see
instruction set
hardware
18
The HW implements the
ISA
Computer Architecture 2010 – Introduction
ISA Considerations
Code size
Long instructions take more time to fetch
Longer instructions require a larger memory
• Important in small devices, e.g., cell phones
Number of instructions (IC)
Reducing IC reduce execution time
• At a given CPI and frequency
Code “simplicity”
Simple HW implementation
• Higher frequency and lower power
19
Code optimization can better be applied to “simple code”
Computer Architecture 2010 – Introduction
Architectural Consideration Example
Immediate data size
30%
Int. Avg.
FP Avg.
20%
10%
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0%
Immediate data bits
20
1% of data values > 16-bits
12 – 16 bits of needed
Computer Architecture 2010 – Introduction
CISC Processors
CISC - Complex Instruction Set Computer
The idea: a high level machine language
Example: x86
Characteristic
Many instruction types, with a many addressing modes
Some of the instructions are complex
• Execute complex tasks
• Require many cycles
ALU operations directly on memory
• Only a few registers, in many cases not orthogonal
Variable length instructions
• common instructions get short codes save code length
21
Computer Architecture 2010 – Introduction
Top 10 x86 Instructions
Rank
instruction
% of total executed
1
load
22%
2
conditional branch
20%
3
compare
16%
4
store
12%
5
add
8%
6
and
6%
7
sub
5%
8
move register-register
4%
9
call
1%
10
return
1%
Total
96%
Simple instructions dominate instruction frequency
22
Computer Architecture 2010 – Introduction
CISC Drawbacks
Complex instructions and complex addressing modes
complicates the processor
slows down the simple, common instructions
contradicts Make The Common Case Fast
Compilers don’t use complex instructions / indexing methods
Variable length instructions are real pain in the neck
23
Difficult to decode few instructions in parallel
• As long as instruction is not decoded, its length is unknown
It is unknown where the instruction ends
It is unknown where the next instruction starts
An instruction may be over more than a single cache line
An instruction may be over more than a single page
Computer Architecture 2010 – Introduction
RISC Processors
RISC - Reduced Instruction Set Computer
The idea: simple instructions enable fast hardware
Characteristic
A small instruction set, with only a few instructions formats
Simple instructions
• execute simple tasks
• Most of them require a single cycle (with pipeline)
A few indexing methods
ALU operations on registers only
• Memory is accessed using Load and Store instructions only
• Many orthogonal registers
• Three address machine:
Add dst, src1, src2
24
Fixed length instructions
Examples: MIPSTM, SparcTM, AlphaTM, PowerTM
Computer Architecture 2010 – Introduction
RISC Processors (Cont.)
Simple architecture Simple micro-architecture
Using a smart compiler
Better pipeline usage
Better register allocation
Existing RISC processor are not “pure” RISC
25
Simple, small and fast control logic
Simpler to design and validate
Room for large on die caches
Shorten time-to-market
e.g., support division which takes many cycles
Computer Architecture 2010 – Introduction
Compilers and ISA
Ease of compilation
Orthogonality:
• no special registers
• few special cases
• all operand modes available with any data type or instruction
type
Regularity:
• no overloading for the meanings of instruction fields
streamlined
• resource needs easily determined
Register Assignment is critical too
26
Easier if lots of registers
Computer Architecture 2010 – Introduction
CISC Is Dominant
The x86 architecture, which is a CISC
architecture, dominates the processor market
A vast amount of existing software
Intel, AMD, Microsoft and others benefit from this
• Intel and AMD put a lot of money to make high performance
x86 processors, despite the architectural disadvantage
• Current x86 processor give the best cost/performance
CISC processors use arch ideas from the RISC world
Starting at Pentium II and K6, x86 processors translate
CISC instructions into RISC-like operations internally
• the inside core looks much like that of a RISC processor
27
Computer Architecture 2010 – Introduction
Software Specific Extensions
Extend arch to accelerate exec of specific apps
Example: SSETM – Streaming SIMD Extensions
128-bit packed (vector) / scalar single precision FP (4×32)
Introduced on Pentium® III on ’99
8 new 128 bit registers (XMM0 – XMM7)
Accelerates graphics, video, scientific calculations, …
Packed:
Scalar:
128-bits
x3
x2
x1
128-bits
x0
x3
x2
+
y3
y2
x0
+
y1
y0
x3+y3 x2+y2 x1+y1 x0+y0
28
x1
y3
y2
y1
y0
y3
y2
y1
x0+y0
Computer Architecture 2010 – Introduction