+ t exe t` exe
Download
Report
Transcript + t exe t` exe
MAMAS – Computer Structure
234267
Lecturers:
Lihu Rappoport
Adi Yoaz
Some of the slides were taken from Avi Mendelson, Randi Katz, Patterson, Gabriel Loh
1
Computer Structure 2012 – Introduction
General Course Information
2
Grade
20% Exercise (mandatory) תקף
80% Final exam
No midterm exam
Course web site
http://webcourse.cs.technion.ac.il/234267
Foils will be on the web several days before the class
Computer Structure 2012 – Introduction
Class Focus
CPU
Introduction: performance, instruction set (RISC vs. CISC)
Pipeline, hazards
Branch prediction
Out-of-order execution
Memory Hierarchy
Cache
Main memory
Virtual Memory
Advanced Topics
PC Architecture
3
Motherboard & chipset, DRAM, I/O, Disk, peripherals
Computer Structure 2012 – Introduction
Computer System – Sandy Bridge
External
Graphics
Card
PCI express ×16
DDRIII
Cache
Channel 1
Mem
BUS
DDRIII
Memory
controller
Core
GFX
System
Agent
Core
Channel 2
Display link
South Bridge (PCH)
HDMI
PCI express ×1
4
Serial Port
Parallel Port
IO Controller
Floppy
Drive
keybrd
USB
SATA
SATA
controller controller controller
mouse
DVD
Drive
Hard
Disk
PCI
Sound
Card
speakers
Lan
Adap
LAN
Computer Structure 2012 – Introduction
Architecture & Microarchitecture
Architecture
The processor features seen by the “user”
Micro-architecture
The way of implementation of a processor
5
Instruction set, addressing modes, data width, …
Caches size and structure, number of execution units, …
Timing is considered uArch (though it is user visible)
Processors with different uArch can support the
same Architecture
Computer Structure 2012 – Introduction
Compatibility
Backward compatibility
New hardware can run existing software
• Core2 Duo can run SW written for Pentium4, PentiumM,
Pentium III, Pentium II, Pentium, 486, 386, 268
Forward compatibility
Architecture independent SW
6
New software can run on existing hardware
Example: new software written with SSE2TM runs on older
processor which does not support SSE2TM
Commonly supports one or two generations behind
JIT – just in time compiler: Java and .NET
Binary translation
Computer Structure 2012 – Introduction
Moore’s Law
The number of transistors
doubles every ~2 years
7
Computer Structure 2012 – Introduction
CPI – Cycles Per Instruction
CPUs work according to a clock signal
Instruction Count (IC)
Clock cycle is measured in nsec (10-9 of a second)
Clock frequency (= 1/clock cycle) measured in GHz (109 cyc/sec)
Total number of instructions executed in the program
CPI – Cycles Per Instruction
Average #cycles per Instruction (in a given program)
CPI =
8
#cycles required to execute the program
IC
IPC (= 1/CPI) : Instructions per cycles
Computer Structure 2012 – Introduction
Calculating the CPI of a Program
ICi: #times instruction of type i is executed in the program
IC
IC: #instruction executed in the program:
n
IC
i 1
Fi: relative frequency of instruction of type i : Fi = ICi/IC
CPIi – #cycles to execute instruction of type i
e.g.: CPIadd = 1, CPImul = 3
#cycles required to execute the entire program:
# cyc
n
CPI
i 1
i
CPI:
# cyc
CPI
IC
9
i
ICi CPI * IC
n
CPI IC
i 1
i
IC
i
n
n
ICi
CPI i
CPI i Fi
IC
i 1
i 1
Computer Structure 2012 – Introduction
CPU Time
CPU Time - time required to execute a program
CPU Time = IC CPI clock cycle
10
Our goal: minimize CPU Time
Minimize clock cycle: more GHz (process, circuit, uArch)
Minimize CPI:
uArch (e.g.: more execution units)
Minimize IC:
architecture (e.g.: SSETM)
Computer Structure 2012 – Introduction
Amdahl’s Law
Suppose enhancement E accelerates a fraction F of the task by a
factor S, and the remainder of the task is unaffected, then:
texe
t’exe
t’exe = texe × (1 – Fractionenhanced) +
texe
Speedupoverall =
t’exe
=
Fractionenhanced
Speedupenhanced
1
(1 - Fractionenhanced) +
11
Fractionenhanced
Speedupenhanced
Computer Structure 2012 – Introduction
Amdahl’s Law: Example
• Floating point instructions improved to run at 2×,
but only 10% of executed instructions are FP
t’exe = texe × (0.9 + 0.1 / 2) = 0.95 × texe
Speedupoverall =
1
= 1.053
0.95
Corollary:
Make The Common Case Fast
12
Computer Structure 2012 – Introduction
Comparing Performance
Peak Performance
MIPS, MFLOPS
Often not useful: unachievable / unsustainable in practice
Benchmarks
Real applications, or representative parts of real apps
Targeted at the specific system usages
SPEC INT – integer applications
• Data compression, C complier, Perl interpreter, database
system, chess-playing, Text-processing, …
SPEC FP – floating point applications
• Mostly important scientific applications
TPC Benchmarks
• Measure transaction-processing throughput
13
Computer Structure 2012 – Introduction
Evaluating Performance of future CPUs
Use a performance simulator to evaluate the
performance of a new feature / algorithm
Models the uarch to a great detail
Run 100’s of representative applications
Produce the performance s-curve
Sort the applications according to the IPC increase
Baseline (0) is the processor without the new feature
3%
Bad S-curve
2%
6%
Positive
outliers
Good S-curve
Positive
outliers
4%
1%
0%
2%
-1%
-2%
Negative
outliers
-3%
0%
Small negative
outliers
-2%
-4%
14
Computer Structure 2012 – Introduction
Instruction Set Design
software
The ISA is what the user /
compiler see
instruction set
hardware
15
The HW implements the
ISA
Computer Structure 2012 – Introduction
ISA Considerations
Reduce the IC to reduce execution time
Simple instructions simpler HW implementation
E.g., a single vector instruction performs the work of
multiple scalar instructions
Higher frequency, lower power, lower cost
Code size
Long instructions take more time to fetch
Longer instructions require a larger memory
• Important in small devices, e.g., cell phones
16
Computer Structure 2012 – Introduction
Architectural Consideration Example
Immediate data size
30%
Int. Avg.
FP Avg.
20%
10%
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0%
Immediate data bits
17
1% of data values > 16-bits
12 – 16 bits of needed
Computer Structure 2012 – Introduction
CISC Processors
CISC – Complex Instruction Set Computer
The idea: a high level machine language
Example: x86
Characteristic
Many instruction types, with a many addressing modes
Some of the instructions are complex
• Execute complex tasks
• Require many cycles
ALU operations directly on memory
• Only a few registers, in many cases not orthogonal
Variable length instructions
• common instructions get short codes save code length
18
Computer Structure 2012 – Introduction
Top 10 x86 Instructions
Rank
instruction
% of total executed
1
load
22%
2
conditional branch
20%
3
compare
16%
4
store
12%
5
add
8%
6
and
6%
7
sub
5%
8
move register-register
4%
9
call
1%
10
return
1%
Total
96%
Simple instructions dominate instruction frequency
19
Computer Structure 2012 – Introduction
CISC Drawbacks
Complex instructions and complex addressing modes
complicates the processor
slows down the simple, common instructions
contradicts Make The Common Case Fast
Not compiler friendly
Non orthogonal registers
Unused complex addressing modes
Variable length instructions are a pain
20
Difficult to decode few instructions in parallel
• As long as instruction is not decoded, its length is unknown
Unknown where the inst. ends, and where the next inst. starts
An instruction may cross a cache line or a page
Computer Structure 2012 – Introduction
RISC Processors
RISC - Reduced Instruction Set Computer
The idea: simple instructions enable fast hardware
Characteristics
A small instruction set, with few instruction formats
Simple instructions that execute simple tasks
• Most of them require a single cycle (with pipeline)
A few indexing methods
ALU operations on registers only
• Memory is accessed using Load and Store instructions only
21
Many orthogonal registers
Three address machine:
Add dst, src1, src2
Fixed length instructions
Computer Structure 2012 – Introduction
RISC Processors (Cont.)
Simple architecture Simple micro-architecture
Using a smart compiler
22
Better pipeline usage
Better register allocation
Existing RISC processor are not “pure” RISC
Simple, small and fast control logic
Simpler to design and validate
Leave space for large on die caches
Shorten time-to-market
e.g., support division which takes many cycles
Examples: MIPSTM, SparcTM, AlphaTM, PowerTM
Computer Structure 2012 – Introduction
Compilers and ISA
Ease of compilation
Orthogonality:
• no special registers
• few special cases
• all operand modes available with any data type or instruction
type
Regularity:
• no overloading for the meanings of instruction fields
streamlined
• resource needs easily determined
Register Assignment is critical too
23
Easier if lots of registers
Computer Structure 2012 – Introduction
CISC Is Dominant
The x86 architecture, which is a CISC
architecture, dominates the processor market
A vast amount of existing software
Intel, AMD, Microsoft and others benefit from this
• Intel and AMD put a lot of money to make high performance
x86 processors, despite the architectural disadvantage
• Current x86 processor give the best cost/performance
CISC processors use arch ideas from the RISC world
Starting at Pentium II and K6, x86 processors translate
CISC instructions into RISC-like operations internally
• the inside core looks much like that of a RISC processor
24
Computer Structure 2012 – Introduction
Software Specific Extensions
Extend arch to accelerate exec of specific apps
Example: SSETM – Streaming SIMD Extensions
128-bit packed (vector) / scalar single precision FP (4×32)
Introduced on Pentium® III on ’99
8 new 128 bit registers (XMM0 – XMM7)
Accelerates graphics, video, scientific calculations, …
Packed:
Scalar:
128-bits
x3
x2
x1
128-bits
x0
x3
x2
+
y3
y2
x0
+
y1
y0
x3+y3 x2+y2 x1+y1 x0+y0
25
x1
y3
y2
y1
y0
y3
y2
y1
x0+y0
Computer Structure 2012 – Introduction