+ t exe t` exe

Transcript + t exe t` exe

MAMAS – Computer Structure
234267
Lecturers:
Lihu Rappoport
Adi Yoaz
Some of the slides were taken from Avi Mendelson, Randi Katz, Patterson, Gabriel Loh
1
Computer Structure 2012 – Introduction
General Course Information


2
Grade

20% Exercise (mandatory) ‫תקף‬

80% Final exam

No midterm exam
Course web site

http://webcourse.cs.technion.ac.il/234267

Foils will be on the web several days before the class
Computer Structure 2012 – Introduction
Class Focus

CPU





Introduction: performance, instruction set (RISC vs. CISC)
Pipeline, hazards
Branch prediction
Out-of-order execution
Memory Hierarchy



Cache
Main memory
Virtual Memory

Advanced Topics

PC Architecture

3
Motherboard & chipset, DRAM, I/O, Disk, peripherals
Computer Structure 2012 – Introduction
Computer System – Sandy Bridge
External
Graphics
Card
PCI express ×16
DDRIII
Cache
Channel 1
Mem
BUS
DDRIII
Memory
controller
Core
GFX
System
Agent
Core
Channel 2
Display link
South Bridge (PCH)
HDMI
PCI express ×1
4
Serial Port
Parallel Port
IO Controller
Floppy
Drive
keybrd
USB
SATA
SATA
controller controller controller
mouse
DVD
Drive
Hard
Disk
PCI
Sound
Card
speakers
Lan
Adap
LAN
Computer Structure 2012 – Introduction
Architecture & Microarchitecture

Architecture
The processor features seen by the “user”


Micro-architecture
The way of implementation of a processor



5
Instruction set, addressing modes, data width, …
Caches size and structure, number of execution units, …
Timing is considered uArch (though it is user visible)
Processors with different uArch can support the
same Architecture
Computer Structure 2012 – Introduction
Compatibility

Backward compatibility

New hardware can run existing software
• Core2 Duo can run SW written for Pentium4, PentiumM,
Pentium III, Pentium II, Pentium, 486, 386, 268

Forward compatibility




Architecture independent SW


6
New software can run on existing hardware
Example: new software written with SSE2TM runs on older
processor which does not support SSE2TM
Commonly supports one or two generations behind
JIT – just in time compiler: Java and .NET
Binary translation
Computer Structure 2012 – Introduction
Moore’s Law
The number of transistors
doubles every ~2 years
7
Computer Structure 2012 – Introduction
CPI – Cycles Per Instruction

CPUs work according to a clock signal



Instruction Count (IC)


Clock cycle is measured in nsec (10-9 of a second)
Clock frequency (= 1/clock cycle) measured in GHz (109 cyc/sec)
Total number of instructions executed in the program
CPI – Cycles Per Instruction

Average #cycles per Instruction (in a given program)
CPI =

8
#cycles required to execute the program
IC
IPC (= 1/CPI) : Instructions per cycles
Computer Structure 2012 – Introduction
Calculating the CPI of a Program


ICi: #times instruction of type i is executed in the program
IC 
IC: #instruction executed in the program:
n
 IC
i 1



Fi: relative frequency of instruction of type i : Fi = ICi/IC
CPIi – #cycles to execute instruction of type i
 e.g.: CPIadd = 1, CPImul = 3
#cycles required to execute the entire program:
# cyc 
n
 CPI
i 1

i
CPI:
# cyc
CPI 

IC
9
i
 ICi  CPI * IC
n
 CPI  IC
i 1
i
IC
i
n
n
ICi
  CPI i 
  CPI i  Fi
IC
i 1
i 1
Computer Structure 2012 – Introduction
CPU Time

CPU Time - time required to execute a program
CPU Time = IC  CPI  clock cycle

10
Our goal: minimize CPU Time

Minimize clock cycle: more GHz (process, circuit, uArch)

Minimize CPI:
uArch (e.g.: more execution units)

Minimize IC:
architecture (e.g.: SSETM)
Computer Structure 2012 – Introduction
Amdahl’s Law
Suppose enhancement E accelerates a fraction F of the task by a
factor S, and the remainder of the task is unaffected, then:
texe
t’exe
t’exe = texe × (1 – Fractionenhanced) +
texe
Speedupoverall =
t’exe
=
Fractionenhanced
Speedupenhanced
1
(1 - Fractionenhanced) +
11
Fractionenhanced
Speedupenhanced
Computer Structure 2012 – Introduction
Amdahl’s Law: Example
• Floating point instructions improved to run at 2×,
but only 10% of executed instructions are FP
t’exe = texe × (0.9 + 0.1 / 2) = 0.95 × texe
Speedupoverall =
1
= 1.053
0.95
Corollary:
Make The Common Case Fast
12
Computer Structure 2012 – Introduction
Comparing Performance

Peak Performance



MIPS, MFLOPS
Often not useful: unachievable / unsustainable in practice
Benchmarks



Real applications, or representative parts of real apps
Targeted at the specific system usages
SPEC INT – integer applications
• Data compression, C complier, Perl interpreter, database
system, chess-playing, Text-processing, …

SPEC FP – floating point applications
• Mostly important scientific applications

TPC Benchmarks
• Measure transaction-processing throughput
13
Computer Structure 2012 – Introduction
Evaluating Performance of future CPUs

Use a performance simulator to evaluate the
performance of a new feature / algorithm



Models the uarch to a great detail
Run 100’s of representative applications
Produce the performance s-curve


Sort the applications according to the IPC increase
Baseline (0) is the processor without the new feature
3%
Bad S-curve
2%
6%
Positive
outliers
Good S-curve
Positive
outliers
4%
1%
0%
2%
-1%
-2%
Negative
outliers
-3%
0%
Small negative
outliers
-2%
-4%
14
Computer Structure 2012 – Introduction
Instruction Set Design
software
The ISA is what the user /
compiler see
instruction set
hardware
15
The HW implements the
ISA
Computer Structure 2012 – Introduction
ISA Considerations

Reduce the IC to reduce execution time


Simple instructions  simpler HW implementation


E.g., a single vector instruction performs the work of
multiple scalar instructions
Higher frequency, lower power, lower cost
Code size

Long instructions take more time to fetch

Longer instructions require a larger memory
• Important in small devices, e.g., cell phones
16
Computer Structure 2012 – Introduction
Architectural Consideration Example
Immediate data size
30%
Int. Avg.
FP Avg.
20%
10%
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0%
Immediate data bits


17
1% of data values > 16-bits
12 – 16 bits of needed
Computer Structure 2012 – Introduction
CISC Processors

CISC – Complex Instruction Set Computer



The idea: a high level machine language
Example: x86
Characteristic


Many instruction types, with a many addressing modes
Some of the instructions are complex
• Execute complex tasks
• Require many cycles

ALU operations directly on memory
• Only a few registers, in many cases not orthogonal

Variable length instructions
• common instructions get short codes  save code length
18
Computer Structure 2012 – Introduction
Top 10 x86 Instructions
Rank
instruction
% of total executed
1
load
22%
2
conditional branch
20%
3
compare
16%
4
store
12%
5
add
8%
6
and
6%
7
sub
5%
8
move register-register
4%
9
call
1%
10
return
1%
Total
96%
Simple instructions dominate instruction frequency
19
Computer Structure 2012 – Introduction
CISC Drawbacks

Complex instructions and complex addressing modes
 complicates the processor
 slows down the simple, common instructions
 contradicts Make The Common Case Fast

Not compiler friendly



Non orthogonal registers
Unused complex addressing modes
Variable length instructions are a pain


20
Difficult to decode few instructions in parallel
• As long as instruction is not decoded, its length is unknown
 Unknown where the inst. ends, and where the next inst. starts
An instruction may cross a cache line or a page
Computer Structure 2012 – Introduction
RISC Processors

RISC - Reduced Instruction Set Computer


The idea: simple instructions enable fast hardware
Characteristics


A small instruction set, with few instruction formats
Simple instructions that execute simple tasks
• Most of them require a single cycle (with pipeline)


A few indexing methods
ALU operations on registers only
• Memory is accessed using Load and Store instructions only



21
Many orthogonal registers
Three address machine:
Add dst, src1, src2
Fixed length instructions
Computer Structure 2012 – Introduction
RISC Processors (Cont.)

Simple architecture  Simple micro-architecture





Using a smart compiler



22
Better pipeline usage
Better register allocation
Existing RISC processor are not “pure” RISC


Simple, small and fast control logic
Simpler to design and validate
Leave space for large on die caches
Shorten time-to-market
e.g., support division which takes many cycles
Examples: MIPSTM, SparcTM, AlphaTM, PowerTM
Computer Structure 2012 – Introduction
Compilers and ISA

Ease of compilation

Orthogonality:
• no special registers
• few special cases
• all operand modes available with any data type or instruction
type

Regularity:
• no overloading for the meanings of instruction fields

streamlined
• resource needs easily determined

Register Assignment is critical too

23
Easier if lots of registers
Computer Structure 2012 – Introduction
CISC Is Dominant

The x86 architecture, which is a CISC
architecture, dominates the processor market


A vast amount of existing software
Intel, AMD, Microsoft and others benefit from this
• Intel and AMD put a lot of money to make high performance
x86 processors, despite the architectural disadvantage
• Current x86 processor give the best cost/performance


CISC processors use arch ideas from the RISC world
Starting at Pentium II and K6, x86 processors translate
CISC instructions into RISC-like operations internally
• the inside core looks much like that of a RISC processor
24
Computer Structure 2012 – Introduction
Software Specific Extensions

Extend arch to accelerate exec of specific apps

Example: SSETM – Streaming SIMD Extensions





128-bit packed (vector) / scalar single precision FP (4×32)
Introduced on Pentium® III on ’99
8 new 128 bit registers (XMM0 – XMM7)
Accelerates graphics, video, scientific calculations, …
Packed:
Scalar:
128-bits
x3
x2
x1
128-bits
x0
x3
x2
+
y3
y2
x0
+
y1
y0
x3+y3 x2+y2 x1+y1 x0+y0
25
x1
y3
y2
y1
y0
y3
y2
y1
x0+y0
Computer Structure 2012 – Introduction

+ t exe t` exe

Transcript + t exe t` exe

Directory