Microprocessor Design 2002

Download Report

Transcript Microprocessor Design 2002

Advanced Computer Architecture
5MD00 / 5Z033
Instruction Set Design
Henk Corporaal
www.ics.ele.tue.nl/~heco/courses/aca
TUEindhoven
November 2011
Lecture overview
•
•
•
•
•
•
•
•
7/21/2015
ISA and Evolution
Architecture classes
Addressing
Operands
Operations
Encoding
RISC
SIMD extensions
ACA H.Corporaal
2
Instruction Set Architecture
• The instruction set architecture =
interface between software and hardware
• It provides the mechanism by which the software tells
the hardware what should be done
• Architecture definition:
“the architecture of a system/processor is (a minimal
description of) its behavior as observed by its
immediate users”
software
instruction set architecture
hardware
Instruction Set Design Issues
• Where are operands stored?
– registers, memory, stack, accumulator
• How many explicit operands are there?
– 0, 1, 2, or 3
• How is the operand location specified?
– register, immediate, indirect, . . .
• What type & size of operands are supported?
– byte, int, float, double, string, vector. . .
• What operations are supported?
– basic operations: add, sub, mul, move, compare . . .
– or also very complex operations?
Operands
• How are operands designated?
– fixed – always in the same place
– by opcode – always the same for groups of instructions
– by a field in the instruction – requires decode first
• What is the format of the data?
–
–
–
–
–
–
binary
character
decimal (packed and unpacked)
floating-point – IEEE 754 (others used less and less)
size – 8-, 16-, 32-, 64-, 128-bit,
or vectors of above types and sizes
• What is the influence on the ISA (= Instruction-Set
Architecture)?
7/21/2015
ACA H.Corporaal
5
Operand Locations
7/21/2015
ACA H.Corporaal
6
Classifying ISAs
Accumulator (before 1960):
1 address
add A
Stack (1960s to 1970s):
0 address
add
Memory-Memory (1970s to 1980s):
2 address
3 address
add A, B
add A, B, C
acc acc + mem[A]
tos tos + next
mem[A] mem[A] + mem[B]
mem[A] mem[B] + mem[C]
Register-Memory (1970s to present):
2 address
add R1, A
load R1, A
R1 R1 + mem[A]
R1 mem[A]
Register-Register (Load/Store) (1960s to present):
3 address
7/21/2015
ACA H.Corporaal
add R1, R2, R3
load R1, R2
store R1, R2
R1 R2 + R3
R1 mem[R2]
mem[R1] R2
7
Evolution of Architectures
Single Accumulator (EDSAC 1950)
Accumulator + Index Registers
(Manchester Mark I, IBM 700 series 1953)
Separation of Programming Model
from Implementation
High-level Language Based
(B5000 1963)
Concept of a Processor Family
(IBM 360 1964)
General Purpose Register Machines
Complex Instruction Sets
(Vax, Intel 8086 1977-80)
Load/Store Architecture
(CDC 6600, Cray 1 1963-76)
RISC
(Mips,Sparc,88000,IBM RS6000, . . .1987+)
Addressing Modes
• Types
– Register – data in a register
– Immediate – data in the instruction
– Memory – data in memory
• Calculation of Effective Address
– Direct – address in instruction
– Indirect – address in register
– Displacement – address = register or PC + offset
– Indexed – address = register + register
– Memory Indirect – address at address in register
• Question: What is the influence on ISA?
7/21/2015
ACA H.Corporaal
9
Types of Addressing Mode (VAX)
Addressing Mode
Register direct
Immediate
Displacement
Register indirect
Indexed
Direct
Memory Indirect
Autoincrement
Example
Add R4, R3
Add R4, #3
Add R4, 100(R1)
Add R4, (R1)
Add R4, (R1 + R2)
Add R4, (1000)
Add R4, @(R3)
Add R4, (R2)+
Action
1.
R4 <- R4 + R3
2.
R4 <- R4 + 3
3.
R4 <- R4 + M[100 + R1]
4.
R4 <- R4 + M[R1]
5.
R4 <- R4 + M[R1 + R2]
6.
R4 <- R4 + M[1000]
7.
R4 <- R4 + M[M[R3]]
8.
R4 <- R4 + M[R2]
R2 <- R2 + d
9. Autodecrement
Add R4, (R2)R4 <- R4 + M[R2]
R2 <- R2 - d
10. Scaled
Add R4, 100(R2)[R3]
R4 <- R4 +
M[100 + R2 + R3*d]
• Studies by [Clark and Emer] indicate that modes 1-4 account for 93% of all
operands on the VAX
Operations
• Types
– ALU – Integer arithmetic and logical functions
– Data transfer – Loads/stores
– Control – Branch, jump, call, return, traps, interrupts
– System – O/S calls, virtual memory management
– Floating point – Floating point arithmetic
– Decimal – Decimal arithmetic (BCD: binary coded decimal)
– String – moves, compares, search, etc.
– Graphics – Pixel/vertex operations
– Vector – Vector (SIMD) functions
– more complex ones
• Addressing
– Which addressing modes for which operands are supported?
7/21/2015
ACA H.Corporaal
11
80x86 Instruction Frequency
Rank
1
2
3
4
5
6
7
8
9
10
Total
Instruction
load
branch
compare
store
add
and
sub
register move
call
return
Frequency
22%
20%
16%
12%
8%
6%
5%
4%
1%
1%
96%
9
Relative Frequency of
Control Instructions
Operation
Call/Return
Jumps
Branches
SPECint92
13%
6%
81%
SPECfp92
11%
4%
87%
• Design hardware to handle branches
quickly, since these occur most frequently
Frequency of Operand Sizes
on 32-bit Load-Store Machines
Size
64 bits
32 bits
16 bits
8 bits
SPECint92
0%
74%
19%
19%
SPECfp92
69%
31%
0%
0%
• For floating-point want good performance for 64 bit
operands.
• For integer operations want good performance for 32
bit operands
• Recent architectures also support 64-bit integers
Instruction Encoding
• Variable
– Instruction length varies based on opcode and address
specifiers
– For example, VAX instructions vary between 1 and 53
bytes, while x86 instruction vary between 1 and 17 bytes.
– Good code density, but difficult to decode and pipeline
• Fixed
– Only a single size for all instructions
– For example MIPS, Power PC, Sparc all have 32 bit
instructions
– Not as good code density, but easier to decode and pipeline
• Hybrid
– Have multiple format lengths specified by the opcode
– For example: IBM 360/370, MIPS16, ARM
– Compromise between code density and ease of decode
Example: MIPS
Operands mostly
at fixed positions
Fixed instruction size;
few formats
7/21/2015
ACA H.Corporaal
16
Instruction Encoding
7/21/2015
ACA H.Corporaal
17
Compilers and ISA
• Compiler Goals
–
–
–
–
–
All correct programs compile correctly
Most compiled programs execute quickly
Most programs compile quickly
Achieve small code size
Provide debugging support
• Multiple Source Compilers
– Same compiler can compile different languages
• Multiple Target Compilers
– Same compiler can generate code for different machines
– 'cross-compiler'
Compiler basics: trajectory
Source program
Preprocessor
Compiler
Error
messages
Assembler
Library
code
Loader/Linker
Object program
7/21/2015
ACA H.Corporaal
19
Compiler basics: structure / passes
Source code
Lexical analyzer
Parsing
Intermediate code
Code optimization
Code generation
Register allocation
Sequential code
Scheduling and allocation
token generation
check syntax
check semantic
parse tree generation
data flow analysis
local optimizations
global optimizations
code selection
peephole optimizations
making interference graph
graph coloring
spill code insertion
caller / callee save and restore code
exploiting ILP
Object code
7/21/2015
ACA H.Corporaal
20
Compiler basics: structure
Simple compilation example
position := initial + rate * 60
Lexical analyzer
id := id + id * 60
temp1 := intoreal(60)
temp2 := id3 * temp1
temp3 := id2 + temp2
id1 := temp3
Syntax analyzer
Code optimizer
temp1 := id3 * 60.0
id1 := id2 + temp1
:=
id
+
id
*
id
60
Intermediate code generator
7/21/2015
ACA H.Corporaal
Code generator
movf
mulf
movf
addf
movf
id3, r2
#60, r2, r2
id2, r1
r2, r1
r1, id1
21
Designing ISA to Improve Compilation
• Provide enough general purpose registers to ease
register allocation ( at least 16)
• Provide regular instruction sets by keeping the
operations, data types, and addressing modes largely
orthogonal
Note: this is not good for code density !
• Provide primitive constructs rather than trying to map
to a high-level language
• Allow compilers to help make the common case fast
A "Typical" RISC
•
•
•
•
32-bit fixed length instruction
Only few instruction formats
32 32-bit GPRs (general purpose registers)
3-address, reg-reg-reg / reg-imm-reg arithmetic
instruction
• Single address mode for load/store:
base + displacement
– no indirection
•
•
•
•
7/21/2015
Simple branch conditions
Pipelined implementation
Separate Instruction and Data level-1 caches
Delayed branch ?
ACA H.Corporaal
23
Comparison MIPS with 80x86
• How would you expect the x86 and MIPS
architectures to compare on the following ?
–
–
–
–
–
CPI on SPEC benchmarks
Ease of design and implementation
Ease of writing assembly language & compilers
Code density
Overall performance
• What other advantages/disadvantages are there
to the two architectures?
Instruction Set Extensions
Subword parallelism
• Support graphics and multimedia applications
– Intel’s MMX Technology (introduced in 1997)
– Intel’s Internet Streaming SIMD Extensions (SSE – SSE4)
– AMD’s 3DNow! Technology
– Sun’s Visual Instruction Set
– Motorola’s and IBM’s AltiVec Technology
• These extensions improve the performance of
– Computer-aided design
– Internet applications
– Computer visualization
– Video games
– Speech recognition
7/21/2015
ACA H.Corporaal
25
MMX Data Types
MMX Technology supports operations on
the following 64-bit integer data types:
Packed byte (eight 8-bit elements)
Packed word (four 16-bit elements)
Packed double word (two 32-bit elements)
Packed quad word (one 64-bit elements)
7/21/2015
ACA H.Corporaal
26
SIMD Operations
• MMX Technology allows a Single Instruction to work
on Multiple pieces of Data (SIMD)
A3
A2
A1
A0
B3
B2
B1
B0
A3+B3
A2+B2
A1+B1
A0+B0
PADD[W]: Packed add word
• In the above example, 4 parallel adds are performed on
16-bit elements
• Most MMX instructions only require a single cycle
7/21/2015
ACA H.Corporaal
27
Saturating Arithmetic
• Both wrap-around and saturating adds are supported
• With saturating arithmetic, results that
overflow/underflow are set to the largest/smallest
value
PADD[W]: Packed wrap-around add
7/21/2015
ACA H.Corporaal
PADDUS[W]: Packed saturating add
28
Pack and Unpack Instructions
• Pack and unpack instructions provide
conversion between standard data types and
packed data types
PACKSS[DW]: Pack signed, with saturating, double to packed word
7/21/2015
ACA H.Corporaal
29
Multiply-Add Operations
• Many graphics applications require multiplyaccumulate operations
–
–
–
–
Vector Dot Products: a • b
Matrix Multiplies
Fast Fourier Transforms (FFTs)
Filter implementations
PMADDWD: Packed multiply-add word to double
7/21/2015
ACA H.Corporaal
30
Vector Dot Product
• A dot product on an 8-element vector can
be performed using 9 MMX instructions
– Without MMX 40 instructions are required
0
a0*c0+..+ a3*c3
0
a4*c4+..+ a7*c7
a0*c0+..+ a7*c7
7/21/2015
ACA H.Corporaal
31
Packed Compare Instructions
• Packed compare instructions allow a bit mask
to be set or cleared
• This is useful when images with certain
qualities need to be extracted
7/21/2015
ACA H.Corporaal
32
MMX Instructions
• MMX Technology adds 57 new instructions to the x86
architecture.
• Some of these instructions include (b=byte;w=32-bit;d=64-bit)
– PADD(b, w, d)
Packed addition
– PSUB(b, w, d)
Packed subtraction
– PCMPE(b, w, d)
Packed compare equal
– PMULLw
Packed word multiply low
– PMULHw
Packed word multiply high
– PMADDwd
Packed word multiply-add
– PSRL(w, d, q)
Pack shift right logical
– PACKSS(wb, dw)
Pack data
– PUNPCK(bw, wd, dq) Unpack data
– PAND, POR, PXOR
Packed logical operations
7/21/2015
ACA H.Corporaal
33
MMX Performance Comparison
Application
Without
MMX
With MMX Speedup
Video
155.52
268.70
1.72
Image Processing 159.03
743.90
4.67
3D geometry
161.52
166.44
1.03
Audio
149.80
318.90
2.13
Overall
156.00
255.43
1.64
7/21/2015
ACA H.Corporaal
34
MMX Technology Summary
• MMX technology extends the Intel x86 architecture to improve
the performance of multimedia and graphics applications.
• It provides a speedup of 1.5 to 2.0 for certain applications.
Can be higher for larger vectors (more subwords / word)
• MMX instructions are hand-coded in assembly or implemented
as libraries to achieve high performance.
• MMX data types use the x86 floating point registers to avoid
adding state to the processor.
– Makes it easy to handle context switches
– Makes it hard to perform MMX and floating point
instructions at the same time
• Only increase the chip area by about 5%.
7/21/2015
ACA H.Corporaal
35
Questions on MMX
• What are the strengths and weaknesses of MMX
Technology?
• How could MMX Technology potentially be
improved?
• How did the developers of MMX preserve backward
compatibility with the x86 architecture?
– Why was this important?
– What are the disadvantages of this approach?
• What restrictions/limitations are there on the use of
MMX Technology?
7/21/2015
ACA H.Corporaal
36
Internet Streaming SIMD
Extensions (SSE)
• Help improve the performance of video and 3D applications
• Are designed for streaming data, which is used once and then
discarded.
• 70 new instructions beyond MMX Technology
• Adds 8 new 128-bit vector registers (XMM0 – XMM7)
• Provide the ability to perform multiple floating point operations
– Four parallel operations on 32-bit numbers
– Reciprocal and reciprocal root instructions - normalization
– Packed average instruction – Motion compensation
• Provide data prefetch instructions
• Make certain applications 1.5 to 2.0 times faster
7/21/2015
ACA H.Corporaal
37
Beyond SSE
• SSE2: SIMD on any data type from 8-bit int to 64-bit
double, using XMM vector registers
• SSE4: dot-product operation
• AVX (Advanced Vector Extensions): 2011
– 16 256-bit vector registers, YMM0-YMM15
• later extended to 512 and 1024 bits
– 3 operand instructions (instead of 2, with one implicit register
operand)
– 2011: in Intel Sandy Bridge and AMD Bulldozer architectures
7/21/2015
ACA H.Corporaal
38