Class Presentation

Download Report

Transcript Class Presentation

Intel Confidential
Pentium® III Processor
Initial Technical Disclosure
1999 Game Developer Conference
Intel Corp.
Pete Baker - [email protected]
Kim Pallister - [email protected]
R
R
®
1
Intel Confidential
Pentium® III Processor Initial
Disclosure Agenda

What is the Pentium® III processor?
 Pentium® II processor / Pentium® III architecture
comparison
 New floating point registers
 New status / control word
 New processor state / OS support

Intel Streaming SIMD Extensions
 New instructions
 Data types
 Instruction categories

How to get a Pentium® III application
 What makes a good Pentium® III application?
 Software development methodology
R
R
®
2
Intel Confidential
Pentium® III Processor–
Combining Innovative Technology
Pentium® II
Technology
R
R
®
Intel streaming
SIMD extensions
3D Graphics Just Got
Faster!
3
Intel Confidential
What’s New And Different?
Feature
Pentium® II
Processor
Pentium® III
Processor
<=450
450 / 500 / 550
Execution Type
Dynamic
Dynamic
System Bus
100 MHz
100 / 133 MHz
32KB
32KB
MMX™ Technology
Yes
Yes
Intel Streaming SIMD
Extensions
No
Yes
MHz
L1 Cache
R
R
®
4
Intel Confidential
Architecture Changes
100% compatible with all existing IA
application software
 8 x 128 bit flat register file

 Complete new state in IA-32 architecture
 Extension is not transparent to OS
 Allows for simultaneous execution of Intel
streaming SIMD extensions and x87 or MMX™
technology instructions

R
R
New status / control word
®
5
Intel Confidential
Pentium® III Register Sets
IA-INT
Registers
MMX™ Technology /
IA-FP Registers
32
80
64
EAX
Pentium® III New
Registers
128
XMM0
.
.
.
XMM3
XMM4
.
.
.
FP0 or MM0
.
.
.
.
.
.
XMM7
FP7 or MM7
EDI
Eight 128 bit registers
 Eight 64 bit registers
 Single precision
 Direct access to the
 Direct access to the registers
registers
 Referred to as XMM0-XMM7
 Referred to as MM0-MM7
 Use simultaneously with FP /
 No MMX™ Technology /
MMX™ Technology
FP interoperability
 Hold data only
 Hold data only
 New state / require OS support 6

Eight 32-bit registers
 Direct access to the
registers
 Scalar data only

R
R
®
Intel Confidential
Combined Status / Control Word
31
15
0
...
Invalid (IE)
Denormal (DE)
Zero (ZE)
Overlow (OE)
Underflow (UE)
Precision (PE)
Reserved
 New Load / Store
Invalid (IM)
Denormal (DM)
instructions
Zero (ZM)
Overflow (OM)
 Rounding modes
Underflow (UM)
Round to nearest even,
Precision (PM)
round down, round up,
Rounding Control (RC)
Rounding Control (RC)
round toward zero
Flush to Zero (FZ)
 Flush-to-zero
R
R
®
7
Intel Confidential
Operating System Support
Required to save and restore the state of
the processor during a context switch
 To use the new instructions:

 Must be using a processor with Intel streaming
SIMD extensions
And
 Must be using an operating system that supports
the save / restore for the new state
Already Have Support for Windows* ‘98, NT*
4.0** and Windows* 2000
R
R
®
*Other brands and products are the property of their respective owners
** With service pack 4 or greater and special driver
8
Intel Confidential
Pentium® III Processor Initial
Disclosure Agenda

What is Pentium® III processor?
 Pentium® II processor / Pentium® III architecture
comparison
 New floating point registers
 New status / control word
 New processor state / OS support

Intel Streaming SIMD Extensions
 New instructions
 Data types
 Instruction categories

How to get a Pentium® III application
 What makes a good Pentium® III application?
 Software development methodology
R
R
®
9
Intel Confidential
Intel Streaming SIMD Extensions

70 new instructions total
 50 new SIMD SINGLE PRECISION floating point
instructions
 12 new integer instructions
 8 new cacheability instructions

Fully integrated into the Intel architecture
 Uses previously reserved opcodes
Audio
 Uses same addressing modes
Physics
AI,
etc
Great for 3D and other floating
point intensive applications!
R
R
Typical CPU
Utilization 3-D
®
10
Intel Confidential
1 Minute SIMD Refresher

Single instruction, multiple data
 Perform the same operations on multiple data
items in parallel
A80501-68
SX835
A80501-68
SX835
Traditional
Processing
Pentium® II
Processor
L3460833
INTEL©M 1992
 Moved
into the processor
one at a time
 Processed sequentially with
multiple instructions
R
R
®
®
®
Pentium® III processor
w/ Streaming SIMD
Extensions
SIMD
Processing
L3460833
INTEL©M 1992
 Multiple
data items
 Moved into the processor as
one value
 All processed in parallel by a
single instruction
11
Intel Confidential
SIMD FP Instructions
Operate on all elements of a packed datatype, in parallel, in SIMD fashion
 Some instructions have scalar or packed
versions

P4
P3
P2
S Exponent
31 30

P1 / Scalar
Significand
23 22
0
IEEE 754 compatible FP arithmetic
 Unmasked support requires new handlers
 Not bit exact with IEEE 758 (x87)

R
R
Are useful in all modes: real, virtual, SMM,
and protected (16-bit & 32-bit)
®
12
Intel Confidential
Data Types

Packed & scalar FP instructions operate on
packed single precision floating point
elements
 Packed instructions operate on 4 numbers
addps
op
X4
X3
X2
X1
Y4
Y3
Y2
Y1
X4opY4
X3opY3
X2opY2
X1opY1
 Scalar instructions operate on least-significant
number
op
addss
R
R
®
X4
X3
X2
X1
Y4
Y3
Y2
Y1
Y4
Y3
Y2
X1opY1
13
Intel Confidential
Instruction Categories
Computation
 Branching
 Cacheability
 Data movement and ordering
 Type conversion
 State management

R
R
We’ll Talk
about
These
Today
®
14
Intel Confidential
Computation
Full Precision
ADD, SUB, MUL, DIV, SQRT
–
–
Floating Point
(Packed/Scalar)
Full 23 bit precision
Approximate Precision
RCP - Reciprocal
RSQRT - Reciprocal
Square Root
–
Integer
PMULHUW
–
16-bit Integer (MMX™
Blending
technology)
–
–
–
Perspective correction /
projection
Vector normalization
Very fast
Return at least 11 bits of
precision
Graphics
R
R
®
15
Intel Confidential
Branching

Removal
C++:
a = (a < b) ? c : d
Assembly:
cmpps xmm0, xmm1, 1
movaps xmm2, xmm0
andps xmm0, xmm3
andnps xmm2, xmm4
orps xmm0, xmm2

;Only doing ONE compare here AND there is a branch.
;4 compares (“a” and “b”) w/ one instruction - creates
;mask. This is also the beginning of the branch removal.
;Save a copy of the mask.
;and(mask, c) | andnot(mask, d)
;Where c=xmm3 and d=xmm4
;Final result as in the above C++ statement, but 4X.
Compression via mask utilization
 Select ”op” capability (EQ, LT, LE, GT,GE, NEQ, NLT, NLE,
NGT, NGE)
cmpps xmm0, xmm1, 1 ;Generates a mask in XMM0
movmskps eax, xmm0
;Move mask into eax
test eax, 2
;Compare w/ desired result
jne “BRANCH TARGET”
R
R
xmm0
111…111
000...000
111…111
000...000
®
eax
000..00 1 0 1 0
16
Intel Confidential
Cacheability Operations

Prefetch
 Indicates the desire to
load a particular piece of
data into the cache in a
given time frame

L2
Cache
Pentium® III
Processor
XMM0
.
.
Streaming capabilities
 Indicates the desire to
have particular data not
placed in the cache
when written
 E.G. Data-dependant
lookup operations
R
R
L1
Cache
.
regs
XMM7
PIO
Graphics
Controller
AGP
Front-Side
Bus
DMA
FSB
Chipset
66/100/133 MHz
System
memory
®
17
Intel Confidential
Pentium® III Processor Initial
Disclosure Agenda



R
R
What is Pentium® III processor?
 Pentium® II processor / Pentium® III architecture
comparison
 New floating point registers
 New status / control word
 New processor state / OS support
Intel Streaming SIMD Extensions
 New instructions
 Data types
 Instruction categories
How to get a Pentium® III application
 What makes a good Pentium® III application?
 Software development methodology
®
18
Intel Confidential
What Makes a Good Pentium® III
Application?
Audio
Graphics
Physics
Video Creativity
R
R
®
Image Manipulation
19
Intel Confidential
Programming in ASM
FACT:  There are principles around which
processors are designed
BUT:  ISA’s, platforms, and microarchitectures
are all moving targets which affect the
performance of both normal and
optimized code
ASM
Example: Partial stalls are primarily a result of
hand coded assembly optimizations
R
R
®
20
Intel Confidential
The 4 Fundamental Principles for
Pentium® III Processor
2
Exploit Parallelism
Minimize Unpredictable Branching
3
Ensure Locality
4
Maximize Sequential Memory Accesses
1
Execution
Window
40 ops
2 inst.
1 inst.
R
R
®
i486™
Processor
Pentium®
Processor
Pentium II
Pentium
Processor Processor
MMX™ Tech.
21
Intel Confidential
Development Methodology

Start with a solid foundation of basics
 Greatest return for least pain

Expand basics into the Intel Streaming
SIMD Extensions
a[i]=_mm_add_ps(
b[i],c[i])
a[i]=b[i]+c
[i]
R
R
®
Assembly
Bit Bangers Only!
Intrinsics
Assembly for the fast food
generation
C++ class library
Performance for the masses
Difficulty
movaps xmm0, b[i]
movaps xmm1, c[i]
addps xmm0, xmm1
movaps a[i], xmm0
Preferred Method
 Understand possible gain & what is required
22
Intel Confidential
Assembly Development
ASM
Masm 6.11d or later required for correct
macro support
 Functionality macros simulate Intel
streaming SIMD extensions

 Test & validate of functional correctness of Intel
streaming SIMD extensions source code before
silicon

Performance macros generate Intel
streaming SIMD extensions
 Execute same instructions same as an actual
Pentium® III processor
R
R
®
23
Intel Confidential
Intrinsics

Intrinsics
C “functions” in place of inline asm
 Available for both Intel streaming SIMD
extensions and MMX™ technology
 At least 75% of “speed of light”

R
R
®
Hide details, such as register allocation and
scheduling
Intel® C/C++ Compiler supports Intel streaming SIMD
extensions intrinsics now!
24
Intel Confidential
C++ Class Libraries
C++ classes


Utilize C++ operator
overloading to abstract
underlying technology
Encapsulates 128 bit
data type
 4 single precision floats
 Methods / operators
based on intrinsics

Intrinsic support is
required for classes
R
R
®
class Fvec32 {
__m128 vec;
public:
Fvec32()
{};
Fvec32(__m128 m)
{vec = m;};
Fvec32(float f3, float f2, float f1, float f0)
{vec=_mm_set_ps(f3,f2,f1,f0);}
Fvec32(float f)
{vec = _mm_set_ss(f);}
operator __m128(){ return vec; }
// Arithmetic Operators
friend Fvec32 operator +(Fvec32 &a, Fvec32 &b)
{return _mm_add_ps(a,b);}
// Branch Elimination Operators
friend Fvec32 operator &(Fvec32 &a, Fvec32 &b)
{return _mm_and_ps(a,b);}
};
Complete set of “Packed Data” classes
released with Pentium® III SDK
25
Intel Confidential
Summary & Opportunity
Differentiators between Pentium® II
processor & Pentium® III processor =
Intel streaming SIMD extensions
 Compelling graphics performance
 Target 550 MHz core frequency

 Additional line items <550 MHz

R
R
100+ MHz front side bus speed
®
26
Intel Confidential
Questions?
R
R
®
27
Intel Confidential
Example: Arithmetic Operation
MULPS: Multiply Packed Single-FP
mulps xmm1, xmm2
xmm1
*
X4
Y4
X4*Y4
R
R
X3
X2
X1
Y3
Y2
Y1
xmm2
X1*Y1
xmm1
X3*Y3
X2*Y2
®
28
Intel Confidential
Example: Compare Operation
CMPPS: Compare Packed Single-FP
cmpps xmm0, xmm1, 1
xmm0
<
R
R
1.1
7.3
2.3
5.6
8.6
2.3
3.5
1.2
111…11
000…00
111…11 000...00
xmm1
xmm0
®
29
Intel Confidential
Two Kinds of Shuffle
SHUFPS
 Moves two SP-FP numbers from each of source
to the destination under control of imm8 mask
 Performs: rotate, shift, swap and broadcast
b3
b2
PSHUFW
b1
b0
a3
a2
a1
a0
b0..b3 b0..b3 a0..a3 a0..a3
 Each 16-bit element independently loaded under
control of an 8-bit mask
a3
R
R
®
a2
a1
a0
a0..a3 a0..a3 a0..a3 a0..a3
30
Intel Confidential
Example: Conversion Operation
CVTSI2SS: Convert signed INT32 to Scalar SP-FP
cvtsi2ss xmm0, eax
int2fp
X4
X3
X2
X1
Y1
X4
R
R
X3
X2
(float)Y1
xmm0
eax
xmm0
®
31
Intel Confidential
Example: Insert / Extract Instructions
pinsrw mm0, [edi], 2
Immediate specifies
which MMX
Technology operand
to use
MM0
MM0
pextrw eax, mm0, 0
R
R
®
addr3
addr2
addr1
addr0
Y3
Y3
X2
Y1
Y2
eax
Y0
Y1
Y0
00...00
Y0
Accelerates data dependent lookup
operations
32
Intel Confidential
Cacheability Control Instructions
(cont’)

Cache Hints: PREFETCH(T0-T2,NTA) m8
 fetches 32 bytes or a multiple of 32 bytes
specified by address m8 closer to the CPU:
 T0: brings data into L1 and L2
 T1, T2: brings data into L2 only
 NTA: brings data into L1 only
 retires quickly to free up machines resources

Store fence: SFENCE
 Enforce correct ordering for weakly ordered
memory writes
R
R
®
33
Intel Confidential
Detecting Intel streaming SIMD
extensions Support

CPUID
 FXSR set if CPU supports FXSAVE/FXRSTOR
 Bit set if CPU supports Intel streaming SIMD
extensions

CR4 bits
 OSFXSR set if both OS and CPU support
FXSAVE/FXRSTOR for context switches
 OSMMXEXCPT set if OS supports unmasked Intel
streaming SIMD extensions’ exceptions
R
R
®
34
Intel Confidential
Video Benefits

Encoding:
 Real-time MPEG-1 encode at 352x240x30 feasible
 Real-time authoring/archiving to HDD, CD-R, DVD-R or
WWW
 Real-time MPEG-2 at 720x480x30 encode feasible
 Tradeoff compression ratio for speed
 Near real-time MPEG-2 encode for very high quality
content production

Decoding:
 Decode of MPEG-2 video at DVD quality is here!
 Use it!! Intel streaming SIMD extensions will bring further
performance improvements
R
R
®
35
Intel Confidential
Pentium(r) III Video Instructions

Two new SIMD integer instructions specifically
designed to increase digital video compression speed
 pavg: Packed Average
 psad: Packed Sum of Absolute Differences

PAVG instruction useful in both digital video
decompression and compression algorithms
 Primarily used for motion compensation during decode
 I.e. Averages two half-pels into one pel

PSAD instruction useful in digital video compression /
encode algorithms
 Primarily used for motion estimation during encode
R
R
®
36