Class Presentation
Download
Report
Transcript Class Presentation
Intel Confidential
Pentium® III Processor
Initial Technical Disclosure
1999 Game Developer Conference
Intel Corp.
Pete Baker - [email protected]
Kim Pallister - [email protected]
R
R
®
1
Intel Confidential
Pentium® III Processor Initial
Disclosure Agenda
What is the Pentium® III processor?
Pentium® II processor / Pentium® III architecture
comparison
New floating point registers
New status / control word
New processor state / OS support
Intel Streaming SIMD Extensions
New instructions
Data types
Instruction categories
How to get a Pentium® III application
What makes a good Pentium® III application?
Software development methodology
R
R
®
2
Intel Confidential
Pentium® III Processor–
Combining Innovative Technology
Pentium® II
Technology
R
R
®
Intel streaming
SIMD extensions
3D Graphics Just Got
Faster!
3
Intel Confidential
What’s New And Different?
Feature
Pentium® II
Processor
Pentium® III
Processor
<=450
450 / 500 / 550
Execution Type
Dynamic
Dynamic
System Bus
100 MHz
100 / 133 MHz
32KB
32KB
MMX™ Technology
Yes
Yes
Intel Streaming SIMD
Extensions
No
Yes
MHz
L1 Cache
R
R
®
4
Intel Confidential
Architecture Changes
100% compatible with all existing IA
application software
8 x 128 bit flat register file
Complete new state in IA-32 architecture
Extension is not transparent to OS
Allows for simultaneous execution of Intel
streaming SIMD extensions and x87 or MMX™
technology instructions
R
R
New status / control word
®
5
Intel Confidential
Pentium® III Register Sets
IA-INT
Registers
MMX™ Technology /
IA-FP Registers
32
80
64
EAX
Pentium® III New
Registers
128
XMM0
.
.
.
XMM3
XMM4
.
.
.
FP0 or MM0
.
.
.
.
.
.
XMM7
FP7 or MM7
EDI
Eight 128 bit registers
Eight 64 bit registers
Single precision
Direct access to the
Direct access to the registers
registers
Referred to as XMM0-XMM7
Referred to as MM0-MM7
Use simultaneously with FP /
No MMX™ Technology /
MMX™ Technology
FP interoperability
Hold data only
Hold data only
New state / require OS support 6
Eight 32-bit registers
Direct access to the
registers
Scalar data only
R
R
®
Intel Confidential
Combined Status / Control Word
31
15
0
...
Invalid (IE)
Denormal (DE)
Zero (ZE)
Overlow (OE)
Underflow (UE)
Precision (PE)
Reserved
New Load / Store
Invalid (IM)
Denormal (DM)
instructions
Zero (ZM)
Overflow (OM)
Rounding modes
Underflow (UM)
Round to nearest even,
Precision (PM)
round down, round up,
Rounding Control (RC)
Rounding Control (RC)
round toward zero
Flush to Zero (FZ)
Flush-to-zero
R
R
®
7
Intel Confidential
Operating System Support
Required to save and restore the state of
the processor during a context switch
To use the new instructions:
Must be using a processor with Intel streaming
SIMD extensions
And
Must be using an operating system that supports
the save / restore for the new state
Already Have Support for Windows* ‘98, NT*
4.0** and Windows* 2000
R
R
®
*Other brands and products are the property of their respective owners
** With service pack 4 or greater and special driver
8
Intel Confidential
Pentium® III Processor Initial
Disclosure Agenda
What is Pentium® III processor?
Pentium® II processor / Pentium® III architecture
comparison
New floating point registers
New status / control word
New processor state / OS support
Intel Streaming SIMD Extensions
New instructions
Data types
Instruction categories
How to get a Pentium® III application
What makes a good Pentium® III application?
Software development methodology
R
R
®
9
Intel Confidential
Intel Streaming SIMD Extensions
70 new instructions total
50 new SIMD SINGLE PRECISION floating point
instructions
12 new integer instructions
8 new cacheability instructions
Fully integrated into the Intel architecture
Uses previously reserved opcodes
Audio
Uses same addressing modes
Physics
AI,
etc
Great for 3D and other floating
point intensive applications!
R
R
Typical CPU
Utilization 3-D
®
10
Intel Confidential
1 Minute SIMD Refresher
Single instruction, multiple data
Perform the same operations on multiple data
items in parallel
A80501-68
SX835
A80501-68
SX835
Traditional
Processing
Pentium® II
Processor
L3460833
INTEL©M 1992
Moved
into the processor
one at a time
Processed sequentially with
multiple instructions
R
R
®
®
®
Pentium® III processor
w/ Streaming SIMD
Extensions
SIMD
Processing
L3460833
INTEL©M 1992
Multiple
data items
Moved into the processor as
one value
All processed in parallel by a
single instruction
11
Intel Confidential
SIMD FP Instructions
Operate on all elements of a packed datatype, in parallel, in SIMD fashion
Some instructions have scalar or packed
versions
P4
P3
P2
S Exponent
31 30
P1 / Scalar
Significand
23 22
0
IEEE 754 compatible FP arithmetic
Unmasked support requires new handlers
Not bit exact with IEEE 758 (x87)
R
R
Are useful in all modes: real, virtual, SMM,
and protected (16-bit & 32-bit)
®
12
Intel Confidential
Data Types
Packed & scalar FP instructions operate on
packed single precision floating point
elements
Packed instructions operate on 4 numbers
addps
op
X4
X3
X2
X1
Y4
Y3
Y2
Y1
X4opY4
X3opY3
X2opY2
X1opY1
Scalar instructions operate on least-significant
number
op
addss
R
R
®
X4
X3
X2
X1
Y4
Y3
Y2
Y1
Y4
Y3
Y2
X1opY1
13
Intel Confidential
Instruction Categories
Computation
Branching
Cacheability
Data movement and ordering
Type conversion
State management
R
R
We’ll Talk
about
These
Today
®
14
Intel Confidential
Computation
Full Precision
ADD, SUB, MUL, DIV, SQRT
–
–
Floating Point
(Packed/Scalar)
Full 23 bit precision
Approximate Precision
RCP - Reciprocal
RSQRT - Reciprocal
Square Root
–
Integer
PMULHUW
–
16-bit Integer (MMX™
Blending
technology)
–
–
–
Perspective correction /
projection
Vector normalization
Very fast
Return at least 11 bits of
precision
Graphics
R
R
®
15
Intel Confidential
Branching
Removal
C++:
a = (a < b) ? c : d
Assembly:
cmpps xmm0, xmm1, 1
movaps xmm2, xmm0
andps xmm0, xmm3
andnps xmm2, xmm4
orps xmm0, xmm2
;Only doing ONE compare here AND there is a branch.
;4 compares (“a” and “b”) w/ one instruction - creates
;mask. This is also the beginning of the branch removal.
;Save a copy of the mask.
;and(mask, c) | andnot(mask, d)
;Where c=xmm3 and d=xmm4
;Final result as in the above C++ statement, but 4X.
Compression via mask utilization
Select ”op” capability (EQ, LT, LE, GT,GE, NEQ, NLT, NLE,
NGT, NGE)
cmpps xmm0, xmm1, 1 ;Generates a mask in XMM0
movmskps eax, xmm0
;Move mask into eax
test eax, 2
;Compare w/ desired result
jne “BRANCH TARGET”
R
R
xmm0
111…111
000...000
111…111
000...000
®
eax
000..00 1 0 1 0
16
Intel Confidential
Cacheability Operations
Prefetch
Indicates the desire to
load a particular piece of
data into the cache in a
given time frame
L2
Cache
Pentium® III
Processor
XMM0
.
.
Streaming capabilities
Indicates the desire to
have particular data not
placed in the cache
when written
E.G. Data-dependant
lookup operations
R
R
L1
Cache
.
regs
XMM7
PIO
Graphics
Controller
AGP
Front-Side
Bus
DMA
FSB
Chipset
66/100/133 MHz
System
memory
®
17
Intel Confidential
Pentium® III Processor Initial
Disclosure Agenda
R
R
What is Pentium® III processor?
Pentium® II processor / Pentium® III architecture
comparison
New floating point registers
New status / control word
New processor state / OS support
Intel Streaming SIMD Extensions
New instructions
Data types
Instruction categories
How to get a Pentium® III application
What makes a good Pentium® III application?
Software development methodology
®
18
Intel Confidential
What Makes a Good Pentium® III
Application?
Audio
Graphics
Physics
Video Creativity
R
R
®
Image Manipulation
19
Intel Confidential
Programming in ASM
FACT: There are principles around which
processors are designed
BUT: ISA’s, platforms, and microarchitectures
are all moving targets which affect the
performance of both normal and
optimized code
ASM
Example: Partial stalls are primarily a result of
hand coded assembly optimizations
R
R
®
20
Intel Confidential
The 4 Fundamental Principles for
Pentium® III Processor
2
Exploit Parallelism
Minimize Unpredictable Branching
3
Ensure Locality
4
Maximize Sequential Memory Accesses
1
Execution
Window
40 ops
2 inst.
1 inst.
R
R
®
i486™
Processor
Pentium®
Processor
Pentium II
Pentium
Processor Processor
MMX™ Tech.
21
Intel Confidential
Development Methodology
Start with a solid foundation of basics
Greatest return for least pain
Expand basics into the Intel Streaming
SIMD Extensions
a[i]=_mm_add_ps(
b[i],c[i])
a[i]=b[i]+c
[i]
R
R
®
Assembly
Bit Bangers Only!
Intrinsics
Assembly for the fast food
generation
C++ class library
Performance for the masses
Difficulty
movaps xmm0, b[i]
movaps xmm1, c[i]
addps xmm0, xmm1
movaps a[i], xmm0
Preferred Method
Understand possible gain & what is required
22
Intel Confidential
Assembly Development
ASM
Masm 6.11d or later required for correct
macro support
Functionality macros simulate Intel
streaming SIMD extensions
Test & validate of functional correctness of Intel
streaming SIMD extensions source code before
silicon
Performance macros generate Intel
streaming SIMD extensions
Execute same instructions same as an actual
Pentium® III processor
R
R
®
23
Intel Confidential
Intrinsics
Intrinsics
C “functions” in place of inline asm
Available for both Intel streaming SIMD
extensions and MMX™ technology
At least 75% of “speed of light”
R
R
®
Hide details, such as register allocation and
scheduling
Intel® C/C++ Compiler supports Intel streaming SIMD
extensions intrinsics now!
24
Intel Confidential
C++ Class Libraries
C++ classes
Utilize C++ operator
overloading to abstract
underlying technology
Encapsulates 128 bit
data type
4 single precision floats
Methods / operators
based on intrinsics
Intrinsic support is
required for classes
R
R
®
class Fvec32 {
__m128 vec;
public:
Fvec32()
{};
Fvec32(__m128 m)
{vec = m;};
Fvec32(float f3, float f2, float f1, float f0)
{vec=_mm_set_ps(f3,f2,f1,f0);}
Fvec32(float f)
{vec = _mm_set_ss(f);}
operator __m128(){ return vec; }
// Arithmetic Operators
friend Fvec32 operator +(Fvec32 &a, Fvec32 &b)
{return _mm_add_ps(a,b);}
// Branch Elimination Operators
friend Fvec32 operator &(Fvec32 &a, Fvec32 &b)
{return _mm_and_ps(a,b);}
};
Complete set of “Packed Data” classes
released with Pentium® III SDK
25
Intel Confidential
Summary & Opportunity
Differentiators between Pentium® II
processor & Pentium® III processor =
Intel streaming SIMD extensions
Compelling graphics performance
Target 550 MHz core frequency
Additional line items <550 MHz
R
R
100+ MHz front side bus speed
®
26
Intel Confidential
Questions?
R
R
®
27
Intel Confidential
Example: Arithmetic Operation
MULPS: Multiply Packed Single-FP
mulps xmm1, xmm2
xmm1
*
X4
Y4
X4*Y4
R
R
X3
X2
X1
Y3
Y2
Y1
xmm2
X1*Y1
xmm1
X3*Y3
X2*Y2
®
28
Intel Confidential
Example: Compare Operation
CMPPS: Compare Packed Single-FP
cmpps xmm0, xmm1, 1
xmm0
<
R
R
1.1
7.3
2.3
5.6
8.6
2.3
3.5
1.2
111…11
000…00
111…11 000...00
xmm1
xmm0
®
29
Intel Confidential
Two Kinds of Shuffle
SHUFPS
Moves two SP-FP numbers from each of source
to the destination under control of imm8 mask
Performs: rotate, shift, swap and broadcast
b3
b2
PSHUFW
b1
b0
a3
a2
a1
a0
b0..b3 b0..b3 a0..a3 a0..a3
Each 16-bit element independently loaded under
control of an 8-bit mask
a3
R
R
®
a2
a1
a0
a0..a3 a0..a3 a0..a3 a0..a3
30
Intel Confidential
Example: Conversion Operation
CVTSI2SS: Convert signed INT32 to Scalar SP-FP
cvtsi2ss xmm0, eax
int2fp
X4
X3
X2
X1
Y1
X4
R
R
X3
X2
(float)Y1
xmm0
eax
xmm0
®
31
Intel Confidential
Example: Insert / Extract Instructions
pinsrw mm0, [edi], 2
Immediate specifies
which MMX
Technology operand
to use
MM0
MM0
pextrw eax, mm0, 0
R
R
®
addr3
addr2
addr1
addr0
Y3
Y3
X2
Y1
Y2
eax
Y0
Y1
Y0
00...00
Y0
Accelerates data dependent lookup
operations
32
Intel Confidential
Cacheability Control Instructions
(cont’)
Cache Hints: PREFETCH(T0-T2,NTA) m8
fetches 32 bytes or a multiple of 32 bytes
specified by address m8 closer to the CPU:
T0: brings data into L1 and L2
T1, T2: brings data into L2 only
NTA: brings data into L1 only
retires quickly to free up machines resources
Store fence: SFENCE
Enforce correct ordering for weakly ordered
memory writes
R
R
®
33
Intel Confidential
Detecting Intel streaming SIMD
extensions Support
CPUID
FXSR set if CPU supports FXSAVE/FXRSTOR
Bit set if CPU supports Intel streaming SIMD
extensions
CR4 bits
OSFXSR set if both OS and CPU support
FXSAVE/FXRSTOR for context switches
OSMMXEXCPT set if OS supports unmasked Intel
streaming SIMD extensions’ exceptions
R
R
®
34
Intel Confidential
Video Benefits
Encoding:
Real-time MPEG-1 encode at 352x240x30 feasible
Real-time authoring/archiving to HDD, CD-R, DVD-R or
WWW
Real-time MPEG-2 at 720x480x30 encode feasible
Tradeoff compression ratio for speed
Near real-time MPEG-2 encode for very high quality
content production
Decoding:
Decode of MPEG-2 video at DVD quality is here!
Use it!! Intel streaming SIMD extensions will bring further
performance improvements
R
R
®
35
Intel Confidential
Pentium(r) III Video Instructions
Two new SIMD integer instructions specifically
designed to increase digital video compression speed
pavg: Packed Average
psad: Packed Sum of Absolute Differences
PAVG instruction useful in both digital video
decompression and compression algorithms
Primarily used for motion compensation during decode
I.e. Averages two half-pels into one pel
PSAD instruction useful in digital video compression /
encode algorithms
Primarily used for motion estimation during encode
R
R
®
36