Transcript VU-Assembly

Vector Unit Assembly
[email protected]
Overview
Architecture Review
VU0 Macro Mode Instruction Set
Building a Vector Library
Review
Playstation2 has two vector units that are similar
but not the same
VU0 is the CPU’s alternate processing unit
VU1 is the GS’s alternate processing unit
Each Unit has a direct pipeline to it’s respective
processor
Vector Units are designed for 4Dx32bit vectors
Review
VU0/1 each have access to 32 float registers
and 16 integer register
Float registers are not like PC registers; they are
128bits in size (PC is 32bit)
128bits can fit 4 float values at once (4D vector)
Integer registers are typically used as loop
counters and address calculators
Review
VU0 has two bus lines
One bus is dedicated to
the CPU
The other bus is used to
communicate with all
other devices
VU0 has 4KB of $
VU0
dedicated
I$
D$
4KB 4KB
CPU CORE
shared bus
SYS RAM
Vector Unit Processing Speed
The graph shows some
vector-math intensive
function calls
200K calls were made
to each function
70
60
50
time(ms)
40
VU0
EE
30
20
10
0
Add
Scale Cross
Macro and Micro Modes
Vector Unit Zero (VU0) has two modes
Micro mode is a mode that allows your vector
processor to act as an independent CPU
A mini program is uploaded and executed in parallel
to the main CPU
Macro mode allows your CPU to directly offload
heavy vector computation with low overhead
Most popular method, hands down.
Micro Mode
When uploaded, the micro program is executed
independent to the CPU
This means that we must time our execution so that
the result is fetched by the CPU after the program is
completed by the Vector Unit
Micro mode causes serious stalls and timing issues
since execution speed is near impossible to
determine
Macro Mode
Macro mode is a much easier method of
executing fast math functionality
Assembly can be used as inline instructions,
telling the compiler to offload the math to VU0
Notes
Just because it’s in assembly does not mean it will
be faster
Switching CPU focus has it’s overheads
Assembly Structure
There is typically a specific method to writing assembly
routines
Load the variable data/addresses to registers
Apply vector computations to those registers
Store the result back into a variable address
Overhead of using assembly is in the load and store
Make sure that the computation stage will improve
performance enough to offset the load/store overhead
Vector Unit MIPS Instructions
Coprocessor Transfer Instructions
Store / Load
Coprocessor Branch Instructions
Macro (primitive) calculation instructions
Add / Subtract / Multiply / Divide / ect…
Micro subroutine execution instructions
(VU Macro Instructions)
EEVectorAdd
Adding two vectors using the EE Core (CPU)
// (Vec4T *v0, Vec4T *v1, Vec4T *v2)
{
v2->x
v2->y
v2->z
v2->w
}
=
=
=
=
v0->x
v0->y
v0->z
v0->w
+
+
+
+
v1->x;
v1->y;
v1->z;
v1->w;
VectorAdd
Adding two vectors using the VU0
// (Vec4T *v0, Vec4T *v1, Vec4T *v2)
{
asm __volatile__ ("
lqc2
vf05, 0x0(%0)
lqc2
vf06, 0x0(%1)
vadd.xyzw vf07, vf05, vf06
sqc2
vf07, 0x0(%2)” :
: "r" (v0) , "r" (v1), "r" (v2)
);
}
EECrossProduct
Notice how we must use a temp because of the cross
// (Vec4T *v1, Vec4T *v2, Vec4T *cross)
{
Vec4T temp;
temp.x = v1->y * v2->z - v1->z * v2->y;
temp.y = v1->z * v2->x - v1->x * v2->z;
temp.z = v1->x * v2->y - v1->y * v2->x;
VectorCopy(&temp, cross);
}
CrossProduct
// (Vec4T *v1, Vec4T *v2, Vec4T *cross)
{
asm __volatile__("
lqc2
vf05, 0x0(%0)
lqc2
vf06, 0x0(%1)
vopmula.xyz
ACC, vf05, vf06 # first
vopmsub.xyz
vf06, vf06, vf05 # - second
vsub.w
vf06, vf00, vf00 # w = 0
sqc2
vf06, 0x0(%2)”
: // No Output
: "r"(v1), "r"(v2), "r"(cross)
);
}
Vector Outer Product
The vopmula instruction
performs an outer
product
The result is stored into
the special purpose ACC
register
VF05 X
VF06 X
ACC X
Y
Y
Y
Z
Z
Z
For Next Time
Read Chapters 7.3.2 – 7.4.2
Read Chapters 9.3