Disclaimer: I do not work for Sony, despite the disturbing percentage of my shirts, jackets, and bookbags that are PlayStation dev-related. I do, however, have many friends that work at Sony, some of which I hope will call off the corporate lawyers. JayStation is in no way associated with Sony or PlayStation, and any stupid things I say represent only my own ineptitude and silliness.
The road to full GL pipeline support is fraught with peril. Quite a few things are involved, such as using a completely different pipeline, using VPM and VCD, and writing vert shaders that read attributes and write varyings and vertices in the right format. Luckily before jumping to full-on vert shaders, there is an intermediate step that allows us to try out some of the useful things like uniforms and varyings, all without leaving our familiar NV pipeline mode. If you haven’t read the previous post on how to initialize the GPU and set up basic binning and rendering command buffers, now is the time. You’re gonna need it.
Before getting into how to set up uniforms and varyings, we have to go over the basics of writing shaders. This will mainly cover the QPU ISA and instruction encoding, the register file, and some basic rules and limitations. There are two register files: A and B. The first 32 registers in each regfile [r0a .. r31a] and [r0b .. r31b] are the physically backed registers, and locations above that (from r32a/b to r63a/b) are for register-space IO. There are also four general purpose accumulators and two special purpose accumulators, whose magic power is that unlike the physically backed registers, their value can be used immediately after being written.
Each register file is single ported, so instructions can’t read or write two different registers in the same file. The register map looks like this:
The address on the left is the register number, from 0 to 63, and the other columns tell you what each register means when reading and writing A and B. Sometimes A and B have the same meaning, for example reading from register r35a and r35b will both read a varying from a FIFO. Sometimes they differ, like r41a and r41b which give you the X and Y pixel coordinate respectively.
Instructions are all fixed size 64-bit, and fall into the following classes: ALU, ALU with small immediate, branch, 32-bit immediate loads, and semaphores. All of the shaders covered in this post use ALU and ALU with small immediate, so let’s focus on those two. The ALU instruction encoding looks like this
That is alot of fields, so let’s take it one by one. The ALU has two pipelines. The add pipeline handles add-like operations, integer shift, and bitwise operations, while the mul pipeline does multiplication-like things and operations on individual bytes in a word. Each ALU instruction can encode up to one add pipe operation via the op_add field, and one mul pipe operation via the op_mul field, to be executed together.
Where an instruction pulls its inputs from is determined by a multiplexer, which can select from register file A, register file B, or any of the six accumulators. The fields specifying the two input muxes for the add pipe are confusingly called add_a and add_b. Here the ‘a’ and ‘b’ suffixes refer to whether its the first or second input operand and have nothing to do with the A and B register files. Similarly, the input muxes for the mul pipe are given with mul_a and mul_b.
So for example if you wanted to add together a value in accumulator 5 with some register in the A file, and multiply together some register in the A file with accumulator 4, you might do something like this
op_add = ADD_PIPE_INST_FADD ; add pipe instruction is a floating point add add_a = INPUT_MUX_ACC5 ; input 1 is operand is ACC5 add_b = INPUT_MUX_REGFILE_A ; input 2 is *some* register in the A file op_mul = ADD_PIPE_INST_FMUL ; mul pipe instruction is a floating point mul mul_a = INPUT_MUX_REGFILE_A ; input 1 is operand is *some* register in the A file mul_b = INPUT_MUX_ACC4 ; input 2 is operand is ACC4
Because the register file is single ported, an instruction can read from two different registers in the A file and B file, but not two different registers in the same file. This means that any input mux field set to INPUT_MUX_REG_FILE_A must necessarily refer to the same register, as will any mux field set to INPUT_MUX_REG_FILE_B. The specific register number is given with raddr_a (read address regfile A) and raddr_b (read address regfile B). In the above example we could set raddr_a = 17, then any mux using ALU_INPUT_MUX_REG_FILE_A would now mean register r17a.
There is a variation on ALU instructions that only allows you to get inputs from accumulators and register file A, but in exchange lets you can reuse the six bits in raddr_b to encode some commonly used literals. This is the “small immediate” form, and some allowed literals are [-16..15], [1.0, 2.0, 4.0, 8.0, … , 128.0 ], and [ 1.0/256.0, 1.0/128.0, 1.0/64.0, …, 1.0/2.0 ]. To use one of these literals as an input operand, the input mux should be set to ALU_INPUT_MUX_REG_FILE_B.
Outputs are a bit simpler. When the write select (ws) bit is cleared, the add pipe writes to a register in regfile A and the mul pipe to B. Setting ws to 1 reverses that so add writes to B and mul to A. The specific destination register numbers are specified with waddr_add and waddr_mul. Let’s say waddr_add = 7 and waddr_mul = 13. With the ws bit cleared, the add pipe will write to r7a and the mul pipe to r13b. Setting ws would make the add pipe write to r7b and the mul pipe to r13a. If an instruction writes out a value that is needed by the next instruction, then you need to use the accumulators, as this isn’t allowed with the physically backed registers.
The hardware has two unpack/pack modes, controlled by the pm bit. When pm is 0 the add ALU can unpack various 8 and 16-bit inputs to 32-bit types, as well as pack 32-bit outputs into 8 and 16-bit types. By setting the pm bit to 1, the mul ALU can saturate and pack 32-bit normalized floats to 8-bit, and insert the result in the R, G, B, A, or all channels of the destination word. However, when the pm bit is set unpack only does some limited conversions that are useful for things like textures, and only works with accumulator 4. Regardless of the value of pm, packing and unpacking is only supported for register file A, registers 0 through 15.
Finally, here are some miscellaneous bits you might find useful. Both add and mul pipe instructions can be conditionally executed based on the Z, N, and C flags, allowing some branchless coding. Whether or not an instruction updates the flags is controlled by the set flags (sf) bit. Usually its the result of the add ALU that sets the flags, except if the add operation is a NOP or its condition code is set to never. In that case flags are updated based on the mul ALU result. There is also a 4-bit signal field used to signal certain conditions to the GPU. Some of the ones we’ll need for this post are software breakpoint, program end, small immediate, and scoreboard unlock.
That’s alot to take in so let’s look at some examples. Uniforms are 32-bit words that can be read from a shader, and whose value are uniform across all invocations of the shader (as opposed to varyings, which can vary across the triangle). All uniforms for a shader should be packed contiguously in a list, and the list address is then set in an NV Shader State record field.
.align 4 ; 128-Bit Align NV_SHADER_STATE_RECORD: .byte 0 ; Flag Bits: 0 = Fragment Shader Single Threaded .byte 3 * 4 ; Shaded Vertex Data Stride .byte 0 ; Fragment Shader Num Uniforms (unused) .byte 0 ; Fragment Shader Num Varyings .word FRAGMENT_SHADER_CODE ; Fragment Shader Code Address .word UNIFORM_DATA ; Frag Shader Uniforms Addr, now non-zero .word VERTEX_DATA ; Shaded Vertex Data Address .align 4 ; RGBA as 4 floats UNIFORM_DATA: .single 0.0 ; red .single 1.0 ; green .single 0.0 ; blue .single 1.0 ; alpha
This example is a bit contrived, as it could be done more efficiently with one uint uniform. However, doing it this way allows us to demonstrate mul ALU packing (pm=1).
To read uniforms from a shader, just specify the IO-space register UNIFORM_READ (r32a or r32b) as an input operand. Each successive read from UNIFORM_READ will get you the next uniform from the FIFO. If you want to reset the stream and re-read the uniforms, the uniform base pointer can be written from SIMD element 0. Lets look at an instruction to read a uniform and pack to a color channel.
; read in a uniform and pack to R ; add op: No operation, add cond: never ; mul pipe: fmul, dest: ACC5, src1: UNIFORM_READ, src2: float 1.f, cond: always ; pack mode: 32->8a Convert mul float result to 8-bit color in range [0, 1.0] (PM1) .word 0x20820DF7, 0xD14059E5
The instruction is fmul, the first input operand is the IO-space UNIFORM_READ register, and the second is the small immediate 1.0f. The pack mode saturates the mul ALU result, converts to a byte, and inserts the byte into the R channel. Let’s look at the bits
m_mul_b 7 (mul input 2 uses value from register file B, or small immediate) m_mul_a 6 (mul input 1 uses value from register file A) m_add_b 7 (add input 2 uses value from register file B, or small immediate) m_add_a 6 (add input 1 uses value from register file A) m_small_imm 32 (1.0f) m_raddr_a 32 (regfile A uses register 32: uniform read) m_op_add 0 (add ALU is NOP) m_op_mul 1 (mul ALU is fmul) m_waddr_mul 37 (mul ALU writes to accumulator 5) m_waddr_add 39 (add ALU writes to NOP, no write) m_ws 1 (write swap: add ALU writes to regfile B, mul ALU writes to regfile A) m_sf 0 (don’t set flags) m_cond_mul 1 (cond always) m_cond_add 0 (cond never) m_pack 4 (normalized float to u8, write only R channel) m_pm 1 (use mul ALU packing) m_unpack 0 (no unpacking of input operands) m_1101 13 (small immediate signal)
Once R, G, B, and A are converted and packed in this way, the resulting color is written to TLB_COLOUR_ALL (r46a or r46b) to export the fragment color.
Varyings are slightly more involved. Like uniforms they are stored in lists of 32-bit words, but unlike uniforms they are specified per-vertex and get barycentrically interpolated across the triangle. This interpolation requires a bit of manual work in the shader.
Varyings are interpolated with an equation of the form (A*(x-x0)+B*(y-y0))*W+C, but the GPU is generous enough to calculate V=(A*(x-x0)+B*(y-y0)) in the hardware for us. Specifically the PSE sets up the varying coefficients and sends them to coefficients memory for each QPU slice’s VRI interpolator, but that’s just some fun trivia and not something you necessarily need to worry about right now. The frag shader is responsible for reading the partially interpolated value V from the VARYING_READ register (r35b), multiplying by W, and then adding C to get the final value. W is initialized for a particular pixel shader instance a few cycles before the shader instance launches, and arrives ready to read in r15a. Reading from VARYING_READ also automatically triggers a write of C to accumulator 5, which is available to read by the next instruction. Therefore the flow for a single varying will be
ACC0 = VARYING_READ * r15a // ACC0 = V * W, trigger write of C to ACC5 ACC0 = ACC0 + ACC5 // V * W + C
Remember we can execute a mul and add ALU instruction together, so in the case of multiple varyings we can do the previous varying’s add and the next varying’s multiply together. Let’s take a look at how this is done
// ACC0 = R * W + C, ACC1 = G * W (r15a) // add pipe: fadd, dest: ACC0, src1: acc r0, src2: acc r5, cond: always // mul pipe: fmul, dest: ACC1, src1: r15a, src2: VARYING_READ, cond: always // signal: no signal m_mul_b 7 (mul input 2 uses value from register file B) m_mul_a 6 (mul input 1 uses value from register file A) m_add_b 5 (add input 2 uses accumulator ACC5) m_add_a 0 (add input 1 uses accumulator ACC0) m_raddr_b 35 (r35b is VARYING_READ) m_raddr_a 15 (r15a is pre-setup with the W for varying interp) m_op_add 1 (add ALU does fadd) m_op_mul 1 (mul ALU does fmul) m_waddr_mul 33 (mul ALU writes to r33b, is ACC1) m_waddr_add 32 (add ALU writes to r32a, is ACC0) m_ws 0 (ws off, add ALU writes regfile A, mul ALU writes regfile B) m_sf 0 (don’t set result flags) m_cond_mul 1 (cond always) m_cond_add 1 (cond always) m_pack 0 (no packing) m_pm 1 (we aren’t packing or unpacking, so who cares what this bit is?) m_unpack 0 (no unpacking) m_signal 1 (no signal)
Of course this gives us an interpolated float. If you’re working with colors, you’d still need to convert and pack in the same way as the uniforms example. Also similar to uniforms, reading from VARYING_READ register internally increments so subsequent reads of the register will fetch the next varying in the FIFO.
The CPU-side setup for varyings isn’t much more complicated than uniforms. The varyings themselves are stored after each vertex’s data. In the NV Shader State Record, you must now specify the number of varyings. The stride field must also be changed to the total size of a vertex plus the per-vertex varyings.
.align 4 ; 128-Bit Align NV_SHADER_STATE_RECORD: .byte 0 ; Flag Bits: 0 = Single Threaded .byte 24 ; Shaded Vert Stride. 2 * hword + 5 * word .byte 0 ; Fragment Shader Num Uniforms (unused) .byte 3 ; Fragment Shader Num Varyings .word FRAGMENT_SHADER_CODE ; Fragment Shader Code Address .word 0 ; Fragment Shader Uniforms Address .word VERTEX_DATA ; Shaded Vertex Data Address .align 4 ; 128-Bit Align VERTEX_DATA: ; Vertex: Top .hword 320 * 16 ; X In 12.4 Fixed Point .hword 32 * 16 ; Y In 12.4 Fixed Point .single 0e1.0 ; Z .single 0e1.0 ; 1 / W .single 1.0 ; varying R .single 0.0 ; varying G .single 0.0 ; varying B ; Vertex: Right … ; Vertex: Left …
Finally, shaders, much like life, have some rules you need to follow if you don’t want to fail mysteriously. The full list can be gotten from the Videocore IV manual, but here are a few important ones that directly affect the examples in this blog post.
- The last three instructions of any program (Thread End plus the following two delay-slot instructions) must not do varyings read, uniforms read or any kind of VPM, VDR, or VDW read or write.
- The Thread End instruction must not write to either physical regfile A or B.
- The Thread End instruction and the following two delay slot instructions must not write or read address 14 in either regfile A or B. The reason for this is kind of fun. The W value for the next pixel is set up in r14b during the current pixel’s two final delay slots. Then the LSB is flipped so it is accessed as r15b for the next pixel.
- A scoreboard wait must not occur in the first two instructions of a fragment shader. This is either the explicit Wait for Scoreboard signal or an implicit wait with the first tile-buffer read or write instruction.
- An instruction must not read from a location in physical regfile A or B that was written to by the previous instruction. To do this you must use the accumulators.
- VPM cannot be accessed in Fragment shaders, because the FIFOs in the interface hardware are shared with the varyings interpolation system.
- When a shader is started back-to-back with the preceding program, the interpolation can start as early as the Program End instruction of the previous program. For this reason, a fragment shader must finish reading varyings before issuing the Program End instruction. All of the set up varyings must be read before the shader completes.
With varyings and uniforms out of the way, you now have everything you need to do texturing.