Vertex and Coordinate Shaders

Disclaimer: I do not work for Sony, despite the disturbing percentage of my shirts, jackets, and bookbags that are PlayStation dev-related. I do, however, have many friends that work at Sony, some of which I hope will call off the corporate lawyers. JayStation is in no way associated with Sony or PlayStation, and any stupid things I say represent only my own ineptitude and silliness.

Once NV mode rendering is working, moving to vertex and coordinate shaders is fairly easy. If you’ve read the posts on GPU init, shaders, uniforms, and varyings, texturing, and VPM, you should have almost everything you need to get started. The main change will be replacing the NV shader state with a GL shader state, which consists of a fixed length segment describing shaders, and a variable length part to configure vertex array streams.

Vertex Array Streams

These allow us to specify up to eight different sources for vertex attributes, and describe how the data is to be laid out in VPM for the shaders to read.

Stream size, stride, VPM offset, and total attributes size

In the above image each capital letter A through F represents a vertex attribute, and subscripts indicate vertex numbers (i.e. B1 is vertex 1’s B attribute). There are three streams containing attributes [A,B], [C,D,E], and [F] respectively. As discussed in a previous post, each attribute gets packed in a horizontal row in VPM, up to sixteen vertices across, such that each entry in that row is the same attribute but for a different vertex.

When setting up the GL shader state record, there are a few per-stream fields that need to be set. First is the stream size minus one. Using stream 0 as an example, and assuming all attributes are word-sized, this is sizeof(A) + sizeof(B) – 1, or 7 bytes. This is different from the Total Attributes size on the right side of the image, which shows the total VPM reserved to store all attributes from all streams (7 attributes * 4 bytes = 28 bytes).

Next is the stride, or distance between subsequent vertices. The stride for stream 0 is 12, because there are 12 bytes between A0 and A1. Often the size and stride will be the same, but allowing them to be different means we can leave gaps, skip unwanted attributes, interleave, and do some other cool tricks without modifying the actual vertex data.

After that is the VPM offset, indicated by the arrows, which defines what offset into a shader’s VPM block a stream’s data will DMA to. These offsets can be set independently for vertex and coordinate shaders, giving you a bit more flexibility to arrange things differently for different stages.

Finally, we need the base address of the stream.

In my samples, I am using two vertex streams. One is only used by the vertex shader, and so is in the PSE-expected format. The other is only used by the coordinate shader, and therefore is in the PTB-expected format.

.align 6 ; PSE format
VERTEX_DATA_FOR_VERTSHADER:
	; Vertex: Top
	.hword 320 * 16 ; X In 12.4 Fixed Point
	.hword  32 * 16 ; Y In 12.4 Fixed Point
	.single 0e1.0   ; Z
	.single 0e1.0   ; 1 / W
	
	; Vertex: Bottom Left
	.hword  32 * 16 ; X In 12.4 Fixed Point
	.hword 448 * 16 ; Y In 12.4 Fixed Point
	.single 0e1.0   ; Z
	.single 0e1.0   ; 1 / W
	
	...

.align 6 ; PTB format
VERTEX_DATA_FOR_COORDSHADER:
	; Vertex: Top
	.single 0.00156494522	; Xc
	.single -0.86638830897  ; Yc
	.single 1.0             ; Zc
	.single 1.0             ; Wc
	.hword 320 * 16         ; X In 12.4 Fixed Point
	.hword  32 * 16         ; Y In 12.4 Fixed Point
	.single 0e1.0           ; Z
	.single 0e1.0           ; 1 / Wc
	
	; Vertex: Bottom Left
	.single -0.89984350547  ; Xc
	.single 0.87056367432   ; Yc
	.single 1.0             ; Zc
	.single 1.0             ; Wc
	.hword  32 * 16         ; X In 12.4 Fixed Point
	.hword 448 * 16         ; Y In 12.4 Fixed Point
	.single 0e1.0           ; Z
	.single 0e1.0           ; 1 / W

	...

With the above two streams in mind, the GL shader state record’s first two streams should be configured as follows, with the other six streams optionally set to zero

; vert array slot 0: verts for the vert shader
.word VERTEX_DATA_FOR_VERTSHADER ; bytes 36–39 : stream 0 Addr
.byte 11        ; byte 40 : stream 0 Number of Bytes-1
.byte 12        ; byte 41 : stream 0 Memory Stride
.byte 0         ; byte 42 : stream 0 Vert Shader VPM Offset
.byte 0         ; byte 43 : stream 0 Coord Shader VPM Offset

; vert array slot 1: verts for the coord shader
.word VERTEX_DATA_FOR_COORDSHADER ; bytes 44–47 : stream 1 Addr
.byte 27        ; byte 48 : stream 1 Number of Bytes-1
.byte 28        ; byte 49 : stream 1 Memory Stride
.byte 0         ; byte 50 : stream 1 Vert Shader VPM Offset
.byte 0         ; byte 51 : stream 1 Coord Shader VPM Offset

The VPM offset is zero because only one of the streams will be enabled per shader type, and I want both streams to start at the beginning of VPM.

Shady Behavior

Unfortunately, even though I chose my vertex shader inputs to be the same format as the expected outputs, we can’t just have a shader that does nothing. Every attribute must be read from and written to VPM exactly once, or undefined behavior will occur. And by undefined behavior, I mean your triangle will come out randomly looking something like this.

As a result of this restriction, that beautiful short NOP shader you thought you could use will have to read and write all attributes, and therefore transform into something like this (shaders collapsed by default because no one cares)

do not click here unless you want to see the longest minimal shader evar

 

.align 4
VERT_CODE:
	; vert shader does nothing, VPM in == VPM out
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

	; must read all attributes from VPM exactly once
	; horizontal read,  elem size 32, stride 1, num 3
	; y = 0

	; VPM read setup fields:
	; add write addr: 49 (VPMVCD_RD_SETUP A), cond: always
	; write swap: 0, set flags: 0, pm: 0
	; pack mode: 32->32 No pack (NOP) (PM0)
	; immediate type is 0x70, loaded 32 immediate is 0x1A341AC0
	.word 0x1A341AC0, 0xE0020C67

	; ready to read in 3... 2... 1...
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7
	.word 0x9E7000, 0x100009E7
	.word 0x9E7000, 0x100009E7

	; r0a = read screen xy (12.4 x2)
	; add pipe: Bitwise OR, R0, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020027

	; r1a = read Zs
	; add pipe: Bitwise OR, R1, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020067

	; r2a = read 1/W
	; add pipe: Bitwise OR, R2, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x100200A7

	; must write all attributes from VPM exactly once
	; horizontal write,  elem size 32, stride 1
	; y = 0

	; VPM write setup fields:
	; add write addr: 49 (VPMVCD_WR_SETUP B), cond: always
	; write swap: 1, set flags: 0, pm: 0
	; pack mode: 32->32 No pack (NOP) (PM0)
	; immediate type is 0x70, loaded 32 immediate is 0x17BC1AC0
	.word 0x17BC1AC0, 0xE0021C67

	; write screen xy (12.4 x2)
	; add pipe: Bitwise OR, VPM_WRITE, R0, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15027DF7, 0x10020C27

	; write Zs
	; add pipe: Bitwise OR, VPM_WRITE, R1, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15067DF7, 0x10020C27

	; write 1/W
	; add pipe: Bitwise OR, VPM_WRITE, R2, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x150A7DF7, 0x10020C27

	; scoreboard done
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: scoreboard unlock
	.word 0x9E7000, 0x500009E7

	; thread end
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: program end
	.word 0x9E7000, 0x300009E7

	; branch delay NOP 1
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

	; branch delay NOP 2
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

.align 4
COORD_CODE:
	; coord shader does nothing, VPM in == VPM out
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

	; must read all 7 attributes from VPM exactly once
	; horizontal read,  elem size 32, stride 1, num 7
	; y = 0

	; VPM read setup fields:
	; add write addr: 49 (VPMVCD_RD_SETUP A), cond: always
	; write swap: 0, set flags: 0, pm: 0
	; pack mode: 32->32 No pack (NOP) (PM0)
	; immediate type is 0x70, loaded 32 immediate is 0x1A741AC0
	.word 0x1A741AC0, 0xE0020C67

	; ready to read in 3... 2... 1...
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7
	.word 0x9E7000, 0x100009E7
	.word 0x9E7000, 0x100009E7

	; r0a = read clip X
	; add pipe: Bitwise OR, R0, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020027

	; r1a = read clip Y
	; add pipe: Bitwise OR, R1, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020067

	; r2a = read clip Z
	; add pipe: Bitwise OR, R2, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x100200A7

	; r3a = clip W
	; add pipe: Bitwise OR, R3, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x100200E7

	; r4a = read screen xy (12.4 x2)
	; add pipe: Bitwise OR, R4, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020127

	; r5a = read Zs
	; add pipe: Bitwise OR, R5, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020167

	; r6a = read 1/W
	; add pipe: Bitwise OR, R6, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x100201A7

	; must write all 7 attributes from VPM exactly once
	; horizontal write,  elem size 32, stride 1
	; y = 0

	; VPM write setup fields:
	; add write addr: 49 (VPMVCD_WR_SETUP B), cond: always
	; write swap: 1, set flags: 0, pm: 0
	; pack mode: 32->32 No pack (NOP) (PM0)
	; immediate type is 0x70, loaded 32 immediate is 0x17BC1AC0
	.word 0x17BC1AC0, 0xE0021C67

	; write clip X
	; add pipe: Bitwise OR, VPM_WRITE, R0, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15027DF7, 0x10020C27

	; write clip Y
	; add pipe: Bitwise OR, VPM_WRITE, R1, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15067DF7, 0x10020C27

	; write clip Z
	; add pipe: Bitwise OR, VPM_WRITE, R2, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x150A7DF7, 0x10020C27

	; write clip W
	; add pipe: Bitwise OR, VPM_WRITE, R3, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x150E7DF7, 0x10020C27

	; write screen xy (12.4 x2)
	; add pipe: Bitwise OR, VPM_WRITE, R4, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15127DF7, 0x10020C27

	; write Zs
	; add pipe: Bitwise OR, VPM_WRITE, R5, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15167DF7, 0x10020C27

	; write 1/W
	; add pipe: Bitwise OR, VPM_WRITE, R6, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x151A7DF7, 0x10020C27

	; scoreboard done
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: scoreboard unlock
	.word 0x9E7000, 0x500009E7

	; thread end
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: program end
	.word 0x9E7000, 0x300009E7

	; branch delay NOP 1
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

	; branch delay NOP 2
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

 

This brings us to the part of the GL shader state record that describes the actual shaders.

.align 4
GL_SHADER_STATE_RECORD:
	.hword 4        ; bytes 0–1 : flag bits, enable clipping

	; stuff describing frag shader
	.byte 0         ; byte 2 : Frag Shader Number of Uniforms
	.byte 0         ; byte 3 : Frag Shader Number of Varyings
	.word FRAG_CODE ; bytes 4–7 : Frag Shader Code Address
	.word 0         ; bytes 8–11 : Frag Shader Uniforms Address

	; stuff describing vert shader
	.hword 0                ; bytes 12–13 : Number of Uniforms
	.byte 1                 ; byte 14 : Stream select mask
	.byte 12                ; byte 15 : Total Attributes Size
	.word VERT_CODE         ; bytes 16–19 : Code Address
	.word VERT_UNIFORM_DATA ; bytes 20–23 : Uniforms Address

	; stuff describing coord shader			     
	.hword 0                 ; bytes 24–25 : Num Uniforms
	.byte 2                  ; byte 26 : Stream select mask
	.byte 28                 ; byte 27 : Total Attributes Size
	.word COORD_CODE         ; bytes 28–31 : Code Address
	.word COORD_UNIFORM_DATA ; bytes 32–35 : Uniforms Address

The shader microcode addresses, uniforms addresses, and number of uniforms and varyings should be familiar from the NV shader state record. Whats new are the fields for total attribute size and stream select. Total attribute size tells the system how much VPM to reserve for all attributes across all streams. This is not the same as the per-stream total attribute size we set up in the stream descriptions.

Stream select is an 8-bit mask where each bit enables or disables one of the eight vertex array streams. The vert shader mask is set to 0b00000001 because it only uses the first stream, and the coordinate shader mask is set to 0b00000010 because it only uses the second.

Kicking It All Off

The only remaining change to make is in the binning command list. The NV shader state command (0x41) is replaced with the GL version (0x40). I am using a macro that looks like this

; Control ID Code 64: GL Shader State
; bits      offs   desc
;  28        4     GL Shader Record addr in 16 byte blocks
;   1        3     Extended shader record
;   3        0     Num attribute arrays (0 = all 8).
.macro GL_Shader_State address numarrays
	.byte 0x40
	.word (\address | \numarrays)
.endm

The address of the shader state record must be 16-byte aligned as the least significant 4 bits are used to toggle extended shader record, and for specifying the number of vertex array streams to activate. This stream count also determines the length of the GL shader state record, because you must have at least this many stream descriptions.

That’s all there is. If you set everything up correctly, you should have the same triangle you had in NV mode, but it should now take over 3x longer to render. Enjoy your newfound freedom.

This week, extra special thanks goes out to Zyte for taking the time to remind me about that VPM restriction. He’s doing some very cool bare metal GPU work, so be sure to see what he’s up to here.

VPM, VCD, and VCM For Vertex Shaders

Disclaimer: I do not work for Sony, despite the disturbing percentage of my shirts, jackets, and bookbags that are PlayStation dev-related. I do, however, have many friends that work at Sony, some of which I hope will call off the corporate lawyers. JayStation is in no way associated with Sony or PlayStation, and any stupid things I say represent only my own ineptitude and silliness.

This will be the final piece of the puzzle before finally moving on to Vertex Shaders.

VPM… Why?

They say a picture is worth a thousand words. The following image probably has a thousand words appearing in it, therefore making it worth 103000 words, a very valuable image indeed.

That’s alot to take in, but try to focus on the following simplified GL mode (as opposed to NV) flow.

  1. The Vertex Cache Manager (VCM) gets vertices from *somewhere*
  2. The VCD then DMAs the vertices to Vertex Pipe Memory (VPM)
  3. Your vert shader reads these verts from VPM and transforms them
  4. Your shader writes these verts back to VPM in a certain expected format
  5. The verts are then read directly by the Primitive Tile Binner (PTB) or Primitive Setup Engine (PSE)

And that’s how triangles are made! Sort of. The ambiguity in steps 1, 4 and 5 arise because there are two vertex shader-like things: full vertex shaders and coordinate shaders. Full vertex shaders transform verts and export them along with varyings to be interpolated to the PSE (Primitive Setup Engine) for eventual rasterization/interpolation. Coordinate shaders are like mini versions of your full vertex shader whose only job is to output whats needed for the PTB (Primitive Tile Binner) to determine what primitives touch what tiles, so that the render lists discussed in a previous post can be generated by the binning thread.  This makes a bit more sense when you look at the expected output formats

The Primitive Setup Engine cares very deeply about varyings

The Primitive Tile Binner only cares about what it needs to determine what tiles a primitive touches

So in a nutshell, VPM is there so your vert-like shaders have a place to read in the vert they are processing, and a place to output the expected format for whatever stage they are.

VPM… What Can It Do For Me?

Before dropping a bunch of shader code on you, I want to visually show the kinds of magical things VPM can do. The GPU wave size is 16 threads, and so 16 will always be the number of things written, be it 16 bytes, halfwords, or words. In 32-bit mode, you write out 16 words, either as horizontal rows or vertical columns. With 8 and 16-bit modes, you have the option to use packed or laned. 8-bit packed means your wave writes out 16 contiguous bytes to some location, while laned would write one byte to a specified lane for 16 contiguous words. 16-bit modes are also available. Let’s look at a visual representation of the horizontal modes.

Here VPM is a block of 16×64 words, with the words going from left to right, and the 4 bytes of each word depicted vertically. If a wave has 16 threads and each thread writes one word, the max you can write is 16 words. This is why the X-axis labeling goes from words 0 to 15.  The Y-axis has two labels: one to tell you what row you are on [0..63] and the other to identify each of the 4 bytes (or lanes) in a word.

Let’s start with the easiest case: 32 bit (e). In 32-bit mode, every wave thread writes out one word, left to right, to a specified row. Here Y = row 63, so that’s where we write. The numbers represented in each word tell you which thread writes it, hence the word processed by thread 0 would contains {0,0,0,0}, and so on.

Next up is 8-bit laned mode (b). Y=1, so we will be using the words in row 1. The B parameter specifies which lane of each word to write, and is 1 here. Thread 0 will take the first word from row 1 and write some byte to lane 1, thread 1 will take the second word from row 1 and write some byte to lane 1, and so on. 16 bit laned modes function similarly, but instead of writing one byte to lane 0, 1, 2, or 3, each word is partitioned into two 16-bit lanes, and you can write a halfword to either lane 0 or 1 (c).

Finally we have packed modes. Y still tells you what word row to start with, but now instead of B we have the block number parameter H. If your 16 threads were to write one one byte each in packed mode, you would only write 16 bytes, or enough to fill a block of four words. However with 16 words per row, there are four of these blocks. The H parameter controls what block you address. H=0 accesses the 16 bytes in words [0..3], H=1 accesses the 16 bytes in words [4..7], and in the example above H=2 accesses the 16 bytes in words [8..11]. Again, 16-bit modes function similarly, but with each thread writing out twice the data, only two blocks are possible. H=0 accesses the 32 bytes at words [0..7] and H=1 accesses the 32 bytes at words [8..15].

And then there are the vertical modes. You still have a block of 16×64 words, but in this diagram the four lanes of each word are represented horizontally. This visually makes much more sense when you start talking about lanes in a word.

All the old concepts still apply. You have 32-bit mode, as well as 8 and 16-bit laned, and packed modes. However now, instead of block parameter H sliding your start offset left and right, it slides it up and down. This puts a new constraint on Y to save bits, avoid redundancy, and make address incrementing more intuitive. Looking at 8-bit packed mode (a), you could get the exact same start address specifying Y=4 as your starting row as you could by specifying Y=0 and block H=1. Therefore in vertical modes, Y values are encoded as multiples of 16. There is now also an X parameter to the address that specifies what column you are going down.

Programming VPM and VCD

There are a few VPM and VCD related registers in the QPU regfile.

VPM_READ and VPM_WRITE are for reading and writing values to and from VPM. However, before doing so you must first program the appropriate setup register (VPMVCD_RD_SETUP or VPMVCD_WR_SETUP) with things like VPM address, access mode, and stride. The *_BUSY and *_WAIT  registers are for either polling the status of or waiting on VCD DMA operations to complete. VPM_LD_ADDR and VPM_ST_ADDR are the load and store addresses for VCD DMA operations.

VPMVCD_RD_SETUP and VPMVCD_WR_SETUP are programmed with the following bits. If you understood the vertical and horizontal memory layout diagrams, these register fields should be fairly self explanatory.

SIZE, LANED, and HORIZ are the fields controlling bits per write, laned/packed mode, and whether you are reading and writing in horizontal mode or vertical. ADDR encodes the X, Y, B, and H address bits, however there are a few errors in the manual. For example

Horizontal 8-bit: ADDR[7:0] = {Y[5:0], B[1:0]}
↓↓↓↓↓↓↓↓
Horizontal 8-bit: ADDR[7:0] = {Y[7:2], B[1:0]}

The number of bits is correct but the starting offset is wrong in a few cases.

You can program the above registers once and then read or write VPM repeatedly. This is where stride comes in. Stride is just the amount that is added to the address field after each read or write.

Finally, reads require an additional parameter, NUM, that specify how many vectors you intend to read from VPM. Reading anything other than the specified amount can lead to garbage or hangs. After you program a read setup, you must wait a minimum of three QPU cycles before trying to read the requested data from VPM. After three cycles, any read will stall at the SP pipeline stage until the data is ready, but waiting less than three cycles will return garbage and continue.

In the case of vert shaders, you shouldn’t have to manually program the VCD DMA config yourself. The system ensures that the data is ready in VPM before your vert shader instance runs, and it will DMA the output data from VPM to the units that need it automatically when your shader finishes. However, if you are going to be doing compute-like shaders via user programs, You must manually DMA things. Examples of how to do this are in the next section.

Sample Coordinate Shader

Here is a sample compute shader I wrote to test what coordinate shaders will eventually write to VPM. It takes in vertices of the same format as all my other samples, except that the “verts” are specified via uniforms.

; VPM in for vert attributes:
;	0) screen space X,Y in 12.4 fixed point
;	1) screen space Z in f32
;	2) clip space 1/W in f32
FAKE_VERT_UNIFORM_DATA:
	.hword  32 * 16		; X In 12.4 Fixed Point
	.hword 448 * 16		; Y In 12.4 Fixed Point
	.single 0e1.0		; Z
	.single 0e1.0		; 1 / W

The shader then outputs the PTB expected format: Clip space X, Y, Z, and W as  f32, screen space X and Y as 12.4 fixed point, screen space Z as f32, and the reciprocal of the clip space W as f32. This format is shown visually in the above image Shader Coordinates Format in VPM for PTB.

VPM_VCD_COORDINATE_SHADER_TEST_1:
; coordinate shader, very sad
; add op: No operation, add cond: never
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x9E7000, 0x100009E7

; r0a = read screen xy (12.4 x2)
; add pipe: Bitwise OR, R0, UNIFORM_READ, NOP, cond: always
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x15827DF7, 0x10020027

; r1a = read screen z (f32)
; add pipe: Bitwise OR, R1, UNIFORM_READ, NOP, cond: always
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x15827DF7, 0x10020067

; r2a = read clip 1/w (f32)
; add pipe: Bitwise OR, R2, UNIFORM_READ, NOP, cond: always
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x15827DF7, 0x100200A7

; VPM write setup: start with vec 2 (clip space Z = 0.0f)
; horizontal write,  elem size 32, stride 1
; y = 2

; VPM write setup fields:
; add write addr: 49 (VPMVCD_WR_SETUP B), cond: always
; write swap: 1, set flags: 0, pm: 0
; pack mode: 32->32 No pack (NOP) (PM0)
; immediate type is 0x70, loaded 32 immediate is 0x17BC1AC2
.word 0x17BC1AC2, 0xE0021C67

; store screen space Z = 0.0f to VPM slot 2:
; add write addr: 48 (VPM_WRITE A), cond: always
; write swap: 0, set flags: 0, pm: 0
; pack mode: 32->32 No pack (NOP) (PM0)
; immediate type is 0x70, loaded 32 immediate is 0.000000
.word 0x0, 0xE0020C27

; store clip space 1/W = 1.0f to VPM slot 3:
; add write addr: 48 (VPM_WRITE A), cond: always
; write swap: 0, set flags: 0, pm: 0
; pack mode: 32->32 No pack (NOP) (PM0)
; immediate type is 0x70, loaded 32 immediate is 1.000000
.word 0x3F800000, 0xE0020C27

; write screen xy (12.4 x2) to VPM slot 4
; add pipe: Bitwise OR, VPM_WRITE, R0, NOP, cond: always
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x15027DF7, 0x10020C27

; write screen z (f32) to VPM slot 5
; add pipe: Bitwise OR, VPM_WRITE, R1, NOP, cond: always
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x15067DF7, 0x10020C27

; write clip 1/w (f32) to VPM slot 6
; add pipe: Bitwise OR, VPM_WRITE, R2, NOP, cond: always
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x150A7DF7, 0x10020C27

; acc0 = extract 16 lsb (X) and shift right by 4 to get rid of x's frac bits
; add pipe: Integer shift right, ACC0, R0, int 4, cond: always
;      unpack mode: 16a->32 Float16float32 if any ALU consuming data
;      executes float instruction, else signed int16->signed int32 (PM0)
; mul op: No operation, mul cond: never
.word 0xE004DC0, 0xD2020827

; acc1 = extract 16 msb (Y) and shift right by 4 to get rid of y's frac bits
; add pipe: Integer shift right, ACC1, R0, int 4, cond: always
;      unpack mode: 16b->32 Float16float32 if any ALU consuming data 
;      executes float instruction, else signed int16->signed int32 (PM0)
; mul op: No operation, mul cond: never
.word 0xE004DC0, 0xD4020867

; acc0 = to_float( acc0 )
; add pipe: Signed integer to floating point, ACC0, acc r0, R0, cond: always
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x80001F7, 0x10022827

; acc1 = to_float( acc1 )
; add pipe: Signed integer to floating point, ACC1, acc r1, R0, cond: always
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x80003F7, 0x10022867

; acc2 = scaling factor 2/639:
; add write addr: 34 (ACC2 A), cond: always
; write swap: 0, set flags: 0, pm: 0
; pack mode: 32->32 No pack (NOP) (PM0)
; immediate type is 0x70, loaded 32 immediate is 0.003130
.word 0x3B4D1ED9, 0xE00208A7

; acc3 = scaling factor 2/479:
; add write addr: 35 (ACC3 A), cond: always
; write swap: 0, set flags: 0, pm: 0
; pack mode: 32->32 No pack (NOP) (PM0)
; immediate type is 0x70, loaded 32 immediate is 0.004175
.word 0x3B88D181, 0xE00208E7

; acc0 *= 2/639
; add op: No operation, add cond: never
; mul pipe: Floating point multiply, ACC0, acc r0, acc r2, cond: always
; signal: no signal
.word 0x20000DC2, 0x100079E0

; acc1 *= 2/479
; add op: No operation, add cond: never
; mul pipe: Floating point multiply, ACC1, acc r1, acc r3, cond: always
; signal: no signal
.word 0x20000DCB, 0x100079E1

; acc0 = acc0 - 1.0f
; add pipe: Floating point subtract, ACC0, acc r0, float 1.f, cond: always
; mul op: No operation, mul cond: never
.word 0x20201C0, 0xD0020827

; acc1 = acc1 - 1.0f
; add pipe: Floating point subtract, ACC1, acc r1, float 1.f, cond: always
; mul op: No operation, mul cond: never
.word 0x20203C0, 0xD0020867

; VPM write setup: start with vec 0 (screen space X)
; horizontal write,  elem size 32, stride 1
; y = 0

; VPM write setup fields:
; add write addr: 49 (VPMVCD_WR_SETUP B), cond: always
; write swap: 1, set flags: 0, pm: 0
; pack mode: 32->32 No pack (NOP) (PM0)
; immediate type is 0x70, loaded 32 immediate is 0x17BC1AC0
.word 0x17BC1AC0, 0xE0021C67

; write clip x (f32) to VPM slot 0
; add pipe: Bitwise OR, VPM_WRITE, acc r0, acc r0, cond: always
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x159E7000, 0x10020C27

; write clip y (f32) to VPM slot 1
; add pipe: Bitwise OR, VPM_WRITE, acc r1, acc r1, cond: always
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x159E7249, 0x10020C27

; VCD store setup word
; horiz DMA store, 7 rows of length 16, XY addr (0, 0), 32-bit, offset 0

; 32 bit DMA store config to VPMVCD_WR_SETUP:
; add write addr: 49 (VPMVCD_WR_SETUP B), cond: always
; write swap: 1, set flags: 0, pm: 0
; pack mode: 32->32 No pack (NOP) (PM0)
; immediate type is 0x70, loaded 32 immediate is 0x83904000
.word 0x83904000, 0xE0021C67

; load uniform to DMA store address, initiate DMA
; add pipe: Bitwise OR, VPM_ST_ADDR, UNIFORM_READ, int 0, cond: always
; mul op: No operation, mul cond: never
.word 0x15800DF7, 0xD0021CA7

; scoreboard done
; add op: No operation, add cond: never
; mul op: No operation, mul cond: never
; signal: scoreboard unlock
.word 0x9E7000, 0x500009E7

; thread end
; add op: No operation, add cond: never
; mul op: No operation, mul cond: never
; signal: program end
.word 0x9E7000, 0x300009E7

; branch delay NOP 1
; add op: No operation, add cond: never
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x9E7000, 0x100009E7

; branch delay NOP 2
; add op: No operation, add cond: never
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x9E7000, 0x100009E7

First we read in the fake vert data and store it to the same VPM locations the system would DMA to if this were a real coordinate shader. Inputs ZC, WC, XS, YS, ZS, and 1/Ware already in the correct output format, so we are able to read them as-is and directly store them to their correct VPM destination slots [2..6]. That just leaves the clip space XC and YC. These are read in as 12.4 fixed point screen space positions, converted to float, mapped [-1..1] by assuming a screen size of 640×480, and written to VPM slots 0 and 1. With all the output data in its correct place, we DMA the 7 words * 16 verts back to main memory where we can print out what the PTB will eventually see

In real life, each of the 16 words would represent the same attribute but for one of the 16 verts being processed this wave. However, since we are having all threads start with the same vert, the same values just repeat 16 times.

First up is the clip space X, Y, Z, and W. TTY is showing values of 0xBF665c24(-0.8998f), 0x3F5EDD42 (0.87f), 0x00000000 (0.0f), and 0x3F800000 (1.0f), which is exactly what we’d expect. Up next is the 12.4 encoding of the screen space X and Y. The TTY shows 0x1C000200, which works out to be 0x1C0 (448) and 0x020 (32) in decimal.

Sharing VPM With Non-Graphics Work

One last point to be careful of when running user programs at the same time as graphics. The manual states

The minimum VPM size is 8Kbytes, which is the amount required for normal pipelined 3D operation with concurrent vertex and coordinate shading and worst case vertex data size. With this size of memory it is not sensible to divide the VPM between 3D shading functions and general-purpose processing at the same time. Fully configured systems may have up to 16Kbytes of VPM for higher vertex shading performance. With this size of VPM it is practical for some of the memory to be reserved for general-purpose processing whilst 3D is operating so long as at least 8Kbytes is left for 3D use.

So it might be a good idea to query how much VPM is available when trying to leave some off for compute programs. It is also not possible to use VPM from pixel shaders, as

The FIFO used for the partially interpolated varying results is also used for VPM read and write accesses and VCD control from QPU programs. For this reason a fragment shader program cannot access the VPM or VCD.

This kills one of my best ideas for full shader debugging, but I have other possibilities in mind

Texturing

Disclaimer: I do not work for Sony, despite the disturbing percentage of my shirts, jackets, and bookbags that are PlayStation dev-related. I do, however, have many friends that work at Sony, some of which I hope will call off the corporate lawyers. JayStation is in no way associated with Sony or PlayStation, and any stupid things I say represent only my own ineptitude and silliness.

This is a bittersweet post for me to write. At the end of August I will be temporarily pausing my eight year Japanese adventure and returning to the States for personal reasons, making this the last JayStation blog update from beautiful Kyoto, Japan. As part of the moving process, I am selling my only computer tomorrow, giving me just 24 hours to bang out this post before I no longer have any means of doing so. It won’t be as in depth or interesting as I had hoped, and it will probably be rushed and error-riddled with unclear wording, but it will show everything you need to know to render textured triangles. The next update probably won’t come until September or October when I settle in, unless of course I am murdered in the street for riding my bike to work. #America

There are three steps to getting textured polygons rendering. First you set up your vert data in memory, including two varyings for the normalized S and T texture coordinates. Next you set up between 1 and 4 uniforms, corresponding to texture config parameters 0 through 3. Finally you write a fragment shader that does interpolation for the ST coordinate varyings, and reads the texture data. The first step, setting up verts and varyings, was covered in the previous post so it won’t be duplicated here.

I Just Love A Config In Uniform

Texture config params are shader uniforms that are used to specify things like base address, dimensions, pixel format, mip levels, min/mag filters, and wrap mode for texture unit memory accesses. They roughly correspond to some combination of T#’s and samplers on GCN. The number of config params needed is dictated by the type of data being accessed, with one word required for 1D buffers and general memory accesses, two words for 2D textures, and three or four words for cubemaps and child images.

No matter the type of data, you must at least specify config param 0. Bits [31:12] give the base address of the LOD0 image in units of 4 KiB blocks, so all images must be at least 4 KiB aligned. Bits [11:10] give the cache swizzle mode, and will be discussed in depth in a later post. Bits [7..4] are the four LSB bits of the 5-bit pixel format value, and [3..0] give the number of mip levels minus one.

The supported pixel formats are

0 RGBA8888 32 8-bit per channel red, green, blue, alpha
1 RGBX8888 32 8-bit per channel RGA, alpha set to 1.0
2 RGBA4444 16 4-bit per channel red, green, blue, alpha
3 RGBA5551 16 5-bit per channel red, green, blue, 1-bit alpha
4 RGB565 16 Alpha channel set to 1.0
5 LUMINANCE 8 8-bit luminance (alpha channel set to 1.0)
6 ALPHA 8 8-bit alpha (RGA channels set to 0)
7 LUMALPHA 16 8-bit luminance, 8-bit alpha
8 ETC1 4 Ericsson Texture Compression format
9 S16F 16 16-bit float sample (blending supported)
10 S8 8 8-bit integer sample (blending supported)
11 S16 16 16-bit integer sample (point sampling only)
12 BW1 1 1-bit black and white
13 A4 4 4-bit alpha
14 A1 1 1-bit alpha
15 RGBA64 64 16-bit float per RGBA channel
16 RGBA32R 32 Raster format 8-bit per channel red, green, blue, alpha
17 YUYV422R 32 Raster format 8-bit per channel Y, U, Y, V

For 2D texture data, a uniform for config param 1 must also be given. This will include things useful for 2D texture reads such as width and height, wrap mode, min/mag mode, and the MSB of the 5-bit pixel format value.

As with the uniforms in previous examples, the texture config params are stored as a word-aligned list in memory, and the address of the list is encoded in the uniform data address field of the NV shader state record. The main difference here is that you won’t be manually reading these uniforms in the fragment shader. Rather, every time your shader writes to a texture unit register (TMUn_S, TMUn_T, TMUn_R, TMUn_B), the texture unit automatically fetches the next config param from the uniform FIFO and pushes it to the texture FIFO. If all four config params are needed, you have to write to all four TMU registers, writing a zero to TMUn_B if no bias is actually needed. Since the S register is the only one required by all access types, writing it also kicks off texture processing, and therefore it must always be written last.

The Shader

The high level shader flow is as follows: read and interpolate the S and T varyings, write T to TMU0_T causing the TMU to auto fetch config param 0, write S to TMU0_S causing the TMU to auto fetch config param 1 and kick off texture processing, send the GPU the texture read signal, and read back the packed pixel data from accumulator r4. Let’s start with the S and T varyings interpolation

; Tex S: ACC0 = S * W (r15a)
; add op: No operation, add cond: never
; mul pipe: Floating point multiply, ACC0, R15, VARYING_READ, cond: always
.word 0x203E3DF7, 0x110059E0

; Tex S coord: ACC0 = S * W + C, Tex T: ACC1 = T * W (r15a)
; add pipe: Floating point add, ACC0, acc r0, acc r5, cond: always
; mul pipe: Floating point multiply, ACC1, R15, VARYING_READ, cond: always
.word 0x213E3177, 0x11024821

; Tex T write reg = T * W + C, triggering first sampler param uniform read
; add pipe: Floating point add, TMU0_T, acc r1, acc r5, cond: always
; mul op: No operation, mul cond: never
.word 0x13E3377, 0x11020E67

Notice how the destination for the T coordinate’s add is TMU0_T. This not only feeds the texture unit the T coordinate, but also causes the TMU to fetch config param 0 from the uniform FIFO.

; moving S coord (in ACC0) to S register
; triggering read of second tex param uniform, and kicking it all off
; add pipe: Bitwise OR, TMU0_S_RETIRING, acc r0, acc r0, cond: always
; mul op: No operation, mul cond: never
.word 0x159E7000, 0x10020E27

Next we move the S value into TMU0_S. This feeds the texture unit the S coordinate, causes the TMU to fetch config param 1 from the uniform FIFO, and kicks off texture processing. Note that because S has to be written last, but usually comes first in the vert data, storing the texture coordinates as TS instead of ST can avoid an extra mov and be a potential optimization in some cases.

; signal TMU texture read. Can this be done with prev instruction?
; add op: No operation, add cond: never
; mul op: No operation, mul cond: never
; signal: load data from tmu0 to r4
.word 0x9E7000, 0xA00009E7

This instruction doesn’t execute any ALU operations, but it does signal the GPU to load data from the texture unit into accumulator r4. Whether or not the signal can occur in the same instruction as the write to TMU0_S is a bit unclear, and may need a bit of testing. For now, just to be safe it’s being done in the instruction after the S coordinate is written.

Also worth noting, data from the texture units is always returned as either packed 8-bit RGBA8888 or RG1616 and BA1616 values in r4, depending on the bits per channel. The special r4 unpack mode (pm=1) mentioned in the previous post exists to convert this packed texture data to normalized [0..1] data. Reading 64-bit pixel data requires two 32-bit reads to get all four channels. For general 1D buffer access, 32-bit data is always fetched.

; exporting read texture data to MRT0
; add pipe: Bitwise OR, TLB_COLOUR_ALL, acc r4, acc r4, cond: always
; mul op: No operation, mul cond: never
.word 0x159E7924, 0x10020BA7

After the signal, the data arrives in r4 and is immediately available for the next instruction to use. Here the sampled texture data in r4 is just directly copied to TLB_COLOUR_ALL for export.

Be careful not to overfill the texture receive FIFO, as it is only 8 entries deep, and each write to a TMU register constitutes one entry. For 2D textures you only write to S and T, and so you can queue up four texture requests. If you only need to write S (1D buffer case), you can queue up eight requests. Finally in the child image case where S, T, R, and B are all needed, you can only queue up two. If your shader runs in multithreaded mode and you suspend, both threads share the same FIFO and should only use half.

I Wanna See You Swizzle It Just A Little Bit

Finally we need to cover how texture data is laid out in memory. Texture reads support both linear (raster order) formats and microtile-based T and LT formats. Microtiles are 64 byte 2D blocks of pixels, whose geometry depends on the number of bits per pixel. In the common case of 32-bit pixels, a microtile would be a 4×4 block. In the 64-bit pixel case, microtiles are 2×4 blocks, and 1-bit pixels are laid out as 32×16 blocks, for example. The pixels are stored in simple raster scan order, left to right and bottom to top, with the origin in the lower left corner.

In T-format, microtiles are grouped into subtiles, where a subtile is a 1 KiB block of 16 microtiles, arranged 4×4 in simple raster scan order. In the 32bpp case, that’s 4×4 pixels per microtile times 4×4 microtiles per subtile, or 16×16 pixels per subtile. 256 pixels times 4 bytes per pixel is 1 KiB, as expected.

Next, four 1 KiB subtiles are grouped into one image tile (sometimes called a 4K tile). The ordering of subtiles within a 4K tile depends on the 4K tile row. Even rows (0, 2, 4,…) orders the four subtiles lower left, upper left, upper right, lower right. Odd rows of 4K tiles (1, 3, 5,…) orders them upper right, lower right, lower left, upper left.

4K tiles on even lines have subtiles wound clockwise from lower left
4K tiles on odd lines have subtiles wound clockwise from upper right

Finally, the 4K tiles themselves go left to right for even rows, and right to left for odd rows. Again, the origin is in the lower left corner. The image below shows a texture three 4K tiles wide and two 4K tiles high. The pattern repeats, alternating every row

This is the tiling order for T-format. Because the format is 4K tile based, the texture dimensions must always be padded out to a multiple of 4K tiles. This can be wasteful for smaller textures, so the hardware assumes any mip level less than a 4K tile in size will be stored in LT (or linear tiled) format. LT-format is also microtile based, but the image is stored as a series of microtiles in linear scan order, without any concept of subtiles and 4K tiles.

That’s about all I have time for.  I’ll have to save cache swizzling, mipmap layout, cubemaps, and hardware utilization for another post this fall.

Shoutouts as always to Peter Lemon for continuing to push low level and bare metal on RPI, Cort Danger Stratton for being recognizable even as a 32×32 texture, Graham Wihlidal (father of the never-gonna-be-released GrahamBox), and new kid in the console game Mike Nicølella who I can only assume is making the Miketendo Nii…

 

Shaders, Uniforms, and Varyings

Disclaimer: I do not work for Sony, despite the disturbing percentage of my shirts, jackets, and bookbags that are PlayStation dev-related. I do, however, have many friends that work at Sony, some of which I hope will call off the corporate lawyers. JayStation is in no way associated with Sony or PlayStation, and any stupid things I say represent only my own ineptitude and silliness.

The road to full GL pipeline support is fraught with peril. Quite a few things are involved, such as using a completely different pipeline, using VPM and VCD, and writing vert shaders that read attributes and write varyings and vertices in the right format. Luckily before jumping to full-on vert shaders, there is an intermediate step that allows us to try out some of the useful things like uniforms and varyings, all without leaving our familiar NV pipeline mode. If you haven’t read the previous post on how to initialize the GPU and set up basic binning and rendering command buffers, now is the time. You’re gonna need it.

Shader ISA

Before getting into how to set up uniforms and varyings, we have to go over the basics of writing shaders. This will mainly cover the QPU ISA and instruction encoding, the register file, and some basic rules and limitations. There are two register files: A and B. The first 32 registers in each regfile [r0a .. r31a] and [r0b .. r31b] are the physically backed registers, and locations above that (from r32a/b to r63a/b) are for register-space IO. There are also four general purpose accumulators and two special purpose accumulators, whose magic power is that unlike the physically backed registers, their value can be used immediately after being written.

Each register file is single ported, so instructions can’t read or write two different registers in the same file. The register map looks like this:

The address on the left is the register number, from 0 to 63, and the other columns tell you what each register means when reading and writing A and B. Sometimes A and B have the same meaning, for example reading from register r35a and r35b will both read a varying from a FIFO. Sometimes they differ, like r41a and r41b which give you the X and Y pixel coordinate respectively.

Instructions are all fixed size 64-bit, and fall into the following classes: ALU, ALU with small immediate, branch, 32-bit immediate loads, and semaphores. All of the shaders covered in this post use ALU and ALU with small immediate, so let’s focus on those two. The ALU instruction encoding looks like this

That is alot of fields, so let’s take it one by one. The ALU has two pipelines. The add pipeline handles add-like operations, integer shift, and bitwise operations, while the mul pipeline does multiplication-like things and operations on individual bytes in a word. Each ALU instruction can encode up to one add pipe operation via the op_add field, and one mul pipe operation via the op_mul field, to be executed together.

Where an instruction pulls its inputs from is determined by a multiplexer, which can select from register file A, register file B, or any of the six accumulators. The fields specifying the two input muxes for the add pipe are confusingly called add_a and add_b. Here the ‘a’ and ‘b’ suffixes refer to whether its the first or second input operand and have nothing to do with the A and B register files. Similarly, the input muxes for the mul pipe are given with mul_a and mul_b.

So for example if you wanted to add together a value in accumulator 5 with some register in the A file, and multiply together some register in the A file with accumulator 4, you might do something like this

op_add = ADD_PIPE_INST_FADD    ; add pipe instruction is a floating point add
add_a =  INPUT_MUX_ACC5        ; input 1 is operand is ACC5
add_b = INPUT_MUX_REGFILE_A    ; input 2 is *some* register in the A file
op_mul = ADD_PIPE_INST_FMUL    ; mul pipe instruction is a floating point mul
mul_a = INPUT_MUX_REGFILE_A    ; input 1 is operand is *some* register in the A file
mul_b = INPUT_MUX_ACC4         ; input 2 is operand is ACC4

Because the register file is single ported, an instruction can read from two different registers in the A file and B file, but not two different registers in the same file. This means that any input mux field set to INPUT_MUX_REG_FILE_A must necessarily refer to the same register, as will any mux field set to INPUT_MUX_REG_FILE_B. The specific register number is given with raddr_a (read address regfile A) and raddr_b (read address regfile B). In the above example we could set raddr_a = 17, then any mux using ALU_INPUT_MUX_REG_FILE_A would now mean register r17a.

There is a variation on ALU instructions that only allows you to get inputs from accumulators and register file A, but in exchange lets you can reuse the six bits in raddr_b to encode some commonly used literals. This is the “small immediate” form, and some allowed literals are [-16..15], [1.0, 2.0, 4.0, 8.0, … , 128.0 ], and [ 1.0/256.0, 1.0/128.0, 1.0/64.0, …, 1.0/2.0 ]. To use one of these literals as an input operand, the input mux should be set to ALU_INPUT_MUX_REG_FILE_B.

Outputs are a bit simpler. When the write select (ws) bit is cleared, the add pipe writes to a register in regfile A and the mul pipe to B. Setting ws to 1 reverses that so add writes to B and mul to A. The specific destination register numbers are specified with waddr_add and waddr_mul. Let’s say waddr_add = 7 and waddr_mul = 13. With the ws bit cleared, the add pipe will write to r7a and the mul pipe to r13b. Setting ws would make the add pipe write to r7b and the mul pipe to r13a. If an instruction writes out a value that is needed by the next instruction, then you need to use the accumulators, as this isn’t allowed with the physically backed registers.

The hardware has two unpack/pack modes, controlled by the pm bit. When pm is 0 the add ALU can unpack various 8 and 16-bit inputs to 32-bit types, as well as pack 32-bit outputs into 8 and 16-bit types. By setting the pm bit to 1, the mul ALU can saturate and pack 32-bit normalized floats to 8-bit, and insert the result in the R, G, B, A, or all channels of the destination word. However, when the pm bit is set unpack only does some limited conversions that are useful for things like textures, and only works with accumulator 4. Regardless of the value of pm, packing and unpacking is only supported for register file A, registers 0 through 15.

Finally, here are some miscellaneous bits you might find useful. Both add and mul pipe instructions can be conditionally executed based on the Z, N, and C flags, allowing some branchless coding. Whether or not an instruction updates the flags is controlled by the set flags (sf) bit.  Usually its the result of the add ALU that sets the flags, except if the add operation is a NOP or its condition code is set to never. In that case flags are updated based on the mul ALU result. There is also a 4-bit signal field used to signal certain conditions to the GPU. Some of the ones we’ll need for this post are software breakpoint, program end, small immediate, and scoreboard unlock.

Uniforms

That’s alot to take in so let’s look at some examples. Uniforms are 32-bit words that can be read from a shader, and whose value are uniform across all invocations of the shader (as opposed to varyings, which can vary across the triangle). All uniforms for a shader should be packed contiguously in a list, and the list address is then set in an NV Shader State record field.

.align 4 ; 128-Bit Align
NV_SHADER_STATE_RECORD:
     .byte 0                    ; Flag Bits: 0 = Fragment Shader Single Threaded
     .byte 3 * 4                ; Shaded Vertex Data Stride
     .byte 0                    ; Fragment Shader Num Uniforms (unused)
     .byte 0                    ; Fragment Shader Num Varyings
     .word FRAGMENT_SHADER_CODE ; Fragment Shader Code Address
     .word UNIFORM_DATA         ; Frag Shader Uniforms Addr, now non-zero
     .word VERTEX_DATA          ; Shaded Vertex Data Address

.align 4 ; RGBA as 4 floats
UNIFORM_DATA:
     .single 0.0               ; red
     .single 1.0               ; green
     .single 0.0               ; blue
     .single 1.0               ; alpha

This example is a bit contrived, as it could be done more efficiently with one uint uniform. However, doing it this way allows us to demonstrate mul ALU packing (pm=1).

To read uniforms from a shader, just specify the IO-space register UNIFORM_READ (r32a or r32b) as an input operand. Each successive read from UNIFORM_READ will get you the next uniform from the FIFO. If you want to reset the stream and re-read the uniforms, the uniform base pointer can be written from SIMD element 0. Lets look at an instruction to read a uniform and pack to a color channel.

; read in a uniform and pack to R
; add op: No operation, add cond: never
; mul pipe: fmul, dest: ACC5, src1: UNIFORM_READ, src2: float 1.f, cond: always
;    pack mode: 32->8a Convert mul float result to 8-bit color in range [0, 1.0] (PM1)
.word 0x20820DF7, 0xD14059E5

The instruction is fmul, the first input operand is the IO-space UNIFORM_READ register, and the second is the small immediate 1.0f. The pack mode saturates the mul ALU result, converts to a byte, and inserts the byte into the R channel. Let’s look at the bits

m_mul_b          7  (mul input 2 uses value from register file B, or small immediate)
m_mul_a          6  (mul input 1 uses value from register file A)
m_add_b          7  (add input 2 uses value from register file B, or small immediate)
m_add_a          6  (add input 1 uses value from register file A)
m_small_imm      32 (1.0f)
m_raddr_a        32 (regfile A uses register 32: uniform read)
m_op_add         0  (add ALU is NOP)
m_op_mul         1  (mul ALU is fmul)
m_waddr_mul      37 (mul ALU writes to accumulator 5)
m_waddr_add      39 (add ALU writes to NOP, no write)
m_ws             1  (write swap: add ALU writes to regfile B, mul ALU writes to regfile A)
m_sf             0  (don’t set flags)
m_cond_mul       1  (cond always)
m_cond_add       0  (cond never)
m_pack           4  (normalized float to u8, write only R channel)
m_pm             1  (use mul ALU packing)
m_unpack         0  (no unpacking of input operands)
m_1101           13 (small immediate signal)

Once R, G, B, and A are converted and packed in this way, the resulting color is written to TLB_COLOUR_ALL (r46a or r46b) to export the fragment color.

Varyings

Varyings are slightly more involved. Like uniforms they are stored in lists of 32-bit words, but unlike uniforms they are specified per-vertex and get barycentrically interpolated across the triangle. This interpolation requires a bit of manual work in the shader.

Varyings are interpolated with an equation of the form (A*(x-x0)+B*(y-y0))*W+C, but the GPU is generous enough to calculate V=(A*(x-x0)+B*(y-y0)) in the hardware for us. Specifically the PSE sets up the varying coefficients and sends them to coefficients memory for each QPU slice’s VRI interpolator, but that’s just some fun trivia and not something you necessarily need to worry about right now. The frag shader is responsible for reading the partially interpolated value V from the VARYING_READ register (r35b), multiplying by W, and then adding C to get the final value. W is initialized for a particular pixel shader instance a few cycles before the shader instance launches, and arrives ready to read in r15a. Reading from VARYING_READ also automatically triggers a write of C to accumulator 5, which is available to read by the next instruction. Therefore the flow for a single varying will be

ACC0 = VARYING_READ * r15a  // ACC0 = V * W, trigger write of C to ACC5
ACC0 = ACC0 + ACC5          // V * W + C

Remember we can execute a mul and add ALU instruction together, so in the case of multiple varyings we can do the previous varying’s add and the next varying’s multiply together. Let’s take a look at how this is done

// ACC0 = R * W + C, ACC1 = G * W (r15a)
// add pipe: fadd, dest: ACC0, src1: acc r0, src2: acc r5, cond: always
// mul pipe: fmul, dest: ACC1, src1: r15a, src2: VARYING_READ, cond: always
// signal: no signal

m_mul_b          7  (mul input 2 uses value from register file B)
m_mul_a          6  (mul input 1 uses value from register file A)
m_add_b          5  (add input 2 uses accumulator ACC5)
m_add_a          0  (add input 1 uses accumulator ACC0)
m_raddr_b        35 (r35b is VARYING_READ)
m_raddr_a        15 (r15a is pre-setup with the W for varying interp)
m_op_add         1  (add ALU does fadd)
m_op_mul         1  (mul ALU does fmul)
m_waddr_mul      33 (mul ALU writes to r33b, is ACC1)
m_waddr_add      32 (add ALU writes to r32a, is ACC0)
m_ws             0  (ws off, add ALU writes regfile A, mul ALU writes regfile B)
m_sf             0  (don’t set result flags)
m_cond_mul       1  (cond always)
m_cond_add       1  (cond always)
m_pack           0  (no packing)
m_pm             1  (we aren’t packing or unpacking, so who cares what this bit is?)
m_unpack         0  (no unpacking)
m_signal         1  (no signal)

Of course this gives us an interpolated float. If you’re working with colors, you’d still need to convert and pack in the same way as the uniforms example. Also similar to uniforms, reading from VARYING_READ register internally increments so subsequent reads of the register will fetch the next varying in the FIFO.

The CPU-side setup for varyings isn’t much more complicated than uniforms. The varyings themselves are stored after each vertex’s data. In the NV Shader State Record, you must now specify the number of varyings. The stride field must also be changed to the total size of a vertex plus the per-vertex varyings.

.align 4 ; 128-Bit Align
NV_SHADER_STATE_RECORD:
     .byte 0                     ; Flag Bits: 0 = Single Threaded
     .byte 24                    ; Shaded Vert Stride. 2 * hword + 5 * word
     .byte 0                     ; Fragment Shader Num Uniforms (unused)
     .byte 3                     ; Fragment Shader Num Varyings
     .word FRAGMENT_SHADER_CODE  ; Fragment Shader Code Address
     .word 0                     ; Fragment Shader Uniforms Address
     .word VERTEX_DATA           ; Shaded Vertex Data Address

.align 4 ; 128-Bit Align
VERTEX_DATA:
     ; Vertex: Top
     .hword 320 * 16             ; X In 12.4 Fixed Point
     .hword  32 * 16             ; Y In 12.4 Fixed Point
     .single 0e1.0               ; Z
     .single 0e1.0               ; 1 / W
     .single 1.0                 ; varying R
     .single 0.0                 ; varying G
     .single 0.0                 ; varying B
     ; Vertex: Right …
     ; Vertex: Left …

Finally, shaders, much like life, have some rules you need to follow if you don’t want to fail mysteriously. The full list can be gotten from the Videocore IV manual, but here are a few important ones that directly affect the examples in this blog post.

  • The last three instructions of any program (Thread End plus the following two delay-slot instructions) must not do varyings read, uniforms read or any kind of VPM, VDR, or VDW read or write.
  • The Thread End instruction must not write to either physical regfile A or B.
  • The Thread End instruction and the following two delay slot instructions must not write or read address 14 in either regfile A or B. The reason for this is kind of fun. The W value for the next pixel is set up in r14b during the current pixel’s two final delay slots. Then the LSB is flipped so it is accessed as r15b for the next pixel.
  • A scoreboard wait must not occur in the first two instructions of a fragment shader. This is either the explicit Wait for Scoreboard signal or an implicit wait with the first tile-buffer read or write instruction.
  • An instruction must not read from a location in physical regfile A or B that was written to by the previous instruction. To do this you must use the accumulators.
  • VPM cannot be accessed in Fragment shaders, because the FIFOs in the interface hardware are shared with the varyings interpolation system.
  • When a shader is started back-to-back with the preceding program, the interpolation can start as early as the Program End instruction of the previous program. For this reason, a fragment shader must finish reading varyings before issuing the Program End instruction. All of the set up varyings must be read before the shader completes.

With varyings and uniforms out of the way, you now have everything you need to do texturing.

First Triangle

Disclaimer: I do not work for Sony, despite the disturbing percentage of my shirts, jackets, and bookbags that are PlayStation dev-related. I do, however, have many friends that work at Sony, some of which I hope will call off the corporate lawyers. JayStation is in no way associated with Sony or PlayStation, and any stupid things I say represent only my own ineptitude and silliness.

Initializing the GPU and Framebuffer

It begins like so many things, by initializing the GPU. We will be communicating with the GPU via mailbox, by writing one word with the 16 byte aligned address of the commands we want to send to the upper 28 bits, and the destination mailbox channel in the lower 4 bits.

// Run Tags To Initialize V3D
mov32 r0, (PERIPHERAL_BASE + MAIL_BASE + MAIL_WRITE)  ; mailbox
mov32 r1, TAGS_STRUCT    ; address of the commands to send
orr r1, #MAIL_TAGS       ; lower 4 bits are channel number
dmb                      ; data memory barrier
str r1, [r0]             ; write to the mailbox
dmb
.align 4 ; 16 bytes
TAGS_STRUCT:
	; total size of thing sent to the mailbox
	.word TAGS_END - TAGS_STRUCT
	.word 0x00000000
	; command to enable the QPU
	.word Enable_QPU
	.word 0x00000004
	.word 0x00000004
	.word 1
	; end of commands
	.word 0x00000000
TAGS_END:

We wait for the GPU to reply by polling the MAIL_STATUS register until MAIL_EMPTY bit is no longer set, and then we keep reading from the MAIL_READ register until the fifo is empty, being careful to check the lower four bits of the response for the expected channel. The GPU will return a message of 0x80000000 on success.

Now that the GPU is initialized, we have to create a framebuffer. This is also done through the mailbox with the following commands

.align 4
FB_STRUCT:
	.word FB_STRUCT_END - FB_STRUCT
	.word 0x00000000

	// Sequence Of Concatenated Tags
	.word Set_Physical_Display
	.word 0x00000008
	.word 0x00000000
	physical_display_x: .word SCREEN_X
	physical_display_y: .word SCREEN_Y

	.word Set_Virtual_Buffer
	.word 0x00000008
	.word 0x00000008
	virtual_display_x: .word SCREEN_X
	virtual_display_y: .word SCREEN_Y

	.word Set_Depth
	.word 0x00000004
	.word 0x00000004
	bits_per_pixel: .word BITS_PER_PIXEL

	.word Set_Virtual_Offset
	.word 0x00000008
	.word 0x00000008
	.word 0
	.word 0

	.word Get_VC_Memory
	.word 8
	.word 0
	vc_mem_base_addr: .word 0
	vc_mem_size: .word 0

	.word Get_ARM_Memory
	.word 8
	.word 0
	arm_mem_base_addr: .word 0
	arm_mem_size: .word 0

	.word Allocate_Buffer
	.word 0x00000008
	.word 0x00000008
	fb_ptr: .word 0
	fb_size: .word 0

; 0x0 (End Tag)
	.word 0x00000000

While getting the VC and ARM memory sizes and offsets are not necessary for initializing the framebuffer, its fun stuff to know. So after running these mailbox tags and waiting for the GPU response, the address and size of the framebuffer will be written to fb_ptr and fb_size respectively.

Command Buffers

This brings us to the real core of today’s entry: command buffers. Even a high level overview has so much to cover, so this is going to have to be another multi-part post. Today I’m focusing on the front end’s two threads: binning and rendering. The binning thread is responsible for setting up the tile binning mode configuration, supplying blocks of binning memory for the render thread command buffers, and specifying state data, shaders, and primitive lists. The rendering thread then goes through the commands generated by the binning thread and… well… renders stuff.

Binning and rendering threads
Figure N: The binning thread command buffer allocates 32 byte blocks of commands and inline geometry for each tile. Then the render thread command buffer does a call and return to execute each tile’s block. If more than 32 bytes is needed, the commands can contain jumps to other blocks

The render thread goes through the following flow. For each tile XY, we add a Tile_Coords X, Y command. We then branch to the command list created for that particular tile by the binning thread. Finally we end with a Store_Tile command, and flush if its the last tile. Doing this for 80 tiles is super tedious, so through the magic of GNU assembler macros, I proudly present the lazy way

// wanted format is 
;    Tile_Coordinates x, y
;    Branch_To_Sub_List TILE_ALLOC_ADDRESS + ((y * 10 + x) * 32)
;    Store_Multi_Sample / Store_Multi_Sample_End
.macro  BINNING_ENTRIES
    .set countery, 0
    .rept NUM_TILES_Y
        .set counterx, 0
        .rept NUM_TILES_X
            Tile_Coordinates counterx, countery
            Branch_To_Sub_List TILE_ALLOC_ADDRESS + ((countery * NUM_TILES_X + counterx) * 32)
            .if((counterx == (NUM_TILES_X-1)) && (countery == (NUM_TILES_Y-1)))
                Store_Multi_Sample_End
            .else
                Store_Multi_Sample
            .endif
            .set counterx, counterx + 1
        .endr
        .set countery, countery + 1
    .endr
.endm

Even if you ignore the messy macro, at least look at the comment above showing the three commands the render thread must execute for every tile. The Branch_To_Sub_List command branches to the command buffer created for the tile by the binning thread. Using this macro, the final render command buffer will look like this:

.align 2
RENDER_COMMAND_BUFFER_START:
     Wait_On_Semaphore

     Clear_Colors 0xFF00FFFF, 0, 0, 0
     
TILE_MODE_ADDRESS:
     Tile_Rendering_Mode_Configuration 0x00000000, SCREEN_X, SCREEN_Y, Frame_Buffer_Color_Format_RGBA8888
     
     Tile_Coordinates 0, 0
     Store_Tile_Buffer_General 0, 0, 0
     
     BINNING_ENTRIES
RENDER_COMMAND_BUFFER_END:

Ignoring the semaphore for now, we have a Clear_Colors command with the color 0xFF00FFFF, a Tile_Rendering_Mode_Configuration to set up the framebuffer address, screen dimensions, and color format, and finally our macro to generate all the per-tile branches. That 0x00000000 in Tile_Rendering_Mode_Configuration is for the framebuffer address, but since we don’t know the address at assemble time we have to patch it in after initializing the GPU and setting up the framebuffer. Again, don’t forget to flush so the patched in framebuffer address is visible to the GPU.

Now lets take a look at the binning thread command buffer format.

.align 2
CONTROL_LIST_BIN_STRUCT:
	Tile_Binning_Mode_Configuration TILE_ALLOC_ADDRESS, BIN_MEM_SIZE, TILE_STATE_DATA_ARRAY_ADDRESS, NUM_TILES_X, NUM_TILES_Y, Auto_Initialise_Tile_State_Data_Array
	Start_Tile_Binning
	Increment_Semaphore
	Clip_Window 0, 0, SCREEN_X, SCREEN_Y
	Configuration_Bits Enable_Forward_Facing_Primitive + Enable_Reverse_Facing_Primitive, Early_Z_Updates_Enable
	Viewport_Offset 0, 0
	NV_Shader_State NV_SHADER_STATE_RECORD
	Vertex_Array_Primitives Mode_Triangles, 9, 0
	Flush
CONTROL_LIST_BIN_END:

Most of this is pretty self explanatory. We’re setting some state, starting binning, and flush when done. That semaphore only increments when binning is done and everything is flushed, so if you put it after the flush it would never increment. The two most interesting things in this command buffer are the binning mode config and the NV shader state. TILE_ALLOC_ADDRESS and BIN_MEM_SIZE are the address and size of the memory pool the binning thread uses to create render command buffers. In experiments allocation block size seems to be 32 bytes, and the remaining size can be read from the BMPRS register. Out of memory conditions can be handled with an interrupt. TILE_STATE_DATA_ARRAY_ADDRESS is the address of the tile state data.

The NV in NV_Shader_State has nothing to do with Nvidia, but rather probably means something like no vertex. The chip has three pipeline modes:

  1. GL is your normal vert+frag thing
  2. NV mode has no vert shader and uses pre-shaded vertices stored in memory
  3. VG mode where vertices are supplied directly from the input primitive list as XY coordinates only

Wanting to get something on screen sooner rather than later, I went with NV mode. Moving on to the NV shader state

.align 4 // 128-Bit Align
NV_SHADER_STATE_RECORD:
	; Flag Bits:
	; 0 = Fragment Shader Is Single Threaded
	; 1 = Point Size Included In Shaded Vertex Data
	; 2 = Enable Clipping
	; 3 = Clip Coordinates Header Included In Shaded Vertex Data
	.byte 0                    ; single threaded frag shader
	.byte 3 * 4                ; Shaded Vertex Data Stride
	.byte 0                    ; Fragment Shader Num Uniforms (unused)
	.byte 0                    ; Fragment Shader Num Varyings
	.word FRAGMENT_SHADER_CODE ; Fragment Shader Code Address
	.word 0                    ; Fragment Shader Uniforms Address
	.word VERTEX_DATA          ; Shaded Vertex Data Address

.align 4 ; 128-Bit Align
VERTEX_DATA:
	; Vertex: Top
	.hword 320 * 16            ; X In 12.4 Fixed Point
	.hword  32 * 16            ; Y In 12.4 Fixed Point
	.single 0e1.0              ; Z
	.single 0e1.0              ; 1 / W
	
	// Vertex: Bottom Left
	.hword  32 * 16            ; X In 12.4 Fixed Point
	.hword 448 * 16            ; Y In 12.4 Fixed Point
	.single 0e1.0              ; Z
	.single 0e1.0              ; 1 / W
	
	// Vertex: Bottom Right
	.hword 608 * 16            ; X In 12.4 Fixed Point
	.hword 448 * 16            ; Y In 12.4 Fixed Point
	.single 0e1.0              ; Z
	.single 0e1.0              ; 1 / W

.align 4 ; 128-Bit Align
FRAGMENT_SHADER_CODE:
	; Fill Color Shader
	.word 0x009E7000 ;
	.word 0x100009E7 ; nop // nop // nop
	
	.word 0xFFFFFFFF ; RGBA White
	.word 0xE0020BA7 ; ldi tlbc, 0xFFFFFFFF
	.word 0x009E7000 ;
	.word 0x500009E7 ; nop // nop // sbdone
	.word 0x009E7000 ;
	.word 0x300009E7 ; nop // nop // thrend
	
	.word 0x009E7000 ;
	.word 0x100009E7 ; nop // nop // nop
	.word 0x009E7000 ;
	.word 0x100009E7 ; nop // nop // nop

Don’t worry too much about that frag shader, as I have another post coming up that dives deep into the details of writing them. So now we have built two commands buffers, but how do we submit them? The GPU has two registers per thread, one for the command buffer start address and another for the end address, and execution continues as long as the start address and end address are not equal. For binning thread 0, these are CT0CA and CT0EA respectively. Likewise for render thread 1, it’s CT1CA and CT1EA.

There are a few ways of synchronizing the threads. If you’re lazy you can have the CPU wait on BMFCT which is incremented when the binning thread flushes all tile lists to memory. When when the count is right you can let the CPU go on to kick the rendering thread. On the other side RMFCT is incremented whenever the last tile store is completed. Making the CPU spin wait on these is probably a very bad idea for performance. Slightly better is kicking the rendering thread when a binning flush interrupt happens. An even better way is to use semaphores (see above). There seem to be two front end semaphores, one that the render thread waits on and the binning thread increments, and another that the binning thread waits on and the render thread increments. This is a great way to stop either thread from getting too far ahead of the other. There are also markers, but that’s a topic for another day.

Shoutouts and props

Andrew, Colin, and Neil from Codeplay, Graham Wihlidal (GrahamBox system architect), Peter Lemon whose examples and header files saved me hours of manually typing MMIO register offsets, and all the poor guinea pigs who had to proofread this: Tom Forsyth (@tom_forsyth), @RapidGS, Jason Proctor