Vertex and Coordinate Shaders

Disclaimer: I do not work for Sony, despite the disturbing percentage of my shirts, jackets, and bookbags that are PlayStation dev-related. I do, however, have many friends that work at Sony, some of which I hope will call off the corporate lawyers. JayStation is in no way associated with Sony or PlayStation, and any stupid things I say represent only my own ineptitude and silliness.

Once NV mode rendering is working, moving to vertex and coordinate shaders is fairly easy. If you’ve read the posts on GPU init, shaders, uniforms, and varyings, texturing, and VPM, you should have almost everything you need to get started. The main change will be replacing the NV shader state with a GL shader state, which consists of a fixed length segment describing shaders, and a variable length part to configure vertex array streams.

Vertex Array Streams

These allow us to specify up to eight different sources for vertex attributes, and describe how the data is to be laid out in VPM for the shaders to read.

Stream size, stride, VPM offset, and total attributes size

In the above image each capital letter A through F represents a vertex attribute, and subscripts indicate vertex numbers (i.e. B1 is vertex 1’s B attribute). There are three streams containing attributes [A,B], [C,D,E], and [F] respectively. As discussed in a previous post, each attribute gets packed in a horizontal row in VPM, up to sixteen vertices across, such that each entry in that row is the same attribute but for a different vertex.

When setting up the GL shader state record, there are a few per-stream fields that need to be set. First is the stream size minus one. Using stream 0 as an example, and assuming all attributes are word-sized, this is sizeof(A) + sizeof(B) – 1, or 7 bytes. This is different from the Total Attributes size on the right side of the image, which shows the total VPM reserved to store all attributes from all streams (7 attributes * 4 bytes = 28 bytes).

Next is the stride, or distance between subsequent vertices. The stride for stream 0 is 12, because there are 12 bytes between A0 and A1. Often the size and stride will be the same, but allowing them to be different means we can leave gaps, skip unwanted attributes, interleave, and do some other cool tricks without modifying the actual vertex data.

After that is the VPM offset, indicated by the arrows, which defines what offset into a shader’s VPM block a stream’s data will DMA to. These offsets can be set independently for vertex and coordinate shaders, giving you a bit more flexibility to arrange things differently for different stages.

Finally, we need the base address of the stream.

In my samples, I am using two vertex streams. One is only used by the vertex shader, and so is in the PSE-expected format. The other is only used by the coordinate shader, and therefore is in the PTB-expected format.

.align 6 ; PSE format
VERTEX_DATA_FOR_VERTSHADER:
	; Vertex: Top
	.hword 320 * 16 ; X In 12.4 Fixed Point
	.hword  32 * 16 ; Y In 12.4 Fixed Point
	.single 0e1.0   ; Z
	.single 0e1.0   ; 1 / W
	
	; Vertex: Bottom Left
	.hword  32 * 16 ; X In 12.4 Fixed Point
	.hword 448 * 16 ; Y In 12.4 Fixed Point
	.single 0e1.0   ; Z
	.single 0e1.0   ; 1 / W
	
	...

.align 6 ; PTB format
VERTEX_DATA_FOR_COORDSHADER:
	; Vertex: Top
	.single 0.00156494522	; Xc
	.single -0.86638830897  ; Yc
	.single 1.0             ; Zc
	.single 1.0             ; Wc
	.hword 320 * 16         ; X In 12.4 Fixed Point
	.hword  32 * 16         ; Y In 12.4 Fixed Point
	.single 0e1.0           ; Z
	.single 0e1.0           ; 1 / Wc
	
	; Vertex: Bottom Left
	.single -0.89984350547  ; Xc
	.single 0.87056367432   ; Yc
	.single 1.0             ; Zc
	.single 1.0             ; Wc
	.hword  32 * 16         ; X In 12.4 Fixed Point
	.hword 448 * 16         ; Y In 12.4 Fixed Point
	.single 0e1.0           ; Z
	.single 0e1.0           ; 1 / W

	...

With the above two streams in mind, the GL shader state record’s first two streams should be configured as follows, with the other six streams optionally set to zero

; vert array slot 0: verts for the vert shader
.word VERTEX_DATA_FOR_VERTSHADER ; bytes 36–39 : stream 0 Addr
.byte 11        ; byte 40 : stream 0 Number of Bytes-1
.byte 12        ; byte 41 : stream 0 Memory Stride
.byte 0         ; byte 42 : stream 0 Vert Shader VPM Offset
.byte 0         ; byte 43 : stream 0 Coord Shader VPM Offset

; vert array slot 1: verts for the coord shader
.word VERTEX_DATA_FOR_COORDSHADER ; bytes 44–47 : stream 1 Addr
.byte 27        ; byte 48 : stream 1 Number of Bytes-1
.byte 28        ; byte 49 : stream 1 Memory Stride
.byte 0         ; byte 50 : stream 1 Vert Shader VPM Offset
.byte 0         ; byte 51 : stream 1 Coord Shader VPM Offset

The VPM offset is zero because only one of the streams will be enabled per shader type, and I want both streams to start at the beginning of VPM.

Shady Behavior

Unfortunately, even though I chose my vertex shader inputs to be the same format as the expected outputs, we can’t just have a shader that does nothing. Every attribute must be read from and written to VPM exactly once, or undefined behavior will occur. And by undefined behavior, I mean your triangle will come out randomly looking something like this.

As a result of this restriction, that beautiful short NOP shader you thought you could use will have to read and write all attributes, and therefore transform into something like this (shaders collapsed by default because no one cares)

do not click here unless you want to see the longest minimal shader evar

 

.align 4
VERT_CODE:
	; vert shader does nothing, VPM in == VPM out
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

	; must read all attributes from VPM exactly once
	; horizontal read,  elem size 32, stride 1, num 3
	; y = 0

	; VPM read setup fields:
	; add write addr: 49 (VPMVCD_RD_SETUP A), cond: always
	; write swap: 0, set flags: 0, pm: 0
	; pack mode: 32->32 No pack (NOP) (PM0)
	; immediate type is 0x70, loaded 32 immediate is 0x1A341AC0
	.word 0x1A341AC0, 0xE0020C67

	; ready to read in 3... 2... 1...
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7
	.word 0x9E7000, 0x100009E7
	.word 0x9E7000, 0x100009E7

	; r0a = read screen xy (12.4 x2)
	; add pipe: Bitwise OR, R0, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020027

	; r1a = read Zs
	; add pipe: Bitwise OR, R1, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020067

	; r2a = read 1/W
	; add pipe: Bitwise OR, R2, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x100200A7

	; must write all attributes from VPM exactly once
	; horizontal write,  elem size 32, stride 1
	; y = 0

	; VPM write setup fields:
	; add write addr: 49 (VPMVCD_WR_SETUP B), cond: always
	; write swap: 1, set flags: 0, pm: 0
	; pack mode: 32->32 No pack (NOP) (PM0)
	; immediate type is 0x70, loaded 32 immediate is 0x17BC1AC0
	.word 0x17BC1AC0, 0xE0021C67

	; write screen xy (12.4 x2)
	; add pipe: Bitwise OR, VPM_WRITE, R0, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15027DF7, 0x10020C27

	; write Zs
	; add pipe: Bitwise OR, VPM_WRITE, R1, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15067DF7, 0x10020C27

	; write 1/W
	; add pipe: Bitwise OR, VPM_WRITE, R2, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x150A7DF7, 0x10020C27

	; scoreboard done
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: scoreboard unlock
	.word 0x9E7000, 0x500009E7

	; thread end
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: program end
	.word 0x9E7000, 0x300009E7

	; branch delay NOP 1
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

	; branch delay NOP 2
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

.align 4
COORD_CODE:
	; coord shader does nothing, VPM in == VPM out
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

	; must read all 7 attributes from VPM exactly once
	; horizontal read,  elem size 32, stride 1, num 7
	; y = 0

	; VPM read setup fields:
	; add write addr: 49 (VPMVCD_RD_SETUP A), cond: always
	; write swap: 0, set flags: 0, pm: 0
	; pack mode: 32->32 No pack (NOP) (PM0)
	; immediate type is 0x70, loaded 32 immediate is 0x1A741AC0
	.word 0x1A741AC0, 0xE0020C67

	; ready to read in 3... 2... 1...
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7
	.word 0x9E7000, 0x100009E7
	.word 0x9E7000, 0x100009E7

	; r0a = read clip X
	; add pipe: Bitwise OR, R0, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020027

	; r1a = read clip Y
	; add pipe: Bitwise OR, R1, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020067

	; r2a = read clip Z
	; add pipe: Bitwise OR, R2, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x100200A7

	; r3a = clip W
	; add pipe: Bitwise OR, R3, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x100200E7

	; r4a = read screen xy (12.4 x2)
	; add pipe: Bitwise OR, R4, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020127

	; r5a = read Zs
	; add pipe: Bitwise OR, R5, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020167

	; r6a = read 1/W
	; add pipe: Bitwise OR, R6, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x100201A7

	; must write all 7 attributes from VPM exactly once
	; horizontal write,  elem size 32, stride 1
	; y = 0

	; VPM write setup fields:
	; add write addr: 49 (VPMVCD_WR_SETUP B), cond: always
	; write swap: 1, set flags: 0, pm: 0
	; pack mode: 32->32 No pack (NOP) (PM0)
	; immediate type is 0x70, loaded 32 immediate is 0x17BC1AC0
	.word 0x17BC1AC0, 0xE0021C67

	; write clip X
	; add pipe: Bitwise OR, VPM_WRITE, R0, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15027DF7, 0x10020C27

	; write clip Y
	; add pipe: Bitwise OR, VPM_WRITE, R1, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15067DF7, 0x10020C27

	; write clip Z
	; add pipe: Bitwise OR, VPM_WRITE, R2, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x150A7DF7, 0x10020C27

	; write clip W
	; add pipe: Bitwise OR, VPM_WRITE, R3, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x150E7DF7, 0x10020C27

	; write screen xy (12.4 x2)
	; add pipe: Bitwise OR, VPM_WRITE, R4, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15127DF7, 0x10020C27

	; write Zs
	; add pipe: Bitwise OR, VPM_WRITE, R5, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15167DF7, 0x10020C27

	; write 1/W
	; add pipe: Bitwise OR, VPM_WRITE, R6, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x151A7DF7, 0x10020C27

	; scoreboard done
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: scoreboard unlock
	.word 0x9E7000, 0x500009E7

	; thread end
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: program end
	.word 0x9E7000, 0x300009E7

	; branch delay NOP 1
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

	; branch delay NOP 2
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

 

This brings us to the part of the GL shader state record that describes the actual shaders.

.align 4
GL_SHADER_STATE_RECORD:
	.hword 4        ; bytes 0–1 : flag bits, enable clipping

	; stuff describing frag shader
	.byte 0         ; byte 2 : Frag Shader Number of Uniforms
	.byte 0         ; byte 3 : Frag Shader Number of Varyings
	.word FRAG_CODE ; bytes 4–7 : Frag Shader Code Address
	.word 0         ; bytes 8–11 : Frag Shader Uniforms Address

	; stuff describing vert shader
	.hword 0                ; bytes 12–13 : Number of Uniforms
	.byte 1                 ; byte 14 : Stream select mask
	.byte 12                ; byte 15 : Total Attributes Size
	.word VERT_CODE         ; bytes 16–19 : Code Address
	.word VERT_UNIFORM_DATA ; bytes 20–23 : Uniforms Address

	; stuff describing coord shader			     
	.hword 0                 ; bytes 24–25 : Num Uniforms
	.byte 2                  ; byte 26 : Stream select mask
	.byte 28                 ; byte 27 : Total Attributes Size
	.word COORD_CODE         ; bytes 28–31 : Code Address
	.word COORD_UNIFORM_DATA ; bytes 32–35 : Uniforms Address

The shader microcode addresses, uniforms addresses, and number of uniforms and varyings should be familiar from the NV shader state record. Whats new are the fields for total attribute size and stream select. Total attribute size tells the system how much VPM to reserve for all attributes across all streams. This is not the same as the per-stream total attribute size we set up in the stream descriptions.

Stream select is an 8-bit mask where each bit enables or disables one of the eight vertex array streams. The vert shader mask is set to 0b00000001 because it only uses the first stream, and the coordinate shader mask is set to 0b00000010 because it only uses the second.

Kicking It All Off

The only remaining change to make is in the binning command list. The NV shader state command (0x41) is replaced with the GL version (0x40). I am using a macro that looks like this

; Control ID Code 64: GL Shader State
; bits      offs   desc
;  28        4     GL Shader Record addr in 16 byte blocks
;   1        3     Extended shader record
;   3        0     Num attribute arrays (0 = all 8).
.macro GL_Shader_State address numarrays
	.byte 0x40
	.word (\address | \numarrays)
.endm

The address of the shader state record must be 16-byte aligned as the least significant 4 bits are used to toggle extended shader record, and for specifying the number of vertex array streams to activate. This stream count also determines the length of the GL shader state record, because you must have at least this many stream descriptions.

That’s all there is. If you set everything up correctly, you should have the same triangle you had in NV mode, but it should now take over 3x longer to render. Enjoy your newfound freedom.

This week, extra special thanks goes out to Zyte for taking the time to remind me about that VPM restriction. He’s doing some very cool bare metal GPU work, so be sure to see what he’s up to here.

VPM, VCD, and VCM For Vertex Shaders

Disclaimer: I do not work for Sony, despite the disturbing percentage of my shirts, jackets, and bookbags that are PlayStation dev-related. I do, however, have many friends that work at Sony, some of which I hope will call off the corporate lawyers. JayStation is in no way associated with Sony or PlayStation, and any stupid things I say represent only my own ineptitude and silliness.

This will be the final piece of the puzzle before finally moving on to Vertex Shaders.

VPM… Why?

They say a picture is worth a thousand words. The following image probably has a thousand words appearing in it, therefore making it worth 103000 words, a very valuable image indeed.

That’s alot to take in, but try to focus on the following simplified GL mode (as opposed to NV) flow.

  1. The Vertex Cache Manager (VCM) gets vertices from *somewhere*
  2. The VCD then DMAs the vertices to Vertex Pipe Memory (VPM)
  3. Your vert shader reads these verts from VPM and transforms them
  4. Your shader writes these verts back to VPM in a certain expected format
  5. The verts are then read directly by the Primitive Tile Binner (PTB) or Primitive Setup Engine (PSE)

And that’s how triangles are made! Sort of. The ambiguity in steps 1, 4 and 5 arise because there are two vertex shader-like things: full vertex shaders and coordinate shaders. Full vertex shaders transform verts and export them along with varyings to be interpolated to the PSE (Primitive Setup Engine) for eventual rasterization/interpolation. Coordinate shaders are like mini versions of your full vertex shader whose only job is to output whats needed for the PTB (Primitive Tile Binner) to determine what primitives touch what tiles, so that the render lists discussed in a previous post can be generated by the binning thread.  This makes a bit more sense when you look at the expected output formats

The Primitive Setup Engine cares very deeply about varyings

The Primitive Tile Binner only cares about what it needs to determine what tiles a primitive touches

So in a nutshell, VPM is there so your vert-like shaders have a place to read in the vert they are processing, and a place to output the expected format for whatever stage they are.

VPM… What Can It Do For Me?

Before dropping a bunch of shader code on you, I want to visually show the kinds of magical things VPM can do. The GPU wave size is 16 threads, and so 16 will always be the number of things written, be it 16 bytes, halfwords, or words. In 32-bit mode, you write out 16 words, either as horizontal rows or vertical columns. With 8 and 16-bit modes, you have the option to use packed or laned. 8-bit packed means your wave writes out 16 contiguous bytes to some location, while laned would write one byte to a specified lane for 16 contiguous words. 16-bit modes are also available. Let’s look at a visual representation of the horizontal modes.

Here VPM is a block of 16×64 words, with the words going from left to right, and the 4 bytes of each word depicted vertically. If a wave has 16 threads and each thread writes one word, the max you can write is 16 words. This is why the X-axis labeling goes from words 0 to 15.  The Y-axis has two labels: one to tell you what row you are on [0..63] and the other to identify each of the 4 bytes (or lanes) in a word.

Let’s start with the easiest case: 32 bit (e). In 32-bit mode, every wave thread writes out one word, left to right, to a specified row. Here Y = row 63, so that’s where we write. The numbers represented in each word tell you which thread writes it, hence the word processed by thread 0 would contains {0,0,0,0}, and so on.

Next up is 8-bit laned mode (b). Y=1, so we will be using the words in row 1. The B parameter specifies which lane of each word to write, and is 1 here. Thread 0 will take the first word from row 1 and write some byte to lane 1, thread 1 will take the second word from row 1 and write some byte to lane 1, and so on. 16 bit laned modes function similarly, but instead of writing one byte to lane 0, 1, 2, or 3, each word is partitioned into two 16-bit lanes, and you can write a halfword to either lane 0 or 1 (c).

Finally we have packed modes. Y still tells you what word row to start with, but now instead of B we have the block number parameter H. If your 16 threads were to write one one byte each in packed mode, you would only write 16 bytes, or enough to fill a block of four words. However with 16 words per row, there are four of these blocks. The H parameter controls what block you address. H=0 accesses the 16 bytes in words [0..3], H=1 accesses the 16 bytes in words [4..7], and in the example above H=2 accesses the 16 bytes in words [8..11]. Again, 16-bit modes function similarly, but with each thread writing out twice the data, only two blocks are possible. H=0 accesses the 32 bytes at words [0..7] and H=1 accesses the 32 bytes at words [8..15].

And then there are the vertical modes. You still have a block of 16×64 words, but in this diagram the four lanes of each word are represented horizontally. This visually makes much more sense when you start talking about lanes in a word.

All the old concepts still apply. You have 32-bit mode, as well as 8 and 16-bit laned, and packed modes. However now, instead of block parameter H sliding your start offset left and right, it slides it up and down. This puts a new constraint on Y to save bits, avoid redundancy, and make address incrementing more intuitive. Looking at 8-bit packed mode (a), you could get the exact same start address specifying Y=4 as your starting row as you could by specifying Y=0 and block H=1. Therefore in vertical modes, Y values are encoded as multiples of 16. There is now also an X parameter to the address that specifies what column you are going down.

Programming VPM and VCD

There are a few VPM and VCD related registers in the QPU regfile.

VPM_READ and VPM_WRITE are for reading and writing values to and from VPM. However, before doing so you must first program the appropriate setup register (VPMVCD_RD_SETUP or VPMVCD_WR_SETUP) with things like VPM address, access mode, and stride. The *_BUSY and *_WAIT  registers are for either polling the status of or waiting on VCD DMA operations to complete. VPM_LD_ADDR and VPM_ST_ADDR are the load and store addresses for VCD DMA operations.

VPMVCD_RD_SETUP and VPMVCD_WR_SETUP are programmed with the following bits. If you understood the vertical and horizontal memory layout diagrams, these register fields should be fairly self explanatory.

SIZE, LANED, and HORIZ are the fields controlling bits per write, laned/packed mode, and whether you are reading and writing in horizontal mode or vertical. ADDR encodes the X, Y, B, and H address bits, however there are a few errors in the manual. For example

Horizontal 8-bit: ADDR[7:0] = {Y[5:0], B[1:0]}
↓↓↓↓↓↓↓↓
Horizontal 8-bit: ADDR[7:0] = {Y[7:2], B[1:0]}

The number of bits is correct but the starting offset is wrong in a few cases.

You can program the above registers once and then read or write VPM repeatedly. This is where stride comes in. Stride is just the amount that is added to the address field after each read or write.

Finally, reads require an additional parameter, NUM, that specify how many vectors you intend to read from VPM. Reading anything other than the specified amount can lead to garbage or hangs. After you program a read setup, you must wait a minimum of three QPU cycles before trying to read the requested data from VPM. After three cycles, any read will stall at the SP pipeline stage until the data is ready, but waiting less than three cycles will return garbage and continue.

In the case of vert shaders, you shouldn’t have to manually program the VCD DMA config yourself. The system ensures that the data is ready in VPM before your vert shader instance runs, and it will DMA the output data from VPM to the units that need it automatically when your shader finishes. However, if you are going to be doing compute-like shaders via user programs, You must manually DMA things. Examples of how to do this are in the next section.

Sample Coordinate Shader

Here is a sample compute shader I wrote to test what coordinate shaders will eventually write to VPM. It takes in vertices of the same format as all my other samples, except that the “verts” are specified via uniforms.

; VPM in for vert attributes:
;	0) screen space X,Y in 12.4 fixed point
;	1) screen space Z in f32
;	2) clip space 1/W in f32
FAKE_VERT_UNIFORM_DATA:
	.hword  32 * 16		; X In 12.4 Fixed Point
	.hword 448 * 16		; Y In 12.4 Fixed Point
	.single 0e1.0		; Z
	.single 0e1.0		; 1 / W

The shader then outputs the PTB expected format: Clip space X, Y, Z, and W as  f32, screen space X and Y as 12.4 fixed point, screen space Z as f32, and the reciprocal of the clip space W as f32. This format is shown visually in the above image Shader Coordinates Format in VPM for PTB.

VPM_VCD_COORDINATE_SHADER_TEST_1:
; coordinate shader, very sad
; add op: No operation, add cond: never
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x9E7000, 0x100009E7

; r0a = read screen xy (12.4 x2)
; add pipe: Bitwise OR, R0, UNIFORM_READ, NOP, cond: always
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x15827DF7, 0x10020027

; r1a = read screen z (f32)
; add pipe: Bitwise OR, R1, UNIFORM_READ, NOP, cond: always
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x15827DF7, 0x10020067

; r2a = read clip 1/w (f32)
; add pipe: Bitwise OR, R2, UNIFORM_READ, NOP, cond: always
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x15827DF7, 0x100200A7

; VPM write setup: start with vec 2 (clip space Z = 0.0f)
; horizontal write,  elem size 32, stride 1
; y = 2

; VPM write setup fields:
; add write addr: 49 (VPMVCD_WR_SETUP B), cond: always
; write swap: 1, set flags: 0, pm: 0
; pack mode: 32->32 No pack (NOP) (PM0)
; immediate type is 0x70, loaded 32 immediate is 0x17BC1AC2
.word 0x17BC1AC2, 0xE0021C67

; store screen space Z = 0.0f to VPM slot 2:
; add write addr: 48 (VPM_WRITE A), cond: always
; write swap: 0, set flags: 0, pm: 0
; pack mode: 32->32 No pack (NOP) (PM0)
; immediate type is 0x70, loaded 32 immediate is 0.000000
.word 0x0, 0xE0020C27

; store clip space 1/W = 1.0f to VPM slot 3:
; add write addr: 48 (VPM_WRITE A), cond: always
; write swap: 0, set flags: 0, pm: 0
; pack mode: 32->32 No pack (NOP) (PM0)
; immediate type is 0x70, loaded 32 immediate is 1.000000
.word 0x3F800000, 0xE0020C27

; write screen xy (12.4 x2) to VPM slot 4
; add pipe: Bitwise OR, VPM_WRITE, R0, NOP, cond: always
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x15027DF7, 0x10020C27

; write screen z (f32) to VPM slot 5
; add pipe: Bitwise OR, VPM_WRITE, R1, NOP, cond: always
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x15067DF7, 0x10020C27

; write clip 1/w (f32) to VPM slot 6
; add pipe: Bitwise OR, VPM_WRITE, R2, NOP, cond: always
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x150A7DF7, 0x10020C27

; acc0 = extract 16 lsb (X) and shift right by 4 to get rid of x's frac bits
; add pipe: Integer shift right, ACC0, R0, int 4, cond: always
;      unpack mode: 16a->32 Float16float32 if any ALU consuming data
;      executes float instruction, else signed int16->signed int32 (PM0)
; mul op: No operation, mul cond: never
.word 0xE004DC0, 0xD2020827

; acc1 = extract 16 msb (Y) and shift right by 4 to get rid of y's frac bits
; add pipe: Integer shift right, ACC1, R0, int 4, cond: always
;      unpack mode: 16b->32 Float16float32 if any ALU consuming data 
;      executes float instruction, else signed int16->signed int32 (PM0)
; mul op: No operation, mul cond: never
.word 0xE004DC0, 0xD4020867

; acc0 = to_float( acc0 )
; add pipe: Signed integer to floating point, ACC0, acc r0, R0, cond: always
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x80001F7, 0x10022827

; acc1 = to_float( acc1 )
; add pipe: Signed integer to floating point, ACC1, acc r1, R0, cond: always
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x80003F7, 0x10022867

; acc2 = scaling factor 2/639:
; add write addr: 34 (ACC2 A), cond: always
; write swap: 0, set flags: 0, pm: 0
; pack mode: 32->32 No pack (NOP) (PM0)
; immediate type is 0x70, loaded 32 immediate is 0.003130
.word 0x3B4D1ED9, 0xE00208A7

; acc3 = scaling factor 2/479:
; add write addr: 35 (ACC3 A), cond: always
; write swap: 0, set flags: 0, pm: 0
; pack mode: 32->32 No pack (NOP) (PM0)
; immediate type is 0x70, loaded 32 immediate is 0.004175
.word 0x3B88D181, 0xE00208E7

; acc0 *= 2/639
; add op: No operation, add cond: never
; mul pipe: Floating point multiply, ACC0, acc r0, acc r2, cond: always
; signal: no signal
.word 0x20000DC2, 0x100079E0

; acc1 *= 2/479
; add op: No operation, add cond: never
; mul pipe: Floating point multiply, ACC1, acc r1, acc r3, cond: always
; signal: no signal
.word 0x20000DCB, 0x100079E1

; acc0 = acc0 - 1.0f
; add pipe: Floating point subtract, ACC0, acc r0, float 1.f, cond: always
; mul op: No operation, mul cond: never
.word 0x20201C0, 0xD0020827

; acc1 = acc1 - 1.0f
; add pipe: Floating point subtract, ACC1, acc r1, float 1.f, cond: always
; mul op: No operation, mul cond: never
.word 0x20203C0, 0xD0020867

; VPM write setup: start with vec 0 (screen space X)
; horizontal write,  elem size 32, stride 1
; y = 0

; VPM write setup fields:
; add write addr: 49 (VPMVCD_WR_SETUP B), cond: always
; write swap: 1, set flags: 0, pm: 0
; pack mode: 32->32 No pack (NOP) (PM0)
; immediate type is 0x70, loaded 32 immediate is 0x17BC1AC0
.word 0x17BC1AC0, 0xE0021C67

; write clip x (f32) to VPM slot 0
; add pipe: Bitwise OR, VPM_WRITE, acc r0, acc r0, cond: always
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x159E7000, 0x10020C27

; write clip y (f32) to VPM slot 1
; add pipe: Bitwise OR, VPM_WRITE, acc r1, acc r1, cond: always
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x159E7249, 0x10020C27

; VCD store setup word
; horiz DMA store, 7 rows of length 16, XY addr (0, 0), 32-bit, offset 0

; 32 bit DMA store config to VPMVCD_WR_SETUP:
; add write addr: 49 (VPMVCD_WR_SETUP B), cond: always
; write swap: 1, set flags: 0, pm: 0
; pack mode: 32->32 No pack (NOP) (PM0)
; immediate type is 0x70, loaded 32 immediate is 0x83904000
.word 0x83904000, 0xE0021C67

; load uniform to DMA store address, initiate DMA
; add pipe: Bitwise OR, VPM_ST_ADDR, UNIFORM_READ, int 0, cond: always
; mul op: No operation, mul cond: never
.word 0x15800DF7, 0xD0021CA7

; scoreboard done
; add op: No operation, add cond: never
; mul op: No operation, mul cond: never
; signal: scoreboard unlock
.word 0x9E7000, 0x500009E7

; thread end
; add op: No operation, add cond: never
; mul op: No operation, mul cond: never
; signal: program end
.word 0x9E7000, 0x300009E7

; branch delay NOP 1
; add op: No operation, add cond: never
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x9E7000, 0x100009E7

; branch delay NOP 2
; add op: No operation, add cond: never
; mul op: No operation, mul cond: never
; signal: no signal
.word 0x9E7000, 0x100009E7

First we read in the fake vert data and store it to the same VPM locations the system would DMA to if this were a real coordinate shader. Inputs ZC, WC, XS, YS, ZS, and 1/Ware already in the correct output format, so we are able to read them as-is and directly store them to their correct VPM destination slots [2..6]. That just leaves the clip space XC and YC. These are read in as 12.4 fixed point screen space positions, converted to float, mapped [-1..1] by assuming a screen size of 640×480, and written to VPM slots 0 and 1. With all the output data in its correct place, we DMA the 7 words * 16 verts back to main memory where we can print out what the PTB will eventually see

In real life, each of the 16 words would represent the same attribute but for one of the 16 verts being processed this wave. However, since we are having all threads start with the same vert, the same values just repeat 16 times.

First up is the clip space X, Y, Z, and W. TTY is showing values of 0xBF665c24(-0.8998f), 0x3F5EDD42 (0.87f), 0x00000000 (0.0f), and 0x3F800000 (1.0f), which is exactly what we’d expect. Up next is the 12.4 encoding of the screen space X and Y. The TTY shows 0x1C000200, which works out to be 0x1C0 (448) and 0x020 (32) in decimal.

Sharing VPM With Non-Graphics Work

One last point to be careful of when running user programs at the same time as graphics. The manual states

The minimum VPM size is 8Kbytes, which is the amount required for normal pipelined 3D operation with concurrent vertex and coordinate shading and worst case vertex data size. With this size of memory it is not sensible to divide the VPM between 3D shading functions and general-purpose processing at the same time. Fully configured systems may have up to 16Kbytes of VPM for higher vertex shading performance. With this size of VPM it is practical for some of the memory to be reserved for general-purpose processing whilst 3D is operating so long as at least 8Kbytes is left for 3D use.

So it might be a good idea to query how much VPM is available when trying to leave some off for compute programs. It is also not possible to use VPM from pixel shaders, as

The FIFO used for the partially interpolated varying results is also used for VPM read and write accesses and VCD control from QPU programs. For this reason a fragment shader program cannot access the VPM or VCD.

This kills one of my best ideas for full shader debugging, but I have other possibilities in mind