Vertex and Coordinate Shaders

Disclaimer: I do not work for Sony, despite the disturbing percentage of my shirts, jackets, and bookbags that are PlayStation dev-related. I do, however, have many friends that work at Sony, some of which I hope will call off the corporate lawyers. JayStation is in no way associated with Sony or PlayStation, and any stupid things I say represent only my own ineptitude and silliness.

Once NV mode rendering is working, moving to vertex and coordinate shaders is fairly easy. If you’ve read the posts on GPU init, shaders, uniforms, and varyings, texturing, and VPM, you should have almost everything you need to get started. The main change will be replacing the NV shader state with a GL shader state, which consists of a fixed length segment describing shaders, and a variable length part to configure vertex array streams.

Vertex Array Streams

These allow us to specify up to eight different sources for vertex attributes, and describe how the data is to be laid out in VPM for the shaders to read.

Stream size, stride, VPM offset, and total attributes size

In the above image each capital letter A through F represents a vertex attribute, and subscripts indicate vertex numbers (i.e. B1 is vertex 1’s B attribute). There are three streams containing attributes [A,B], [C,D,E], and [F] respectively. As discussed in a previous post, each attribute gets packed in a horizontal row in VPM, up to sixteen vertices across, such that each entry in that row is the same attribute but for a different vertex.

When setting up the GL shader state record, there are a few per-stream fields that need to be set. First is the stream size minus one. Using stream 0 as an example, and assuming all attributes are word-sized, this is sizeof(A) + sizeof(B) – 1, or 7 bytes. This is different from the Total Attributes size on the right side of the image, which shows the total VPM reserved to store all attributes from all streams (7 attributes * 4 bytes = 28 bytes).

Next is the stride, or distance between subsequent vertices. The stride for stream 0 is 12, because there are 12 bytes between A0 and A1. Often the size and stride will be the same, but allowing them to be different means we can leave gaps, skip unwanted attributes, interleave, and do some other cool tricks without modifying the actual vertex data.

After that is the VPM offset, indicated by the arrows, which defines what offset into a shader’s VPM block a stream’s data will DMA to. These offsets can be set independently for vertex and coordinate shaders, giving you a bit more flexibility to arrange things differently for different stages.

Finally, we need the base address of the stream.

In my samples, I am using two vertex streams. One is only used by the vertex shader, and so is in the PSE-expected format. The other is only used by the coordinate shader, and therefore is in the PTB-expected format.

.align 6 ; PSE format
VERTEX_DATA_FOR_VERTSHADER:
	; Vertex: Top
	.hword 320 * 16 ; X In 12.4 Fixed Point
	.hword  32 * 16 ; Y In 12.4 Fixed Point
	.single 0e1.0   ; Z
	.single 0e1.0   ; 1 / W
	
	; Vertex: Bottom Left
	.hword  32 * 16 ; X In 12.4 Fixed Point
	.hword 448 * 16 ; Y In 12.4 Fixed Point
	.single 0e1.0   ; Z
	.single 0e1.0   ; 1 / W
	
	...

.align 6 ; PTB format
VERTEX_DATA_FOR_COORDSHADER:
	; Vertex: Top
	.single 0.00156494522	; Xc
	.single -0.86638830897  ; Yc
	.single 1.0             ; Zc
	.single 1.0             ; Wc
	.hword 320 * 16         ; X In 12.4 Fixed Point
	.hword  32 * 16         ; Y In 12.4 Fixed Point
	.single 0e1.0           ; Z
	.single 0e1.0           ; 1 / Wc
	
	; Vertex: Bottom Left
	.single -0.89984350547  ; Xc
	.single 0.87056367432   ; Yc
	.single 1.0             ; Zc
	.single 1.0             ; Wc
	.hword  32 * 16         ; X In 12.4 Fixed Point
	.hword 448 * 16         ; Y In 12.4 Fixed Point
	.single 0e1.0           ; Z
	.single 0e1.0           ; 1 / W

	...

With the above two streams in mind, the GL shader state record’s first two streams should be configured as follows, with the other six streams optionally set to zero

; vert array slot 0: verts for the vert shader
.word VERTEX_DATA_FOR_VERTSHADER ; bytes 36–39 : stream 0 Addr
.byte 11        ; byte 40 : stream 0 Number of Bytes-1
.byte 12        ; byte 41 : stream 0 Memory Stride
.byte 0         ; byte 42 : stream 0 Vert Shader VPM Offset
.byte 0         ; byte 43 : stream 0 Coord Shader VPM Offset

; vert array slot 1: verts for the coord shader
.word VERTEX_DATA_FOR_COORDSHADER ; bytes 44–47 : stream 1 Addr
.byte 27        ; byte 48 : stream 1 Number of Bytes-1
.byte 28        ; byte 49 : stream 1 Memory Stride
.byte 0         ; byte 50 : stream 1 Vert Shader VPM Offset
.byte 0         ; byte 51 : stream 1 Coord Shader VPM Offset

The VPM offset is zero because only one of the streams will be enabled per shader type, and I want both streams to start at the beginning of VPM.

Shady Behavior

Unfortunately, even though I chose my vertex shader inputs to be the same format as the expected outputs, we can’t just have a shader that does nothing. Every attribute must be read from and written to VPM exactly once, or undefined behavior will occur. And by undefined behavior, I mean your triangle will come out randomly looking something like this.

As a result of this restriction, that beautiful short NOP shader you thought you could use will have to read and write all attributes, and therefore transform into something like this (shaders collapsed by default because no one cares)

do not click here unless you want to see the longest minimal shader evar

 

.align 4
VERT_CODE:
	; vert shader does nothing, VPM in == VPM out
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

	; must read all attributes from VPM exactly once
	; horizontal read,  elem size 32, stride 1, num 3
	; y = 0

	; VPM read setup fields:
	; add write addr: 49 (VPMVCD_RD_SETUP A), cond: always
	; write swap: 0, set flags: 0, pm: 0
	; pack mode: 32->32 No pack (NOP) (PM0)
	; immediate type is 0x70, loaded 32 immediate is 0x1A341AC0
	.word 0x1A341AC0, 0xE0020C67

	; ready to read in 3... 2... 1...
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7
	.word 0x9E7000, 0x100009E7
	.word 0x9E7000, 0x100009E7

	; r0a = read screen xy (12.4 x2)
	; add pipe: Bitwise OR, R0, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020027

	; r1a = read Zs
	; add pipe: Bitwise OR, R1, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020067

	; r2a = read 1/W
	; add pipe: Bitwise OR, R2, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x100200A7

	; must write all attributes from VPM exactly once
	; horizontal write,  elem size 32, stride 1
	; y = 0

	; VPM write setup fields:
	; add write addr: 49 (VPMVCD_WR_SETUP B), cond: always
	; write swap: 1, set flags: 0, pm: 0
	; pack mode: 32->32 No pack (NOP) (PM0)
	; immediate type is 0x70, loaded 32 immediate is 0x17BC1AC0
	.word 0x17BC1AC0, 0xE0021C67

	; write screen xy (12.4 x2)
	; add pipe: Bitwise OR, VPM_WRITE, R0, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15027DF7, 0x10020C27

	; write Zs
	; add pipe: Bitwise OR, VPM_WRITE, R1, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15067DF7, 0x10020C27

	; write 1/W
	; add pipe: Bitwise OR, VPM_WRITE, R2, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x150A7DF7, 0x10020C27

	; scoreboard done
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: scoreboard unlock
	.word 0x9E7000, 0x500009E7

	; thread end
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: program end
	.word 0x9E7000, 0x300009E7

	; branch delay NOP 1
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

	; branch delay NOP 2
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

.align 4
COORD_CODE:
	; coord shader does nothing, VPM in == VPM out
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

	; must read all 7 attributes from VPM exactly once
	; horizontal read,  elem size 32, stride 1, num 7
	; y = 0

	; VPM read setup fields:
	; add write addr: 49 (VPMVCD_RD_SETUP A), cond: always
	; write swap: 0, set flags: 0, pm: 0
	; pack mode: 32->32 No pack (NOP) (PM0)
	; immediate type is 0x70, loaded 32 immediate is 0x1A741AC0
	.word 0x1A741AC0, 0xE0020C67

	; ready to read in 3... 2... 1...
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7
	.word 0x9E7000, 0x100009E7
	.word 0x9E7000, 0x100009E7

	; r0a = read clip X
	; add pipe: Bitwise OR, R0, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020027

	; r1a = read clip Y
	; add pipe: Bitwise OR, R1, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020067

	; r2a = read clip Z
	; add pipe: Bitwise OR, R2, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x100200A7

	; r3a = clip W
	; add pipe: Bitwise OR, R3, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x100200E7

	; r4a = read screen xy (12.4 x2)
	; add pipe: Bitwise OR, R4, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020127

	; r5a = read Zs
	; add pipe: Bitwise OR, R5, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020167

	; r6a = read 1/W
	; add pipe: Bitwise OR, R6, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x100201A7

	; must write all 7 attributes from VPM exactly once
	; horizontal write,  elem size 32, stride 1
	; y = 0

	; VPM write setup fields:
	; add write addr: 49 (VPMVCD_WR_SETUP B), cond: always
	; write swap: 1, set flags: 0, pm: 0
	; pack mode: 32->32 No pack (NOP) (PM0)
	; immediate type is 0x70, loaded 32 immediate is 0x17BC1AC0
	.word 0x17BC1AC0, 0xE0021C67

	; write clip X
	; add pipe: Bitwise OR, VPM_WRITE, R0, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15027DF7, 0x10020C27

	; write clip Y
	; add pipe: Bitwise OR, VPM_WRITE, R1, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15067DF7, 0x10020C27

	; write clip Z
	; add pipe: Bitwise OR, VPM_WRITE, R2, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x150A7DF7, 0x10020C27

	; write clip W
	; add pipe: Bitwise OR, VPM_WRITE, R3, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x150E7DF7, 0x10020C27

	; write screen xy (12.4 x2)
	; add pipe: Bitwise OR, VPM_WRITE, R4, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15127DF7, 0x10020C27

	; write Zs
	; add pipe: Bitwise OR, VPM_WRITE, R5, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15167DF7, 0x10020C27

	; write 1/W
	; add pipe: Bitwise OR, VPM_WRITE, R6, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x151A7DF7, 0x10020C27

	; scoreboard done
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: scoreboard unlock
	.word 0x9E7000, 0x500009E7

	; thread end
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: program end
	.word 0x9E7000, 0x300009E7

	; branch delay NOP 1
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

	; branch delay NOP 2
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

 

This brings us to the part of the GL shader state record that describes the actual shaders.

.align 4
GL_SHADER_STATE_RECORD:
	.hword 4        ; bytes 0–1 : flag bits, enable clipping

	; stuff describing frag shader
	.byte 0         ; byte 2 : Frag Shader Number of Uniforms
	.byte 0         ; byte 3 : Frag Shader Number of Varyings
	.word FRAG_CODE ; bytes 4–7 : Frag Shader Code Address
	.word 0         ; bytes 8–11 : Frag Shader Uniforms Address

	; stuff describing vert shader
	.hword 0                ; bytes 12–13 : Number of Uniforms
	.byte 1                 ; byte 14 : Stream select mask
	.byte 12                ; byte 15 : Total Attributes Size
	.word VERT_CODE         ; bytes 16–19 : Code Address
	.word VERT_UNIFORM_DATA ; bytes 20–23 : Uniforms Address

	; stuff describing coord shader			     
	.hword 0                 ; bytes 24–25 : Num Uniforms
	.byte 2                  ; byte 26 : Stream select mask
	.byte 28                 ; byte 27 : Total Attributes Size
	.word COORD_CODE         ; bytes 28–31 : Code Address
	.word COORD_UNIFORM_DATA ; bytes 32–35 : Uniforms Address

The shader microcode addresses, uniforms addresses, and number of uniforms and varyings should be familiar from the NV shader state record. Whats new are the fields for total attribute size and stream select. Total attribute size tells the system how much VPM to reserve for all attributes across all streams. This is not the same as the per-stream total attribute size we set up in the stream descriptions.

Stream select is an 8-bit mask where each bit enables or disables one of the eight vertex array streams. The vert shader mask is set to 0b00000001 because it only uses the first stream, and the coordinate shader mask is set to 0b00000010 because it only uses the second.

Kicking It All Off

The only remaining change to make is in the binning command list. The NV shader state command (0x41) is replaced with the GL version (0x40). I am using a macro that looks like this

; Control ID Code 64: GL Shader State
; bits      offs   desc
;  28        4     GL Shader Record addr in 16 byte blocks
;   1        3     Extended shader record
;   3        0     Num attribute arrays (0 = all 8).
.macro GL_Shader_State address numarrays
	.byte 0x40
	.word (\address | \numarrays)
.endm

The address of the shader state record must be 16-byte aligned as the least significant 4 bits are used to toggle extended shader record, and for specifying the number of vertex array streams to activate. This stream count also determines the length of the GL shader state record, because you must have at least this many stream descriptions.

That’s all there is. If you set everything up correctly, you should have the same triangle you had in NV mode, but it should now take over 3x longer to render. Enjoy your newfound freedom.

This week, extra special thanks goes out to Zyte for taking the time to remind me about that VPM restriction. He’s doing some very cool bare metal GPU work, so be sure to see what he’s up to here.

Author: okonomiyonda

SPU whisperer, VU1 fetishist, physics enthusiast, GCN asm fanboy, low-level optimizationist. JayStation2 system architect. Jaymin Dreams Of SHUFB 京都市, Kyoto, Japan @okonomiyonda

1 thought on “Vertex and Coordinate Shaders”

Leave a Reply

Your email address will not be published. Required fields are marked *