Porting From RPI2 to RPI3

Disclaimer: I do not work for Sony, despite the disturbing percentage of my shirts, jackets, and bookbags that are PlayStation dev-related. I do, however, have many friends that work at Sony, some of which I hope will call off the corporate lawyers. JayStation is in no way associated with Sony or PlayStation, and any stupid things I say represent only my own ineptitude and silliness.

Minor Revisions

It all started when GDC superhero and primary source of my inferiority complex Graham Wihilididalolz had asked me about getting JayStation2 running on his Pi. Graham had wisely chosen to attempt this with an RPI2, since RPI3 moved to a 64-bit Cortex-A53, and therefore my RPI2-based stuff surely wouldn’t boot on such a totally different CPU.

Funny story. It didn’t boot anyway. Turns out my code was depending on a minor HW difference that was introduced between RPI2’s v1.1 and v1.2 board revisions. That ever so small change was a swapping of the old Cortex-A7 CPU for the same A53/2837 found in the RPI3. Allegedly they were having a hard time sourcing the old part, and switched to the new CPU without bothering to make a big deal of it.

At this point, I said screw it, and ported the whole thing to RPI3. This post is an open topic that I will keep adding to as I find more differences. If I’m missing anything, please let me know and I will update.

UART Clock Speed

The UART clock has gotten a bit of a speed bump on RPI3, from 3MHz to 48MHz. This was originally pointed out to me by Mike Nicolella, and later confirmed by querying clock ID 2 (UART) via mailbox property interface. However, if you are lazy and don’t feel like querying, all that’s required is the following change to the integer and fractional parts of the baud rate divisor

.if RPI_VERSION == 3
	; Divider = 48000000 / (16 * 115200) = 26.0416666667 = ~26.
	; Frac part = (0.0416666667 * 64) + 0.5 = 3.1666666688 = ~3.
	mov r2, #26
	str r2, [r0, #UART0_IBRD]
	mov r2, #3
	str r2, [r0, #UART0_FBRD]
.endif
.if RPI_VERSION == 2 
	; Divider = 3000000 / (16 * 115200) = 1.627 = ~1.
	; Frac part = (0.627 * 64) + 0.5 = 40.6 = ~40.
	mov r2, #1
	str r2, [r0, #UART0_IBRD]
	mov r2, #40
	str r2, [r0, #UART0_FBRD]
.endif

Exact same code, you just use a different clock in the calculation.

UART GPIO Pin Config

For some reason I don’t fully understand, GPIO pins 14 and 15 had their functions changed around to default to something bluetooth related. This is even stranger, because they chose to use the pins from the only UART in the system with a stable clock. Its easy enough to fix, just set GPIO pins 14 and 15 to alt func 0 (0b100).

ldr r1, =GPIO_BASE_ADDR

; pins 14 and 15 must be manually mapped to txd0 and rxd0
; because now they are blue teeth. Use alt func 0 (0b100)
; word 0 is config for pins [0..9],
;/ and word 1 is config for pins [10..19]
ldr r2, [r1, #4]	; load word 1 for pins [10..19]
mov r3, #0b100100	; pins 14 and 15 want to be 0b100 (alt func 0)
bfi r2, r3, #12, #6	; insert config into bits [17..12]
str r2, [r1, #4]	; store it back
MMU Registers

The SMP bit, responsible for marking a CPU as part of the inner shareable domain, has changed location. It used to be bit 6 of the ACTLR register, and is now bit 6 of the CPUECTLR register.

.if RPI_VERSION == 3
	; ACTLR register changed from A7 to A52.
	; The SMP bit went to CPU ECTLR
	mrrc p15, 1, r0, r1, c15
	orr r0, r0, #( 1 << 6 )
	mcrr p15, 1, r0, r1, c15
.endif
.if RPI_VERSION == 2 
	; ACTLR ONLY WORKS ON THE A7
	mrc p15, 0, r0, c1, c0, 1
	orr r0, r0, #( 1 << 6 )
	mcr p15, 0, r0, c1, c0, 1
.endif
RGB <=> BGR

I’m not sure whether this is a hardware change, or if its related to using different GPU firmware and boot files, but the pixel order seems to have been swapped going from RPI2 to RPI3. This doesn’t affect you if you are writing render targets via the GPU, but anything written by the CPU has to be careful. The order can be changed and queried via mailbox property interface. For example

Get pixel order
Tag: 0x00040006
Request:
	Length: 0
Response:
	Length: 4
	Value:
		u32: state
State:
	0x0: BGR
	0x1: RGB

Set pixel order
Tag: 0x00048006
Request:
	Length: 4
	Value:
		u32: state (as above)
Response:
	Length: 4
	Value:
		u32: state (as above)

I’m not currently in a position to try this on my RPI2, but I would love for someone else to try and let me know what the default is.

LED Blinker

I still haven’t looked into this but its on my list. Seems the ACT LED has been moved off the GPIOs and now must be controlled via mailbox. Its not super high priority for me, but be aware this might be why your LED no longer works

Vertex and Coordinate Shaders

Disclaimer: I do not work for Sony, despite the disturbing percentage of my shirts, jackets, and bookbags that are PlayStation dev-related. I do, however, have many friends that work at Sony, some of which I hope will call off the corporate lawyers. JayStation is in no way associated with Sony or PlayStation, and any stupid things I say represent only my own ineptitude and silliness.

Once NV mode rendering is working, moving to vertex and coordinate shaders is fairly easy. If you’ve read the posts on GPU init, shaders, uniforms, and varyings, texturing, and VPM, you should have almost everything you need to get started. The main change will be replacing the NV shader state with a GL shader state, which consists of a fixed length segment describing shaders, and a variable length part to configure vertex array streams.

Vertex Array Streams

These allow us to specify up to eight different sources for vertex attributes, and describe how the data is to be laid out in VPM for the shaders to read.

Stream size, stride, VPM offset, and total attributes size

In the above image each capital letter A through F represents a vertex attribute, and subscripts indicate vertex numbers (i.e. B1 is vertex 1’s B attribute). There are three streams containing attributes [A,B], [C,D,E], and [F] respectively. As discussed in a previous post, each attribute gets packed in a horizontal row in VPM, up to sixteen vertices across, such that each entry in that row is the same attribute but for a different vertex.

When setting up the GL shader state record, there are a few per-stream fields that need to be set. First is the stream size minus one. Using stream 0 as an example, and assuming all attributes are word-sized, this is sizeof(A) + sizeof(B) – 1, or 7 bytes. This is different from the Total Attributes size on the right side of the image, which shows the total VPM reserved to store all attributes from all streams (7 attributes * 4 bytes = 28 bytes).

Next is the stride, or distance between subsequent vertices. The stride for stream 0 is 12, because there are 12 bytes between A0 and A1. Often the size and stride will be the same, but allowing them to be different means we can leave gaps, skip unwanted attributes, interleave, and do some other cool tricks without modifying the actual vertex data.

After that is the VPM offset, indicated by the arrows, which defines what offset into a shader’s VPM block a stream’s data will DMA to. These offsets can be set independently for vertex and coordinate shaders, giving you a bit more flexibility to arrange things differently for different stages.

Finally, we need the base address of the stream.

In my samples, I am using two vertex streams. One is only used by the vertex shader, and so is in the PSE-expected format. The other is only used by the coordinate shader, and therefore is in the PTB-expected format.

.align 6 ; PSE format
VERTEX_DATA_FOR_VERTSHADER:
	; Vertex: Top
	.hword 320 * 16 ; X In 12.4 Fixed Point
	.hword  32 * 16 ; Y In 12.4 Fixed Point
	.single 0e1.0   ; Z
	.single 0e1.0   ; 1 / W
	
	; Vertex: Bottom Left
	.hword  32 * 16 ; X In 12.4 Fixed Point
	.hword 448 * 16 ; Y In 12.4 Fixed Point
	.single 0e1.0   ; Z
	.single 0e1.0   ; 1 / W
	
	...

.align 6 ; PTB format
VERTEX_DATA_FOR_COORDSHADER:
	; Vertex: Top
	.single 0.00156494522	; Xc
	.single -0.86638830897  ; Yc
	.single 1.0             ; Zc
	.single 1.0             ; Wc
	.hword 320 * 16         ; X In 12.4 Fixed Point
	.hword  32 * 16         ; Y In 12.4 Fixed Point
	.single 0e1.0           ; Z
	.single 0e1.0           ; 1 / Wc
	
	; Vertex: Bottom Left
	.single -0.89984350547  ; Xc
	.single 0.87056367432   ; Yc
	.single 1.0             ; Zc
	.single 1.0             ; Wc
	.hword  32 * 16         ; X In 12.4 Fixed Point
	.hword 448 * 16         ; Y In 12.4 Fixed Point
	.single 0e1.0           ; Z
	.single 0e1.0           ; 1 / W

	...

With the above two streams in mind, the GL shader state record’s first two streams should be configured as follows, with the other six streams optionally set to zero

; vert array slot 0: verts for the vert shader
.word VERTEX_DATA_FOR_VERTSHADER ; bytes 36–39 : stream 0 Addr
.byte 11        ; byte 40 : stream 0 Number of Bytes-1
.byte 12        ; byte 41 : stream 0 Memory Stride
.byte 0         ; byte 42 : stream 0 Vert Shader VPM Offset
.byte 0         ; byte 43 : stream 0 Coord Shader VPM Offset

; vert array slot 1: verts for the coord shader
.word VERTEX_DATA_FOR_COORDSHADER ; bytes 44–47 : stream 1 Addr
.byte 27        ; byte 48 : stream 1 Number of Bytes-1
.byte 28        ; byte 49 : stream 1 Memory Stride
.byte 0         ; byte 50 : stream 1 Vert Shader VPM Offset
.byte 0         ; byte 51 : stream 1 Coord Shader VPM Offset

The VPM offset is zero because only one of the streams will be enabled per shader type, and I want both streams to start at the beginning of VPM.

Shady Behavior

Unfortunately, even though I chose my vertex shader inputs to be the same format as the expected outputs, we can’t just have a shader that does nothing. Every attribute must be read from and written to VPM exactly once, or undefined behavior will occur. And by undefined behavior, I mean your triangle will come out randomly looking something like this.

As a result of this restriction, that beautiful short NOP shader you thought you could use will have to read and write all attributes, and therefore transform into something like this (shaders collapsed by default because no one cares)

do not click here unless you want to see the longest minimal shader evar

 

.align 4
VERT_CODE:
	; vert shader does nothing, VPM in == VPM out
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

	; must read all attributes from VPM exactly once
	; horizontal read,  elem size 32, stride 1, num 3
	; y = 0

	; VPM read setup fields:
	; add write addr: 49 (VPMVCD_RD_SETUP A), cond: always
	; write swap: 0, set flags: 0, pm: 0
	; pack mode: 32->32 No pack (NOP) (PM0)
	; immediate type is 0x70, loaded 32 immediate is 0x1A341AC0
	.word 0x1A341AC0, 0xE0020C67

	; ready to read in 3... 2... 1...
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7
	.word 0x9E7000, 0x100009E7
	.word 0x9E7000, 0x100009E7

	; r0a = read screen xy (12.4 x2)
	; add pipe: Bitwise OR, R0, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020027

	; r1a = read Zs
	; add pipe: Bitwise OR, R1, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020067

	; r2a = read 1/W
	; add pipe: Bitwise OR, R2, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x100200A7

	; must write all attributes from VPM exactly once
	; horizontal write,  elem size 32, stride 1
	; y = 0

	; VPM write setup fields:
	; add write addr: 49 (VPMVCD_WR_SETUP B), cond: always
	; write swap: 1, set flags: 0, pm: 0
	; pack mode: 32->32 No pack (NOP) (PM0)
	; immediate type is 0x70, loaded 32 immediate is 0x17BC1AC0
	.word 0x17BC1AC0, 0xE0021C67

	; write screen xy (12.4 x2)
	; add pipe: Bitwise OR, VPM_WRITE, R0, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15027DF7, 0x10020C27

	; write Zs
	; add pipe: Bitwise OR, VPM_WRITE, R1, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15067DF7, 0x10020C27

	; write 1/W
	; add pipe: Bitwise OR, VPM_WRITE, R2, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x150A7DF7, 0x10020C27

	; scoreboard done
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: scoreboard unlock
	.word 0x9E7000, 0x500009E7

	; thread end
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: program end
	.word 0x9E7000, 0x300009E7

	; branch delay NOP 1
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

	; branch delay NOP 2
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

.align 4
COORD_CODE:
	; coord shader does nothing, VPM in == VPM out
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

	; must read all 7 attributes from VPM exactly once
	; horizontal read,  elem size 32, stride 1, num 7
	; y = 0

	; VPM read setup fields:
	; add write addr: 49 (VPMVCD_RD_SETUP A), cond: always
	; write swap: 0, set flags: 0, pm: 0
	; pack mode: 32->32 No pack (NOP) (PM0)
	; immediate type is 0x70, loaded 32 immediate is 0x1A741AC0
	.word 0x1A741AC0, 0xE0020C67

	; ready to read in 3... 2... 1...
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7
	.word 0x9E7000, 0x100009E7
	.word 0x9E7000, 0x100009E7

	; r0a = read clip X
	; add pipe: Bitwise OR, R0, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020027

	; r1a = read clip Y
	; add pipe: Bitwise OR, R1, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020067

	; r2a = read clip Z
	; add pipe: Bitwise OR, R2, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x100200A7

	; r3a = clip W
	; add pipe: Bitwise OR, R3, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x100200E7

	; r4a = read screen xy (12.4 x2)
	; add pipe: Bitwise OR, R4, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020127

	; r5a = read Zs
	; add pipe: Bitwise OR, R5, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x10020167

	; r6a = read 1/W
	; add pipe: Bitwise OR, R6, VPM_READ, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15C27DF7, 0x100201A7

	; must write all 7 attributes from VPM exactly once
	; horizontal write,  elem size 32, stride 1
	; y = 0

	; VPM write setup fields:
	; add write addr: 49 (VPMVCD_WR_SETUP B), cond: always
	; write swap: 1, set flags: 0, pm: 0
	; pack mode: 32->32 No pack (NOP) (PM0)
	; immediate type is 0x70, loaded 32 immediate is 0x17BC1AC0
	.word 0x17BC1AC0, 0xE0021C67

	; write clip X
	; add pipe: Bitwise OR, VPM_WRITE, R0, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15027DF7, 0x10020C27

	; write clip Y
	; add pipe: Bitwise OR, VPM_WRITE, R1, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15067DF7, 0x10020C27

	; write clip Z
	; add pipe: Bitwise OR, VPM_WRITE, R2, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x150A7DF7, 0x10020C27

	; write clip W
	; add pipe: Bitwise OR, VPM_WRITE, R3, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x150E7DF7, 0x10020C27

	; write screen xy (12.4 x2)
	; add pipe: Bitwise OR, VPM_WRITE, R4, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15127DF7, 0x10020C27

	; write Zs
	; add pipe: Bitwise OR, VPM_WRITE, R5, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x15167DF7, 0x10020C27

	; write 1/W
	; add pipe: Bitwise OR, VPM_WRITE, R6, NOP, cond: always
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x151A7DF7, 0x10020C27

	; scoreboard done
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: scoreboard unlock
	.word 0x9E7000, 0x500009E7

	; thread end
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: program end
	.word 0x9E7000, 0x300009E7

	; branch delay NOP 1
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

	; branch delay NOP 2
	; add op: No operation, add cond: never
	; mul op: No operation, mul cond: never
	; signal: no signal
	.word 0x9E7000, 0x100009E7

 

This brings us to the part of the GL shader state record that describes the actual shaders.

.align 4
GL_SHADER_STATE_RECORD:
	.hword 4        ; bytes 0–1 : flag bits, enable clipping

	; stuff describing frag shader
	.byte 0         ; byte 2 : Frag Shader Number of Uniforms
	.byte 0         ; byte 3 : Frag Shader Number of Varyings
	.word FRAG_CODE ; bytes 4–7 : Frag Shader Code Address
	.word 0         ; bytes 8–11 : Frag Shader Uniforms Address

	; stuff describing vert shader
	.hword 0                ; bytes 12–13 : Number of Uniforms
	.byte 1                 ; byte 14 : Stream select mask
	.byte 12                ; byte 15 : Total Attributes Size
	.word VERT_CODE         ; bytes 16–19 : Code Address
	.word VERT_UNIFORM_DATA ; bytes 20–23 : Uniforms Address

	; stuff describing coord shader			     
	.hword 0                 ; bytes 24–25 : Num Uniforms
	.byte 2                  ; byte 26 : Stream select mask
	.byte 28                 ; byte 27 : Total Attributes Size
	.word COORD_CODE         ; bytes 28–31 : Code Address
	.word COORD_UNIFORM_DATA ; bytes 32–35 : Uniforms Address

The shader microcode addresses, uniforms addresses, and number of uniforms and varyings should be familiar from the NV shader state record. Whats new are the fields for total attribute size and stream select. Total attribute size tells the system how much VPM to reserve for all attributes across all streams. This is not the same as the per-stream total attribute size we set up in the stream descriptions.

Stream select is an 8-bit mask where each bit enables or disables one of the eight vertex array streams. The vert shader mask is set to 0b00000001 because it only uses the first stream, and the coordinate shader mask is set to 0b00000010 because it only uses the second.

Kicking It All Off

The only remaining change to make is in the binning command list. The NV shader state command (0x41) is replaced with the GL version (0x40). I am using a macro that looks like this

; Control ID Code 64: GL Shader State
; bits      offs   desc
;  28        4     GL Shader Record addr in 16 byte blocks
;   1        3     Extended shader record
;   3        0     Num attribute arrays (0 = all 8).
.macro GL_Shader_State address numarrays
	.byte 0x40
	.word (\address | \numarrays)
.endm

The address of the shader state record must be 16-byte aligned as the least significant 4 bits are used to toggle extended shader record, and for specifying the number of vertex array streams to activate. This stream count also determines the length of the GL shader state record, because you must have at least this many stream descriptions.

That’s all there is. If you set everything up correctly, you should have the same triangle you had in NV mode, but it should now take over 3x longer to render. Enjoy your newfound freedom.

This week, extra special thanks goes out to Zyte for taking the time to remind me about that VPM restriction. He’s doing some very cool bare metal GPU work, so be sure to see what he’s up to here.