First Triangle

Disclaimer: I do not work for Sony, despite the disturbing percentage of my shirts, jackets, and bookbags that are PlayStation dev-related. I do, however, have many friends that work at Sony, some of which I hope will call off the corporate lawyers. JayStation is in no way associated with Sony or PlayStation, and any stupid things I say represent only my own ineptitude and silliness.

Initializing the GPU and Framebuffer

It begins like so many things, by initializing the GPU. We will be communicating with the GPU via mailbox, by writing one word with the 16 byte aligned address of the commands we want to send to the upper 28 bits, and the destination mailbox channel in the lower 4 bits.

// Run Tags To Initialize V3D
mov32 r0, (PERIPHERAL_BASE + MAIL_BASE + MAIL_WRITE)  ; mailbox
mov32 r1, TAGS_STRUCT    ; address of the commands to send
orr r1, #MAIL_TAGS       ; lower 4 bits are channel number
dmb                      ; data memory barrier
str r1, [r0]             ; write to the mailbox
dmb
.align 4 ; 16 bytes
TAGS_STRUCT:
	; total size of thing sent to the mailbox
	.word TAGS_END - TAGS_STRUCT
	.word 0x00000000
	; command to enable the QPU
	.word Enable_QPU
	.word 0x00000004
	.word 0x00000004
	.word 1
	; end of commands
	.word 0x00000000
TAGS_END:

We wait for the GPU to reply by polling the MAIL_STATUS register until MAIL_EMPTY bit is no longer set, and then we keep reading from the MAIL_READ register until the fifo is empty, being careful to check the lower four bits of the response for the expected channel. The GPU will return a message of 0x80000000 on success.

Now that the GPU is initialized, we have to create a framebuffer. This is also done through the mailbox with the following commands

.align 4
FB_STRUCT:
	.word FB_STRUCT_END - FB_STRUCT
	.word 0x00000000

	// Sequence Of Concatenated Tags
	.word Set_Physical_Display
	.word 0x00000008
	.word 0x00000000
	physical_display_x: .word SCREEN_X
	physical_display_y: .word SCREEN_Y

	.word Set_Virtual_Buffer
	.word 0x00000008
	.word 0x00000008
	virtual_display_x: .word SCREEN_X
	virtual_display_y: .word SCREEN_Y

	.word Set_Depth
	.word 0x00000004
	.word 0x00000004
	bits_per_pixel: .word BITS_PER_PIXEL

	.word Set_Virtual_Offset
	.word 0x00000008
	.word 0x00000008
	.word 0
	.word 0

	.word Get_VC_Memory
	.word 8
	.word 0
	vc_mem_base_addr: .word 0
	vc_mem_size: .word 0

	.word Get_ARM_Memory
	.word 8
	.word 0
	arm_mem_base_addr: .word 0
	arm_mem_size: .word 0

	.word Allocate_Buffer
	.word 0x00000008
	.word 0x00000008
	fb_ptr: .word 0
	fb_size: .word 0

; 0x0 (End Tag)
	.word 0x00000000

While getting the VC and ARM memory sizes and offsets are not necessary for initializing the framebuffer, its fun stuff to know. So after running these mailbox tags and waiting for the GPU response, the address and size of the framebuffer will be written to fb_ptr and fb_size respectively.

Command Buffers

This brings us to the real core of today’s entry: command buffers. Even a high level overview has so much to cover, so this is going to have to be another multi-part post. Today I’m focusing on the front end’s two threads: binning and rendering. The binning thread is responsible for setting up the tile binning mode configuration, supplying blocks of binning memory for the render thread command buffers, and specifying state data, shaders, and primitive lists. The rendering thread then goes through the commands generated by the binning thread and… well… renders stuff.

Binning and rendering threads
Figure N: The binning thread command buffer allocates 32 byte blocks of commands and inline geometry for each tile. Then the render thread command buffer does a call and return to execute each tile’s block. If more than 32 bytes is needed, the commands can contain jumps to other blocks

The render thread goes through the following flow. For each tile XY, we add a Tile_Coords X, Y command. We then branch to the command list created for that particular tile by the binning thread. Finally we end with a Store_Tile command, and flush if its the last tile. Doing this for 80 tiles is super tedious, so through the magic of GNU assembler macros, I proudly present the lazy way

// wanted format is 
;    Tile_Coordinates x, y
;    Branch_To_Sub_List TILE_ALLOC_ADDRESS + ((y * 10 + x) * 32)
;    Store_Multi_Sample / Store_Multi_Sample_End
.macro  BINNING_ENTRIES
    .set countery, 0
    .rept NUM_TILES_Y
        .set counterx, 0
        .rept NUM_TILES_X
            Tile_Coordinates counterx, countery
            Branch_To_Sub_List TILE_ALLOC_ADDRESS + ((countery * NUM_TILES_X + counterx) * 32)
            .if((counterx == (NUM_TILES_X-1)) && (countery == (NUM_TILES_Y-1)))
                Store_Multi_Sample_End
            .else
                Store_Multi_Sample
            .endif
            .set counterx, counterx + 1
        .endr
        .set countery, countery + 1
    .endr
.endm

Even if you ignore the messy macro, at least look at the comment above showing the three commands the render thread must execute for every tile. The Branch_To_Sub_List command branches to the command buffer created for the tile by the binning thread. Using this macro, the final render command buffer will look like this:

.align 2
RENDER_COMMAND_BUFFER_START:
     Wait_On_Semaphore

     Clear_Colors 0xFF00FFFF, 0, 0, 0
     
TILE_MODE_ADDRESS:
     Tile_Rendering_Mode_Configuration 0x00000000, SCREEN_X, SCREEN_Y, Frame_Buffer_Color_Format_RGBA8888
     
     Tile_Coordinates 0, 0
     Store_Tile_Buffer_General 0, 0, 0
     
     BINNING_ENTRIES
RENDER_COMMAND_BUFFER_END:

Ignoring the semaphore for now, we have a Clear_Colors command with the color 0xFF00FFFF, a Tile_Rendering_Mode_Configuration to set up the framebuffer address, screen dimensions, and color format, and finally our macro to generate all the per-tile branches. That 0x00000000 in Tile_Rendering_Mode_Configuration is for the framebuffer address, but since we don’t know the address at assemble time we have to patch it in after initializing the GPU and setting up the framebuffer. Again, don’t forget to flush so the patched in framebuffer address is visible to the GPU.

Now lets take a look at the binning thread command buffer format.

.align 2
CONTROL_LIST_BIN_STRUCT:
	Tile_Binning_Mode_Configuration TILE_ALLOC_ADDRESS, BIN_MEM_SIZE, TILE_STATE_DATA_ARRAY_ADDRESS, NUM_TILES_X, NUM_TILES_Y, Auto_Initialise_Tile_State_Data_Array
	Start_Tile_Binning
	Increment_Semaphore
	Clip_Window 0, 0, SCREEN_X, SCREEN_Y
	Configuration_Bits Enable_Forward_Facing_Primitive + Enable_Reverse_Facing_Primitive, Early_Z_Updates_Enable
	Viewport_Offset 0, 0
	NV_Shader_State NV_SHADER_STATE_RECORD
	Vertex_Array_Primitives Mode_Triangles, 9, 0
	Flush
CONTROL_LIST_BIN_END:

Most of this is pretty self explanatory. We’re setting some state, starting binning, and flush when done. That semaphore only increments when binning is done and everything is flushed, so if you put it after the flush it would never increment. The two most interesting things in this command buffer are the binning mode config and the NV shader state. TILE_ALLOC_ADDRESS and BIN_MEM_SIZE are the address and size of the memory pool the binning thread uses to create render command buffers. In experiments allocation block size seems to be 32 bytes, and the remaining size can be read from the BMPRS register. Out of memory conditions can be handled with an interrupt. TILE_STATE_DATA_ARRAY_ADDRESS is the address of the tile state data.

The NV in NV_Shader_State has nothing to do with Nvidia, but rather probably means something like no vertex. The chip has three pipeline modes:

  1. GL is your normal vert+frag thing
  2. NV mode has no vert shader and uses pre-shaded vertices stored in memory
  3. VG mode where vertices are supplied directly from the input primitive list as XY coordinates only

Wanting to get something on screen sooner rather than later, I went with NV mode. Moving on to the NV shader state

.align 4 // 128-Bit Align
NV_SHADER_STATE_RECORD:
	; Flag Bits:
	; 0 = Fragment Shader Is Single Threaded
	; 1 = Point Size Included In Shaded Vertex Data
	; 2 = Enable Clipping
	; 3 = Clip Coordinates Header Included In Shaded Vertex Data
	.byte 0                    ; single threaded frag shader
	.byte 3 * 4                ; Shaded Vertex Data Stride
	.byte 0                    ; Fragment Shader Num Uniforms (unused)
	.byte 0                    ; Fragment Shader Num Varyings
	.word FRAGMENT_SHADER_CODE ; Fragment Shader Code Address
	.word 0                    ; Fragment Shader Uniforms Address
	.word VERTEX_DATA          ; Shaded Vertex Data Address

.align 4 ; 128-Bit Align
VERTEX_DATA:
	; Vertex: Top
	.hword 320 * 16            ; X In 12.4 Fixed Point
	.hword  32 * 16            ; Y In 12.4 Fixed Point
	.single 0e1.0              ; Z
	.single 0e1.0              ; 1 / W
	
	// Vertex: Bottom Left
	.hword  32 * 16            ; X In 12.4 Fixed Point
	.hword 448 * 16            ; Y In 12.4 Fixed Point
	.single 0e1.0              ; Z
	.single 0e1.0              ; 1 / W
	
	// Vertex: Bottom Right
	.hword 608 * 16            ; X In 12.4 Fixed Point
	.hword 448 * 16            ; Y In 12.4 Fixed Point
	.single 0e1.0              ; Z
	.single 0e1.0              ; 1 / W

.align 4 ; 128-Bit Align
FRAGMENT_SHADER_CODE:
	; Fill Color Shader
	.word 0x009E7000 ;
	.word 0x100009E7 ; nop // nop // nop
	
	.word 0xFFFFFFFF ; RGBA White
	.word 0xE0020BA7 ; ldi tlbc, 0xFFFFFFFF
	.word 0x009E7000 ;
	.word 0x500009E7 ; nop // nop // sbdone
	.word 0x009E7000 ;
	.word 0x300009E7 ; nop // nop // thrend
	
	.word 0x009E7000 ;
	.word 0x100009E7 ; nop // nop // nop
	.word 0x009E7000 ;
	.word 0x100009E7 ; nop // nop // nop

Don’t worry too much about that frag shader, as I have another post coming up that dives deep into the details of writing them. So now we have built two commands buffers, but how do we submit them? The GPU has two registers per thread, one for the command buffer start address and another for the end address, and execution continues as long as the start address and end address are not equal. For binning thread 0, these are CT0CA and CT0EA respectively. Likewise for render thread 1, it’s CT1CA and CT1EA.

There are a few ways of synchronizing the threads. If you’re lazy you can have the CPU wait on BMFCT which is incremented when the binning thread flushes all tile lists to memory. When when the count is right you can let the CPU go on to kick the rendering thread. On the other side RMFCT is incremented whenever the last tile store is completed. Making the CPU spin wait on these is probably a very bad idea for performance. Slightly better is kicking the rendering thread when a binning flush interrupt happens. An even better way is to use semaphores (see above). There seem to be two front end semaphores, one that the render thread waits on and the binning thread increments, and another that the binning thread waits on and the render thread increments. This is a great way to stop either thread from getting too far ahead of the other. There are also markers, but that’s a topic for another day.

Shoutouts and props

Andrew, Colin, and Neil from Codeplay, Graham Wihlidal (GrahamBox system architect), Peter Lemon whose examples and header files saved me hours of manually typing MMIO register offsets, and all the poor guinea pigs who had to proofread this: Tom Forsyth (@tom_forsyth), @RapidGS, Jason Proctor

Author: okonomiyonda

SPU whisperer, VU1 fetishist, physics enthusiast, GCN asm fanboy, low-level optimizationist. JayStation2 system architect. Jaymin Dreams Of SHUFB 京都市, Kyoto, Japan @okonomiyonda

1 thought on “First Triangle”

Leave a Reply

Your email address will not be published. Required fields are marked *