Disclaimer: I do not work for Sony, despite the disturbing percentage of my shirts, jackets, and bookbags that are PlayStation dev-related. I do, however, have many friends that work at Sony, some of which I hope will call off the corporate lawyers. JayStation is in no way associated with Sony or PlayStation, and any stupid things I say represent only my own ineptitude and silliness.

This is a bittersweet post for me to write. At the end of August I will be temporarily pausing my eight year Japanese adventure and returning to the States for personal reasons, making this the last JayStation blog update from beautiful Kyoto, Japan. As part of the moving process, I am selling my only computer tomorrow, giving me just 24 hours to bang out this post before I no longer have any means of doing so. It won’t be as in depth or interesting as I had hoped, and it will probably be rushed and error-riddled with unclear wording, but it will show everything you need to know to render textured triangles. The next update probably won’t come until September or October when I settle in, unless of course I am murdered in the street for riding my bike to work. #America

There are three steps to getting textured polygons rendering. First you set up your vert data in memory, including two varyings for the normalized S and T texture coordinates. Next you set up between 1 and 4 uniforms, corresponding to texture config parameters 0 through 3. Finally you write a fragment shader that does interpolation for the ST coordinate varyings, and reads the texture data. The first step, setting up verts and varyings, was covered in the previous post so it won’t be duplicated here.

I Just Love A Config In Uniform

Texture config params are shader uniforms that are used to specify things like base address, dimensions, pixel format, mip levels, min/mag filters, and wrap mode for texture unit memory accesses. They roughly correspond to some combination of T#’s and samplers on GCN. The number of config params needed is dictated by the type of data being accessed, with one word required for 1D buffers and general memory accesses, two words for 2D textures, and three or four words for cubemaps and child images.

No matter the type of data, you must at least specify config param 0. Bits [31:12] give the base address of the LOD0 image in units of 4 KiB blocks, so all images must be at least 4 KiB aligned. Bits [11:10] give the cache swizzle mode, and will be discussed in depth in a later post. Bits [7..4] are the four LSB bits of the 5-bit pixel format value, and [3..0] give the number of mip levels minus one.

The supported pixel formats are

0 RGBA8888 32 8-bit per channel red, green, blue, alpha
1 RGBX8888 32 8-bit per channel RGA, alpha set to 1.0
2 RGBA4444 16 4-bit per channel red, green, blue, alpha
3 RGBA5551 16 5-bit per channel red, green, blue, 1-bit alpha
4 RGB565 16 Alpha channel set to 1.0
5 LUMINANCE 8 8-bit luminance (alpha channel set to 1.0)
6 ALPHA 8 8-bit alpha (RGA channels set to 0)
7 LUMALPHA 16 8-bit luminance, 8-bit alpha
8 ETC1 4 Ericsson Texture Compression format
9 S16F 16 16-bit float sample (blending supported)
10 S8 8 8-bit integer sample (blending supported)
11 S16 16 16-bit integer sample (point sampling only)
12 BW1 1 1-bit black and white
13 A4 4 4-bit alpha
14 A1 1 1-bit alpha
15 RGBA64 64 16-bit float per RGBA channel
16 RGBA32R 32 Raster format 8-bit per channel red, green, blue, alpha
17 YUYV422R 32 Raster format 8-bit per channel Y, U, Y, V

For 2D texture data, a uniform for config param 1 must also be given. This will include things useful for 2D texture reads such as width and height, wrap mode, min/mag mode, and the MSB of the 5-bit pixel format value.

As with the uniforms in previous examples, the texture config params are stored as a word-aligned list in memory, and the address of the list is encoded in the uniform data address field of the NV shader state record. The main difference here is that you won’t be manually reading these uniforms in the fragment shader. Rather, every time your shader writes to a texture unit register (TMUn_S, TMUn_T, TMUn_R, TMUn_B), the texture unit automatically fetches the next config param from the uniform FIFO and pushes it to the texture FIFO. If all four config params are needed, you have to write to all four TMU registers, writing a zero to TMUn_B if no bias is actually needed. Since the S register is the only one required by all access types, writing it also kicks off texture processing, and therefore it must always be written last.

The Shader

The high level shader flow is as follows: read and interpolate the S and T varyings, write T to TMU0_T causing the TMU to auto fetch config param 0, write S to TMU0_S causing the TMU to auto fetch config param 1 and kick off texture processing, send the GPU the texture read signal, and read back the packed pixel data from accumulator r4. Let’s start with the S and T varyings interpolation

; Tex S: ACC0 = S * W (r15a)
; add op: No operation, add cond: never
; mul pipe: Floating point multiply, ACC0, R15, VARYING_READ, cond: always
.word 0x203E3DF7, 0x110059E0

; Tex S coord: ACC0 = S * W + C, Tex T: ACC1 = T * W (r15a)
; add pipe: Floating point add, ACC0, acc r0, acc r5, cond: always
; mul pipe: Floating point multiply, ACC1, R15, VARYING_READ, cond: always
.word 0x213E3177, 0x11024821

; Tex T write reg = T * W + C, triggering first sampler param uniform read
; add pipe: Floating point add, TMU0_T, acc r1, acc r5, cond: always
; mul op: No operation, mul cond: never
.word 0x13E3377, 0x11020E67

Notice how the destination for the T coordinate’s add is TMU0_T. This not only feeds the texture unit the T coordinate, but also causes the TMU to fetch config param 0 from the uniform FIFO.

; moving S coord (in ACC0) to S register
; triggering read of second tex param uniform, and kicking it all off
; add pipe: Bitwise OR, TMU0_S_RETIRING, acc r0, acc r0, cond: always
; mul op: No operation, mul cond: never
.word 0x159E7000, 0x10020E27

Next we move the S value into TMU0_S. This feeds the texture unit the S coordinate, causes the TMU to fetch config param 1 from the uniform FIFO, and kicks off texture processing. Note that because S has to be written last, but usually comes first in the vert data, storing the texture coordinates as TS instead of ST can avoid an extra mov and be a potential optimization in some cases.

; signal TMU texture read. Can this be done with prev instruction?
; add op: No operation, add cond: never
; mul op: No operation, mul cond: never
; signal: load data from tmu0 to r4
.word 0x9E7000, 0xA00009E7

This instruction doesn’t execute any ALU operations, but it does signal the GPU to load data from the texture unit into accumulator r4. Whether or not the signal can occur in the same instruction as the write to TMU0_S is a bit unclear, and may need a bit of testing. For now, just to be safe it’s being done in the instruction after the S coordinate is written.

Also worth noting, data from the texture units is always returned as either packed 8-bit RGBA8888 or RG1616 and BA1616 values in r4, depending on the bits per channel. The special r4 unpack mode (pm=1) mentioned in the previous post exists to convert this packed texture data to normalized [0..1] data. Reading 64-bit pixel data requires two 32-bit reads to get all four channels. For general 1D buffer access, 32-bit data is always fetched.

; exporting read texture data to MRT0
; add pipe: Bitwise OR, TLB_COLOUR_ALL, acc r4, acc r4, cond: always
; mul op: No operation, mul cond: never
.word 0x159E7924, 0x10020BA7

After the signal, the data arrives in r4 and is immediately available for the next instruction to use. Here the sampled texture data in r4 is just directly copied to TLB_COLOUR_ALL for export.

Be careful not to overfill the texture receive FIFO, as it is only 8 entries deep, and each write to a TMU register constitutes one entry. For 2D textures you only write to S and T, and so you can queue up four texture requests. If you only need to write S (1D buffer case), you can queue up eight requests. Finally in the child image case where S, T, R, and B are all needed, you can only queue up two. If your shader runs in multithreaded mode and you suspend, both threads share the same FIFO and should only use half.

I Wanna See You Swizzle It Just A Little Bit

Finally we need to cover how texture data is laid out in memory. Texture reads support both linear (raster order) formats and microtile-based T and LT formats. Microtiles are 64 byte 2D blocks of pixels, whose geometry depends on the number of bits per pixel. In the common case of 32-bit pixels, a microtile would be a 4×4 block. In the 64-bit pixel case, microtiles are 2×4 blocks, and 1-bit pixels are laid out as 32×16 blocks, for example. The pixels are stored in simple raster scan order, left to right and bottom to top, with the origin in the lower left corner.

In T-format, microtiles are grouped into subtiles, where a subtile is a 1 KiB block of 16 microtiles, arranged 4×4 in simple raster scan order. In the 32bpp case, that’s 4×4 pixels per microtile times 4×4 microtiles per subtile, or 16×16 pixels per subtile. 256 pixels times 4 bytes per pixel is 1 KiB, as expected.

Next, four 1 KiB subtiles are grouped into one image tile (sometimes called a 4K tile). The ordering of subtiles within a 4K tile depends on the 4K tile row. Even rows (0, 2, 4,…) orders the four subtiles lower left, upper left, upper right, lower right. Odd rows of 4K tiles (1, 3, 5,…) orders them upper right, lower right, lower left, upper left.

4K tiles on even lines have subtiles wound clockwise from lower left
4K tiles on odd lines have subtiles wound clockwise from upper right

Finally, the 4K tiles themselves go left to right for even rows, and right to left for odd rows. Again, the origin is in the lower left corner. The image below shows a texture three 4K tiles wide and two 4K tiles high. The pattern repeats, alternating every row

This is the tiling order for T-format. Because the format is 4K tile based, the texture dimensions must always be padded out to a multiple of 4K tiles. This can be wasteful for smaller textures, so the hardware assumes any mip level less than a 4K tile in size will be stored in LT (or linear tiled) format. LT-format is also microtile based, but the image is stored as a series of microtiles in linear scan order, without any concept of subtiles and 4K tiles.

That’s about all I have time for.  I’ll have to save cache swizzling, mipmap layout, cubemaps, and hardware utilization for another post this fall.

Shoutouts as always to Peter Lemon for continuing to push low level and bare metal on RPI, Cort Danger Stratton for being recognizable even as a 32×32 texture, Graham Wihlidal (father of the never-gonna-be-released GrahamBox), and new kid in the console game Mike Nicølella who I can only assume is making the Miketendo Nii…


Author: okonomiyonda

SPU whisperer, VU1 fetishist, physics enthusiast, GCN asm fanboy, low-level optimizationist. JayStation2 system architect. Jaymin Dreams Of SHUFB 京都市, Kyoto, Japan @okonomiyonda

1 thought on “Texturing”

Leave a Reply

Your email address will not be published. Required fields are marked *