Renderer — legend-of-legaia-re

How it works

The PSX has dedicated 3D hardware called the GTE (Geometry Transformation Engine), which transforms vertices, and a dedicated 2D framebuffer with its own 1 MB of VRAM. The way games typically draw a 3D model: walk the mesh primitive-by-primitive, push each one through the GTE for projection, then drop the resulting 2D primitive (as a "GP0" packet) into an ordering table (OT) that the GPU reads and rasterises. Textures and palettes (CLUTs) live as rectangular regions inside VRAM, addressed by their top-left corner.

Legaia's TMDs are almost the standard format Sony documented, but with a custom primitive-group header layout and a custom renderer. Each primitive group has an 8-byte header [u16 count, u16 flags, u8 olen, u8 ilen, u8 flag, u8 mode] followed by count × ilen*4 bytes of per-prim data. The vertex-index byte offset within each prim isn't a fixed field — it's looked up from a 6-entry descriptor table indexed by ((flags >> 1) - 8) >> 1. The descriptor table also encodes which OT-packet shape the renderer emits.

The clean-room engine port emulates PSX VRAM as a 1024×512 R16Uint texture. The fragment shader reads the texture coordinates and CLUT base from the per-primitive bind, decodes 4/8/15bpp, indexes the CLUT, and outputs the final colour. That gives the same per-prim mode-switching the PSX hardware does, but in a single GPU draw call.

Per-mode descriptor table

The renderer indexes into an 8-byte-stride table at 0x8007326C using ((flags >> 1) - 8) >> 1:

`flags`	table idx	byte 0	byte 4
`0x10/11`	0	`0x04`	`0x05`
`0x12/13`	1	`0x09`	`0x07`
`0x14/15`	2	`0x04`	`0x00`
`0x16/17`	3	`0x06`	`0x06`
`0x18/19`	4	`0x07`	`0x07`
`0x1A/1B`	5	`0x09`	`0x0B`
`0x20-23`	4	(same)

Each entry's first u32 has bytes [?, ?, ?, type_bits] where the low 2 bits of byte 3 select the OT packet shape (0/1/2/3 → different DrawPolyXX variants). Each entry's second u32 has the vertex-index offset (in u16 units) within the prim in its low byte. See formats/tmd for the full layout.

TMD pointer table

FUN_80026B4C writes registered TMDs to *(int **)(idx * 4 + 0x8007C018). Consumers in retail (4 functions, all setup-not-render):

Function	Role
`FUN_80021B04`	Actor-spawn helper, builds per-actor OBJECT pointer table.
`FUN_80024D78`	Per-actor OBJECT-table rebuild.
`FUN_8001EBEC`	Per-frame OBJECT[10/11] swap (pose select for player TMDs).
`FUN_8001E890`	DATA_FIELD player loader; loads `data_field_player_lzs` chains, registers TMDs.

The per-actor OBJECT[i] is a 28-byte struct copied into actor[0x44][i+1] from tmd + 12 + i*28 — sizeof(OBJECT) = 28.

VRAM emulation in the engine port

crates/engine-render emulates a 1024×512 R16Uint VRAM page so the per-prim CBA/TSB selectors plus 4/8/15bpp + CLUT decoding can happen in a fragment shader. The viewer uploads every sibling TIM into VRAM so multi-page meshes render correctly.

CLUT data scatters across PROT entries — many character meshes reference CLUT rows that live in different PROT entries from their TMD source. The viewer's --vram-extra-dir is the workaround until the runtime asset chain is fully traced. Battle is fully traced (the bundle loader handles this); field, town, and level-up still rely on the workaround. See Asset loader → "CLUT-data scattering".

PSX-faithful rendering knobs

Renderer::set_psx_mode(true) enables two retail-faithful rasterisation modes on the VRAM-mesh pipeline:

Affine UV interpolation. Per-vertex UVs interpolate linearly in screen space (no perspective-correct division). This reproduces the texture warping you see on retail surfaces with steep depth gradients — GP0(0x24)-class triangle commands transmit only (u, v) per vertex, the rasteriser does not divide by 1/w. WGSL @interpolate(linear) gives the same behaviour.
Sub-pixel vertex snap ("vertex jitter"). Clip-space x / y are snapped to integer pixel positions inside the vertex shader (NDC → pixel grid → NDC round-trip). Reproduces the GTE's per-vertex sub-pixel-truncation jitter that gives PSX rendering its characteristic shimmer on slow-moving geometry.

Texture page (tsb) and CLUT base address (cba) remain @interpolate(flat) — they are per-primitive in retail because GP0(0x24) sets them once per draw call, not per vertex.

A fixed-point GTE math module at crates/engine-render/src/gte.rs mirrors the retail accumulator shape: q3.12 rotation matrices, q19.12 translation vectors, i64-widened multiply-add to absorb three-term sums without overflow. The module exposes the GTE's higher-level primitives — a Camera bundle that runs RTPT (rotate-translate-perspective) end-to-end with PSX-correct saturation on behind-camera vertices, nclip for back-face rejection, avsz3 / avsz4 for OT-bucket selection, and a small CPU rasterizer scaffold (top-left fill rule, integer-pixel bounding-box iterator) that downstream tooling uses to validate captured traces.

The same module also ships a register-state emulator Gte that mirrors the PSX cop2 register file at the layer below Camera: V0..V2 input vectors, MAC0..MAC3 wide accumulators, IR0..IR3 saturated short results, the SXY / SZ / RGB FIFOs (3-deep / 4-deep / 3-deep), OTZ, and the FLAG sticky-saturation register with bit positions matching the hardware (engines comparing against a captured FLAG dump can mask the same bits via gte::flag_bits). Control registers cover the rotation / light-source / light-color matrices, translation, focal length H, screen offset OFX/OFY, the average-Z scale factors ZSF3 / ZSF4, depth-cue slope/intercept DQA / DQB, and the back_color / far_color triplets used by the depth-cue pipeline.

Instructions implemented at the register level: RTPS / RTPT (single / triple-vertex rotate-translate-perspective), NCLIP (signed area), AVSZ3 / AVSZ4 (OT-bucket selection), MVMVA (generic matrix-vector multiply with the SF / LM flags), NCDS / NCDT (normal-color depth shading), NCS / NCT (normal-color, no depth-fade), NCCS / NCCT (normal-color color, double light pass), CDP (color depth-cued, no normal pass), CC (color color, no normal / depth), DCPL (depth-cued primary-color blend), DPCS / DPCT (depth-cued color blend), INTPL (far-color interpolation primitive), SQR (squares IR1..IR3), OP (cross-product of the rotation-matrix diagonal with IR), and GPF / GPL (general-purpose IR×IR0 multiply / accumulate) — the full retail cop2 instruction set. Each public op charges its hardware cycle count (Nocash PSX reference table) into a Gte::cycles accumulator so emulators can pace MIPS execution against cop2 stalls; CopOp::cycles() exposes the table directly for engines that want to budget without running ops. Used for offline regression checks against captured GTE traces and as the substrate for effect spawners / hit-detection / animation re-targeting that need per-vertex visibility into the cop2 state. Production rendering still uses f32 wgpu math.

Beyond the cop2 instruction set the module exposes the four MIPS register-transfer ops (MFC2 / MTC2 / CFC2 / CTC2) plus the two memory ops (LWC2 / SWC2) so engines can replay a captured GTE trace without re-deriving the cop2 register layout. read_data / write_data map the 32 cop2 data registers (V0..V2 packed pairs, RGBC, OTZ, IR0..IR3, the SXY-FIFO push slot SXYP, SZ-FIFO entries, RGB-FIFO entries, MAC0..MAC3, packed IRGB / ORGB, LZCS / LZCR) to the typed register fields; read_ctrl / write_ctrl handle the 32 control registers (rotation / light / light-color matrices packed two-per-word, translation triple, H / OFX / OFY / DQA / DQB / ZSF3 / ZSF4 / FLAG). LWC2 / SWC2 thread through a Cop2Mem trait so engines plug their main-memory implementation behind it; a VecMem implementation is shipped for replay against captured RAM snapshots, and a NullMem for tests that don't exercise memory at all.

Trace capture & replay harness

A companion module at crates/engine-render/src/gte_trace.rs turns the cop2 emulator into a regression-test harness. GteSnapshot::capture serialises the entire register file (data + control + cycle counter) to a plain struct that round-trips through restore; diff returns a typed list of per-field divergences. TraceRecorder wraps a live Gte — engines configure it with rotation matrices and vertex inputs, then call record(op) per cop2 operation; the recorder pushes one TraceStep per op containing the before / after snapshots.

Recorded traces serialise to JSON via Cop2Trace::write_json_pretty and round-trip through read_json. Cop2Trace::replay runs each step against a fresh emulator and surfaces any per-field divergence as a StepMismatch with the op name + diff list. The legaia-engine gte-replay --trace FILE subcommand drives this from the CLI: pass a captured retail RAM trace and the harness reports any cop2 emulator regression.

The per-mode descriptor table from DAT_8007326C is also exposed as a typed lookup at crates/tmd/src/descriptor.rs: Descriptor::for_flags(flags) returns the resolved PacketShape (one of F3 / FT3 / G3 / GT3 / F4 / FT4 / G4 / GT4) and the per-prim vertex-index offset. The lookup matches the older legaia_prims::vertex_offset_bytes free function on every valid flags value — both read the same on-disc table — but exposes the shading mode (flat vs gouraud) and texture flag as typed fields so consumers can branch on them without re-deriving the bit math.

Stage geometry detector (legacy, signal only)

A "12-byte fixed prefix 00 F0 84 7F 01 F0 1F 00 00 F1 00 00 repeated at 20-byte stride" detector lives at crates/asset/src/stage_geom.rs. It's not real stage geometry — it's the standard primitive-group header for Legaia TMD primitive group data when ((flags >> 1) - 8) >> 1 == K (where K is the group type that uses 20-byte stride).

The detector is preserved as a signal during exploration ("this buffer contains a TMD with effect-style primitives") but for actual geometry extraction use the TMD parser (crates/tmd::legaia_prims).