Renderer (Legaia TMD)
Legaia uses a custom variant of the PSX TMD format for its 3D meshes, and a custom renderer to draw it. The renderer is FUN_8002735C - 60 GTE ops, with a per-mode descriptor table that encodes how each primitive group is laid out. The engine port emulates PSX VRAM in a compute-friendly format so multi-page meshes render correctly.
How it works
The PSX has dedicated 3D hardware called the GTE (Geometry Transformation Engine), which transforms vertices, and a dedicated 2D framebuffer with its own 1 MB of VRAM. The way games typically draw a 3D model: walk the mesh primitive-by-primitive, push each one through the GTE for projection, then drop the resulting 2D primitive (as a "GP0" packet) into an ordering table (OT) that the GPU reads and rasterises. Textures and palettes (CLUTs) live as rectangular regions inside VRAM, addressed by their top-left corner.
Legaia's TMDs are almost the standard format Sony documented, but with a custom primitive-group header layout and a custom renderer. Each primitive group has an 8-byte header [u16 count, u16 flags, u8 olen, u8 ilen, u8 flag, u8 mode] followed by count × ilen*4 bytes of per-prim data. The vertex-index byte offset within each prim isn't a fixed field - it's looked up from a 6-entry descriptor table indexed by ((flags >> 1) - 8) >> 1. The descriptor table also encodes which OT-packet shape the renderer emits.
The clean-room engine port emulates PSX VRAM as a 1024×512 R16Uint texture. The fragment shader reads the texture coordinates and CLUT base from the per-primitive bind, decodes 4/8/15bpp, indexes the CLUT, and outputs the final colour. That gives the same per-prim mode-switching the PSX hardware does, but in a single GPU draw call.
Per-mode descriptor table
The renderer indexes into an 8-byte-stride table at 0x8007326C using ((flags >> 1) - 8) >> 1:
flags | table idx | byte 0 | byte 4 |
|---|---|---|---|
0x10/11 | 0 | 0x04 | 0x05 |
0x12/13 | 1 | 0x09 | 0x07 |
0x14/15 | 2 | 0x04 | 0x00 |
0x16/17 | 3 | 0x06 | 0x06 |
0x18/19 | 4 | 0x07 | 0x07 |
0x1A/1B | 5 | 0x09 | 0x0B |
0x20-23 | 4 | (same) |
Each entry's first u32 has bytes [?, ?, ?, type_bits] where the low 2 bits of byte 3 select the OT packet shape (0/1/2/3 → different DrawPolyXX variants). Each entry's second u32 has the vertex-index offset (in u16 units) within the prim in its low byte. See formats/tmd for the full layout.
TMD pointer table
FUN_80026B4C writes registered TMDs to *(int **)(idx * 4 + 0x8007C018). Consumers in retail (4 functions, all setup-not-render):
| Function | Role |
|---|---|
FUN_80021B04 | Actor-spawn helper, builds per-actor OBJECT pointer table. |
FUN_80024D78 | Per-actor OBJECT-table rebuild. |
FUN_8001EBEC | Per-frame OBJECT[10/11] swap (pose select for player TMDs). |
FUN_8001E890 | “DATA_FIELD player loader”. The retail-PROT branch targets PROT 876 (player_data), which is a streaming-format VAB+TIM_LIST+SEQ payload - not a TMD pack. The dev string data\field\player.lzs maps to that same PROT 876 entry. The DAT_8007C018[0..4] character TMDs actually come from PROT 0874 (befect_data) section 0; see formats / world-map-overlay § Disc-side source of [0..4]. What FUN_8001E890 does end up writing into DAT_8007C018[0..2] is the post-install group-count cap (entry[+0x08] = 10) and the equipment-conditional patch dispatch into FUN_8001EBEC. |
The per-actor OBJECT[i] is a 28-byte struct copied into actor[0x44][i+1] from tmd + 12 + i*28 - sizeof(OBJECT) = 28.
VRAM emulation in the engine port
crates/engine-render emulates a 1024×512 R16Uint VRAM page so the per-prim CBA/TSB selectors plus 4/8/15bpp + CLUT decoding can happen in a fragment shader. The viewer uploads every sibling TIM into VRAM so multi-page meshes render correctly.
CLUT data scatters across PROT entries - many character meshes reference CLUT rows that live in different PROT entries from their TMD source. The viewer's --vram-extra-dir is the workaround until the runtime asset chain is fully traced. Battle is fully traced (the bundle loader handles this); field, town, and level-up still rely on the workaround. See Asset loader → "CLUT-data scattering".
Targeted VRAM upload
The TIM corpus on a single PROT entry can run into the hundreds. Uploading every TIM into the 1 MB VRAM clobbers regions a different mesh references as its CLUT row, and the paletted decode reads image pixels as palette entries (rainbow noise). The asset viewer and the tmd CLI both go through legaia_tmd::vram_targeted::build_vram_targeted: for every TIM, the image block and CLUT block are decided independently against the prim-target rectangles for the current TMD - a TIM can contribute one block, both, or neither.
legaia_tim::vram::Vram::prim_texture_status then classifies each prim's (cba, tsb, uv) lookup as Ok / MissingClut / ClutDepthMismatch { populated_width, expected_width } / MissingTexturePage so the viewer can drop bad prims at mesh-build time and the CLI can explain why a prim was dropped (the most common case is a 4bpp prim referencing a CLUT row that's been populated as a 256-entry 8bpp palette by a different TIM).
The same filter is wired into engine-side scene loads through ResolvedTmd::build_filtered_vram_mesh, so battle / field actor meshes inherit the same cleanup the asset viewer has.
Engine-side targeted upload + shared blocks
SceneResources::build_targeted is the engine-side mirror of the asset-viewer's targeted-upload path: it parses every TMD in a scene, collects the union of all prim-target rectangles (CLUT rows + texture-page UV bboxes), then walks every TIM and decides per-block whether to write it. This matches what the retail field loader does - DMA only the texture bytes the current scene's meshes need - and avoids the CLUT-row collisions that drop 80%+ of textured prims under the naive "upload every TIM" path.
build_targeted also accepts a list of shared CDNAME blocks via the FIELD_SHARED_BLOCKS constant (init_data + player_data). These are the blocks the retail engine keeps resident across field-scene transitions - player_data (PROT 876) is a streaming-format file whose 0x01 (TIM_LIST) chunk carries the 256x256 player atlas at VRAM fb=(768, 0) with CLUT at (0, 500) (the other chunks are a VAB header and a small SEQ-magic trailer; the file carries no TMDs - character meshes come from PROT 0874, see formats / world-map-overlay § Disc-side source of [0..4]); init_data (PROT 0) holds shared UI / sprite tiles. The shared blocks are uploaded first, so scene-local TIMs win any slot collision (mirrors the retail boot-then-scene order).
SceneHost::enter_field_scene calls build_targeted with the field shared blocks by default; the legacy SceneResources::build / build_with_shared paths remain for tests and engines that want the unfiltered upload for diagnostic purposes.
Render vs parity: targeted vs DMA-every-TIM. The targeted upload is a render optimisation - it writes only the texture bytes the current meshes sample. The retail field loader DMAs every scene TIM to VRAM regardless of which prim samples it. For the VRAM parity oracle, BuildOptions { upload_all_tims: true } switches build_targeted to build_vram_full_from_buffers: every parseable collected TIM is written to its header destination (images first as sequential DMA, then CLUTs with merge-zeros to preserve the row-479 palette split). On town01 this lifts oracle coverage from ~4% (targeted) to ~38% of the runtime texture region, with wrong (engine-only) texels dropping from ~11.5k to ~250. The flag defaults false, so the render path is unchanged.
The TIM scan walks both raw entry bytes and any LZS-decompressed sections (via legaia_asset::tim_scan::scan_entry), so battle / level-up bundles that pack their character TIMs inside an LZS container don't need a raw-byte fallback path.
legaia-engine info --scene <name> --tmd-stats reports per-TMD kept / miss_clut / depth_mm / miss_page counts so future regressions in the targeted-upload pipeline are visible without firing up the windowed viewer. --vram-png / --vram-bin write the engine VRAM as a 1024x512 PNG / raw BGR555 blob; --runtime-vram <bin> (paired with mednafen-state vram-dump --out-bin) reports per-region pixel-coverage statistics against the runtime ground truth, and --vram-diff-png writes a colour-coded diff (red = runtime has, engine missing; green = engine extras; blue = both populated but different).
Two-pass upload ordering
Inside build_vram_targeted_from_buffers the targeted upload runs in two passes:
- Image pass writes every useful TIM image block (image overlaps a mesh's tex page region AND does NOT overlap another mesh's CLUT row).
- CLUT pass writes every useful TIM CLUT block (CLUT overlaps a mesh's CLUT row), unconditionally with respect to image-page collisions.
Earlier versions filtered CLUT uploads with a clut_collides_page suppression that dropped legitimate palette rows whenever any mesh's UV bbox happened to brush the CLUT row's y-coordinate. The town01 character TMDs hit this: their 256-pixel-wide palette at y=479 overlapped a separate scene mesh's texture-page rectangle, so the CLUT upload was suppressed and 388 prims dropped as MissingClut. Splitting into image-then-CLUT order keeps the palette rows that PSX games place on the bottom of texture pages coherent without the per-prim heuristic.
CLUT-trace + VRAM-oracle diagnostics
Two legaia-engine subcommands surface where the engine's loader still has gaps against a captured runtime VRAM:
legaia-engine clut-trace --scene <name> --disc <bin> [--runtime-vram <bin>]walks every droppingMissingClutprim, groups by(cba, depth), and reports which PROT entries carry a TIM whose CLUT block covers each missing row by rectangle containment (PSX TIMs commonly pack 16 distinct 16-entry palettes into one 256-wide CLUT block, so a CBA's 16-pixel slot sits inside a wider supplier rect).--runtime-vramdistinguishes "row absent from engine but present at runtime" (engine loader gap) from "row absent from runtime too" (mesh references unreachable CLUT - likely a sub-pack walker port needed).legaia-engine vram-oracle --scene <name> --disc <bin> --runtime-vram <bin> [--diff-png <path>] [--tiles]rebuilds the scene's engine VRAM and reports per-band overlap counts plus an optional 64x64-tile breakdown.--diff-pngwrites the same colour-coded diff asinfo --vram-diff-png. The standalone VRAM build picks its load kind viaoracle_load_kind, mirroring the liveenter_field_scenechoice: world-map scenes (map\d\d) build withSceneLoadKind::WorldMapso the kingdom slot-0 terrain atlas (opaque to the generic TIM scanner) lands in VRAM, instead of reporting the grass/water pages as a phantom gap (roughly doublesmap01texpage residency).
Both work without any pre-extracted tim_scan/ tree - they operate straight off PROT.DAT + CDNAME.TXT (extracted-root or in-place disc image).
CLUT-depth-mismatch threshold
Vram::prim_texture_status flags ClutDepthMismatch when a CLUT row is populated past what the prim's color depth could legitimately fill: for 4bpp prims the threshold is 16 * 16 = 256 entries (16 distinct 16-entry palettes packed in one row, picked by the prim's CBA low 6 bits - the standard Legaia character-TIM layout); for 8bpp it's 2 * 256 (one palette plus slack for stray pixels). Anything past that indicates another TIM's image bytes have spilled onto the CLUT row, and the paletted decode would index into pixel data. The targeted-upload path in build_targeted prevents this spillage, so engine-side scenes hit the mismatch threshold only when a regression breaks the per-TIM block-arbitration.
Texture-window register (GP0 0xE2)
Renderer::set_texture_window(mask_x, mask_y, off_x, off_y) maps to GP0(0xE2) "Texture Window setting": four 5-bit values in 8-pixel steps that clamp / wrap texture-coordinate sampling to a smaller window inside the texture page. Default is all-zero (no-op). Retail Legaia leaves the register at zero almost everywhere; the API is wired primarily so future runtime LoadImage / DMA-to-VRAM trace work can replay the register state faithfully. The fragment shader applies the per-pixel coord = (coord & ~(mask*8)) | ((offset & mask)*8) transformation before texture-page lookup.
Asset-viewer flat-shaded fallback
asset-viewer tmd <PATH> --no-textures (alias --flat-shaded) suppresses the VRAM path entirely and renders unlit flat geometry. Useful for inspecting mesh silhouettes without battling palette guesses (the runtime LoadImage trace for field / town scenes is not yet captured, so some palette rows always render as garbage in textured mode).
tmd CLI VRAM diagnostics
tmd prims <PATH> --vram-dir extracted/tim_scan/<entry> simulates the targeted upload and adds a per-prim verdict trailer (-> Ok / -> MISSING CLUT (row N) / -> DEPTH MISMATCH (row N populated with K entries; prim expects M) / -> MISSING TEXTURE PAGE (tpage 0xNN)).
tmd vram-dump <PATH> -o vram.png [--vram-dir ...] [--annotate] exports the post-upload software VRAM as a 1024x512 PNG with optional red CLUT-row + green texture-page outlines, so collisions are obvious without firing up the GUI.
PSX-faithful rendering knobs
Renderer::set_psx_mode(true) enables a set of retail-faithful rasterisation modes on the 3D mesh pipelines (in legaia-engine play-window, opt in with LEGAIA_PSX_RENDER=1):
- Affine UV interpolation. Per-vertex UVs interpolate linearly in screen space (no perspective-correct division). This reproduces the texture warping you see on retail surfaces with steep depth gradients -
GP0(0x24)-class triangle commands transmit only(u, v)per vertex, the rasteriser does not divide by1/w. WGSL@interpolate(linear)gives the same behaviour. - Sub-pixel vertex snap ("vertex jitter"). Clip-space
x/yare snapped to integer pixel positions inside the vertex shader (NDC → pixel grid → NDC round-trip). Reproduces the GTE's per-vertex sub-pixel-truncation jitter that gives PSX rendering its characteristic shimmer on slow-moving geometry. - 15-bit ordered dithering. Packing the 24-bit shaded colour into the 15-bit (BGR555) framebuffer, the PSX GPU adds a signed 4×4 ordered-dither offset per pixel before truncating each channel to 5 bits. The shader helper
PSX_DITHER_WGSL(prepended to every shaded 3D shader) reproduces it and mirrors the unit-tested CPUpsx_dithermodule; the composed shaders are naga-validated in the test suite (a GPU-free guard that the WGSL stays well-formed). - No synthetic lighting. Outside
psx_modethe mesh shaders multiply the texel / vertex colour by a per-frame directional Lambert from a fixed engine light, purely so untextured silhouettes read. Retail bakes its GTE lighting into the per-vertex colours and texels and the GPU only interpolates, sopsx_modedrops the synthetic Lambert (shade = 1.0) and shows the source data unlit. The default keeps the readable shade.
Texture page (tsb) and CLUT base address (cba) remain @interpolate(flat) - they are per-primitive in retail because GP0(0x24) sets them once per draw call, not per vertex.
A fixed-point GTE math module at crates/engine-render/src/gte.rs mirrors the retail accumulator shape: q3.12 rotation matrices, q19.12 translation vectors, i64-widened multiply-add to absorb three-term sums without overflow. The module exposes the GTE's higher-level primitives - a Camera bundle that runs RTPT (rotate-translate-perspective) end-to-end with PSX-correct saturation on behind-camera vertices, nclip for back-face rejection, avsz3 / avsz4 for OT-bucket selection, and a small CPU rasterizer scaffold (top-left fill rule, integer-pixel bounding-box iterator) that downstream tooling uses to validate captured traces.
The same module also ships a register-state emulator Gte that mirrors the PSX cop2 register file at the layer below Camera: V0..V2 input vectors, MAC0..MAC3 wide accumulators, IR0..IR3 saturated short results, the SXY / SZ / RGB FIFOs (3-deep / 4-deep / 3-deep), OTZ, and the FLAG sticky-saturation register with bit positions matching the hardware (engines comparing against a captured FLAG dump can mask the same bits via gte::flag_bits). Control registers cover the rotation / light-source / light-color matrices, translation, focal length H, screen offset OFX/OFY, the average-Z scale factors ZSF3 / ZSF4, depth-cue slope/intercept DQA / DQB, and the back_color / far_color triplets used by the depth-cue pipeline.
Instructions implemented at the register level: RTPS / RTPT (single / triple-vertex rotate-translate-perspective), NCLIP (signed area), AVSZ3 / AVSZ4 (OT-bucket selection), MVMVA (generic matrix-vector multiply with the SF / LM flags), NCDS / NCDT (normal-color depth shading), NCS / NCT (normal-color, no depth-fade), NCCS / NCCT (normal-color color, double light pass), CDP (color depth-cued, no normal pass), CC (color color, no normal / depth), DCPL (depth-cued primary-color blend), DPCS / DPCT (depth-cued color blend), INTPL (far-color interpolation primitive), SQR (squares IR1..IR3), OP (cross-product of the rotation-matrix diagonal with IR), and GPF / GPL (general-purpose IR×IR0 multiply / accumulate) - the full retail cop2 instruction set. Each public op charges its hardware cycle count (Nocash PSX reference table) into a Gte::cycles accumulator so emulators can pace MIPS execution against cop2 stalls; CopOp::cycles() exposes the table directly for engines that want to budget without running ops. Used for offline regression checks against captured GTE traces and as the substrate for effect spawners / hit-detection / animation re-targeting that need per-vertex visibility into the cop2 state. Production rendering still uses f32 wgpu math.
Beyond the cop2 instruction set the module exposes the four MIPS register-transfer ops (MFC2 / MTC2 / CFC2 / CTC2) plus the two memory ops (LWC2 / SWC2) so engines can replay a captured GTE trace without re-deriving the cop2 register layout. read_data / write_data map the 32 cop2 data registers (V0..V2 packed pairs, RGBC, OTZ, IR0..IR3, the SXY-FIFO push slot SXYP, SZ-FIFO entries, RGB-FIFO entries, MAC0..MAC3, packed IRGB / ORGB, LZCS / LZCR) to the typed register fields; read_ctrl / write_ctrl handle the 32 control registers (rotation / light / light-color matrices packed two-per-word, translation triple, H / OFX / OFY / DQA / DQB / ZSF3 / ZSF4 / FLAG). LWC2 / SWC2 thread through a Cop2Mem trait so engines plug their main-memory implementation behind it; a VecMem implementation is shipped for replay against captured RAM snapshots, and a NullMem for tests that don't exercise memory at all.
Trace capture & replay harness
A companion module at crates/engine-render/src/gte_trace.rs turns the cop2 emulator into a regression-test harness. GteSnapshot::capture serialises the entire register file (data + control + cycle counter) to a plain struct that round-trips through restore; diff returns a typed list of per-field divergences. TraceRecorder wraps a live Gte - engines configure it with rotation matrices and vertex inputs, then call record(op) per cop2 operation; the recorder pushes one TraceStep per op containing the before / after snapshots.
Recorded traces serialise to JSON via Cop2Trace::write_json_pretty and round-trip through read_json. Cop2Trace::replay runs each step against a fresh emulator and surfaces any per-field divergence as a StepMismatch with the op name + diff list. The legaia-engine gte-replay --trace FILE subcommand drives this from the CLI: pass a captured retail RAM trace and the harness reports any cop2 emulator regression.
The per-mode descriptor table from DAT_8007326C is also exposed as a typed lookup at crates/tmd/src/descriptor.rs: Descriptor::for_flags(flags) returns the resolved PacketShape (one of F3 / FT3 / G3 / GT3 / F4 / FT4 / G4 / GT4) and the per-prim vertex-index offset. The lookup matches the older legaia_prims::vertex_offset_bytes free function on every valid flags value - both read the same on-disc table - but exposes the shading mode (flat vs gouraud) and texture flag as typed fields so consumers can branch on them without re-deriving the bit math.
Stage geometry detector (legacy, signal only)
A "12-byte fixed prefix 00 F0 84 7F 01 F0 1F 00 00 F1 00 00 repeated at 20-byte stride" detector lives at crates/asset/src/stage_geom.rs. It's not real stage geometry - it's the standard primitive-group header for Legaia TMD primitive group data when ((flags >> 1) - 8) >> 1 == K (where K is the group type that uses 20-byte stride).
The detector is preserved as a signal during exploration ("this buffer contains a TMD with effect-style primitives") but for actual geometry extraction use the TMD parser (crates/tmd::legaia_prims).