This document covers the feasibility analysis and implementation plan for adding
SIMD acceleration to the Yamagi Quake II software renderer
(src/client/refresh/soft/).
- Architecture Overview
- Current Bottlenecks
- Platform SIMD Comparison
- Per-Loop SIMD Analysis
- R_ApplyLight Rework
- Implementation Phases
- File Organization
- Build System Changes
- Multicore Feasibility (Deferred)
The software renderer is a classic Quake-era scanline renderer:
- 8-bit paletted internal framebuffer (
pixel_t = unsigned char) - Double-buffered:
swap_frames[0]andswap_frames[1] - Z-buffer:
d_pzbuffer(zvalue_t = int), same dimensions as framebuffer - Active Edge Table (AET) scanline rasterizer in
sw_edge.c - Surface cache system in
sw_surf.c(texture + lightmap composited once, then reused across frames) - BSP front-to-back traversal for occlusion via edge/surface stack
- 17 source files, zero threading, zero SIMD
R_SetupFrame()
-> R_MarkLeaves() (PVS decompression, leaf/node visibility marking)
-> R_PushDlights() (mark surfaces affected by dynamic lights)
-> R_EdgeDrawing()
R_BeginEdgeFrame()
R_RenderWorld() (BSP traversal -> R_RenderFace -> Global Edge Table)
R_DrawBEntitiesOnList()
R_ScanEdges() (scanline AET processing -> R_GenerateSpans
-> D_DrawSurfaces dispatches to D_SolidSurf,
D_TurbulentSurf, D_SkySurf, D_BackgroundSurf)
-> R_DrawEntitiesOnList() (alias models via z-buffer read)
-> R_DrawParticles()
-> R_DrawAlphaSurfaces()
-> D_WarpScreen() (underwater warp, if applicable)
-> R_CalcPalette() (screen tint/blend)
RE_EndFrame()
-> RE_CopyFrame() (8-bit -> 32-bit palette conversion -> SDL texture)
| Table | Type | Size | Purpose |
|---|---|---|---|
sdl_palette |
Uint32[] |
1 KB | 256-entry palette for final 8->32 convert |
vid_colormap |
pixel_t[] |
16 KB | 256 texels x 64 light grades |
d_8to24table |
byte[] |
1 KB | Indexed color -> 32-bit RGBA |
d_16to8table |
byte[] |
64 KB | RGB565 -> best-match 8-bit palette index |
-
D_DrawSpansPow2()—sw_scan.c:539World surface texturing. Perspective-correct with 16-pixel subdivision. Inner loop: fixed-point (s,t) stepping + texture gather + byte store. -
R_PolysetDrawSpans8_Opaque()—sw_polyset.c:725Alias model (MD2) pixel loop. Z-test +R_ApplyLight()+ affine texture walk with carry. -
R_ScanEdges()scanline loop —sw_edge.cPer-scanline AET maintenance and span generation. Fundamentally sequential. -
D_DrawZSpans()—sw_scan.c:801Z-buffer fill. Linear ramp of 32-bit ints. -
R_DrawSurfaceBlock8_anymip()/R_DrawSurfaceBlock_Light()—sw_surf.c:78-114Surface cache building (texture + lightmap compositing). -
R_BuildLightMap()—sw_light.c:443-481Lightmap accumulation and bound/invert/shift post-processing. -
RE_CopyFrame()—sw_main.c:22498-bit to 32-bit palette conversion for display.
Located at sw_image.c:453-487. The source code itself has a
TODO: -22% fps lost comment. This function performs 7 chained table
lookups per pixel:
// 1-3. Three vid_colormap[] lookups (16KB table), one per RGB channel
i_r = vid_colormap[light_masked[0] + pix];
i_g = vid_colormap[light_masked[1] + pix];
i_b = vid_colormap[light_masked[2] + pix];
// 4-6. Three d_8to24table[] lookups (1KB table) to get RGB components
b_r = d_8to24table[i_r * 4 + 0];
b_g = d_8to24table[i_g * 4 + 1];
b_b = d_8to24table[i_b * 4 + 2];
// 7. Pack to RGB565, lookup d_16to8table[] (64KB table) to get palette index
i_c = (b_r >> 3) | ((b_g >> 2) << 5) | ((b_b >> 3) << 11);
return d_16to8table[i_c & 0xFFFF];There is a fast path (line 466) when all three light channels are equal, which
falls back to a single vid_colormap[] lookup. The slow path is the problem.
Called from:
R_DrawSurfaceBlock_Light()insw_surf.c:109(surface cache building)R_PolysetDrawSpans8_Opaque()insw_polyset.c:735(alias model rendering)
| Capability | x86 SSE2 | x86 SSE4.1 | x86 AVX2 | ARM NEON | POWER9 VMX/VSX |
|---|---|---|---|---|---|
| Vector width | 128-bit | 128-bit | 256-bit | 128-bit | 128-bit |
| Vector registers | 16 XMM | 16 XMM | 16 YMM | 32 Q | 32 VMX (64 VSX) |
| 32-bit int multiply | Workaround | pmulld |
vpmulld |
vmulq_s32 |
vmuluwm (ISA 3) |
| Byte->int zero-extend | Manual | pmovzxbd |
vpmovzxbd |
vmovl x2 |
vec_mergeh+zero |
| Gather (table lookup) | None | None | vpgatherdd |
None | None |
| Int min/max (signed) | Workaround | pmaxsd |
vpmaxsd |
vmaxq_s32 |
vmaxsw |
| Multiply-accumulate | pmaddwd |
pmaddwd |
vpmaddwd |
vmlal |
vmsumubm/uhm |
| SMT per core | 2 (HT) | 2 (HT) | 2 (HT) | Typically 1 | 4-8 (SMT4/8) |
x86 AVX2 is the clear winner for gather-dependent workloads. vpgatherdd
can load 8 non-contiguous 32-bit values in a single instruction. This directly
accelerates palette lookups, texture fetches, and the R_ApplyLight lookup
chain. The 256-bit width also doubles throughput for pure arithmetic.
POWER9 VMX (ISA 3.0) matches NEON for pure arithmetic but has two distinguishing characteristics:
vmsumubm/vmsumuhm(multiply-sum): processes 16 byte MACs into 4 word accumulators in a single instruction. Ideal for the lightmap accumulation pass. Can match or exceed AVX2 throughput when the scale factor fits in a byte.vmuluwm(32x32->32 multiply): this ISA 3.0 addition is critical. Prior POWER ISA required a 3-4 instruction sequence (vmulesw/vmulosw+ merge) for what is now 1 instruction.- No gather instruction (not even in POWER10/ISA 3.1). For gather-dependent loops, POWER9 must fall back to scalar extract-load-insert.
- SMT4/SMT8 compensates: 4-8 hardware threads per core can hide table lookup latency by interleaving scalar work from multiple threads. This makes the multicore threading strategy (deferred below) more impactful on POWER9 than on other architectures.
- 32 VMX registers (vs 16 YMM on x86) reduce register pressure in complex inner loops.
ARM NEON is on par with POWER9 VMX for most operations. No gather
instruction. No unique advantages or disadvantages for this workload. vtbl
can do small table lookups (up to 64 bytes) in-register but the tables here
are too large.
x86 SSE2 is the minimum x86 baseline. Lacks pmulld (32-bit multiply)
and pmovzxbd (byte->int extend), requiring multi-instruction workarounds.
Still worthwhile for the arithmetic-only loops.
| Platform | Strategy | Relative Throughput |
|---|---|---|
| AVX2 | 8 bytes -> vpmovzxbd -> vpgatherdd -> store 8x uint32 |
8-12x |
| SSE2 | 4 bytes, scalar lookup, pack __m128i |
~1x |
| NEON | 4 bytes, scalar lookup, pack uint32x4_t |
~1x |
| POWER9 VMX | 4 bytes, scalar lookup, pack vector unsigned int |
~1x |
AVX2 has a massive advantage. All other platforms must do scalar loads from the 1 KB palette table (which fits entirely in L1 cache, keeping latency low).
All platforms are equally capable. Pure arithmetic:
{base, base+step, base+2*step, ...}, shift right by 16, store.
AVX2 gets 2x from 256-bit width. POWER9's vmuluwm (ISA 3.0) makes the
setup efficient.
| Platform | Key Instruction | Elements per Instruction |
|---|---|---|
| AVX2 | vpmulld + vpaddd |
8 |
| SSE4.1 | pmulld + paddd |
4 |
| NEON | vmulq_u32 + vaddq_u32 |
4 |
| POWER9 VMX | vmsumubm (byte MAC) |
16 -> 4 accumulators |
POWER9's vmsumubm is genuinely strong here — when the scale factor fits in a
byte, a single instruction processes 16 byte multiply-accumulates.
Identical across all platforms: max, sub, sra, max. AVX2 gets 2x from
width. SSE2 needs a workaround for signed 32-bit max (no pmaxsd until
SSE4.1).
Address computation is fine on all platforms. The bottleneck is the texture
byte gather. Same story as palette lookup: AVX2 wins with vpgatherdd, all
others fall back to scalar.
Fundamentally serial per pixel due to dependency chains. SIMD helps only by processing multiple independent pixels in parallel (cross-pixel vectorization). AVX2's gather enables this; other platforms must use scalar lookups. POWER9's SMT advantage applies here — multiple hardware threads hide lookup latency.
- File:
sw_main.c:2249-2255 - Inner loop:
*dst = sdl_palette[*src]; src++; dst++; - Data types: byte input, uint32 output, 256-entry uint32 palette (1 KB)
- Access pattern: Sequential read/write, random palette lookup (L1-resident)
- Inter-pixel dependencies: None (embarrassingly parallel)
- Typical count:
width * height(e.g. 2M pixels at 1920x1080)
| Platform | Strategy | Speedup |
|---|---|---|
| AVX2 | 8 bytes -> vpmovzxbd -> vpgatherdd -> store 8x uint32 |
4-6x |
| SSE2 | 4 bytes, scalar lookup, pack into __m128i, store |
1.5-2x |
| NEON | 4 bytes, scalar lookup, pack into uint32x4_t, store |
1.5-2x |
| POWER9 VMX | 4 bytes, scalar lookup, pack into vector unsigned int |
1.5-2x |
Effort: ~30-50 lines of intrinsics per platform.
- File:
sw_scan.c:801-806 - Inner loop:
*pdest++ = izi >> 16; izi += izistep; - Data types: 32-bit fixed-point input/output
- Access pattern: Purely sequential write
- Inter-pixel dependencies: Linear accumulation (trivially vectorizable by
computing
izi + 0*step, izi + 1*step, ...) - Typical count: Full span width (hundreds of pixels)
| Platform | Strategy | Speedup |
|---|---|---|
| AVX2 | Init 8-wide ramp, vpsrld by 16, vmovdqu store |
6-8x |
| SSE2 | Same, 4-wide | 3-4x |
| NEON | vshrq_n_s32 + vst1q_s32, 4-wide |
3-4x |
| POWER9 VMX | vec_sr + vec_st, 4-wide |
3-4x |
Effort: ~20-30 lines per platform. All platforms equally capable.
Note: the "safe-step" path at sw_scan.c:787-797 fills repeated values, which
is just a memset-style 32-bit fill (_mm_set1_epi32 / vec_splats).
- File:
sw_light.c:465-481 - Inner loop:
t = max(0,t); t = (255*256 - t) >> 2; t = max(64, t); - Data types: 32-bit int array (
light_t) - Access pattern: Sequential read/write
- Inter-pixel dependencies: None
- Typical count:
smax * tmax * 3(48 to 3072)
| Platform | Strategy | Speedup |
|---|---|---|
| AVX2 | vpmaxsd / vpsubd / vpsrad / vpmaxsd, 8-wide |
6-8x |
| SSE4.1 | pmaxsd / psubd / psrad / pmaxsd, 4-wide |
3-4x |
| NEON | vmaxq_s32 / vsubq_s32 / vshrq / vmaxq_s32, 4-wide |
3-4x |
| POWER9 VMX | vec_max / vec_sub / vec_sra / vec_max, 4-wide |
3-4x |
Effort: ~15-20 lines per platform. Note: SSE2 lacks pmaxsd (signed 32-bit
max); requires a compare+blend workaround or targeting SSE4.1 minimum.
- File:
sw_light.c:443-449 - Inner loop:
*curr_light += *lightmap * scale; - Data types: byte input (
lightmap), uint32 accumulator, uint32 scalar - Access pattern: Sequential
- Inter-element dependencies: None
- Typical count: same as bound/invert pass
| Platform | Strategy | Speedup |
|---|---|---|
| AVX2 | vpmovzxbd + vpmulld + vpaddd, 8-wide |
4-6x |
| SSE4.1 | pmovzxbd + pmulld + paddd, 4-wide |
2-3x |
| NEON | vmovl chain + vmulq_u32 + vaddq_u32, 4-wide |
2-3x |
| POWER9 VMX | vmsumubm if scale fits in byte: 16 MACs -> 4 accumulators in 1 insn; otherwise vec_mul + vec_add |
2-4x |
POWER9 note: vmsumubm is uniquely powerful here. A single VMX instruction
processes 16 byte multiply-accumulates into 4 word accumulators. When the
scale factor fits in a byte (common — it's r_modulate scaled by light style
intensity), this can match AVX2 throughput from a 128-bit instruction.
- File:
sw_surf.c:78-83 - Inner loop:
*dest = vid_colormap[*src + light_masked_right]; - Data types: byte source + constant offset -> 16KB table lookup -> byte dest
- Access pattern: Sequential source/dest, random table lookup (L1-resident)
- Inter-pixel dependencies: None
- Typical count: 16 pixels per block (at mip 0)
| Platform | Strategy | Speedup |
|---|---|---|
| AVX2 | Byte add, vpmovzxbd, vpgatherdd from vid_colormap |
2-3x |
| SSE2/NEON/VMX | Batch byte load/store, scalar lookups | 1.3-1.5x |
- File:
sw_scan.c:539-544 - Inner loop:
*pdest++ = *(pbase + (s >> 16) + (t >> 16) * cachewidth); s += sstep; t += tstep;
- Data types: 32-bit fixed-point coords, byte texture, byte dest
- Access pattern: Sequential dest, random texture access (surface cache)
- Inter-pixel dependencies: Linear (s,t) accumulation (parallelizable)
- Typical count: 16 pixels per sub-span (
SPANSTEP_SHIFT = 4) - Variants: horizontal-only (line 518), vertical-only (line 528), diagonal
| Platform | Strategy | Speedup |
|---|---|---|
| AVX2 | 8-wide address compute (vpsrld+vpmulld+vpaddd), gather |
2-3x |
| SSE4.1 | 4-wide address compute, scalar texture fetch | 1.3-1.5x |
| NEON | Same as SSE4.1 | 1.3-1.5x |
| POWER9 VMX | 4-wide address compute with vmuluwm, scalar fetch |
1.3-1.5x |
- File:
sw_polyset.c:725-754 - Per-pixel z-test (branch),
R_ApplyLight()(7+ table lookups), carry-based texture walk (ltfrac & 0x10000). Small spans (5-50 px). SIMD hostile. - Benefit: ~1.2x — the
R_ApplyLightrework (Phase 4) is the real fix.
- File:
sw_scan.c:169-178 - 3 data-dependent table lookups per pixel (2x turb[], 1x texture). Cross-coordinate turbulence dependency. 16 pixels per sub-span.
- Benefit: ~1.5x with AVX2, marginal on other platforms.
- File:
sw_scan.c:86-89 - Double pointer indirection (
row[turb[u]][col[u]]). Fundamentally serial memory access pattern. No SIMD instruction can do pointer-chasing gathers. - Benefit: negligible.
R_ApplyLight() (sw_image.c:453-487) is a systemic bottleneck acknowledged
in the source with a TODO: -22% fps lost comment. It performs 7 chained
table lookups per pixel and is called from two performance-critical paths:
- Surface cache building (
sw_surf.c:109) - Alias model rendering (
sw_polyset.c:735)
Option A: Precomputed combined table
Direct (light_r, light_g, light_b, texel) -> pixel table.
Size: 64^3 x 256 = 67M entries. Rejected: too large.
Option B: Intermediate RGB table
Precompute rgb_from_lit_texel[64][256] = 48 KB per channel. Reduces 7
lookups to 4. Marginal improvement. Not worth the complexity.
Option C: 16-bit RGB internal framebuffer
Switch from 8-bit paletted to RGB565 internally. Eliminates R_ApplyLight
entirely — lighting becomes direct RGB multiply. Large architectural change.
Trade-offs: doubles framebuffer/cache size, changes the visual aesthetic,
requires texture format conversion. Best long-term option but highest risk.
Option D: Batch R_ApplyLight across pixels (recommended) Restructure callers to process 4-8 pixels at a time. Each pixel's lookup chain is independent, enabling cross-pixel vectorization:
- With AVX2: 3
vpgatherddforvid_colormap[], 3 ford_8to24table[], 1 ford_16to8table[]= 7 gathers for 8 pixels (vs 7 scalar lookups per pixel currently) - With SSE2/NEON/POWER9: extract-load-insert pattern, but still amortizes loop overhead and enables SIMD for the arithmetic portions (shifts, masks, OR-packing for RGB565)
| Platform | Strategy | Speedup |
|---|---|---|
| AVX2 | 7 gathers for 8 pixels + SIMD packing | ~4x |
| SSE2 | Scalar lookups + SIMD packing | ~1.5x |
| NEON | Scalar lookups + SIMD packing | ~1.5x |
| POWER9 VMX | Scalar lookups + SIMD packing | ~1.5x |
Recommendation: Implement Option D. It works within the existing 8-bit architecture, requires no table precomputation, and gives the biggest win on AVX2 while still providing modest gains elsewhere.
Callers to modify:
R_DrawSurfaceBlock_Light()insw_surf.c— process block rows in batches of 4-8 pixelsR_PolysetDrawSpans8_Opaque()insw_polyset.c— process span pixels in batches (more complex due to z-test branching and texture carry)
1a. SIMD detection header (header/simd.h)
Compile-time detection macros:
/* x86 */
#if defined(__AVX2__)
#define YQ2_SIMD_AVX2 1
#endif
#if defined(__SSE4_1__)
#define YQ2_SIMD_SSE41 1
#endif
#if defined(__SSE2__) || defined(_M_X64) || \
(defined(_M_IX86_FP) && _M_IX86_FP >= 2)
#define YQ2_SIMD_SSE2 1
#endif
/* ARM */
#if defined(__ARM_NEON) || defined(__aarch64__)
#define YQ2_SIMD_NEON 1
#endif
/* POWER */
#if defined(__POWER9_VECTOR__) || \
(defined(__ALTIVEC__) && defined(_ARCH_PWR9))
#define YQ2_SIMD_VMX_P9 1
#elif defined(__ALTIVEC__)
#define YQ2_SIMD_VMX 1
#endifRuntime dispatch via function pointers, initialized once at renderer startup.
On x86, use SDL_HasAVX2() / SDL_HasSSE41() for safe runtime detection. On
ARM and POWER, compile-time detection is sufficient (features are guaranteed by
the target ABI/CPU).
1b. Build system changes (see Build System Changes)
Implement SIMD for the four highest-impact loops:
| Target | ~Lines per platform |
|---|---|
RE_CopyFrame |
30-50 |
D_DrawZSpans |
20-30 |
R_BuildLightMap (bound) |
15-20 |
R_BuildLightMap (accumulate) |
25-35 |
Estimated total: ~400-500 lines across all platforms.
| Target | ~Lines per platform |
|---|---|
R_DrawSurfaceBlock fast path |
30-40 |
D_DrawSpansPow2 |
50-80 |
Estimated total: ~400-600 lines across all platforms. The span drawer is more complex due to three code paths (horizontal/vertical/diagonal).
Restructure R_DrawSurfaceBlock_Light() and R_PolysetDrawSpans8_Opaque()
to batch-process pixels, then apply SIMD to the batched lookups.
Estimated total: ~300-500 lines (most complexity is in restructuring the scalar code, not the intrinsics).
src/client/refresh/soft/
├── header/
│ ├── local.h (existing — add SIMD dispatch function pointer decls)
│ ├── model.h (existing — unchanged)
│ └── simd.h (NEW — detection macros, dispatch init prototype)
├── sw_main.c (modify — SIMD dispatch in RE_CopyFrame, init call)
├── sw_scan.c (modify — SIMD dispatch in D_DrawZSpans, D_DrawSpansPow2)
├── sw_surf.c (modify — SIMD dispatch in R_DrawSurfaceBlock)
├── sw_light.c (modify — SIMD dispatch in R_BuildLightMap)
├── sw_image.c (modify — batch R_ApplyLight rework)
├── sw_polyset.c (modify — batch pixel processing for alias models)
├── sw_simd_sse2.c (NEW — SSE2 implementations)
├── sw_simd_sse41.c (NEW — SSE4.1, compiled with -msse4.1)
├── sw_simd_avx2.c (NEW — AVX2, compiled with -mavx2)
├── sw_simd_neon.c (NEW — NEON, AArch64 only)
└── sw_simd_vmx.c (NEW — VMX/VSX, compiled with -mcpu=power9 -mvsx)
Each sw_simd_*.c file implements the same set of functions with
platform-specific intrinsics. The scalar fallback remains in the original
source files. At init time, function pointers are set to the best available
implementation.
Architecture-specific SIMD files need per-file compiler flags since the base
CFLAGS do not enable AVX2/SSE4.1/POWER9:
# SSE4.1 (x86/x86_64 only)
build/ref_soft/sw_simd_sse41.o: src/client/refresh/soft/sw_simd_sse41.c
$(CC) -c $(CFLAGS) -msse4.1 $(SDLCFLAGS) $(INCLUDE) -o $@ $<
# AVX2 (x86/x86_64 only)
build/ref_soft/sw_simd_avx2.o: src/client/refresh/soft/sw_simd_avx2.c
$(CC) -c $(CFLAGS) -mavx2 $(SDLCFLAGS) $(INCLUDE) -o $@ $<
# POWER9 VMX/VSX (ppc64le only)
build/ref_soft/sw_simd_vmx.o: src/client/refresh/soft/sw_simd_vmx.c
$(CC) -c $(CFLAGS) -mcpu=power9 -mvsx $(SDLCFLAGS) $(INCLUDE) -o $@ $<NEON on AArch64 requires no special flags (always available). SSE2 on x86_64 requires no special flags (always available).
Conditional inclusion in REFSOFT_OBJS_ based on YQ2_ARCH:
ifeq ($(YQ2_ARCH),x86_64)
REFSOFT_OBJS_ += sw_simd_sse2.o sw_simd_sse41.o sw_simd_avx2.o
else ifeq ($(YQ2_ARCH),i386)
REFSOFT_OBJS_ += sw_simd_sse2.o
else ifeq ($(YQ2_ARCH),aarch64)
REFSOFT_OBJS_ += sw_simd_neon.o
else ifneq (,$(findstring powerpc,$(YQ2_ARCH)))
REFSOFT_OBJS_ += sw_simd_vmx.o
endifSimilar conditional logic using CMAKE_SYSTEM_PROCESSOR. Note: CMakeLists.txt
is marked as unmaintained in the project; Makefile is the primary build system.
| Optimization | x86 SSE2 | x86 AVX2 | ARM NEON | POWER9 VMX | Effort |
|---|---|---|---|---|---|
| RE_CopyFrame | 1.5x | 4-6x | 1.5x | 1.5x | Low |
| D_DrawZSpans | 3-4x | 6-8x | 3-4x | 3-4x | Low |
| R_BuildLightMap (bound) | 3-4x | 6-8x | 3-4x | 3-4x | Low |
| R_BuildLightMap (accumulate) | 2-3x | 4-6x | 2-3x | 2-4x* | Low |
| R_DrawSurfaceBlock fast | 1.3x | 2-3x | 1.3x | 1.3x | Medium |
| D_DrawSpansPow2 | 1.3x | 2-3x | 1.3x | 1.3x | Medium |
| R_ApplyLight batch (Option D) | 1.5x | ~4x | 1.5x | 1.5x | High |
*POWER9's vmsumubm can match AVX2 throughput when scale fits in a byte.
These are per-loop speedups. The overall frame time improvement depends on what fraction of time is spent in each loop, which varies by scene complexity, resolution, and whether the surface cache is warm.
A full multicore analysis was performed as a precursor to this SIMD plan. Threading is deferred because it requires significant architectural refactoring (encapsulating ~50+ global variables into a render context struct), whereas SIMD can be applied incrementally to existing code.
The renderer uses massive shared mutable global state:
- View vectors (
vpn,vup,vright) are mutated mid-frame during brush model rendering and restored afterward - Edge/surface allocators (
edge_p++,surface_p++,span_p) are unsynchronized bump allocators - Texture state (
cacheblock,cachewidth,d_sdivzstepu, etc.) is global, set before each surface draw - Lighting buffer (
blocklights[]) is a single shared accumulation buffer - The scanline loop in
R_ScanEdgesflushes spans mid-loop when the span buffer fills, coupling span generation with span drawing
- Encapsulate state into
render_context_t— prerequisite for everything - Horizontal band parallelism — divide screen into bands, run full pipeline per band with independent contexts. Architecturally cleanest but requires near-complete state isolation.
- Parallel entity rendering — alias models are independent. Per-thread vertex/span buffers, shared z-buffer with atomic or per-band partitioning.
- Parallel surface cache building — each surface is independent.
Per-thread
blocklights[], thread-safe cache allocator. - Parallel
RE_CopyFrame— trivial band decomposition.
POWER9's SMT4/SMT8 makes threading particularly attractive on this platform.
For gather-heavy operations where SIMD provides little benefit (palette lookup,
R_ApplyLight), running 4-8 scalar threads on the same core naturally hides
memory latency. A modest 2-4 thread split of entity rendering or framebuffer
conversion would already utilize the POWER9 hardware better than single-threaded
scalar code, without requiring the full state-encapsulation refactoring needed
for band-based parallelism.