SIMD Optimization Plan for the Software Renderer

This document covers the feasibility analysis and implementation plan for adding SIMD acceleration to the Yamagi Quake II software renderer (src/client/refresh/soft/).

Architecture Overview
Current Bottlenecks
Platform SIMD Comparison
Per-Loop SIMD Analysis
R_ApplyLight Rework
Implementation Phases
File Organization
Build System Changes
Multicore Feasibility (Deferred)

Architecture Overview

The software renderer is a classic Quake-era scanline renderer:

8-bit paletted internal framebuffer (pixel_t = unsigned char)
Double-buffered: swap_frames[0] and swap_frames[1]
Z-buffer: d_pzbuffer (zvalue_t = int), same dimensions as framebuffer
Active Edge Table (AET) scanline rasterizer in sw_edge.c
Surface cache system in sw_surf.c (texture + lightmap composited once, then reused across frames)
BSP front-to-back traversal for occlusion via edge/surface stack
17 source files, zero threading, zero SIMD

Rendering Pipeline (per frame, in `RE_RenderFrame`)

R_SetupFrame()
  -> R_MarkLeaves()         (PVS decompression, leaf/node visibility marking)
  -> R_PushDlights()        (mark surfaces affected by dynamic lights)
  -> R_EdgeDrawing()
       R_BeginEdgeFrame()
       R_RenderWorld()       (BSP traversal -> R_RenderFace -> Global Edge Table)
       R_DrawBEntitiesOnList()
       R_ScanEdges()         (scanline AET processing -> R_GenerateSpans
                              -> D_DrawSurfaces dispatches to D_SolidSurf,
                                 D_TurbulentSurf, D_SkySurf, D_BackgroundSurf)
  -> R_DrawEntitiesOnList()  (alias models via z-buffer read)
  -> R_DrawParticles()
  -> R_DrawAlphaSurfaces()
  -> D_WarpScreen()          (underwater warp, if applicable)
  -> R_CalcPalette()         (screen tint/blend)
RE_EndFrame()
  -> RE_CopyFrame()          (8-bit -> 32-bit palette conversion -> SDL texture)

Lookup Tables Used by the Renderer

Table	Type	Size	Purpose
`sdl_palette`	`Uint32[]`	1 KB	256-entry palette for final 8->32 convert
`vid_colormap`	`pixel_t[]`	16 KB	256 texels x 64 light grades
`d_8to24table`	`byte[]`	1 KB	Indexed color -> 32-bit RGBA
`d_16to8table`	`byte[]`	64 KB	RGB565 -> best-match 8-bit palette index

Current Bottlenecks

Hot Inner Loops (ranked by impact)

D_DrawSpansPow2() — sw_scan.c:539 World surface texturing. Perspective-correct with 16-pixel subdivision. Inner loop: fixed-point (s,t) stepping + texture gather + byte store.
R_PolysetDrawSpans8_Opaque() — sw_polyset.c:725 Alias model (MD2) pixel loop. Z-test + R_ApplyLight() + affine texture walk with carry.
R_ScanEdges() scanline loop — sw_edge.c Per-scanline AET maintenance and span generation. Fundamentally sequential.
D_DrawZSpans() — sw_scan.c:801 Z-buffer fill. Linear ramp of 32-bit ints.
R_DrawSurfaceBlock8_anymip() / R_DrawSurfaceBlock_Light() — sw_surf.c:78-114 Surface cache building (texture + lightmap compositing).
R_BuildLightMap() — sw_light.c:443-481 Lightmap accumulation and bound/invert/shift post-processing.
RE_CopyFrame() — sw_main.c:2249 8-bit to 32-bit palette conversion for display.

Systemic Bottleneck: `R_ApplyLight()`

Located at sw_image.c:453-487. The source code itself has a TODO: -22% fps lost comment. This function performs 7 chained table lookups per pixel:

// 1-3. Three vid_colormap[] lookups (16KB table), one per RGB channel
i_r = vid_colormap[light_masked[0] + pix];
i_g = vid_colormap[light_masked[1] + pix];
i_b = vid_colormap[light_masked[2] + pix];

// 4-6. Three d_8to24table[] lookups (1KB table) to get RGB components
b_r = d_8to24table[i_r * 4 + 0];
b_g = d_8to24table[i_g * 4 + 1];
b_b = d_8to24table[i_b * 4 + 2];

// 7. Pack to RGB565, lookup d_16to8table[] (64KB table) to get palette index
i_c = (b_r >> 3) | ((b_g >> 2) << 5) | ((b_b >> 3) << 11);
return d_16to8table[i_c & 0xFFFF];

There is a fast path (line 466) when all three light channels are equal, which falls back to a single vid_colormap[] lookup. The slow path is the problem.

Called from:

R_DrawSurfaceBlock_Light() in sw_surf.c:109 (surface cache building)
R_PolysetDrawSpans8_Opaque() in sw_polyset.c:735 (alias model rendering)

Platform SIMD Comparison

Instruction Set Overview

Capability	x86 SSE2	x86 SSE4.1	x86 AVX2	ARM NEON	POWER9 VMX/VSX
Vector width	128-bit	128-bit	256-bit	128-bit	128-bit
Vector registers	16 XMM	16 XMM	16 YMM	32 Q	32 VMX (64 VSX)
32-bit int multiply	Workaround	`pmulld`	`vpmulld`	`vmulq_s32`	`vmuluwm` (ISA 3)
Byte->int zero-extend	Manual	`pmovzxbd`	`vpmovzxbd`	`vmovl` x2	`vec_mergeh`+zero
Gather (table lookup)	None	None	`vpgatherdd`	None	None
Int min/max (signed)	Workaround	`pmaxsd`	`vpmaxsd`	`vmaxq_s32`	`vmaxsw`
Multiply-accumulate	`pmaddwd`	`pmaddwd`	`vpmaddwd`	`vmlal`	`vmsumubm/uhm`
SMT per core	2 (HT)	2 (HT)	2 (HT)	Typically 1	4-8 (SMT4/8)

Key Per-Platform Notes

x86 AVX2 is the clear winner for gather-dependent workloads. vpgatherdd can load 8 non-contiguous 32-bit values in a single instruction. This directly accelerates palette lookups, texture fetches, and the R_ApplyLight lookup chain. The 256-bit width also doubles throughput for pure arithmetic.

POWER9 VMX (ISA 3.0) matches NEON for pure arithmetic but has two distinguishing characteristics:

vmsumubm / vmsumuhm (multiply-sum): processes 16 byte MACs into 4 word accumulators in a single instruction. Ideal for the lightmap accumulation pass. Can match or exceed AVX2 throughput when the scale factor fits in a byte.
vmuluwm (32x32->32 multiply): this ISA 3.0 addition is critical. Prior POWER ISA required a 3-4 instruction sequence (vmulesw/vmulosw + merge) for what is now 1 instruction.
No gather instruction (not even in POWER10/ISA 3.1). For gather-dependent loops, POWER9 must fall back to scalar extract-load-insert.
SMT4/SMT8 compensates: 4-8 hardware threads per core can hide table lookup latency by interleaving scalar work from multiple threads. This makes the multicore threading strategy (deferred below) more impactful on POWER9 than on other architectures.
32 VMX registers (vs 16 YMM on x86) reduce register pressure in complex inner loops.

ARM NEON is on par with POWER9 VMX for most operations. No gather instruction. No unique advantages or disadvantages for this workload. vtbl can do small table lookups (up to 64 bytes) in-register but the tables here are too large.

x86 SSE2 is the minimum x86 baseline. Lacks pmulld (32-bit multiply) and pmovzxbd (byte->int extend), requiring multi-instruction workarounds. Still worthwhile for the arithmetic-only loops.

Per-Operation Platform Comparison

1. 256-Entry Palette Lookup (8-bit -> 32-bit)

Platform	Strategy	Relative Throughput
AVX2	8 bytes -> `vpmovzxbd` -> `vpgatherdd` -> store 8x uint32	8-12x
SSE2	4 bytes, scalar lookup, pack `__m128i`	~1x
NEON	4 bytes, scalar lookup, pack `uint32x4_t`	~1x
POWER9 VMX	4 bytes, scalar lookup, pack `vector unsigned int`	~1x

AVX2 has a massive advantage. All other platforms must do scalar loads from the 1 KB palette table (which fits entirely in L1 cache, keeping latency low).

2. Linear Ramp Fill (z-buffer)

All platforms are equally capable. Pure arithmetic: {base, base+step, base+2*step, ...}, shift right by 16, store. AVX2 gets 2x from 256-bit width. POWER9's vmuluwm (ISA 3.0) makes the setup efficient.

3. Byte->Int Multiply-Accumulate (lightmap)

Platform	Key Instruction	Elements per Instruction
AVX2	`vpmulld` + `vpaddd`	8
SSE4.1	`pmulld` + `paddd`	4
NEON	`vmulq_u32` + `vaddq_u32`	4
POWER9 VMX	`vmsumubm` (byte MAC)	16 -> 4 accumulators

POWER9's vmsumubm is genuinely strong here — when the scale factor fits in a byte, a single instruction processes 16 byte multiply-accumulates.

4. Clamp/Subtract/Shift Chain (lightmap post-process)

Identical across all platforms: max, sub, sra, max. AVX2 gets 2x from width. SSE2 needs a workaround for signed 32-bit max (no pmaxsd until SSE4.1).

5. Texture Coordinate Stepping + Gather

Address computation is fine on all platforms. The bottleneck is the texture byte gather. Same story as palette lookup: AVX2 wins with vpgatherdd, all others fall back to scalar.

6. R_ApplyLight Multi-Table Lookup Chain

Fundamentally serial per pixel due to dependency chains. SIMD helps only by processing multiple independent pixels in parallel (cross-pixel vectorization). AVX2's gather enables this; other platforms must use scalar lookups. POWER9's SMT advantage applies here — multiple hardware threads hide lookup latency.

Per-Loop SIMD Analysis

Tier 1: High Benefit, Low Effort

`RE_CopyFrame()` — Palette Conversion

File: sw_main.c:2249-2255
Inner loop: *dst = sdl_palette[*src]; src++; dst++;
Data types: byte input, uint32 output, 256-entry uint32 palette (1 KB)
Access pattern: Sequential read/write, random palette lookup (L1-resident)
Inter-pixel dependencies: None (embarrassingly parallel)
Typical count: width * height (e.g. 2M pixels at 1920x1080)

Platform	Strategy	Speedup
AVX2	8 bytes -> `vpmovzxbd` -> `vpgatherdd` -> store 8x uint32	4-6x
SSE2	4 bytes, scalar lookup, pack into `__m128i`, store	1.5-2x
NEON	4 bytes, scalar lookup, pack into `uint32x4_t`, store	1.5-2x
POWER9 VMX	4 bytes, scalar lookup, pack into `vector unsigned int`	1.5-2x

Effort: ~30-50 lines of intrinsics per platform.

`D_DrawZSpans()` — Z-Buffer Fill

File: sw_scan.c:801-806
Inner loop: *pdest++ = izi >> 16; izi += izistep;
Data types: 32-bit fixed-point input/output
Access pattern: Purely sequential write
Inter-pixel dependencies: Linear accumulation (trivially vectorizable by computing izi + 0*step, izi + 1*step, ...)
Typical count: Full span width (hundreds of pixels)

Platform	Strategy	Speedup
AVX2	Init 8-wide ramp, `vpsrld` by 16, `vmovdqu` store	6-8x
SSE2	Same, 4-wide	3-4x
NEON	`vshrq_n_s32` + `vst1q_s32`, 4-wide	3-4x
POWER9 VMX	`vec_sr` + `vec_st`, 4-wide	3-4x

Effort: ~20-30 lines per platform. All platforms equally capable.

Note: the "safe-step" path at sw_scan.c:787-797 fills repeated values, which is just a memset-style 32-bit fill (_mm_set1_epi32 / vec_splats).

`R_BuildLightMap()` — Bound/Invert Pass

File: sw_light.c:465-481
Inner loop: t = max(0,t); t = (255*256 - t) >> 2; t = max(64, t);
Data types: 32-bit int array (light_t)
Access pattern: Sequential read/write
Inter-pixel dependencies: None
Typical count: smax * tmax * 3 (48 to 3072)

Platform	Strategy	Speedup
AVX2	`vpmaxsd` / `vpsubd` / `vpsrad` / `vpmaxsd`, 8-wide	6-8x
SSE4.1	`pmaxsd` / `psubd` / `psrad` / `pmaxsd`, 4-wide	3-4x
NEON	`vmaxq_s32` / `vsubq_s32` / `vshrq` / `vmaxq_s32`, 4-wide	3-4x
POWER9 VMX	`vec_max` / `vec_sub` / `vec_sra` / `vec_max`, 4-wide	3-4x

Effort: ~15-20 lines per platform. Note: SSE2 lacks pmaxsd (signed 32-bit max); requires a compare+blend workaround or targeting SSE4.1 minimum.

`R_BuildLightMap()` — Accumulation Pass

File: sw_light.c:443-449
Inner loop: *curr_light += *lightmap * scale;
Data types: byte input (lightmap), uint32 accumulator, uint32 scalar
Access pattern: Sequential
Inter-element dependencies: None
Typical count: same as bound/invert pass

Platform	Strategy	Speedup
AVX2	`vpmovzxbd` + `vpmulld` + `vpaddd`, 8-wide	4-6x
SSE4.1	`pmovzxbd` + `pmulld` + `paddd`, 4-wide	2-3x
NEON	`vmovl` chain + `vmulq_u32` + `vaddq_u32`, 4-wide	2-3x
POWER9 VMX	`vmsumubm` if scale fits in byte: 16 MACs -> 4 accumulators in 1 insn; otherwise `vec_mul` + `vec_add`	2-4x

POWER9 note: vmsumubm is uniquely powerful here. A single VMX instruction processes 16 byte multiply-accumulates into 4 word accumulators. When the scale factor fits in a byte (common — it's r_modulate scaled by light style intensity), this can match AVX2 throughput from a 128-bit instruction.

Tier 2: Medium Benefit, Moderate Effort

`R_DrawSurfaceBlock` Fast Path (Greyscale Light)

File: sw_surf.c:78-83
Inner loop: *dest = vid_colormap[*src + light_masked_right];
Data types: byte source + constant offset -> 16KB table lookup -> byte dest
Access pattern: Sequential source/dest, random table lookup (L1-resident)
Inter-pixel dependencies: None
Typical count: 16 pixels per block (at mip 0)

Platform	Strategy	Speedup
AVX2	Byte add, `vpmovzxbd`, `vpgatherdd` from vid_colormap	2-3x
SSE2/NEON/VMX	Batch byte load/store, scalar lookups	1.3-1.5x

`D_DrawSpansPow2()` — World Surface Texturing

File: sw_scan.c:539-544

Inner loop:

*pdest++ = *(pbase + (s >> 16) + (t >> 16) * cachewidth);
s += sstep; t += tstep;

Data types: 32-bit fixed-point coords, byte texture, byte dest
Access pattern: Sequential dest, random texture access (surface cache)
Inter-pixel dependencies: Linear (s,t) accumulation (parallelizable)
Typical count: 16 pixels per sub-span (SPANSTEP_SHIFT = 4)
Variants: horizontal-only (line 518), vertical-only (line 528), diagonal

Platform	Strategy	Speedup
AVX2	8-wide address compute (`vpsrld`+`vpmulld`+`vpaddd`), gather	2-3x
SSE4.1	4-wide address compute, scalar texture fetch	1.3-1.5x
NEON	Same as SSE4.1	1.3-1.5x
POWER9 VMX	4-wide address compute with `vmuluwm`, scalar fetch	1.3-1.5x

Tier 3: Low Benefit / High Effort (Likely Skip SIMD)

`R_PolysetDrawSpans8_Opaque()` — Alias Model Pixels

File: sw_polyset.c:725-754
Per-pixel z-test (branch), R_ApplyLight() (7+ table lookups), carry-based texture walk (ltfrac & 0x10000). Small spans (5-50 px). SIMD hostile.
Benefit: ~1.2x — the R_ApplyLight rework (Phase 4) is the real fix.

`TurbulentPow2()` — Water/Warp Surfaces

File: sw_scan.c:169-178
3 data-dependent table lookups per pixel (2x turb[], 1x texture). Cross-coordinate turbulence dependency. 16 pixels per sub-span.
Benefit: ~1.5x with AVX2, marginal on other platforms.

`D_WarpScreen()` — Full-Screen Warp

File: sw_scan.c:86-89
Double pointer indirection (row[turb[u]][col[u]]). Fundamentally serial memory access pattern. No SIMD instruction can do pointer-chasing gathers.
Benefit: negligible.

R_ApplyLight Rework

Problem Statement

R_ApplyLight() (sw_image.c:453-487) is a systemic bottleneck acknowledged in the source with a TODO: -22% fps lost comment. It performs 7 chained table lookups per pixel and is called from two performance-critical paths:

Surface cache building (sw_surf.c:109)
Alias model rendering (sw_polyset.c:735)

Rework Options Considered

Option A: Precomputed combined table Direct (light_r, light_g, light_b, texel) -> pixel table. Size: 64^3 x 256 = 67M entries. Rejected: too large.

Option B: Intermediate RGB table Precompute rgb_from_lit_texel[64][256] = 48 KB per channel. Reduces 7 lookups to 4. Marginal improvement. Not worth the complexity.

Option C: 16-bit RGB internal framebuffer Switch from 8-bit paletted to RGB565 internally. Eliminates R_ApplyLight entirely — lighting becomes direct RGB multiply. Large architectural change. Trade-offs: doubles framebuffer/cache size, changes the visual aesthetic, requires texture format conversion. Best long-term option but highest risk.

Option D: Batch R_ApplyLight across pixels (recommended) Restructure callers to process 4-8 pixels at a time. Each pixel's lookup chain is independent, enabling cross-pixel vectorization:

With AVX2: 3 vpgatherdd for vid_colormap[], 3 for d_8to24table[], 1 for d_16to8table[] = 7 gathers for 8 pixels (vs 7 scalar lookups per pixel currently)
With SSE2/NEON/POWER9: extract-load-insert pattern, but still amortizes loop overhead and enables SIMD for the arithmetic portions (shifts, masks, OR-packing for RGB565)

Platform	Strategy	Speedup
AVX2	7 gathers for 8 pixels + SIMD packing	~4x
SSE2	Scalar lookups + SIMD packing	~1.5x
NEON	Scalar lookups + SIMD packing	~1.5x
POWER9 VMX	Scalar lookups + SIMD packing	~1.5x

Recommendation: Implement Option D. It works within the existing 8-bit architecture, requires no table precomputation, and gives the biggest win on AVX2 while still providing modest gains elsewhere.

Callers to modify:

R_DrawSurfaceBlock_Light() in sw_surf.c — process block rows in batches of 4-8 pixels
R_PolysetDrawSpans8_Opaque() in sw_polyset.c — process span pixels in batches (more complex due to z-test branching and texture carry)

Implementation Phases

Phase 1: Infrastructure

1a. SIMD detection header (header/simd.h)

Compile-time detection macros:

/* x86 */
#if defined(__AVX2__)
  #define YQ2_SIMD_AVX2 1
#endif
#if defined(__SSE4_1__)
  #define YQ2_SIMD_SSE41 1
#endif
#if defined(__SSE2__) || defined(_M_X64) || \
    (defined(_M_IX86_FP) && _M_IX86_FP >= 2)
  #define YQ2_SIMD_SSE2 1
#endif

/* ARM */
#if defined(__ARM_NEON) || defined(__aarch64__)
  #define YQ2_SIMD_NEON 1
#endif

/* POWER */
#if defined(__POWER9_VECTOR__) || \
    (defined(__ALTIVEC__) && defined(_ARCH_PWR9))
  #define YQ2_SIMD_VMX_P9 1
#elif defined(__ALTIVEC__)
  #define YQ2_SIMD_VMX 1
#endif

Runtime dispatch via function pointers, initialized once at renderer startup. On x86, use SDL_HasAVX2() / SDL_HasSSE41() for safe runtime detection. On ARM and POWER, compile-time detection is sufficient (features are guaranteed by the target ABI/CPU).

1b. Build system changes (see Build System Changes)

Phase 2: Tier 1 Targets

Implement SIMD for the four highest-impact loops:

Target	~Lines per platform
`RE_CopyFrame`	30-50
`D_DrawZSpans`	20-30
`R_BuildLightMap` (bound)	15-20
`R_BuildLightMap` (accumulate)	25-35

Estimated total: ~400-500 lines across all platforms.

Phase 3: Tier 2 Targets

Target	~Lines per platform
`R_DrawSurfaceBlock` fast path	30-40
`D_DrawSpansPow2`	50-80

Estimated total: ~400-600 lines across all platforms. The span drawer is more complex due to three code paths (horizontal/vertical/diagonal).

Phase 4: R_ApplyLight Rework

Restructure R_DrawSurfaceBlock_Light() and R_PolysetDrawSpans8_Opaque() to batch-process pixels, then apply SIMD to the batched lookups.

Estimated total: ~300-500 lines (most complexity is in restructuring the scalar code, not the intrinsics).

File Organization

src/client/refresh/soft/
├── header/
│   ├── local.h            (existing — add SIMD dispatch function pointer decls)
│   ├── model.h            (existing — unchanged)
│   └── simd.h             (NEW — detection macros, dispatch init prototype)
├── sw_main.c              (modify — SIMD dispatch in RE_CopyFrame, init call)
├── sw_scan.c              (modify — SIMD dispatch in D_DrawZSpans, D_DrawSpansPow2)
├── sw_surf.c              (modify — SIMD dispatch in R_DrawSurfaceBlock)
├── sw_light.c             (modify — SIMD dispatch in R_BuildLightMap)
├── sw_image.c             (modify — batch R_ApplyLight rework)
├── sw_polyset.c           (modify — batch pixel processing for alias models)
├── sw_simd_sse2.c         (NEW — SSE2 implementations)
├── sw_simd_sse41.c        (NEW — SSE4.1, compiled with -msse4.1)
├── sw_simd_avx2.c         (NEW — AVX2, compiled with -mavx2)
├── sw_simd_neon.c         (NEW — NEON, AArch64 only)
└── sw_simd_vmx.c          (NEW — VMX/VSX, compiled with -mcpu=power9 -mvsx)

Each sw_simd_*.c file implements the same set of functions with platform-specific intrinsics. The scalar fallback remains in the original source files. At init time, function pointers are set to the best available implementation.

Build System Changes

Makefile

Architecture-specific SIMD files need per-file compiler flags since the base CFLAGS do not enable AVX2/SSE4.1/POWER9:

# SSE4.1 (x86/x86_64 only)
build/ref_soft/sw_simd_sse41.o: src/client/refresh/soft/sw_simd_sse41.c
	$(CC) -c $(CFLAGS) -msse4.1 $(SDLCFLAGS) $(INCLUDE) -o $@ $<

# AVX2 (x86/x86_64 only)
build/ref_soft/sw_simd_avx2.o: src/client/refresh/soft/sw_simd_avx2.c
	$(CC) -c $(CFLAGS) -mavx2 $(SDLCFLAGS) $(INCLUDE) -o $@ $<

# POWER9 VMX/VSX (ppc64le only)
build/ref_soft/sw_simd_vmx.o: src/client/refresh/soft/sw_simd_vmx.c
	$(CC) -c $(CFLAGS) -mcpu=power9 -mvsx $(SDLCFLAGS) $(INCLUDE) -o $@ $<

NEON on AArch64 requires no special flags (always available). SSE2 on x86_64 requires no special flags (always available).

Conditional inclusion in REFSOFT_OBJS_ based on YQ2_ARCH:

ifeq ($(YQ2_ARCH),x86_64)
REFSOFT_OBJS_ += sw_simd_sse2.o sw_simd_sse41.o sw_simd_avx2.o
else ifeq ($(YQ2_ARCH),i386)
REFSOFT_OBJS_ += sw_simd_sse2.o
else ifeq ($(YQ2_ARCH),aarch64)
REFSOFT_OBJS_ += sw_simd_neon.o
else ifneq (,$(findstring powerpc,$(YQ2_ARCH)))
REFSOFT_OBJS_ += sw_simd_vmx.o
endif

CMakeLists.txt

Similar conditional logic using CMAKE_SYSTEM_PROCESSOR. Note: CMakeLists.txt is marked as unmaintained in the project; Makefile is the primary build system.

Estimated Impact Summary

Optimization	x86 SSE2	x86 AVX2	ARM NEON	POWER9 VMX	Effort
RE_CopyFrame	1.5x	4-6x	1.5x	1.5x	Low
D_DrawZSpans	3-4x	6-8x	3-4x	3-4x	Low
R_BuildLightMap (bound)	3-4x	6-8x	3-4x	3-4x	Low
R_BuildLightMap (accumulate)	2-3x	4-6x	2-3x	2-4x*	Low
R_DrawSurfaceBlock fast	1.3x	2-3x	1.3x	1.3x	Medium
D_DrawSpansPow2	1.3x	2-3x	1.3x	1.3x	Medium
R_ApplyLight batch (Option D)	1.5x	~4x	1.5x	1.5x	High

*POWER9's vmsumubm can match AVX2 throughput when scale fits in a byte.

These are per-loop speedups. The overall frame time improvement depends on what fraction of time is spent in each loop, which varies by scene complexity, resolution, and whether the surface cache is warm.

Multicore Feasibility (Deferred)

A full multicore analysis was performed as a precursor to this SIMD plan. Threading is deferred because it requires significant architectural refactoring (encapsulating ~50+ global variables into a render context struct), whereas SIMD can be applied incrementally to existing code.

Why Threading Is Hard

The renderer uses massive shared mutable global state:

View vectors (vpn, vup, vright) are mutated mid-frame during brush model rendering and restored afterward
Edge/surface allocators (edge_p++, surface_p++, span_p) are unsynchronized bump allocators
Texture state (cacheblock, cachewidth, d_sdivzstepu, etc.) is global, set before each surface draw
Lighting buffer (blocklights[]) is a single shared accumulation buffer
The scanline loop in R_ScanEdges flushes spans mid-loop when the span buffer fills, coupling span generation with span drawing

Threading Strategies (for future consideration)

Encapsulate state into render_context_t — prerequisite for everything
Horizontal band parallelism — divide screen into bands, run full pipeline per band with independent contexts. Architecturally cleanest but requires near-complete state isolation.
Parallel entity rendering — alias models are independent. Per-thread vertex/span buffers, shared z-buffer with atomic or per-band partitioning.
Parallel surface cache building — each surface is independent. Per-thread blocklights[], thread-safe cache allocator.
Parallel RE_CopyFrame — trivial band decomposition.

POWER9 Threading Advantage

POWER9's SMT4/SMT8 makes threading particularly attractive on this platform. For gather-heavy operations where SIMD provides little benefit (palette lookup, R_ApplyLight), running 4-8 scalar threads on the same core naturally hides memory latency. A modest 2-4 thread split of entity rendering or framebuffer conversion would already utilize the POWER9 hardware better than single-threaded scalar code, without requiring the full state-encapsulation refactoring needed for band-based parallelism.

runlevel5/PLAN.md