Skip to content

Instantly share code, notes, and snippets.

@runlevel5
Created February 8, 2026 13:17
Show Gist options
  • Select an option

  • Save runlevel5/cf5975065f3fdc1d1160f8df5f323e03 to your computer and use it in GitHub Desktop.

Select an option

Save runlevel5/cf5975065f3fdc1d1160f8df5f323e03 to your computer and use it in GitHub Desktop.
Improvement plan to speed up Yamagi Quake 2 software renderer

SIMD Optimization Plan for the Software Renderer

This document covers the feasibility analysis and implementation plan for adding SIMD acceleration to the Yamagi Quake II software renderer (src/client/refresh/soft/).

Table of Contents

  1. Architecture Overview
  2. Current Bottlenecks
  3. Platform SIMD Comparison
  4. Per-Loop SIMD Analysis
  5. R_ApplyLight Rework
  6. Implementation Phases
  7. File Organization
  8. Build System Changes
  9. Multicore Feasibility (Deferred)

Architecture Overview

The software renderer is a classic Quake-era scanline renderer:

  • 8-bit paletted internal framebuffer (pixel_t = unsigned char)
  • Double-buffered: swap_frames[0] and swap_frames[1]
  • Z-buffer: d_pzbuffer (zvalue_t = int), same dimensions as framebuffer
  • Active Edge Table (AET) scanline rasterizer in sw_edge.c
  • Surface cache system in sw_surf.c (texture + lightmap composited once, then reused across frames)
  • BSP front-to-back traversal for occlusion via edge/surface stack
  • 17 source files, zero threading, zero SIMD

Rendering Pipeline (per frame, in RE_RenderFrame)

R_SetupFrame()
  -> R_MarkLeaves()         (PVS decompression, leaf/node visibility marking)
  -> R_PushDlights()        (mark surfaces affected by dynamic lights)
  -> R_EdgeDrawing()
       R_BeginEdgeFrame()
       R_RenderWorld()       (BSP traversal -> R_RenderFace -> Global Edge Table)
       R_DrawBEntitiesOnList()
       R_ScanEdges()         (scanline AET processing -> R_GenerateSpans
                              -> D_DrawSurfaces dispatches to D_SolidSurf,
                                 D_TurbulentSurf, D_SkySurf, D_BackgroundSurf)
  -> R_DrawEntitiesOnList()  (alias models via z-buffer read)
  -> R_DrawParticles()
  -> R_DrawAlphaSurfaces()
  -> D_WarpScreen()          (underwater warp, if applicable)
  -> R_CalcPalette()         (screen tint/blend)
RE_EndFrame()
  -> RE_CopyFrame()          (8-bit -> 32-bit palette conversion -> SDL texture)

Lookup Tables Used by the Renderer

Table Type Size Purpose
sdl_palette Uint32[] 1 KB 256-entry palette for final 8->32 convert
vid_colormap pixel_t[] 16 KB 256 texels x 64 light grades
d_8to24table byte[] 1 KB Indexed color -> 32-bit RGBA
d_16to8table byte[] 64 KB RGB565 -> best-match 8-bit palette index

Current Bottlenecks

Hot Inner Loops (ranked by impact)

  1. D_DrawSpansPow2()sw_scan.c:539 World surface texturing. Perspective-correct with 16-pixel subdivision. Inner loop: fixed-point (s,t) stepping + texture gather + byte store.

  2. R_PolysetDrawSpans8_Opaque()sw_polyset.c:725 Alias model (MD2) pixel loop. Z-test + R_ApplyLight() + affine texture walk with carry.

  3. R_ScanEdges() scanline loopsw_edge.c Per-scanline AET maintenance and span generation. Fundamentally sequential.

  4. D_DrawZSpans()sw_scan.c:801 Z-buffer fill. Linear ramp of 32-bit ints.

  5. R_DrawSurfaceBlock8_anymip() / R_DrawSurfaceBlock_Light()sw_surf.c:78-114 Surface cache building (texture + lightmap compositing).

  6. R_BuildLightMap()sw_light.c:443-481 Lightmap accumulation and bound/invert/shift post-processing.

  7. RE_CopyFrame()sw_main.c:2249 8-bit to 32-bit palette conversion for display.

Systemic Bottleneck: R_ApplyLight()

Located at sw_image.c:453-487. The source code itself has a TODO: -22% fps lost comment. This function performs 7 chained table lookups per pixel:

// 1-3. Three vid_colormap[] lookups (16KB table), one per RGB channel
i_r = vid_colormap[light_masked[0] + pix];
i_g = vid_colormap[light_masked[1] + pix];
i_b = vid_colormap[light_masked[2] + pix];

// 4-6. Three d_8to24table[] lookups (1KB table) to get RGB components
b_r = d_8to24table[i_r * 4 + 0];
b_g = d_8to24table[i_g * 4 + 1];
b_b = d_8to24table[i_b * 4 + 2];

// 7. Pack to RGB565, lookup d_16to8table[] (64KB table) to get palette index
i_c = (b_r >> 3) | ((b_g >> 2) << 5) | ((b_b >> 3) << 11);
return d_16to8table[i_c & 0xFFFF];

There is a fast path (line 466) when all three light channels are equal, which falls back to a single vid_colormap[] lookup. The slow path is the problem.

Called from:

  • R_DrawSurfaceBlock_Light() in sw_surf.c:109 (surface cache building)
  • R_PolysetDrawSpans8_Opaque() in sw_polyset.c:735 (alias model rendering)

Platform SIMD Comparison

Instruction Set Overview

Capability x86 SSE2 x86 SSE4.1 x86 AVX2 ARM NEON POWER9 VMX/VSX
Vector width 128-bit 128-bit 256-bit 128-bit 128-bit
Vector registers 16 XMM 16 XMM 16 YMM 32 Q 32 VMX (64 VSX)
32-bit int multiply Workaround pmulld vpmulld vmulq_s32 vmuluwm (ISA 3)
Byte->int zero-extend Manual pmovzxbd vpmovzxbd vmovl x2 vec_mergeh+zero
Gather (table lookup) None None vpgatherdd None None
Int min/max (signed) Workaround pmaxsd vpmaxsd vmaxq_s32 vmaxsw
Multiply-accumulate pmaddwd pmaddwd vpmaddwd vmlal vmsumubm/uhm
SMT per core 2 (HT) 2 (HT) 2 (HT) Typically 1 4-8 (SMT4/8)

Key Per-Platform Notes

x86 AVX2 is the clear winner for gather-dependent workloads. vpgatherdd can load 8 non-contiguous 32-bit values in a single instruction. This directly accelerates palette lookups, texture fetches, and the R_ApplyLight lookup chain. The 256-bit width also doubles throughput for pure arithmetic.

POWER9 VMX (ISA 3.0) matches NEON for pure arithmetic but has two distinguishing characteristics:

  • vmsumubm / vmsumuhm (multiply-sum): processes 16 byte MACs into 4 word accumulators in a single instruction. Ideal for the lightmap accumulation pass. Can match or exceed AVX2 throughput when the scale factor fits in a byte.
  • vmuluwm (32x32->32 multiply): this ISA 3.0 addition is critical. Prior POWER ISA required a 3-4 instruction sequence (vmulesw/vmulosw + merge) for what is now 1 instruction.
  • No gather instruction (not even in POWER10/ISA 3.1). For gather-dependent loops, POWER9 must fall back to scalar extract-load-insert.
  • SMT4/SMT8 compensates: 4-8 hardware threads per core can hide table lookup latency by interleaving scalar work from multiple threads. This makes the multicore threading strategy (deferred below) more impactful on POWER9 than on other architectures.
  • 32 VMX registers (vs 16 YMM on x86) reduce register pressure in complex inner loops.

ARM NEON is on par with POWER9 VMX for most operations. No gather instruction. No unique advantages or disadvantages for this workload. vtbl can do small table lookups (up to 64 bytes) in-register but the tables here are too large.

x86 SSE2 is the minimum x86 baseline. Lacks pmulld (32-bit multiply) and pmovzxbd (byte->int extend), requiring multi-instruction workarounds. Still worthwhile for the arithmetic-only loops.

Per-Operation Platform Comparison

1. 256-Entry Palette Lookup (8-bit -> 32-bit)

Platform Strategy Relative Throughput
AVX2 8 bytes -> vpmovzxbd -> vpgatherdd -> store 8x uint32 8-12x
SSE2 4 bytes, scalar lookup, pack __m128i ~1x
NEON 4 bytes, scalar lookup, pack uint32x4_t ~1x
POWER9 VMX 4 bytes, scalar lookup, pack vector unsigned int ~1x

AVX2 has a massive advantage. All other platforms must do scalar loads from the 1 KB palette table (which fits entirely in L1 cache, keeping latency low).

2. Linear Ramp Fill (z-buffer)

All platforms are equally capable. Pure arithmetic: {base, base+step, base+2*step, ...}, shift right by 16, store. AVX2 gets 2x from 256-bit width. POWER9's vmuluwm (ISA 3.0) makes the setup efficient.

3. Byte->Int Multiply-Accumulate (lightmap)

Platform Key Instruction Elements per Instruction
AVX2 vpmulld + vpaddd 8
SSE4.1 pmulld + paddd 4
NEON vmulq_u32 + vaddq_u32 4
POWER9 VMX vmsumubm (byte MAC) 16 -> 4 accumulators

POWER9's vmsumubm is genuinely strong here — when the scale factor fits in a byte, a single instruction processes 16 byte multiply-accumulates.

4. Clamp/Subtract/Shift Chain (lightmap post-process)

Identical across all platforms: max, sub, sra, max. AVX2 gets 2x from width. SSE2 needs a workaround for signed 32-bit max (no pmaxsd until SSE4.1).

5. Texture Coordinate Stepping + Gather

Address computation is fine on all platforms. The bottleneck is the texture byte gather. Same story as palette lookup: AVX2 wins with vpgatherdd, all others fall back to scalar.

6. R_ApplyLight Multi-Table Lookup Chain

Fundamentally serial per pixel due to dependency chains. SIMD helps only by processing multiple independent pixels in parallel (cross-pixel vectorization). AVX2's gather enables this; other platforms must use scalar lookups. POWER9's SMT advantage applies here — multiple hardware threads hide lookup latency.


Per-Loop SIMD Analysis

Tier 1: High Benefit, Low Effort

RE_CopyFrame() — Palette Conversion

  • File: sw_main.c:2249-2255
  • Inner loop: *dst = sdl_palette[*src]; src++; dst++;
  • Data types: byte input, uint32 output, 256-entry uint32 palette (1 KB)
  • Access pattern: Sequential read/write, random palette lookup (L1-resident)
  • Inter-pixel dependencies: None (embarrassingly parallel)
  • Typical count: width * height (e.g. 2M pixels at 1920x1080)
Platform Strategy Speedup
AVX2 8 bytes -> vpmovzxbd -> vpgatherdd -> store 8x uint32 4-6x
SSE2 4 bytes, scalar lookup, pack into __m128i, store 1.5-2x
NEON 4 bytes, scalar lookup, pack into uint32x4_t, store 1.5-2x
POWER9 VMX 4 bytes, scalar lookup, pack into vector unsigned int 1.5-2x

Effort: ~30-50 lines of intrinsics per platform.

D_DrawZSpans() — Z-Buffer Fill

  • File: sw_scan.c:801-806
  • Inner loop: *pdest++ = izi >> 16; izi += izistep;
  • Data types: 32-bit fixed-point input/output
  • Access pattern: Purely sequential write
  • Inter-pixel dependencies: Linear accumulation (trivially vectorizable by computing izi + 0*step, izi + 1*step, ...)
  • Typical count: Full span width (hundreds of pixels)
Platform Strategy Speedup
AVX2 Init 8-wide ramp, vpsrld by 16, vmovdqu store 6-8x
SSE2 Same, 4-wide 3-4x
NEON vshrq_n_s32 + vst1q_s32, 4-wide 3-4x
POWER9 VMX vec_sr + vec_st, 4-wide 3-4x

Effort: ~20-30 lines per platform. All platforms equally capable.

Note: the "safe-step" path at sw_scan.c:787-797 fills repeated values, which is just a memset-style 32-bit fill (_mm_set1_epi32 / vec_splats).

R_BuildLightMap() — Bound/Invert Pass

  • File: sw_light.c:465-481
  • Inner loop: t = max(0,t); t = (255*256 - t) >> 2; t = max(64, t);
  • Data types: 32-bit int array (light_t)
  • Access pattern: Sequential read/write
  • Inter-pixel dependencies: None
  • Typical count: smax * tmax * 3 (48 to 3072)
Platform Strategy Speedup
AVX2 vpmaxsd / vpsubd / vpsrad / vpmaxsd, 8-wide 6-8x
SSE4.1 pmaxsd / psubd / psrad / pmaxsd, 4-wide 3-4x
NEON vmaxq_s32 / vsubq_s32 / vshrq / vmaxq_s32, 4-wide 3-4x
POWER9 VMX vec_max / vec_sub / vec_sra / vec_max, 4-wide 3-4x

Effort: ~15-20 lines per platform. Note: SSE2 lacks pmaxsd (signed 32-bit max); requires a compare+blend workaround or targeting SSE4.1 minimum.

R_BuildLightMap() — Accumulation Pass

  • File: sw_light.c:443-449
  • Inner loop: *curr_light += *lightmap * scale;
  • Data types: byte input (lightmap), uint32 accumulator, uint32 scalar
  • Access pattern: Sequential
  • Inter-element dependencies: None
  • Typical count: same as bound/invert pass
Platform Strategy Speedup
AVX2 vpmovzxbd + vpmulld + vpaddd, 8-wide 4-6x
SSE4.1 pmovzxbd + pmulld + paddd, 4-wide 2-3x
NEON vmovl chain + vmulq_u32 + vaddq_u32, 4-wide 2-3x
POWER9 VMX vmsumubm if scale fits in byte: 16 MACs -> 4 accumulators in 1 insn; otherwise vec_mul + vec_add 2-4x

POWER9 note: vmsumubm is uniquely powerful here. A single VMX instruction processes 16 byte multiply-accumulates into 4 word accumulators. When the scale factor fits in a byte (common — it's r_modulate scaled by light style intensity), this can match AVX2 throughput from a 128-bit instruction.

Tier 2: Medium Benefit, Moderate Effort

R_DrawSurfaceBlock Fast Path (Greyscale Light)

  • File: sw_surf.c:78-83
  • Inner loop: *dest = vid_colormap[*src + light_masked_right];
  • Data types: byte source + constant offset -> 16KB table lookup -> byte dest
  • Access pattern: Sequential source/dest, random table lookup (L1-resident)
  • Inter-pixel dependencies: None
  • Typical count: 16 pixels per block (at mip 0)
Platform Strategy Speedup
AVX2 Byte add, vpmovzxbd, vpgatherdd from vid_colormap 2-3x
SSE2/NEON/VMX Batch byte load/store, scalar lookups 1.3-1.5x

D_DrawSpansPow2() — World Surface Texturing

  • File: sw_scan.c:539-544
  • Inner loop:
    *pdest++ = *(pbase + (s >> 16) + (t >> 16) * cachewidth);
    s += sstep; t += tstep;
  • Data types: 32-bit fixed-point coords, byte texture, byte dest
  • Access pattern: Sequential dest, random texture access (surface cache)
  • Inter-pixel dependencies: Linear (s,t) accumulation (parallelizable)
  • Typical count: 16 pixels per sub-span (SPANSTEP_SHIFT = 4)
  • Variants: horizontal-only (line 518), vertical-only (line 528), diagonal
Platform Strategy Speedup
AVX2 8-wide address compute (vpsrld+vpmulld+vpaddd), gather 2-3x
SSE4.1 4-wide address compute, scalar texture fetch 1.3-1.5x
NEON Same as SSE4.1 1.3-1.5x
POWER9 VMX 4-wide address compute with vmuluwm, scalar fetch 1.3-1.5x

Tier 3: Low Benefit / High Effort (Likely Skip SIMD)

R_PolysetDrawSpans8_Opaque() — Alias Model Pixels

  • File: sw_polyset.c:725-754
  • Per-pixel z-test (branch), R_ApplyLight() (7+ table lookups), carry-based texture walk (ltfrac & 0x10000). Small spans (5-50 px). SIMD hostile.
  • Benefit: ~1.2x — the R_ApplyLight rework (Phase 4) is the real fix.

TurbulentPow2() — Water/Warp Surfaces

  • File: sw_scan.c:169-178
  • 3 data-dependent table lookups per pixel (2x turb[], 1x texture). Cross-coordinate turbulence dependency. 16 pixels per sub-span.
  • Benefit: ~1.5x with AVX2, marginal on other platforms.

D_WarpScreen() — Full-Screen Warp

  • File: sw_scan.c:86-89
  • Double pointer indirection (row[turb[u]][col[u]]). Fundamentally serial memory access pattern. No SIMD instruction can do pointer-chasing gathers.
  • Benefit: negligible.

R_ApplyLight Rework

Problem Statement

R_ApplyLight() (sw_image.c:453-487) is a systemic bottleneck acknowledged in the source with a TODO: -22% fps lost comment. It performs 7 chained table lookups per pixel and is called from two performance-critical paths:

  • Surface cache building (sw_surf.c:109)
  • Alias model rendering (sw_polyset.c:735)

Rework Options Considered

Option A: Precomputed combined table Direct (light_r, light_g, light_b, texel) -> pixel table. Size: 64^3 x 256 = 67M entries. Rejected: too large.

Option B: Intermediate RGB table Precompute rgb_from_lit_texel[64][256] = 48 KB per channel. Reduces 7 lookups to 4. Marginal improvement. Not worth the complexity.

Option C: 16-bit RGB internal framebuffer Switch from 8-bit paletted to RGB565 internally. Eliminates R_ApplyLight entirely — lighting becomes direct RGB multiply. Large architectural change. Trade-offs: doubles framebuffer/cache size, changes the visual aesthetic, requires texture format conversion. Best long-term option but highest risk.

Option D: Batch R_ApplyLight across pixels (recommended) Restructure callers to process 4-8 pixels at a time. Each pixel's lookup chain is independent, enabling cross-pixel vectorization:

  • With AVX2: 3 vpgatherdd for vid_colormap[], 3 for d_8to24table[], 1 for d_16to8table[] = 7 gathers for 8 pixels (vs 7 scalar lookups per pixel currently)
  • With SSE2/NEON/POWER9: extract-load-insert pattern, but still amortizes loop overhead and enables SIMD for the arithmetic portions (shifts, masks, OR-packing for RGB565)
Platform Strategy Speedup
AVX2 7 gathers for 8 pixels + SIMD packing ~4x
SSE2 Scalar lookups + SIMD packing ~1.5x
NEON Scalar lookups + SIMD packing ~1.5x
POWER9 VMX Scalar lookups + SIMD packing ~1.5x

Recommendation: Implement Option D. It works within the existing 8-bit architecture, requires no table precomputation, and gives the biggest win on AVX2 while still providing modest gains elsewhere.

Callers to modify:

  • R_DrawSurfaceBlock_Light() in sw_surf.c — process block rows in batches of 4-8 pixels
  • R_PolysetDrawSpans8_Opaque() in sw_polyset.c — process span pixels in batches (more complex due to z-test branching and texture carry)

Implementation Phases

Phase 1: Infrastructure

1a. SIMD detection header (header/simd.h)

Compile-time detection macros:

/* x86 */
#if defined(__AVX2__)
  #define YQ2_SIMD_AVX2 1
#endif
#if defined(__SSE4_1__)
  #define YQ2_SIMD_SSE41 1
#endif
#if defined(__SSE2__) || defined(_M_X64) || \
    (defined(_M_IX86_FP) && _M_IX86_FP >= 2)
  #define YQ2_SIMD_SSE2 1
#endif

/* ARM */
#if defined(__ARM_NEON) || defined(__aarch64__)
  #define YQ2_SIMD_NEON 1
#endif

/* POWER */
#if defined(__POWER9_VECTOR__) || \
    (defined(__ALTIVEC__) && defined(_ARCH_PWR9))
  #define YQ2_SIMD_VMX_P9 1
#elif defined(__ALTIVEC__)
  #define YQ2_SIMD_VMX 1
#endif

Runtime dispatch via function pointers, initialized once at renderer startup. On x86, use SDL_HasAVX2() / SDL_HasSSE41() for safe runtime detection. On ARM and POWER, compile-time detection is sufficient (features are guaranteed by the target ABI/CPU).

1b. Build system changes (see Build System Changes)

Phase 2: Tier 1 Targets

Implement SIMD for the four highest-impact loops:

Target ~Lines per platform
RE_CopyFrame 30-50
D_DrawZSpans 20-30
R_BuildLightMap (bound) 15-20
R_BuildLightMap (accumulate) 25-35

Estimated total: ~400-500 lines across all platforms.

Phase 3: Tier 2 Targets

Target ~Lines per platform
R_DrawSurfaceBlock fast path 30-40
D_DrawSpansPow2 50-80

Estimated total: ~400-600 lines across all platforms. The span drawer is more complex due to three code paths (horizontal/vertical/diagonal).

Phase 4: R_ApplyLight Rework

Restructure R_DrawSurfaceBlock_Light() and R_PolysetDrawSpans8_Opaque() to batch-process pixels, then apply SIMD to the batched lookups.

Estimated total: ~300-500 lines (most complexity is in restructuring the scalar code, not the intrinsics).


File Organization

src/client/refresh/soft/
├── header/
│   ├── local.h            (existing — add SIMD dispatch function pointer decls)
│   ├── model.h            (existing — unchanged)
│   └── simd.h             (NEW — detection macros, dispatch init prototype)
├── sw_main.c              (modify — SIMD dispatch in RE_CopyFrame, init call)
├── sw_scan.c              (modify — SIMD dispatch in D_DrawZSpans, D_DrawSpansPow2)
├── sw_surf.c              (modify — SIMD dispatch in R_DrawSurfaceBlock)
├── sw_light.c             (modify — SIMD dispatch in R_BuildLightMap)
├── sw_image.c             (modify — batch R_ApplyLight rework)
├── sw_polyset.c           (modify — batch pixel processing for alias models)
├── sw_simd_sse2.c         (NEW — SSE2 implementations)
├── sw_simd_sse41.c        (NEW — SSE4.1, compiled with -msse4.1)
├── sw_simd_avx2.c         (NEW — AVX2, compiled with -mavx2)
├── sw_simd_neon.c         (NEW — NEON, AArch64 only)
└── sw_simd_vmx.c          (NEW — VMX/VSX, compiled with -mcpu=power9 -mvsx)

Each sw_simd_*.c file implements the same set of functions with platform-specific intrinsics. The scalar fallback remains in the original source files. At init time, function pointers are set to the best available implementation.


Build System Changes

Makefile

Architecture-specific SIMD files need per-file compiler flags since the base CFLAGS do not enable AVX2/SSE4.1/POWER9:

# SSE4.1 (x86/x86_64 only)
build/ref_soft/sw_simd_sse41.o: src/client/refresh/soft/sw_simd_sse41.c
	$(CC) -c $(CFLAGS) -msse4.1 $(SDLCFLAGS) $(INCLUDE) -o $@ $<

# AVX2 (x86/x86_64 only)
build/ref_soft/sw_simd_avx2.o: src/client/refresh/soft/sw_simd_avx2.c
	$(CC) -c $(CFLAGS) -mavx2 $(SDLCFLAGS) $(INCLUDE) -o $@ $<

# POWER9 VMX/VSX (ppc64le only)
build/ref_soft/sw_simd_vmx.o: src/client/refresh/soft/sw_simd_vmx.c
	$(CC) -c $(CFLAGS) -mcpu=power9 -mvsx $(SDLCFLAGS) $(INCLUDE) -o $@ $<

NEON on AArch64 requires no special flags (always available). SSE2 on x86_64 requires no special flags (always available).

Conditional inclusion in REFSOFT_OBJS_ based on YQ2_ARCH:

ifeq ($(YQ2_ARCH),x86_64)
REFSOFT_OBJS_ += sw_simd_sse2.o sw_simd_sse41.o sw_simd_avx2.o
else ifeq ($(YQ2_ARCH),i386)
REFSOFT_OBJS_ += sw_simd_sse2.o
else ifeq ($(YQ2_ARCH),aarch64)
REFSOFT_OBJS_ += sw_simd_neon.o
else ifneq (,$(findstring powerpc,$(YQ2_ARCH)))
REFSOFT_OBJS_ += sw_simd_vmx.o
endif

CMakeLists.txt

Similar conditional logic using CMAKE_SYSTEM_PROCESSOR. Note: CMakeLists.txt is marked as unmaintained in the project; Makefile is the primary build system.


Estimated Impact Summary

Optimization x86 SSE2 x86 AVX2 ARM NEON POWER9 VMX Effort
RE_CopyFrame 1.5x 4-6x 1.5x 1.5x Low
D_DrawZSpans 3-4x 6-8x 3-4x 3-4x Low
R_BuildLightMap (bound) 3-4x 6-8x 3-4x 3-4x Low
R_BuildLightMap (accumulate) 2-3x 4-6x 2-3x 2-4x* Low
R_DrawSurfaceBlock fast 1.3x 2-3x 1.3x 1.3x Medium
D_DrawSpansPow2 1.3x 2-3x 1.3x 1.3x Medium
R_ApplyLight batch (Option D) 1.5x ~4x 1.5x 1.5x High

*POWER9's vmsumubm can match AVX2 throughput when scale fits in a byte.

These are per-loop speedups. The overall frame time improvement depends on what fraction of time is spent in each loop, which varies by scene complexity, resolution, and whether the surface cache is warm.


Multicore Feasibility (Deferred)

A full multicore analysis was performed as a precursor to this SIMD plan. Threading is deferred because it requires significant architectural refactoring (encapsulating ~50+ global variables into a render context struct), whereas SIMD can be applied incrementally to existing code.

Why Threading Is Hard

The renderer uses massive shared mutable global state:

  • View vectors (vpn, vup, vright) are mutated mid-frame during brush model rendering and restored afterward
  • Edge/surface allocators (edge_p++, surface_p++, span_p) are unsynchronized bump allocators
  • Texture state (cacheblock, cachewidth, d_sdivzstepu, etc.) is global, set before each surface draw
  • Lighting buffer (blocklights[]) is a single shared accumulation buffer
  • The scanline loop in R_ScanEdges flushes spans mid-loop when the span buffer fills, coupling span generation with span drawing

Threading Strategies (for future consideration)

  1. Encapsulate state into render_context_t — prerequisite for everything
  2. Horizontal band parallelism — divide screen into bands, run full pipeline per band with independent contexts. Architecturally cleanest but requires near-complete state isolation.
  3. Parallel entity rendering — alias models are independent. Per-thread vertex/span buffers, shared z-buffer with atomic or per-band partitioning.
  4. Parallel surface cache building — each surface is independent. Per-thread blocklights[], thread-safe cache allocator.
  5. Parallel RE_CopyFrame — trivial band decomposition.

POWER9 Threading Advantage

POWER9's SMT4/SMT8 makes threading particularly attractive on this platform. For gather-heavy operations where SIMD provides little benefit (palette lookup, R_ApplyLight), running 4-8 scalar threads on the same core naturally hides memory latency. A modest 2-4 thread split of entity rendering or framebuffer conversion would already utilize the POWER9 hardware better than single-threaded scalar code, without requiring the full state-encapsulation refactoring needed for band-based parallelism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment