Instruction ReadRAF Sumcheck — Memory + Residency Cheat Sheet

Stage 1 (log_K address rounds)

Large vectors allocated during ReadRafSumcheckProver::gen:

Object	Size driver	Lifetime	Residency
`lookup_indices`, `lookup_indices_uninterleave`	`T` packed keys / interleaved subsets	Needed until Stage‑2 start	GPU-only per shard once Stage‑2 materialization moves to GPU; host can drop them because `ra` never rematerializes on CPU
`lookup_indices_identity`	Up to `T` indices	Needed per phase and again in `cache_openings` (to sum RAF-flag EQs)	CPU master (required for final RAF flag) + GPU shard slices
`lookup_indices_by_table`	`T` total entries grouped by opcode	Used every phase in `init_suffix_polys` and later in `cache_openings`	CPU master (needed for lookup-table flags) + GPU shard buckets
`lookup_tables`, `is_interleaved_operands`	`T` entries	Needed for Stage‑1 suffixes and Stage‑2 value materialization	GPU-only per shard after Stage‑2 materialization; CPU only holds them if it needs to redo Stage‑2 locally
`u_evals_rv`, `u_evals_raf`	≈ `2T` field elems	Rescaled at each phase, dropped before Stage‑2	GPU only
`suffix_polys`	`NUM_TABLE × suffixes × M (256)`	Rebuilt each phase, bound every round	GPU produces, CPU consumes (no GPU copy once sent)
Prefix-suffix `Q` buffers (`left/right/identity_ps`)	`ORDER × M` each	Rebuilt/bound every round	CPU only
Expanding tables `v[phase]`	`16 × M`	Maintain prefix products, feed Stage‑2	CPU only
`GruenSplitEqPolynomial` (`eq_r_spartan`, `eq_r_branch`)	`O(T)`	Used to form `u_evals_*` and later Gruen rounds	CPU master; per-phase slices can be cached on GPU if needed

Per-phase flow (16 phases, 8 rounds each):

CPU sends the 8 challenges from previous phase to GPUs (8 field elements).
Each GPU rescales its local u_evals_*, buckets cycles, and computes suffix polynomials (NUM_TABLE × 256 values per GPU).
GPUs ship suffix polynomials to CPU; CPU stitches shards.
CPU runs the 8 rounds: binds prefix/suffix structures, updates prefix_registry, and evolves v[phase].

Dual-resident objects (Stage‑1): only the data needed at proof finalization keeps CPU copies: lookup_indices_by_table and lookup_indices_identity. Everything else (e.g., lookup_indices, lookup_tables) can be GPU-only once Stage‑2 materializes on device.

Stage 2 (log_T cycle rounds)

Improved plan keeps large vectors on GPUs until domains shrink:

Object	Size driver	Phase‑2 lifetime	Residency
`ra`, `combined_val`, `combined_raf`	`T` each	Materialize at Stage‑2 start, bound every round	GPU-only while shard length stays power-of-two; once small, download final slices to CPU and free GPU buffers
`prefixes` (`Vec<PrefixEval>`)	`#prefixes` (~tens)	Broadcast once	CPU storage, copied to GPUs as constants
`prev_round_poly_spartan/branch`	Degree‑3 sacs	Maintained per shard while GPUs run	GPU while active; CPU recomputes after fallback
`eq_r_spartan`, `eq_r_branch`	`O(T)`	Needed throughout log_T	CPU master copy; upload per-GPU slices as needed
`eq_r_cycle_prime`	`T`	Only used when caching lookup-table openings	CPU-only vector created in `cache_openings`; we keep no GPU copy

GPU→CPU fallback: once a shard length is no longer a power of two (or below a bandwidth threshold), copy the residual ra/Val/RafVal to CPU, merge shards, and run the remaining small number of rounds on CPU exclusively. After the copy, those vectors are CPU-only; GPUs release the storage.

Data Transfer Summary

Per-phase Stage‑1 traffic: 8 field elements CPU→GPU; NUM_TABLE × 256 field elements GPU→CPU.
Stage‑2 handoff: only the final compact shards are copied once, when GPU domain shrinks.
Dual-resident data is limited to lookup_indices_by_table and lookup_indices_identity (both needed for final cache_openings); every other large vector lives on exactly one side at any moment.

chaosma/readraf_openai_v1.md

Select an option