Large vectors allocated during ReadRafSumcheckProver::gen:
| Object | Size driver | Lifetime | Residency |
|---|---|---|---|
lookup_indices, lookup_indices_uninterleave |
T packed keys / interleaved subsets |
Needed until Stage‑2 start | GPU-only per shard once Stage‑2 materialization moves to GPU; host can drop them because ra never rematerializes on CPU |
lookup_indices_identity |
Up to T indices |
Needed per phase and again in cache_openings (to sum RAF-flag EQs) |
CPU master (required for final RAF flag) + GPU shard slices |
lookup_indices_by_table |
T total entries grouped by opcode |
Used every phase in init_suffix_polys and later in cache_openings |
CPU master (needed for lookup-table flags) + GPU shard buckets |
lookup_tables, is_interleaved_operands |
T entries |
Needed for Stage‑1 suffixes and Stage‑2 value materialization | GPU-only per shard after Stage‑2 materialization; CPU only holds them if it needs to redo Stage‑2 locally |
u_evals_rv, u_evals_raf |
≈ 2T field elems |
Rescaled at each phase, dropped before Stage‑2 | GPU only |
suffix_polys |
NUM_TABLE × suffixes × M (256) |
Rebuilt each phase, bound every round | GPU produces, CPU consumes (no GPU copy once sent) |
Prefix-suffix Q buffers (left/right/identity_ps) |
ORDER × M each |
Rebuilt/bound every round | CPU only |
Expanding tables v[phase] |
16 × M |
Maintain prefix products, feed Stage‑2 | CPU only |
GruenSplitEqPolynomial (eq_r_spartan, eq_r_branch) |
O(T) |
Used to form u_evals_* and later Gruen rounds |
CPU master; per-phase slices can be cached on GPU if needed |
Per-phase flow (16 phases, 8 rounds each):
- CPU sends the 8 challenges from previous phase to GPUs (8 field elements).
- Each GPU rescales its local
u_evals_*, buckets cycles, and computes suffix polynomials (NUM_TABLE × 256 values per GPU). - GPUs ship suffix polynomials to CPU; CPU stitches shards.
- CPU runs the 8 rounds: binds prefix/suffix structures, updates
prefix_registry, and evolvesv[phase].
Dual-resident objects (Stage‑1): only the data needed at proof finalization keeps CPU copies: lookup_indices_by_table and lookup_indices_identity. Everything else (e.g., lookup_indices, lookup_tables) can be GPU-only once Stage‑2 materializes on device.
Improved plan keeps large vectors on GPUs until domains shrink:
| Object | Size driver | Phase‑2 lifetime | Residency |
|---|---|---|---|
ra, combined_val, combined_raf |
T each |
Materialize at Stage‑2 start, bound every round | GPU-only while shard length stays power-of-two; once small, download final slices to CPU and free GPU buffers |
prefixes (Vec<PrefixEval>) |
#prefixes (~tens) |
Broadcast once | CPU storage, copied to GPUs as constants |
prev_round_poly_spartan/branch |
Degree‑3 sacs | Maintained per shard while GPUs run | GPU while active; CPU recomputes after fallback |
eq_r_spartan, eq_r_branch |
O(T) |
Needed throughout log_T | CPU master copy; upload per-GPU slices as needed |
eq_r_cycle_prime |
T |
Only used when caching lookup-table openings | CPU-only vector created in cache_openings; we keep no GPU copy |
GPU→CPU fallback: once a shard length is no longer a power of two (or below a bandwidth threshold), copy the residual ra/Val/RafVal to CPU, merge shards, and run the remaining small number of rounds on CPU exclusively. After the copy, those vectors are CPU-only; GPUs release the storage.
- Per-phase Stage‑1 traffic:
8field elements CPU→GPU;NUM_TABLE × 256field elements GPU→CPU. - Stage‑2 handoff: only the final compact shards are copied once, when GPU domain shrinks.
- Dual-resident data is limited to
lookup_indices_by_tableandlookup_indices_identity(both needed for finalcache_openings); every other large vector lives on exactly one side at any moment.