Skip to content

Feat: jagged pcs#40

Merged
kunxian-xia merged 26 commits into
mainfrom
feat/jagged_pcs
May 13, 2026
Merged

Feat: jagged pcs#40
kunxian-xia merged 26 commits into
mainfrom
feat/jagged_pcs

Conversation

@kunxian-xia
Copy link
Copy Markdown
Collaborator

@kunxian-xia kunxian-xia commented Apr 24, 2026

Summary

This PR includes works in several separate PRs.

PR/Commit Note
#31 compute q(b) from the input row major matrices
#32 run the jagged sumcheck $v = \sum_b q(b) * f(b)$ in a memory constrained device (note that both q and f have around 30 ~ 32 variables)
#39 evaluator for indicator function $g(z_1, z_2, z_3, z_4)$
4c3a23f verifier for $f(z) = \sum_c \textrm{eq}(z_c, c) * g(z_r, z, t_c, t_{c+1})$
#42 jagged assist sumcheck
#48 stacked pcs (reshape q(b) as several smaller multilinear polynomials)
#47 batch open multiple sets of matrices

Key difference from SP1's Jagged PCS

Unlike SP1's jagged PCS (eprint 2025/917) which uses raw heights $h_i$ as block sizes in the packed polynomial $q'$, our implementation uses $2^{s_i}$ (where $s_i = \lceil \log_2 h_i \rceil$). This is because Ceno's main sumcheck evaluates polynomials at a suffix of the challenge point, requiring a bit-reversal permutation (suffix-to-prefix transformation) to make the jagged sumcheck work with prefix-aligned points.

Before bit-reversal, each polynomial $p_i$ is zero-padded from $h_i$ to $2^{s_i}$ entries. After bit-reversal, these zeros are scattered throughout the block (not contiguous at the end), so each polynomial must occupy a full $2^{s_i}$-entry block in $q'$. This means $q'$ contains $\sum_i (2^{s_i} - h_i)$ extra zeros compared to SP1's approach.

Integration Tests

The main integration test (test_jagged_batch_open_verify_small and variants) exercises the full protocol pipeline end-to-end:

  1. Commit — builds matrices, calls jagged_commit (bit-reversal, transpose, concatenation, inner PCS commit)
  2. Evaluate — computes true column polynomial evaluations at random points (ground truth, independent of the jagged machinery)
  3. Batch Open — calls jagged_batch_open (transcript setup, jagged sumcheck, inner PCS open)
  4. Batch Verify — calls jagged_batch_verify (transcript replay, sumcheck verify, ROBP ĝ evaluation, inner PCS verify)

This touches every submodule: types (structs), sumcheck (streaming prover), evaluator (ROBP ĝ), and mod.rs (commit/open/verify).

Three variants cover different scenarios:

  • _small — 3 matrices with different heights ($2^{10}$, $2^{11}$, $2^{9}$), each with 1 column: exercises different-height polynomial handling with correction factors $C_i = \textrm{eq}(z_r[s_i..], 0)$
  • _single_poly — 1 column: edge case where num_polys = 1
  • _soundness — tampers evals[1] += 1 and asserts the verifier rejects: confirms the protocol actually catches cheating

kunxian-xia and others added 10 commits April 7, 2026 18:34
* claude code impl plan

* implement the jagged_sumcheck using time-space tradeoff sumcheck prover algorithm

* remove debugging codes

* ref to the original paper

* add jagged sumcheck bench

* #32 parallelize (#34)

* par wip

* check f(z) * g(z) matches expected evaluation

* fix clippy

* fix clippy: add #[cfg(test)] to test-only method and fix unused import/variable warnings

Agent-Logs-Url: https://github.com/scroll-tech/gkr-backend/sessions/4a61316e-cf28-47b8-a43e-fb6ab432701e

Co-authored-by: hero78119 <3962077+hero78119@users.noreply.github.com>

* refactor test

* apply functional programming style

* avoid unnecessary BaseField-to-ExtensionField conversion in q_evals access

Use E * BaseField multiplication directly instead of converting q_evals
elements to extension field with .into() first.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* replace col_row binary search with incremental ColRowIter

Add ColRowIter that does one binary search at construction and O(1)
per step, replacing per-element binary searches in build_m_table,
bind_and_materialize, compute_claimed_sum, and final_evaluations_slow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: hero78119 <3962077+hero78119@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* extend jagged sumcheck benchmark to cover n=25..31

* switch jagged sumcheck benchmark to BabyBearExt4

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* remove jagged sumcheck plan doc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: hero78119 <3962077+hero78119@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…rove (#38)

* claude code impl plan

* implement the jagged_sumcheck using time-space tradeoff sumcheck prover algorithm

* remove debugging codes

* ref to the original paper

* add jagged sumcheck bench

* #32 parallelize (#34)

* par wip

* check f(z) * g(z) matches expected evaluation

* fix clippy

* fix clippy: add #[cfg(test)] to test-only method and fix unused import/variable warnings

Agent-Logs-Url: https://github.com/scroll-tech/gkr-backend/sessions/4a61316e-cf28-47b8-a43e-fb6ab432701e

Co-authored-by: hero78119 <3962077+hero78119@users.noreply.github.com>

* refactor test

* apply functional programming style

* avoid unnecessary BaseField-to-ExtensionField conversion in q_evals access

Use E * BaseField multiplication directly instead of converting q_evals
elements to extension field with .into() first.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* replace col_row binary search with incremental ColRowIter

Add ColRowIter that does one binary search at construction and O(1)
per step, replacing per-element binary searches in build_m_table,
bind_and_materialize, compute_claimed_sum, and final_evaluations_slow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: hero78119 <3962077+hero78119@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* extend jagged sumcheck benchmark to cover n=25..31

* switch jagged sumcheck benchmark to BabyBearExt4

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* remove jagged sumcheck plan doc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Initial plan

* Make EPOCH_SIZES configurable in jagged_sumcheck_prove via optional parameter

Agent-Logs-Url: https://github.com/scroll-tech/gkr-backend/sessions/b18805c7-6e15-44c0-ab03-a1905068d964

Co-authored-by: hero78119 <3962077+hero78119@users.noreply.github.com>

* Fix doc spacing and add debug_assert for epoch_sizes validation

Agent-Logs-Url: https://github.com/scroll-tech/gkr-backend/sessions/b18805c7-6e15-44c0-ab03-a1905068d964

Co-authored-by: hero78119 <3962077+hero78119@users.noreply.github.com>

---------

Co-authored-by: kunxian xia <xiakunxian130@gmail.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: hero78119 <3962077+hero78119@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* implement evaluate_g using width-4 ROBP from Section 3.2/4

Evaluate the MLE of indicator function g(a,b,c,d) = [a+c=b AND b<d]
using a width-4 read-once branching program with state (carry, lt).
Placed in separate jagged_evaluator.rs module. Includes brute-force
test for correctness verification (n=1..4).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* implement forward and backward algorithms for evaluate_g

Separate evaluate_g into two functions following the ROBP-based MLE
evaluation: forward (source→sinks, MLE definition) and backward
(sinks→source, Claim 4.2.1/Lemma 4.2 from jagged PCS paper).
Extract shared transition_weights helper. Default to backward for
future batch/symbolic evaluation support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Add the opening protocol for the jagged PCS: given K evaluation claims
on individual column polynomials, prove they're consistent with the
commitment to the giga polynomial q'. Uses the jagged sumcheck to reduce
to a single evaluation of q', then opens via the inner PCS.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kunxian-xia kunxian-xia changed the title Feat/jagged pcs Feat: jagged pcs Apr 24, 2026
@kunxian-xia kunxian-xia mentioned this pull request Apr 24, 2026
kunxian-xia and others added 5 commits April 24, 2026 14:51
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… polynomials

The zero-padding interpretation (p_i^pad) explains why C_i = eq(z_r[s_i..], 0)
is needed when batching polynomials of different heights with a single eq table.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread crates/mpcs/src/jagged/mod.rs Outdated
Clarify that each polynomial in q' occupies a full 2^{s_i} block
(not the original h_i entries) because bit-reversal scatters the
zero-padding throughout the block. Add note comparing with SP1's
jagged PCS which uses raw h_i, highlighting the Σ(2^{s_i} - h_i)
extra zeros as the cost of suffix-to-prefix bit-reversal.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread crates/mpcs/src/jagged/mod.rs
kunxian-xia and others added 2 commits April 28, 2026 19:13
* feat: implement assist sumcheck to reduce K ROBP evaluations to one

Implements the assist sumcheck protocol (Lemma 5.1 of eprint 2025/917)
which batches K indicator function evaluations into a single opening,
using an interleaved variable ordering and forward-backward ROBP state
decomposition.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* perf: parallelize backward precomputation and fuse round accumulations

- Parallelize the O(K × n_robp) backward vector precomputation with rayon
- Derive bwd_sum_d from bwd_sum in O(1) instead of a second O(K) pass
- Fuse the two per-step weight updates into one
- Add assist_sumcheck benchmark (K=100/500/1000)

K=1000 drops from 14.1ms to 4.6ms (3.1x speedup with --features parallel).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: annotate key MLE telescoping identity in assist sumcheck

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* bench: update assist sumcheck benchmark to K=1000,2000,4000,8000

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: derive ROBP transition matrices from raw transition table

Separate the ROBP state machine logic from the eq-weighting algebra.
The raw transition table ROBP_TRANSITION encodes the state machine
directly, and symbol_transition_matrices derives M_i^{(c,d)} via the
closed formula Σ_{a,b} eq₁(z1,a)·eq₁(z2,b)·[transition(from,(a,b,c,d))=to].

Also document why z₁/z₂ are bound first (reducing alphabet from {0,1}⁴
to {0,1}²) and z₃/z₄ interleaved to match ROBP step order.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: suppress needless_range_loop clippy lint and apply fmt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* bench: add end-to-end jagged PCS benchmark (commit, open, verify)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: suppress clippy needless_range_loop in evaluator tests, add K=16000 bench

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* perf: transpose bwd/c_bits/d_bits to step-major layout for cache-friendly access

Reorganize data structures from poly-major bwd[y][i] to step-major
bwd[i][y], and similarly for c_bits and d_bits. The main loop now
scans contiguous memory when accumulating over K polynomials, reducing
cache misses at large K (~38% improvement at K=16000). Backward
precomputation uses split_at_mut + par_iter for safe parallel writes.

Also adds the round polynomial formula (§2.3, Eq. 4) as a code comment
to clarify the bwd_sum accumulation logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* perf: parallelize assist sumcheck round-polynomial computation and weight update

Thread-local bwd_sum buckets + per-thread p(0),p(1),p(2) evaluation
eliminates the sequential O(K) bottleneck in the main step loop.
Each thread builds local buckets, computes local round polynomial
values, and only the scalars are summed across threads.

Benchmark (BabyBearExt4, --features parallel, vs HEAD):
  K=1000:  7.8ms → 6.8ms  (-14%)
  K=2000:  13.5ms → 10.2ms (-24%)
  K=4000:  25.6ms → 16.2ms (-38%)
  K=8000:  50.7ms → 28.0ms (-46%)
  K=16000: 100.5ms → 53.6ms (-48%)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* perf: eliminate bwd_sum merge by reusing per-thread buckets in round 2i+1

Instead of merging thread-local bwd_sums after round 2i, cache them and
absorb alpha directly into each thread's local bwd_sum for round 2i+1.
Only scalar p(0),p(1),p(2) values are summed across threads in both rounds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: gate IndexedParallelIterator import behind parallel feature, remove tautological debug_assert

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fmt

* fix: use wildcard maybe_rayon import, remove unused direct rayon dep

Wildcard `use p3::maybe_rayon::prelude::*` avoids unused-import warnings
caused by Cargo feature unification: `p3/parallel` can be activated by
other workspace crates even when `mpcs/parallel` is off, making a
`#[cfg(feature = "parallel")]` gate on IndexedParallelIterator incorrect.

Also remove `dep:rayon` from mpcs — all parallelism goes through
`p3::maybe_rayon`, so the direct rayon dependency was unnecessary.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
@kunxian-xia
Copy link
Copy Markdown
Collaborator Author

@copilot Let $T$ denote the number of total evaluations in the giga mle $q(x)$ (we pad it to next power of 2 s.t. $n = \lceil \log T \rceil$). Please run a careful check to see if we have the following checks:

  1. In the ROBP based evaluator for $g(z_{\textrm{row}}, z_\rho, t_{i}, t_{i+1})$: $z_{\textrm{row}}, z_\rho$ are padded to have length $n+1$.
  2. The final evaluated point in jagged assist sumcheck has length $2*(n+1)$, the jagged assist sumcheck has $2*(n+1)$ rounds.

Please think with highest scrutiny. Correct me if I'm wrong about the above checks.

…usion in prover

Agent-Logs-Url: https://github.com/scroll-tech/gkr-backend/sessions/22b3ba65-8e69-45eb-ae09-0a5267f8b518

Co-authored-by: kunxian-xia <1082586+kunxian-xia@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 30, 2026

@copilot Let $T$ denote the number of total evaluations in the giga mle $q(x)$ (we pad it to next power of 2 s.t. $n = \lceil \log T \rceil$). Please run a careful check to see if we have the followi...

The checks are only partially correct — both depend on whether total_evals is a power of 2.

Let n = num_giga_vars = ⌈log₂(total_evals)⌉. The ROBP needs to represent cumulative heights t_i ∈ [0, total_evals]. The key constraint is whether total_evals itself fits in n bits:

  • total_evals < 2^n (not a power of 2): total_evals ≤ 2^n − 1, fits in n bits → n_robp = n
  • total_evals = 2^n (is a power of 2): total_evals = 2^n requires n+1 bits → n_robp = n + 1

The code captures this correctly:

let n_robp = num_giga_vars + if total_evals.is_power_of_two() { 1 } else { 0 };

Check 1"z_row, z_ρ are padded to length n+1": Only true when total_evals = 2^n. Otherwise they have length n. The current code pads to n_robp (n or n+1).

Check 2"assist sumcheck has 2*(n+1) rounds": Same condition. Rounds = 2 * n_robp = 2n or 2(n+1).

So your checks are too strong in general — they hold only for the power-of-two case. The conditional n_robp logic is the correct and efficient approach.

@kunxian-xia
Copy link
Copy Markdown
Collaborator Author

@copilot read this issue scroll-tech/ceno#1334, give me a plan to revise the fn jagged_batch_open() to support at least two sets of (point, evals). Create a new PR on top of this PR that contains the initial design / plan in markdown. Do not start the implementation before we're aligned on the design.

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 30, 2026

@kunxian-xia I've opened a new pull request, #47, to work on those changes. Once the pull request is ready, I'll request review from you.

kunxian-xia and others added 3 commits May 1, 2026 01:08
…trix

- Add comparison benchmark (comparison.rs) measuring commit, batch_open,
  batch_verify, and proof size for jagged PCS vs direct inner PCS
- Refactor jagged_commit to accept p3::matrix::dense::RowMajorMatrix instead
  of witness::RowMajorMatrix, supporting non-power-of-two matrix heights
  (internally padded to next power of two before bit-reversal)
- Update jagged_pcs bench to use BabyBearExt4, parallel make_rmm, jittered
  non-power-of-two heights, and eq-table-based column evaluation
- Cap reshape_log_height to 25 to fit BabyBear two-adicity constraint

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* misc: prefix align without bit reverse

* WIP integrate jagged pcs cpu

* Fix jagged PCS padded opening normalization

* misc: clippy fix

* avoid reconstruct q_mles

* more docs

* misc: fmt and clippy
Copy link
Copy Markdown
Collaborator

@hero78119 hero78119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome great job 🔥 👍

@kunxian-xia kunxian-xia merged commit e8f8f5c into main May 13, 2026
2 checks passed
@kunxian-xia kunxian-xia deleted the feat/jagged_pcs branch May 13, 2026 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants