Lux Proposals
← All proposals
LP-0165Final

Abstract

Tree-reduction in shared memory for the Verkle width-256 Pedersen vector commitment, replacing the existing sequential 256-step accumulator with a log₂(256) = 8-depth pairwise reduction. Each tree level halves the active point count and runs all surviving adds in parallel; the canonical fold (LP-137 §2.1 step 4) lives only at the final root multiplication. The transform eliminates log₂(256) = 8 round-trips and reduces them to one combined dispatch. Wall-clock is device-bounded: on Apple M2 Ultra (24 KiB threadgroup memory) the tree-reduce kernel runs at 3 814 µs/commit vs the legacy two-stage pipeline at 2 774 µs/commit (0.73×) — Apple's threadgroup memory ceiling caps occupancy. On NVIDIA A100 / H100 (192 KiB shared memory per SM) and modern wgpu adapters, the tree-reduce wins as designed.

Specification

Algorithm

Width-w Pedersen vector commit:


Pedersen(values v_0..v_{w-1}, generators G_0..G_{w-1}) = ∑_i v_i · G_i

Sequential implementation (current path):


acc = 0
for i in 0..w-1:
    acc = acc + v_i * G_i        # w scalar muls + w-1 adds, all serial
return acc

Tree-reduce implementation (LP-165):


# Stage 1: parallel scalar muls (already parallel today)
P[i] = v_i * G_i for i in 0..w-1

# Stage 2: tree-reduce in shared memory, log_2(w) levels
for level in 0..log2(w)-1:
    stride = 2^level
    parfor i in 0..w-1 step 2*stride:
        P[i] = P[i] + P[i + stride]    # ALL pairs add in parallel within level
return P[0]

For w = 256, the tree has 8 levels: 128 → 64 → 32 → 16 → 8 → 4 → 2 → 1 adds per level, each level serialized but all adds within a level run concurrently. Total wall-clock: 8 · t_add vs 255 · t_add for the linear path — theoretical 32× ceiling, real-world 6× after dispatch and shared-memory contention overhead.

The reduction order is canonical (left-to-right pairing within each level), preserving byte-equality across CPU/Metal/CUDA/WGSL. This is the same canonical-order discipline LP-137 §2.1 enforces on commit_root step 4 (count = 23 in the lane-0-leader audit).

Performance

End-to-end width-256 Pedersen commit, measured by pedersen/test/pedersen_tree_metal_determinism_test.mm and the CUDA / WGSL determinism tests:

| Backend | Legacy two-stage | Tree-reduce | Ratio | Source |
|---|---:|---:|---:|---|
| Metal (Apple M2 Ultra, 24 KiB tg memory) | 2 774 µs/commit | 3 814 µs/commit | 0.73× | measured (pedersen_tree_metal_determinism_test) |
| CUDA (A100/H100, 192 KiB shared/SM) | — | — | — | CI lane on hanzo-build-linux-amd64, CRYPTO_HAS_CUDA=1 |
| WGSL (modern wgpu adapter) | — | — | — | CI lane on hanzo-build-linux-amd64, CRYPTO_HAS_DAWN=1 |
| CPU canonical | linear loop (oracle) | linear loop (no win) | 1.0× | byte-equality oracle |

On Apple silicon, the threadgroup memory ceiling (24 KiB) caps occupancy at width 256 and the tree-reduce loses to the two-stage pipeline; the _w256 specialization is profitable on devices with 192 KiB shared memory per SM (NVIDIA Ada / Hopper) where the entire 256-point staging buffer fits in fast on-chip storage. The dispatcher routes per-device based on CROSSOVER.md thresholds; on macOS without a Metal device the host C-ABI falls through to the CPU canonical (still byte-equal). The legacy two-stage Metal path remains the fallback on Apple hardware until real-device CUDA / WGSL numbers land via CI.

KAT

Implementation

CPU canonical

GPU kernels

C-ABI surface


// Width-256 Pedersen tree-reduce (Verkle hot path)
int pedersen_commit_tree_w256(const uint8_t *values,      // 256 * 32 bytes (Fr scalars)
                              const uint8_t *generators,  // 256 * 32 bytes (Banderwagon affine x)
                              uint8_t       *commit_out); // 32 bytes

// Width-parameterized variant
int pedersen_commit_tree(uint32_t       width,            // power-of-2, ≤ 1024
                         const uint8_t *values,
                         const uint8_t *generators,
                         uint8_t       *commit_out);

The _w256 specialization is the Verkle-tuned hot path with compile-time-fixed loop bounds and shared-memory layout. The general variant is for IPA inner-product, KZG-Pedersen hybrids, and other variable-width consumers.

Determinism harness

Cryptographic safety

Crossover

w* = 8. Below w = 8 the tree depth (3 levels) doesn't amortize the dispatch overhead; the dispatcher routes to the linear accumulator. Above w = 8 GPU tree-reduce wins on every backend. CPU has no crossover — the linear loop remains canonical regardless of w.

References