Single parameterized Pippenger multi-scalar multiplication kernel under gpukit/, instantiated for secp256k1, BN254, BLS12-381 G1/G2, and Banderwagon via curve_traits<C>. Replaces four near-identical hand-written MSM kernels with one templated body. Cost: O(λ·n / log n) point additions for λ-bit scalars and n points, with Pippenger window size c = ⌊log₂ n⌋ - 2. Headline win is code consolidation (3 200 LOC → ~600 LOC kernel + ~150 LOC per curve_traits). The shared kernel is the dispatch point, not a re-implementation: per-curve traits delegate to the proven first-party curve libraries.
Pippenger's bucket method:
# Inputs: points P_0..P_{n-1}, scalars k_0..k_{n-1} ∈ [0, 2^λ)
# Window size c, number of windows W = ⌈λ / c⌉
# 1. Per-window bucket fill (parallel over points)
for w in 0..W-1:
for i in 0..n-1:
b = (k_i >> (w * c)) & ((1 << c) - 1)
if b != 0:
buckets[w][b] += P_i # affine mixed add via batch_inv
# 2. Per-window bucket reduce (parallel over windows)
for w in 0..W-1:
acc = 0; running = 0
for b in (1 << c) - 1 .. 1:
running += buckets[w][b]
acc += running
window_sum[w] = acc
# 3. Window combine (serial, log-depth shift-and-add)
result = window_sum[W-1]
for w in W-2 .. 0:
result = (result << c) + window_sum[w]
return result
Step 1 uses LP-160 batched Montgomery inversion to amortize the affine-mix-add denominators. Step 2 is the bucket "running sum" — parallel across windows, serial within. Step 3 is the canonical fold (LP-137 §2.1 step 4).
curve_traits<C> surface
template<typename C>
struct curve_traits {
using point_affine = ...;
using point_jacobian = ...;
using scalar = ...;
static constexpr uint32_t scalar_bits; // λ
static constexpr uint32_t bucket_max_bits; // c upper bound
static point_jacobian add (point_jacobian, point_jacobian);
static point_jacobian add_mixed (point_jacobian, point_affine);
static point_jacobian dbl (point_jacobian);
static point_affine to_affine_batch (...); // calls gpukit_batch_inv
};
Specializations live at luxcpp/crypto/gpukit/gpu/curve_traits/{secp256k1,bn254,bls12_381,banderwagon}.{metal,cu,wgsl}.
luxcpp/crypto@741f7c3f) | KAT pass |luxcpp/crypto@741f7c3f) | KAT pass |multiexp.cpp signed-digit body) | KAT pass |GPUKIT_ERR_NOTIMPL | sibling commit in flight |Status is Active rather than Final until BLS12-381 G1 lands a body — the current dispatcher honestly returns GPUKIT_ERR_NOTIMPL for that curve and the multi_pippenger_test::test_bls_notimpl test asserts on that contract. Promotion to Final will record the wiring commit URL here once it lands and the BLS12-381 KAT vector passes byte-equal.
End-to-end throughput numbers go in BENCHMARKS.md per backend. Real-device CUDA / WGSL numbers come from the hanzo-build-linux-amd64 CI lane with CRYPTO_HAS_CUDA=1 / CRYPTO_HAS_DAWN=1; Metal numbers come from the macOS dev sweep recorded in CROSSOVER.md.
luxcpp/crypto/gpukit/cpp/cpu_reference/pippenger_msm.{hpp,cpp} — templated over curve_traits<C>.luxcpp/crypto/{secp256k1,bn254,bls,banderwagon}/cpp/msm.cpp (those become 5-line forwarding wrappers).luxcpp/crypto/gpukit/gpu/metal/pippenger_msm.metal + curve specializations under gpu/curve_traits/.luxcpp/crypto/gpukit/gpu/cuda/pippenger_msm.cu.luxcpp/crypto/gpukit/gpu/wgsl/pippenger_msm.wgsl.luxcpp/crypto/gpukit/gpu/{metal,cuda,wgsl}/pippenger_msm_driver.*.
int gpukit_pippenger_msm(uint8_t curve_id,
const uint8_t *points, // n * point_bytes (affine)
const uint8_t *scalars, // n * scalar_bytes
uint8_t *out, // 1 * point_bytes (Jacobian)
uint32_t n,
uint32_t window_c); // 0 = auto-pick
Per-curve wrappers re-exported under each curve's c-abi (secp256k1_msm, bn254_msm, bls12_381_g1_msm, bls12_381_g2_msm, banderwagon_msm). The Banderwagon variable-time MSM secrecy contract from LP-137 §2.5 is preserved: secret scalars MUST go through gpukit_pippenger_msm_blinded.
luxcpp/crypto/gpukit/test/pippenger_msm_test.{cpp,mm,cu} — for each curve, N ∈ {16, 64, 256, 1024, 4096}, 50 random (points, scalars) batches, asserted byte-equal CPU vs Metal vs CUDA vs WGSL.MultiExp (BN254, BLS12-381), arkworks VariableBaseMSM (second oracle), bitcoin-core/secp256k1 secp256k1_ecmult_multi_var (secp256k1).b = (k_i >> (w·c)) & mask. For verifier-side public scalars this is fine. For prover-side secret scalars the caller MUST use gpukit_pippenger_msm_blinded (Pedersen-style scalar randomization before MSM, unblind after — same contract LP-137 §2.5 enforces for Banderwagon).The per-curve N* thresholds at which Pippenger beats naive serial scalar-mul:
Below N* the dispatcher routes to per-point scalar multiplication (LP-146/147/152). Recorded in luxcpp/crypto/CROSSOVER.md.
luxcpp/crypto@<sha> impl commit (CTO agent #2, in flight)