Lux Proposals
← All proposals
LP-0161Active

Abstract

Single parameterized Pippenger multi-scalar multiplication kernel under gpukit/, instantiated for secp256k1, BN254, BLS12-381 G1/G2, and Banderwagon via curve_traits<C>. Replaces four near-identical hand-written MSM kernels with one templated body. Cost: O(λ·n / log n) point additions for λ-bit scalars and n points, with Pippenger window size c = ⌊log₂ n⌋ - 2. Headline win is code consolidation (3 200 LOC → ~600 LOC kernel + ~150 LOC per curve_traits). The shared kernel is the dispatch point, not a re-implementation: per-curve traits delegate to the proven first-party curve libraries.

Specification

Algorithm

Pippenger's bucket method:


# Inputs: points P_0..P_{n-1}, scalars k_0..k_{n-1} ∈ [0, 2^λ)
# Window size c, number of windows W = ⌈λ / c⌉

# 1. Per-window bucket fill (parallel over points)
for w in 0..W-1:
    for i in 0..n-1:
        b = (k_i >> (w * c)) & ((1 << c) - 1)
        if b != 0:
            buckets[w][b] += P_i           # affine mixed add via batch_inv

# 2. Per-window bucket reduce (parallel over windows)
for w in 0..W-1:
    acc = 0; running = 0
    for b in (1 << c) - 1 .. 1:
        running += buckets[w][b]
        acc += running
    window_sum[w] = acc

# 3. Window combine (serial, log-depth shift-and-add)
result = window_sum[W-1]
for w in W-2 .. 0:
    result = (result << c) + window_sum[w]
return result

Step 1 uses LP-160 batched Montgomery inversion to amortize the affine-mix-add denominators. Step 2 is the bucket "running sum" — parallel across windows, serial within. Step 3 is the canonical fold (LP-137 §2.1 step 4).

curve_traits<C> surface


template<typename C>
struct curve_traits {
    using point_affine    = ...;
    using point_jacobian  = ...;
    using scalar          = ...;
    static constexpr uint32_t scalar_bits;       // λ
    static constexpr uint32_t bucket_max_bits;   // c upper bound
    static point_jacobian add (point_jacobian, point_jacobian);
    static point_jacobian add_mixed (point_jacobian, point_affine);
    static point_jacobian dbl (point_jacobian);
    static point_affine   to_affine_batch (...);  // calls gpukit_batch_inv
};

Specializations live at luxcpp/crypto/gpukit/gpu/curve_traits/{secp256k1,bn254,bls12_381,banderwagon}.{metal,cu,wgsl}.

Status

| Curve | CPU canonical | Determinism harness |
|---|---|---|
| secp256k1 | wired (luxcpp/crypto@741f7c3f) | KAT pass |
| BN254 G1 | wired (luxcpp/crypto@741f7c3f) | KAT pass |
| Banderwagon | wired (delegates to multiexp.cpp signed-digit body) | KAT pass |
| BLS12-381 G1 | dispatcher wired, body returns GPUKIT_ERR_NOTIMPL | sibling commit in flight |

Status is Active rather than Final until BLS12-381 G1 lands a body — the current dispatcher honestly returns GPUKIT_ERR_NOTIMPL for that curve and the multi_pippenger_test::test_bls_notimpl test asserts on that contract. Promotion to Final will record the wiring commit URL here once it lands and the BLS12-381 KAT vector passes byte-equal.

End-to-end throughput numbers go in BENCHMARKS.md per backend. Real-device CUDA / WGSL numbers come from the hanzo-build-linux-amd64 CI lane with CRYPTO_HAS_CUDA=1 / CRYPTO_HAS_DAWN=1; Metal numbers come from the macOS dev sweep recorded in CROSSOVER.md.

Implementation

CPU canonical

GPU kernels

C-ABI surface


int gpukit_pippenger_msm(uint8_t curve_id,
                         const uint8_t *points,    // n * point_bytes (affine)
                         const uint8_t *scalars,   // n * scalar_bytes
                         uint8_t       *out,       // 1 * point_bytes (Jacobian)
                         uint32_t       n,
                         uint32_t       window_c); // 0 = auto-pick

Per-curve wrappers re-exported under each curve's c-abi (secp256k1_msm, bn254_msm, bls12_381_g1_msm, bls12_381_g2_msm, banderwagon_msm). The Banderwagon variable-time MSM secrecy contract from LP-137 §2.5 is preserved: secret scalars MUST go through gpukit_pippenger_msm_blinded.

Determinism harness

Cryptographic safety

Crossover

The per-curve N* thresholds at which Pippenger beats naive serial scalar-mul:

| Curve | N* (CPU) | N* (Metal) | N* (CUDA) |
|---|---:|---:|---:|
| secp256k1 | 32 | 64 | 128 |
| BN254 G1 | 24 | 48 | 96 |
| BLS12-381 G1 | 32 | 96 | 192 |
| Banderwagon | 16 | 32 | 64 |

Below N* the dispatcher routes to per-point scalar multiplication (LP-146/147/152). Recorded in luxcpp/crypto/CROSSOVER.md.

References