Lux Proposals
← All proposals
LP-0160Final

Abstract

Montgomery's batched modular inversion replaces N independent field inversions with 1 inversion + 3·(N-1) multiplications. The kernel was previously authored only on Metal (secp256k1_batch_inv.metal); LP-160 ports the algorithm to CUDA + WGSL under the canonical three-backend layout, and lifts the canonical body into gpukit/ so every curve (secp256k1, BN254 Fp/Fr, BLS12-381 Fp/Fr, Banderwagon Fp) shares one kernel. Apple M1 Max measured 10–20× over N independent Fermat exponentiations at N = 1024 (see CROSSOVER.md §batch_inv). CUDA + WGSL real-device benchmarks land in CI on the hanzo-build-linux-amd64 runner with CRYPTO_HAS_CUDA=1 and CRYPTO_HAS_DAWN=1; macOS dev builds skip CUDA/WGSL legs by design and run Metal exclusively.

Specification

Algorithm

Given inputs a_0, …, a_{N-1} in a field F_q (Montgomery form):


# Forward prefix product
p_0 = a_0
for i in 1..N-1:
    p_i = p_{i-1} * a_i

# Single inversion
inv = p_{N-1}^{-1}        # one Fermat exponentiation: a^(q-2) mod q

# Backward sweep
for i in N-1 .. 1:
    out_i  = inv * p_{i-1}
    inv    = inv * a_i
out_0 = inv

Cost: 1 inv + 3·(N-1) muls. For q ≈ 2^256, one inversion ≈ 256 squarings + ~128 multiplications under windowed Fermat; one mul ≈ 1 Montgomery REDC. Crossover happens at N ≥ 8 on every backend; below N = 8 the per-call dispatch overhead beats the savings.

Determinism

The forward prefix product and backward sweep are serial chains by construction — p_i depends on p_{i-1}, and the running inv depends on the previous inv. The kernel runs as a single workgroup of one thread (lane-0-leader pattern, classified as "Batch-inversion serial chain" in LP-137 §2.1 lane-0 audit). Byte-equality across CPU/Metal/CUDA/WGSL is unconditional.

Performance

Speedup vs N-many independent Fermat inversions, identical curve, identical hardware:

| Backend | N=64 | N=256 | N=1024 | Source |
|---|---:|---:|---:|---|
| Metal (M1 Max) | 6× | 11× | 14× | measured (CROSSOVER.md §batch_inv, secp256k1_batch_inv 2026-02 baseline) |
| CUDA (Ada/Hopper) | — | — | — | CI lane on hanzo-build-linux-amd64, CRYPTO_HAS_CUDA=1 |
| WGSL (wgpu, RTX 4090) | — | — | — | CI lane on hanzo-build-linux-amd64, CRYPTO_HAS_DAWN=1 |

macOS dev builds run Metal only; CUDA / WGSL legs are exercised on the linux-amd64 CI runner with the corresponding env flags and recorded directly in BENCHMARKS.md as they land. No projections here.

Implementation

CPU canonical

GPU kernels

C-ABI surface


// gpukit C-ABI shim (one entry, curve_id selects parameter set)
int gpukit_batch_inv(uint8_t curve_id,
                     const uint8_t *in,    // N * field_bytes Montgomery form
                     uint8_t       *out,   // N * field_bytes Montgomery form
                     uint32_t       n);

// Per-curve thin wrappers re-exported under each curve's c-abi:
int secp256k1_batch_inv_fp(const uint8_t *in, uint8_t *out, uint32_t n);
int bn254_batch_inv_fr    (const uint8_t *in, uint8_t *out, uint32_t n);
int bls12_381_batch_inv_fp(const uint8_t *in, uint8_t *out, uint32_t n);
int banderwagon_batch_inv (const uint8_t *in, uint8_t *out, uint32_t n);

Determinism harness

Cryptographic safety

Crossover

Below N* = 8 the per-dispatch overhead exceeds the algorithmic savings; the dispatcher routes to the per-element CPU path. Above N* = 8 GPU on Metal, N* = 16 on CUDA, N* = 32 on WGSL. The threshold is profiled per device and recorded in luxcpp/crypto/CROSSOVER.md.

References