Lux Proposals
← All proposals
LP-0162Final

Abstract

Fused k-pair Miller loop for BLS12-381 batch pairing verification. Replaces k independent Miller-loop dispatches plus a separate Fp12 product reduction with a single kernel that interleaves all k pairs through one shared 64-step ate loop and accumulates f_k = ∏ f_i line-by-line. Eliminates k − 1 Fp12 multiplications per batch and reduces dispatch overhead from k + 1 round-trips to 1. Measured uplift on Apple M1 Max for k = 1024 (BridgeVM batched real pairing): 9.5× mean (LP-137 §2.3). Pre-existing bls_combined_miller.metal is the Metal reference; LP-162 ports to CUDA + WGSL and parameterizes by pair count.

Specification

Algorithm

The BLS12-381 ate Miller loop runs over the binary expansion of |x| = 0xD201000000010000 (NAF length 64). For a single pair (P, Q):


f = 1
T = Q
for i in 62..0:
    f = f^2 * line(T, T, P)
    T = 2T
    if x_i == 1:
        f = f * line(T, Q, P)
        T = T + Q

For k pairs (P_i, Q_i), the combined loop interleaves k line evaluations per Miller bit and folds them into one accumulator:


f = 1
T_0..T_{k-1} = Q_0..Q_{k-1}
for i in 62..0:
    f = f^2
    for j in 0..k-1:
        f = f * line(T_j, T_j, P_j)
        T_j = 2 * T_j
    if x_i == 1:
        for j in 0..k-1:
            f = f * line(T_j, Q_j, P_j)
            T_j = T_j + Q_j

The k line evaluations per bit are independent (no cross-pair dependency) and parallelize cleanly. The f updates are serial within a bit (one Fp12 chain) but the same Fp12 chain that the unfused version would do at the very end — net savings: k − 1 extra Fp12-mul-by-Fp12 operations are eliminated, replaced by k Fp12-mul-by-sparse-line operations (~6× cheaper each).

Determinism

Line accumulation order is fixed: j = 0 first, ascending. Squaring is the same per-bit canonical squaring as single-pair Miller. Final exponentiation runs once on the combined f. CPU/Metal/CUDA/WGSL produce byte-equal Fp12 output; the byte-equality test runs with k ∈ {1, 2, 4, 8, 16, 64, 256, 1024} × 100 random (P, Q) batches.

Performance

| k | Prior (per-pair Miller + product) | LP-162 fused | Source |
|---:|---|---|---|
| 1024 | 99 040 ms | 10 425 ms (9.50×) | BridgeVM v0.60 measured, n=1024 mean (LP-137 §2.3) |

The k = 1024 row is the only row backed by a number on this host. Smaller k rows are intentionally omitted: the per-pair Miller-bit cost reduction is the same algorithmic transform but the wall-clock ratio is dispatch-overhead-bounded at small k and depends on the device. CUDA / WGSL real-device numbers land in BENCHMARKS.md from the hanzo-build-linux-amd64 CI lane with CRYPTO_HAS_CUDA=1 / CRYPTO_HAS_DAWN=1. The full bench ladder (k ∈ {1, 2, 4, 8, 16, 64, 256, 1024} per backend) lives in CI; this LP records only what is measured today.

Implementation

CPU canonical

GPU kernels

C-ABI surface


// k-pair fused Miller + final exponentiation
int bls12_381_pairing_check_batch(const uint8_t *P_pairs,   // k * 96 bytes G1 affine
                                  const uint8_t *Q_pairs,   // k * 192 bytes G2 affine
                                  uint32_t       k,
                                  uint8_t       *verdict);  // 0/1 byte

// k-pair fused Miller WITHOUT final exp (caller wants raw Fp12 product)
int bls12_381_combined_miller(const uint8_t *P_pairs,
                              const uint8_t *Q_pairs,
                              uint32_t       k,
                              uint8_t       *fp12_out);     // 576 bytes

Determinism harness

Cryptographic safety

Crossover

k* = 2. Below k=2 there's nothing to fuse; the dispatcher routes to single-pair Miller. Recorded in luxcpp/crypto/CROSSOVER.md per backend; the CTO impl commit will tune pair_count template specializations for k ∈ {2, 4, 8, 16} and fall back to dynamic for larger k.

References