Fused k-pair Miller loop for BLS12-381 batch pairing verification. Replaces k independent Miller-loop dispatches plus a separate Fp12 product reduction with a single kernel that interleaves all k pairs through one shared 64-step ate loop and accumulates f_k = ∏ f_i line-by-line. Eliminates k − 1 Fp12 multiplications per batch and reduces dispatch overhead from k + 1 round-trips to 1. Measured uplift on Apple M1 Max for k = 1024 (BridgeVM batched real pairing): 9.5× mean (LP-137 §2.3). Pre-existing bls_combined_miller.metal is the Metal reference; LP-162 ports to CUDA + WGSL and parameterizes by pair count.
The BLS12-381 ate Miller loop runs over the binary expansion of |x| = 0xD201000000010000 (NAF length 64). For a single pair (P, Q):
f = 1
T = Q
for i in 62..0:
f = f^2 * line(T, T, P)
T = 2T
if x_i == 1:
f = f * line(T, Q, P)
T = T + Q
For k pairs (P_i, Q_i), the combined loop interleaves k line evaluations per Miller bit and folds them into one accumulator:
f = 1
T_0..T_{k-1} = Q_0..Q_{k-1}
for i in 62..0:
f = f^2
for j in 0..k-1:
f = f * line(T_j, T_j, P_j)
T_j = 2 * T_j
if x_i == 1:
for j in 0..k-1:
f = f * line(T_j, Q_j, P_j)
T_j = T_j + Q_j
The k line evaluations per bit are independent (no cross-pair dependency) and parallelize cleanly. The f updates are serial within a bit (one Fp12 chain) but the same Fp12 chain that the unfused version would do at the very end — net savings: k − 1 extra Fp12-mul-by-Fp12 operations are eliminated, replaced by k Fp12-mul-by-sparse-line operations (~6× cheaper each).
Line accumulation order is fixed: j = 0 first, ascending. Squaring is the same per-bit canonical squaring as single-pair Miller. Final exponentiation runs once on the combined f. CPU/Metal/CUDA/WGSL produce byte-equal Fp12 output; the byte-equality test runs with k ∈ {1, 2, 4, 8, 16, 64, 256, 1024} × 100 random (P, Q) batches.
k | Prior (per-pair Miller + product) | LP-162 fused | Source |The k = 1024 row is the only row backed by a number on this host. Smaller k rows are intentionally omitted: the per-pair Miller-bit cost reduction is the same algorithmic transform but the wall-clock ratio is dispatch-overhead-bounded at small k and depends on the device. CUDA / WGSL real-device numbers land in BENCHMARKS.md from the hanzo-build-linux-amd64 CI lane with CRYPTO_HAS_CUDA=1 / CRYPTO_HAS_DAWN=1. The full bench ladder (k ∈ {1, 2, 4, 8, 16, 64, 256, 1024} per backend) lives in CI; this LP records only what is measured today.
luxcpp/crypto/bls/cpp/bls_combined_miller.{hpp,cpp} — templated over pair_count (compile-time) and dynamic n_pairs (runtime fallback for k > template ceiling).luxcpp/crypto/bls/cpp/bls_pairing.cpp Fp12 ops; only the loop-and-accumulate skeleton is new.bls_fused.cpp (already shipped, LP-137 §2.3 reference) becomes a thin wrapper around bls_combined_miller.luxcpp/crypto/bls/gpu/metal/bls_combined_miller.metal (already exists, canonical Metal reference).luxcpp/crypto/bls/gpu/cuda/bls_combined_miller.cu (NEW, port of Metal).luxcpp/crypto/bls/gpu/wgsl/bls_combined_miller.wgsl (NEW, port of Metal).luxcpp/crypto/bls/gpu/{metal,cuda,wgsl}/bls_combined_miller_driver.*.
// k-pair fused Miller + final exponentiation
int bls12_381_pairing_check_batch(const uint8_t *P_pairs, // k * 96 bytes G1 affine
const uint8_t *Q_pairs, // k * 192 bytes G2 affine
uint32_t k,
uint8_t *verdict); // 0/1 byte
// k-pair fused Miller WITHOUT final exp (caller wants raw Fp12 product)
int bls12_381_combined_miller(const uint8_t *P_pairs,
const uint8_t *Q_pairs,
uint32_t k,
uint8_t *fp12_out); // 576 bytes
luxcpp/crypto/bls/test/bls_combined_miller_test.{cpp,mm,cu} — k ∈ {1, 2, 4, 8, 16, 64, 256, 1024}, 100 random pair batches per k, byte-equal CPU vs Metal vs CUDA vs WGSL.blst v0.3.11 blst_miller_loop_n (test-only, vendored under bls/test/cmake/blst.cmake); arkworks Bls12_381::multi_pairing (second oracle).T_j = 2 T_j and T_j = T_j + Q_j updates are independent across j, so a malformed Q_j cannot influence the line evaluation for a different pair j' != j. Subgroup checks on every Q_i are still required up-front (LP-137 §8).f; this preserves the security argument from Vercauteren's optimal-ate paper because e(P, Q) = f^{(p^12 - 1)/r} is multiplicative — final_exp(∏ f_i) = ∏ final_exp(f_i) modulo the easy/hard split.k* = 2. Below k=2 there's nothing to fuse; the dispatcher routes to single-pair Miller. Recorded in luxcpp/crypto/CROSSOVER.md per backend; the CTO impl commit will tune pair_count template specializations for k ∈ {2, 4, 8, 16} and fall back to dynamic for larger k.
luxcpp/crypto@<sha> impl commit (CTO agent #3, in flight)