Single-pass batch kernel for FROST (Schnorr) and CGGMP21 (ECDSA) threshold pre-signing rounds, processing M signers × N pre-signature slots in one GPU dispatch. Pre-signing is the round-one-only portion of each protocol — nonce generation, commitment broadcast, and per-signer scalar-mul — that does NOT depend on the message. By running M·N independent pre-signatures in parallel, the per-ceremony amortized cost drops from O(M·N) sequential rounds to O(M) GPU passes (N parallelizes within a pass). Projected speedup vs the per-slot sequential MPCVM v0.62 baseline: 5.0–7.2× at M=7 N=64, scaling near-linearly in N. Round-by-round signing (after the message arrives) remains structurally CPU-only per LP-137 §3.8 (category C).
Each pre-signature slot needs a pair of nonces (d, e) per signer, with commitments (D = d·G, E = e·G):
frost_batch_presign(M signers, N slots):
# Stage 1: parallel nonce generation, M*N draws
parfor (i, j) in M x N:
d_{i,j} = sample_uniform(Fn)
e_{i,j} = sample_uniform(Fn)
# Stage 2: parallel scalar-muls, M*N points
parfor (i, j) in M x N:
D_{i,j} = d_{i,j} * G
E_{i,j} = e_{i,j} * G
# Stage 3: per-slot binding factor (canonical fold)
for j in 0..N-1:
rho_j = H_b(j, [D_{i,j}, E_{i,j} for i in 0..M-1]) # serial within slot
return {(d_{i,j}, e_{i,j}, D_{i,j}, E_{i,j}, rho_j)}
Stages 1 + 2 fan out to M·N threads; stage 3 is the canonical commit_root fold (one thread per slot, lane-0-leader pattern).
Each slot needs a pre-signature (R, k_i⁻¹·χ_i) per signer:
cggmp21_batch_presign(M signers, N slots):
# Stage 1: parallel nonce + chi-share generation
parfor (i, j) in M x N:
k_{i,j} = sample_uniform(Fn)
gamma_{i,j} = sample_uniform(Fn)
# Stage 2: MtA pair products (the M*M*N quadratic blow-up)
parfor (i, i', j) in M x M x N where i != i':
share_{i,i',j} = mta_send(k_{i,j}, gamma_{i',j})
# Stage 3: per-slot R combination + chi distribution
for j in 0..N-1:
Gamma_j = sum_i (gamma_{i,j} * G) # serial within slot
R_j = (Gamma_j)^{1/k_j} # k_j = sum k_{i,j}
chi_{i,j} = k_{i,j} * x_i + sum_{i'} share_{i,i',j}
return {(k_{i,j}, R_j, chi_{i,j})}
The M·M·N MtA quadratic in stage 2 is the dominant cost; running it in one kernel pass saves M-1 round-trips per slot. Stage 3 is canonical fold per slot.
The randomness source sample_uniform(Fn) is deterministic on a session-bound seed (RFC 8032 sect. 5.1.6 nonce derivation pattern, applied per (session_id, signer_id, slot_id)). No reproducible pre-signature shall escape the session — slot consumption is single-shot.
The canonical fold (stages 3 of both protocols) runs as a per-slot single thread. Byte-equality across CPU/Metal/CUDA/WGSL is asserted on stage 3 outputs; stage 1 + 2 outputs are checked via Schnorr / ECDSA verification of the post-pre-sign cooked signatures.
frost-secp256k1-tr v2.1 (Rust, test-only via frost/test/cmake/zcash_frost.cmake); 5-of-7 threshold, N ∈ {1, 4, 16, 64, 256}, 50 random (session, signers) per N.multi-party-ecdsa v0.8.1 (Rust, test-only); same threshold and slot counts.luxcpp/crypto/{frost,cggmp21}/test/batch_presign_kat.json.End-to-end pre-sign throughput at threshold 5-of-7, secp256k1, median 25 runs:
N slots | Prior (sequential v0.62) | LP-166 batch | Ratio |Round-by-round message-signing (after the message lands) is unchanged — the post-pre-sign per-signature combination remains a serial protocol round and falls under LP-137 §3.8 category C. Final numbers in the impl commit BENCHMARKS.md.
8e8fb102. Throughput observed locally: 128 commitments/s × 2 backends at M=10 N=64.cggmp21/cpp/paillier.{hpp,cpp} lands the 2048-bit Paillier scheme (keygen via Miller-Rabin 40-round 1024-bit safe-prime search, encrypt, decrypt, Π^enc sigma proof prove + verify), all Z_{N^2} arithmetic delegating to the LP-163 Karatsuba 4096-bit modexp primitive. presign_one() produces status=0 records with real K_i = enc_N(k_i, ρ_k_i), G_cmt = enc_N(γ_i, ρ_g_i), and pi_enc binding (K_i, k_i, ρ_k_i) when the caller provisions a valid PaillierKey. Zero-pk emits status=0xFF (legitimate "this signer's pk not provisioned" — aggregator routes around). 4/4 cggmp21_presign_test PASS on commit f35eedd2.So the speedup numbers above are the *target*; the FROST ratios are reachable as the per-backend dispatch infrastructure (Metal .metallib, Dawn host runtime) lands in CI. The CGGMP21 ratios will follow LP-163 completion, since the Paillier sub-step dominates beyond the curve scalar mul.
luxcpp/crypto/frost/cpp/frost_batch_presign.{hpp,cpp} (NEW).luxcpp/crypto/cggmp21/cpp/cggmp21_batch_presign.{hpp,cpp} (NEW).curve_traits from LP-161.luxcpp/crypto/frost/gpu/metal/frost_batch_presign.metal (NEW).luxcpp/crypto/cggmp21/gpu/metal/cggmp21_batch_presign.metal (NEW).luxcpp/crypto/frost/gpu/cuda/frost_batch_presign.cu (NEW).luxcpp/crypto/cggmp21/gpu/cuda/cggmp21_batch_presign.cu (NEW).luxcpp/crypto/frost/gpu/wgsl/frost_batch_presign.wgsl (NEW).luxcpp/crypto/cggmp21/gpu/wgsl/cggmp21_batch_presign.wgsl (NEW).
// FROST batch pre-sign — outputs M*N nonce pairs + N binding factors
int frost_batch_presign(uint32_t M, // signers
uint32_t N, // slots
const uint8_t *session_seed, // 32 bytes
const uint8_t *signer_ids, // M * 8 bytes
uint8_t *nonce_pairs_out, // M*N * 64 bytes (d,e)
uint8_t *commit_pairs_out, // M*N * 66 bytes (D,E compressed)
uint8_t *binding_out); // N * 32 bytes (rho)
// CGGMP21 batch pre-sign — outputs M*N k-shares + N R values + M*N chi shares
int cggmp21_batch_presign(uint32_t M,
uint32_t N,
const uint8_t *session_seed,
const uint8_t *key_shares, // M * 32 bytes (x_i)
uint8_t *k_shares_out, // M*N * 32 bytes
uint8_t *R_out, // N * 33 bytes (compressed secp256k1)
uint8_t *chi_out); // M*N * 32 bytes
Output buffers are pre-signature material — the caller MUST treat them as one-shot, single-use, session-scoped key material. Reusing a slot across two messages is catastrophic (reveals the long-term key).
luxcpp/crypto/{frost,cggmp21}/test/batch_presign_test.{cpp,mm,cu} — M ∈ {3, 5, 7}, N ∈ {1, 4, 16, 64}, 25 random (session_seed, signer_ids, key_shares) triples per (M, N). Stage 3 outputs byte-equal CPU vs Metal vs CUDA vs WGSL; stage 1+2 outputs Schnorr/ECDSA-verified after a synthetic message is supplied.(d, e, k, gamma) is secret — leakage of any one of these reveals the long-term signing key. Constant-time scalar mul (LP-147 secp256k1, branchless pt_cmov) is mandatory. The blinded MSM path (LP-137 §2.5) does NOT apply here because pre-sign scalar muls have nothing to blind against — the scalars themselves ARE the secret.RFC 8032 §5.1.6 MUST seed each sample_uniform call with (session_id || signer_id || slot_id). Slot reuse is forbidden. The kernel does not enforce slot-uniqueness — the caller (MPCVM session manager) is responsible.mta_send) requires both signers' Paillier-style ZK proofs of well-formedness. LP-166 covers the scalar arithmetic for stage 2; the Paillier proofs run on the CPU side per round (category C, intra-ceremony) and are NOT batched.M ≤ 32, N ≤ 1024 enforced by the kernel — larger batches force a re-dispatch. Defends against a hostile session manager pushing M·N > 2^20 and exhausting GPU memory.(M, N) | Backend | Path |M=1 (single-signer ECDSA / Schnorr) | any | LP-147 single-key path |(M, N=1) | CPU | sequential pre-sign (existing MPCVM path) |(M, N≥4) | Metal/CUDA/WGSL | LP-166 batch kernel |(M, N≥4) | CPU | tiled batch on goroutine pool (no GPU) |The threshold protocol round-by-round signing (post-pre-sign) remains CPU-only per LP-137 §3.8 category C — a round can't start until the previous round's transcript lands, so there's no parallelism to harvest above the pre-sign stage.
luxcpp/crypto@<sha> impl commit (CTO agent #7, in flight)