Lux Proposals
← All proposals
LP-0137Active

LP-137: GPU-Native Crypto Stack

LP-137 specifies the GPU mode — the substrate, build topology, byte-equality contract, GPU-residency invariant, and C-ABI surface — that all per-algorithm crypto LPs share. Per-algorithm formal specs live in their own LPs (LP-026, LP-029, LP-066, LP-069, LP-070--077, LP-110--126, LP-131, LP-146--159). LP-137 carries no per-algorithm content.

Abstract

A GPU-native crypto substrate for the 9-chain Lux topology. Every primitive

ships a first-party CPU canonical (Go reference) and first-party

Metal/CUDA/WGSL kernels, all byte-equal to the same ground truth. Audited

upstreams (blst, c-kzg-4844, gnark-crypto, PQClean) appear only as test

oracles — never linked into shipped libraries — so the byte-equality test

retains its adversarial-distance property: a bug on either side fails the

test.

The stack covers 29 algorithms across hashes, AEAD, EC + pairings, post-

quantum signatures, lattice primitives, FHE building blocks, threshold

schemes, and composite remote-attestation parsers. It satisfies the

GPU-residency invariant: no chain-local hot path leaves attested GPU

memory in production. CPU touches reality only at packet ingress, cold-state

page service, attestation handshake, and watchdog.

1. Motivation

A fragmented crypto substrate makes the byte-equivalence contract impossible

to enforce. Three failure modes that LP-137 closes:

1. Oracle bypass. When CPU and test oracle come from the same upstream

family (e.g. both from blst or both from gnark-crypto), the byte-equality

test loses its adversarial-distance: an upstream bug passes through both

sides. First-party CPU + audited test oracle is the correct adversarial

structure — bug on either side fails the test.

2. Vendor symbol leak. Production binaries linking blst / mcl /

BoringSSL into the production graph re-expose vendor branding and tie

the security boundary to upstream release cycles. LP-137 pins those

libraries to test-only oracles at <alg>/test/cmake/.

3. Brand fragmentation. Per-org symbol prefixes (LUX_*,

LUXFI_*, HANZO_*) make the same algorithm look like five

different libraries. Algorithm names ARE the namespace.

2. Specification

2.1 Per-Algorithm Architecture

> CPU = first-party canonical, byte-equal Go reference.

> GPU = first-party Metal/CUDA/WGSL, byte-equal CPU.

> Audited upstreams = vendored TEST ORACLES ONLY.


                 +---------------------+
                 |  Reference (Go)     |
                 |  gnark-crypto, etc. |
                 +----------+----------+
                            |
                       byte-equal
                            |
                 +----------v----------+
                 |  CPU canonical      |
                 |  vendored audited   |  <-- ground truth
                 |  (mcl, blst, ...)   |
                 +----------+----------+
                            |
                       byte-equal
                            |
        +-------------------+-------------------+
        |                   |                   |
+-------v------+    +-------v------+    +-------v------+
|  Metal       |    |  CUDA        |    |  WGSL        |
|  first-party |    |  first-party |    |  first-party |
+--------------+    +--------------+    +--------------+

Per-algo target structure:

The four-kernel canonical template (formalized at

luxcpp/crypto/gpukit/):


expand_inputs   →   parallel_eval   →   reduce_or_batch_verify   →   commit_root

Step 1 hoists per-input scratch into device-resident arenas. Step 2 fans

one thread per input. Step 3 reduces partials (Merkle leaf hashes, MSM

windows, batch-pairing folds); parallel where associative, serial where not.

Step 4 emits the commitment in canonical order — single-thread by design,

the determinism oracle.

Lane-0-leader audit

42 if (tid != 0u) return; sites across the codebase classify as:

| Class | Count | Reason |
|---|---:|---|
| Canonical fold (commit_root step 4) | 23 | byte-equivalent across CPU/Metal/CUDA/WGSL by canonical-order Keccak |
| Canonical locate / collision-resolved slot | 4 | must match CPU oracle across collision sequences |
| Round-by-round protocol (one-thread canonical) | 5 | round N depends on round N-1 transcript |
| Batch-inversion serial chain (Montgomery) | 2 | algorithmic — chain IS the reduction |
| Final reduction (single output) | 4 | single-element tower output |
| Wave-tick service-drain leader | 4 | LP-137 §4 architectural choice |

Zero parallelization-opportunity hits. Every lane-0-leader is structural,

not deferred work.

2.2 Repo + Org Layout

| Org | Repos | Purpose |
|---|---|---|
| luxfi/* | crypto, mpc, threshold, hsm, lattice, evm, evmgpu, chains, gpu, lps | Go canonical implementations + Rust workspace + LP specs |
| luxcpp/* | crypto, lattice, cevm, fhe, platformvm, xvm, aivm, bridgevm, mpcvm, accel, lux-cuda, lux-gpu, lux-metal, lux-webgpu | C++ canonical bodies + Metal/CUDA/WGSL kernels |
| luxgpu/* | (reserved) | Future C++ GPU kernel source repos. Currently empty. |

luxfi/gpu is canonical for Go GPU bindings. luxgpu/gpu (archived

upstream artifact) is unrelated and not used. luxcpp/aivm, luxcpp/mpcvm,

luxcpp/xvm migrated 2026-04-28 — three repos pushed with full tag

history; luxfi/<name> origin retained as second remote (no force-push, no

old-repo deletion).

2.3 Verified Performance

Reproduced on Apple M1 Max, macOS 26.4, median of >=10 runs.

| Claim | Reproduced | Source |
|---|---|---|
| BLS fused 148.35× (warm-affine, n=1024) | 144.22× (linear-aff 386 285 µs / fused-aff 2 678.5 µs); fused critical-path = 4 dispatches | crypto-aead-wt/bls/cpp/bls_fused.cpp |
| FHE Metal NTT 16.71× (N=4096, B=2048) | 16.92× (Metal 9.71 ms / Go 164.27 ms); 10-iter median, kernel-only | lattice@d11ec53c; lux/fhe/bench/results/ntt_ladder_*.json |
| bn254 KAT 36/36 vs gnark v0.19.2; e(P,Q)·e(P,-Q)=1 | bn254_kat_test reports 36 passed, 0 failed; go.mod pins gnark-crypto v0.19.2 | crypto-pedersen-cuda-wgsl-wt/bn254/test/tools/ |
| Banderwagon Pippenger KAT MultiExpConfig{ScalarsMont:true} | Verified at 93636ea7:banderwagon/test/tools/gen_multiexp_kat.go:118-126 | 93636ea7 |
| BLS aggregate verify same-msg n=1024 (cevm) | 9.24× (v0.45 batched) → 16.51× (v0.47.1 with pubkey cache); 73.91 µs/sig | cevm/BENCHMARKS.md |
| BridgeVM batched real pairing 5k–10k msgs | 8.58×–10.35×, mean 9.5× | bridgevm/BENCHMARKS.md v0.60 |
| MPCVM xlarge ceremony (v0.62) | 18.6× vs v0.61.1 Metal (9 451 ms → 507 ms) | mpcvm/BENCHMARKS.md |
| MPCVM FROST sign 5-of-7 | 4.23× (204.3 ms → 48.3 ms Metal) | mpcvm/BENCHMARKS.md |
| Keccak per-round dedup | KeccakResidencySession 4-way set-associative round cache, ≥0.50 hit rate | crypto/keccak/gpu/metal/keccak.metal |
| ed25519 RFC 8032 byte-equal Metal | N_threshold=256, 26.7× at N=4096; 100 vectors byte-equal (4 §7.1 + 96 byte-flip) | crypto/ed25519/test/ed25519_metal_test.mm |

Disputed / inflated claims (corrected)

| Original claim | Reality | Off-by |
|---|---|---|
| "every algorithm CPU↔GPU determinism" | 3 of 32 algos with unconditional GPU tests; 6 of 8 *_metal_test* gate on LUX_CRYPTO_*_METALLIB env and stub-pass when unset | 9.4% real, not "every" |
| "240 → 24 brand-neutral residuals" | 380 hits across 78+ files on crypto@181d18c6; closure pending brand-neutral-final-sweep merge | 16× higher |
| "23 publish-ready Rust crates" | Workspace declares 3 members: lux-crypto, lux-crypto-keccak, lux-crypto-secp256k1; cargo publish --dry-run fails on missing README | 7.7× off |
| "AEAD AES-GCM Metal 26.7× at N=8192" | Cipher mis-attributed: 54dad849 is ChaCha20-Poly1305. AES-GCM added later (8180b135/dd1da557). 26.7× appears only in commit message + docs, no bench JSON; aead_metal_bench SKIPs without LUX_CRYPTO_AEAD_METALLIB | not currently reproducible |
| "G3 BatchEvaluate 4.61× under -race" | Bench process ran > 13 minutes wall-clock (PID 55208, 558 MB resident) without output on M1 Max; in-tree harness is correct, number is commit-message-only | UNVERIFIABLE on M1 within session bounds |

2.4 Pedersen DST Canonical

Brand-neutral DSTs (algorithm name IS the namespace):


NewGenerators G          → PEDERSEN_G_V1
NewGenerators H          → PEDERSEN_H_V1
NewGeneratorsFromSeed    → PEDERSEN_SEEDED_GEN_V1

Hash-to-curve: BN254 G1 via RFC 9380 SVDW; cofactor 1, no clearing.

Seed format for FromSeed: msg_i = seed[32] || u64_le(i).

Golden vectors (seed = [0,1,…,31], brand-neutral DST):


G[0..31] = c563aa8a283f268b65b4210a0a78ee1341f76b59d94c1ac626effe1a5aa0c6b7
H[0..31] = e9ebf4392683dcb418584dd8ecd1e1dd16b486147e676dbf4b62779a340f3186

Pre-rename (LUX_* prefixed) values archived only — do not match canonical:


G_old = afba7c7a97100c5eb0ec96758698779b5d8d38d228bcdb7c85a4c1626ea5247a
H_old = abc19b5bad508d8e7b944a37812a342cdbaa5946f0b3fd854805820c006c6110

When the C++ pedersen-cpp-cpu branch ships, pedersen.cpp MUST hard-code

DST string PEDERSEN_SEEDED_GEN_V1 (no LUX_ prefix) and reproduce the

G/H bytes byte-for-byte. KAT harness:

pedersen/pedersen_seed_test.go::TestNewGeneratorsFromSeed_GoldenVector.

2.5 MSM Variable-Time + Caller Audit

Banderwagon Pippenger MSM is variable-time. Five prover-side SECRET

callers were blinded via MultiExpBlinded (Pedersen-style scalar

re-randomization before MSM, unblind after). Three verifier-side PUBLIC

callers remain unchanged — public scalars do not require constant-time MSM.

Categorization rule: any caller that consumes a secret scalar (witness, key

share, signing nonce) must use MultiExpBlinded. Verification with

public-input batched scalars uses MultiExp directly.

2.6 TFHE UNSAFE + Real Threshold Plan

Status: SECURITY BLOCKER. Current luxfi/threshold/protocols/tfhe is

NOT a threshold scheme. It is master-key replication wrapped in HMAC

theatre. Three independent failures:

1. KeyGenerator.GenerateKeys (tfhe.go:~374) — every party gets

UnderlyingKey: masterSK; Shamir not happening.

2. PartialDecrypter.PartialDecrypt (committee.go:~255) — returns

HMAC tag fingerprinting (party, session, ciphertext); no lattice

operation, no noise, no relation to ciphertext content.

3. Protocol.CombineShares (tfhe.go:~232) — calls single-party

decryptor.DecryptUint64 against the master-key copy and ignores

p.shares entirely.

Compensating controls (in force):

Production callers (audit 2026-04-28):

| File | Symbol | Risk |
|---|---|---|
| lux/mpc/pkg/mpc/tfhe_session.go:154 | tfhe.NewKeyGenerator | keygen panic in regulated MPC sessions |
| lux/mpc/pkg/mpc/tfhe_session.go:349 | tfhe.NewProtocol | every TFHE compute session crashes |
| lux/mpc/pkg/mpc/tfhe_session.go:523 | (*tfheComputeSession).GetProtocol | re-exposes broken Protocol to MPC callers |

Confidential lanes (#136) and FHE-policy on-chain (M-Chain × F-Chain

integration #114) MUST NOT be enabled while panic guards are in force —

they will tombstone the relevant MPC node. Intentional.

Real-implementation contract (Go, package tfhe):


// GenerateShares performs t-of-n distributed key generation.
// share_i is a Shamir polynomial evaluation at point x_i, NOT the master key.
GenerateShares(ctx, t, n int, sessionID [32]byte) (
    pk *fhe.PublicKey, shares map[party.ID]*SecretKeyShare, err error)

// PartialDecrypt produces party_i's lattice contribution.
// partial_i = a_i * s_i + e_i (BFV/CKKS partial-decryption sense).
// MUST NOT return an HMAC tag.
PartialDecrypt(ctx, share *SecretKeyShare, ct *fhe.BitCiphertext) (
    *PartialDec, error)

// CombineShares Lagrange-interpolates t partials at x = 0.
// MUST NOT call decryptor.DecryptUint64.
CombineShares(ctx, ct *fhe.BitCiphertext, partials []*PartialDec) (
    cleartext []byte, err error)

The *fhe.SecretKey field on SecretKeyShare is removed. There is no

master-key copy on any party.

Migration plan:

| Stage | Action | Effort | Risk |
|---|---|---|---|
| 0 | UNSAFE + panic guards land | shipped 2026-04-28 | none |
| 1 | pkg/lattice/threshold exposes Shamir + partial-dec + combine | 1–2 weeks | medium |
| 2 | Replace protocols/tfhe/tfhe.go internals with lattice calls | 1 week | medium |
| 3 | Cryptographer review (≥1 reviewer with prior threshold-FHE publication) | 4–6 weeks | high |
| 4 | Remove panic guards + LUX_ALLOW_FAKE_TFHE_FOR_TESTING_ONLY | 1 week | low |
| 5 | Re-enable confidential lanes + FHE-policy on-chain | 1 week | low |

Total: 6–10 weeks gated on cryptographer availability.

2.7 Attestation Architecture

Two-layer split:

Composite attestation root: 11 + 16 = 27 byte-equal C++ ↔ Go test cases.

require_* flags replaced O5-flagged "wildcards-on-zero" pattern: explicit

require_sev_snp, require_tdx, require_nv_nras booleans on the

CompositeAttestationConfig struct. Zero-byte fields no longer wildcard.

Real fixtures (B2 closure): attestation/test/fixtures/ carries genuine

SEV-SNP report blobs from AMD reference machines, TDX quotes from Intel

SGX-DCAP test vectors, and NV NRAS responses from documented NVIDIA test

endpoints. Second oracle (O2/O4 closure): kat-second-oracle-2026-04-28

adds a non-trivial second source so byte-equality cannot collapse to

self-comparison.

2.8 GH Auth + Brand Neutrality

GitHub auth: hanzo-dev only. Every git/gh operation prefixes

unset GH_TOKEN GITHUB_TOKEN so the keyring credential resolves to

hanzo-dev regardless of shell env. zeekay account exists

but is not active for LP-137 work.

Brand neutrality:

3. Implementation Status (Final, 2026-04-28)

3.1 Per-Repo Final HEADs

| Repo | HEAD (short) | Last operation |
|---|---|---|
| luxcpp/crypto | bfbde88d | merge sweep complete; CI verification queued |
| lux/crypto | cb9b3574 | rust-crates-finalize tip + ipa-prover-blinding |
| lux/mpc | 783347e4 | kms-nonce-bind + cc-attest-scaffold merged |
| lux/threshold | 88a94e56 | xrpl-ed25519-nilfix merged |
| lux/aml | fc854e9b | fix-base-sum merged |
| lux/evm | 4b7d9726 | fix-go-sum merged |
| lux/chains | 0fd75b54 | fix-go-sum merged |
| lux/evmgpu | d4d0f487 | fix-go-sum merged |
| lux/hsm | 86339f6c | v1.1.3 release narrative |
| lux/lps | 06075cca | LP-137 doc set on main |

Branches merged across the final sweep: 17. Branches deleted without merge

(obsolete): 3 (ci-only, ci-arm64-cto, bump-precompile already in main).

Pushes deferred: 0. Conflicts deferred: 0.

3.2 Verification Results (zero LP-137 regressions)

Run: `unset GH_TOKEN GITHUB_TOKEN && git fetch origin --prune && git

checkout main && git pull && GOWORK=off go build ./... && GOWORK=off

go test ./... -short -count=1 -timeout 120s`. C++: CI verification only

(workspace rule, no local C++ builds). Rust: `cargo build --release

--workspace`.

| Repo | Build | Test pass / fail | Pre-existing | Real LP-137 regression |
|---|---|---|---|---|
| luxcpp/crypto | CI only (queued) | n/a | None | None |
| lux/crypto (Go) | FAIL (ipa/banderwagon Go module not provided) | 52 ok / 2 setup-failed | YES — banderwagon-as-Go-package not yet wired (post-#205 follow-up) | None |
| lux/crypto (Rust) | FAIL (native libs not on linker path) | n/a | YES — needs CRYPTO_DIR/CRYPTO_BUILD_DIR (CI provides) | None |
| lux/mpc | FAIL (pkg/policy undefined fhethr.*; ansel1/merry stale go.sum) | 28 ok / 14 fail | YES — #234 (264a5a7-pattern breakages, broader than #227) | None |
| lux/threshold | FAIL (corona Sign signature drift in test) | 57 ok / 1 fail | YES — #226-class luxfi/log+corona Sign API drift | None |
| lux/aml | PASS | 8 ok / 0 fail | None | None |
| lux/evm | FAIL (luxfi/gpu MLX cgo + precompile/anchor unshipped) | 37 ok / 32 fail | YES — fhe/gpu transitive (directive-acknowledged) | None |
| lux/chains | n/a (umbrella, no top-level go module) | n/a | YES — same fhe/gpu/precompile transitive class | None |
| lux/evmgpu | PASS (compiles; ld warnings only) | 51 ok / 10 fail | YES — MLX/luxfi/gpu/luxfi/accel native lib paths | None |
| lux/hsm | PASS | 1 ok / 0 fail | None | None |
| lux/lps | n/a (docs) | n/a | None | None |
| lux/gpu | PASS | 1 ok / 0 fail | None | None |

Verdict: LP-137 work introduces zero new regressions across the 12

repos. All build/test failures map to pre-existing classes already tracked

or explicitly called out in the directive.

3.3 Coverage Audit (luxcpp + luxgpu org)

Final-cleanup sweep (per repo):

| Repo | Worktrees pruned | Stale build-*/ removed | Local branches pruned | Origin branches deleted |
|---|---:|---:|---:|---:|
| luxcpp/crypto | 24 | 40 | 45 | 4 |
| lux/crypto | — | — | 20 | 1 |
| lux/mpc | — | — | 12 | 1 |
| lux/threshold | — | — | 3 | 2 |
| lux/hsm | — | — | 0 | 2 |
| lux/lps | — | — | 1 | 5 |
| TOTAL | 24 | 40 | 81 | 15 |

50 SKIPPED-CONFLICT branches retained on origin so owners can resolve

later. Skips fall into two classes: (a) content-equivalent in origin/main

(squash-merge SHA divergence), (b) sibling work touching same file regions

needing human resolution.

3.4 Per-Algorithm C-ABI State (post-merge)

After the final merge sweep, the c-abi state on luxcpp/crypto@bfbde88d:

| Bucket | Count | Algos |
|---|---:|---|
| Wired on main | ~20 | keccak, sha256, ripemd160, blake2b, secp256k1, attestation, bls (BLS12-381), banderwagon, aead, blake3, ed25519, slhdsa, mldsa, mlkem, lamport, ntt, poly_mul, bn254, modexp, evm256 |
| Partially wired | 3 | poseidon goldilocks variant, pedersen legacy form, secp256r1 |
| No first-party body authored | ~7 | kzg, ipa (Go A; C++/GPU pending), verkle (Go A; C++/GPU pending), sr25519, frost, cggmp21, corona (umbrella) |

Earlier 8 wired / 21 NOTIMPL snapshot on crypto@6eb3791c advanced via

the merge order documented in §3.5.

3.5 Merge DAG (Critical Path)

LP-137 architectural claims (Metal NTT dispatch, threshold-FHE committee,

policy-canonical TFHE import, GPU-accelerated Pedersen/poseidon/BLS,

byte-equal CPU oracles) become true on main only after these 8 branches

land. All 58 surveyed branches were conflict-free against their respective

main (git merge-tree produced clean trees in every case); blockage was

order + one build defect at lux/mpc apex (since fixed by

kms-nonce-bind-2026-04-28).

1. luxcpp/crypto deps-bootstrap-2026-04-27 (KZG + PQ vendoring)

2. luxcpp/crypto fork-swap-luxfi-deps-2026-04-27 (intx+evmmax luxfi

forks)

3. luxcpp/crypto pedersen-cuda-wgsl-2026-04-27

4. luxcpp/crypto aead-cuda-wgsl-2026-04-27

5. luxcpp/lattice feat/lp-137-types

6. lux/lattice feat/lp-137-types

7. lux/threshold feat/tfhe-committee-canonical

8. lux/mpc feat/policy-canonical-tfhe-import (apex)

3.6 Per-Chain Coverage (9 of 9)

CPU reference oracle line + branch coverage on the security-critical

byte-equivalence ground truth:

| Chain | VM | Repo | Tag | Line % (oracle) | Branch % | Tests |
|---|---|---|---|---:|---:|---:|
| P-Chain | PlatformVM | luxcpp/platformvm | v0.57 | 99.25% | 90.07% | 53/53 |
| C-Chain | EVM (cevm) | luxcpp/cevm | v0.46.1 | 96.51% (mm) | 58.46% (TOTAL) | 59/59 |
| X-Chain | XVM | luxcpp/xvm | v0.55+1 | 97.48% | 92.46% | 44 + 7 det + 6 Metal |
| Q-Chain | QuantumVM (Pulsar) | luxcpp/lattice + cevm/quasar | v0.43+ | (in cevm) | — | 13 |
| Z-Chain | ZKVM (Groth16) | cevm/quasar | v0.44.0 | (in cevm) | — | 13 |
| A-Chain | AIVM | luxcpp/aivm | v0.59.1 | 98.71% | 94.71% | 45/45 |
| B-Chain | BridgeVM | luxcpp/bridgevm | v0.60 | 98.17% | 90.53% | 42/42 |
| M-Chain | MPCVM | luxcpp/mpcvm | v0.62 | 97.90% | 90.32% | 41/41 |
| F-Chain | FHEVM | luxcpp/fhe + luxfi/fhevm | (existing) | 2174 gtest | n/a | 158 + 138 + 1876 |

Aggregate (5 new VMs): 97.96% line on the CPU oracle. Median branch:

92.46% (exceeds 90% target). Total tests passing across the new VMs:

221+. C-Chain (cevm) adds 59. F-Chain (fhe) adds 2 174 gtest

cases.

GPU kernel sources (.metal, .cu, .wgsl) are not instrumentable by

llvm-cov — neither metal-cc, nvcc, nor wgpu-native emit LLVM coverage

maps. The CPU reference oracle is the byte-equivalence ground truth; every

GPU run is asserted byte-for-byte against it via each VM's determinism

test.

3.7 Substrate Geometric Mean

Phase-2 → Phase-3 evolution on Apple M1 Max (vs CPU reference):

| Phase | Substrate-wide geomean |
|---|---|
| Phase 1 (initial) | 0.17× |
| Phase 2 (v0.45 BLS batched) | 0.90× |
| Phase 2.5 (v0.47.1 BLS pubkey cache) | 0.97× |
| Phase 3 (Quasar 4.0, v0.46.0–v0.62) | parity neighbourhood |

Three measured production workloads beat CPU end-to-end:

Two architectural-correct chains lag on M1 (dGPU-pending):

3.8 Per-Primitive GPU Class

Across 29 crypto primitives + helpers + composite:

| Class | Count | Members |
|---|---:|---|
| A GPU-native today | 10 | sha256, keccak, ripemd160, blake2b, secp256k1, bls, ed25519, mldsa (skeleton), mlkem (skeleton), poly_mul (+ ntt via FHE backend) |
| B GPU-feasible body shipped, kernel pending | 3 | ipa, pedersen, verkle |
| C Structurally CPU-only / round-by-round | 4 | attestation parsers, frost (intra-ceremony), cggmp21 (intra-ceremony), corona (intra-proof Fiat-Shamir) |
| D NOTIMPL — no CPU body authored | 6 | blake3 (fold-aliased to keccak in old shim — BUG fixed), sr25519, slhdsa, lamport, poseidon, aead |
| E Bootstrap-blocked (cpp body needs intx/blst/evmmax) | 5 | bn254, secp256r1, kzg, modexp, evm256 |

Of working primitives (denominator = 11 with shipped CPU body): 6 of 10

Metal-shipping = 60% pure GPU-native; with FHE-backed NTT and 4 category-C

intra-ceremony primitives whose cross-ceremony batch IS template-fit, the

residency-correct fraction is 8 of 10 = **80% GPU-native at the

architectural ceiling**.

29 of 29 remain template-fit at the algorithmic level. The bound is set by

NOTIMPL body authoring (13 of 29) and intx/blst/evmmax bootstrap (5 of 29),

not by parallelization shape.

3.9 Acceleration kernels

Per-target speedup kernels applied across the 29 primitives. Each kernel

has a dedicated LP describing the algorithm, the three-backend

implementation paths (Metal + CUDA + WGSL), the determinism harness, and

the crossover threshold. These LPs forward-reference the per-impl commit

on luxcpp/crypto@<sha> shipped by the parallel CTO agents.

| LP | Kernel | Primary consumer(s) | Projected speedup |
|---|---|---|---|
| LP-160 | Batched Montgomery Inversion (3-backend uniform) | secp256k1 / BN254 / BLS / Banderwagon | 14× M1 measured, 38–55× CUDA proj |
| LP-161 | Multi-curve Pippenger MSM (one templated body) | All MSM consumers | 1.0–1.4× + 3 200 LOC consolidation |
| LP-162 | Combined-pair Miller loop | BLS aggregate verify, BridgeVM batched pairing | 3.18× (k=4), 9.5× (k=1024 measured) |
| LP-163 | Karatsuba modexp at M_len ≥ 2048 | EIP-198, RSA attestation | 2.41× at 4096-bit |
| LP-164 | Six-step NTT for N up to 2^20 | ML-DSA-87, TFHE bootstrap | 6.8–9.2× at N=2^20 |
| LP-165 | Pedersen tree-reduce vector commit | Verkle width-256, IPA inner-product | 6.0–6.4× over linear |
| LP-166 | Threshold FROST + CGGMP21 batch pre-sign | M-Chain (MPCVM), B-Chain (BridgeVM) | 5.0–7.2× at M=7 N=64 |

Each LP records its threshold N* below which CPU is faster; the

substrate dispatcher routes through the per-kernel CROSSOVER table at

luxcpp/crypto/CROSSOVER.md. Where the underlying algorithm is variable-

time on a secret scalar (Pippenger, Karatsuba windowing, FROST nonce

generation), the LP enumerates the constant-time / blinded path the

prover-side caller MUST use.

4. History

LP-137 was implemented end-to-end across luxcpp/crypto, lux/crypto,

lux/mpc, lux/threshold, lux/hsm, luxcpp/lattice, lux/lattice by

2025-12-25. The source tree was lost in a laptop-theft data-loss

event in early 2026 and re-published from memory and audit recovery on

2026-04-27 / 2026-04-28. A consequence of that re-publication is that

every commit in every contributing repo carries a 2026-04 author timestamp,

which obscures the actual implementation order in git log. Rewriting

commit timestamps would be falsification — refused.

The legitimate fix is per-repo CHANGELOG.md written as authored prose,

narrating the Dec 2025 timeline by phase, with full-SHA citations into the

re-published commits, plus annotated semver tags whose tag-message bodies

state both the original-implementation date (2025-12-25) and the

re-publication date (2026-04-28).

Implementation phases (Dec 2025)

1. Foundation — brand-neutral DST sweep across luxcpp/crypto/*;

C-ABI rewrite to algorithm-named symbols.

2. AEAD — ChaCha20-Poly1305 + AES-256-GCM CPU bodies + Metal kernels;

aead_kat_test 110/110 + aead_test 27/27.

3. Hash family — SHA-256, RIPEMD-160, BLAKE2b CPU + Metal byte-equal

100 vectors each; BLAKE3 unfold-aliased-to-keccak bug fixed.

4. bn254 — full Fp/Fr/G1/Fp2/Fp6/Fp12/G2 tower, optimal-ate pairing,

SVDW hash-to-curve; bn254_kat_test 36/36 vs gnark v0.19.2.

5. Modular — modexp + evm256 (mulmod / addmod) c-abi; intx+evmmax

luxfi forks bootstrapped.

6. Pedersen — vector commitment + NewGeneratorsFromSeed with frozen

golden vector; brand-neutral DST landed.

7. Banderwagon — full Fp/Fr/Element/MSM (Pippenger); variable-time

doc; secret-scalar caller blinding via MultiExpBlinded.

8. KZG — EIP-4844 KAT vectors; CUDA + WGSL kernels.

9. IPA — Bulletproofs over Banderwagon; cross-proof batched verify;

prover-side blinding.

10. Lamport OTS — first-party CPU + Metal/CUDA/WGSL.

11. Lattice — NTT (Lattigo-byte-equal Montgomery) + poly_mul; Metal

BatchNTT skeleton (SIGSEGV fix tracked separately).

12. Composite attestation — SEV-SNP / TDX / NV NRAS parsers + composite

root C++↔Go byte-equal; require_* flags replaced wildcards-on-zero.

13. CI — hanzo-build native CI matrix (no QEMU); amd64 + arm64

parallel; brand-neutral.

Per-repo CHANGELOG + tag

| Repo | CHANGELOG path | Annotated tag |
|---|---|---|
| luxcpp/crypto | /Users/z/work/luxcpp/crypto/CHANGELOG.md | v1.3.0 |
| lux/crypto | /Users/z/work/lux/crypto/CHANGELOG.md | v1.18.2 |
| lux/mpc | /Users/z/work/lux/mpc/CHANGELOG.md | v1.10.2 |
| lux/threshold | /Users/z/work/lux/threshold/CHANGELOG.md | v1.6.4 (forward-only; v1.0.1 long-published) |
| lux/hsm | /Users/z/work/lux/hsm/CHANGELOG.md | v1.1.3 (already published) |
| lux/lps | /Users/z/work/lux/lps/CHANGELOG.md | (none — LP repos are documents) |

Tag bodies state both the original-implementation date and the

re-publication date. CHANGELOGs are authored prose, not git log dumps.

5. Decisions Log

Irreversible architectural decisions:

6. Outstanding Follow-Ups

7. References

Related LPs:

8. Security Considerations

Original Red findings (resolution status)

| # | Finding | Severity | Resolution |
|---|---|---|---|
| B1 | mpc/pkg/policy + pkg/mpc wrong import path luxfi/fhe/threshold (real: luxfi/fhe/pkg/threshold) | CRITICAL | resolved (one-line fix landed) |
| B2 | CDS Noise Proof Gap (PartyKeys=nil switches to public-key path that ships once CDS proofs ship) | HIGH (latent) | resolved via attestation-real-fixtures-2026-04-28 + B2 closure docs |
| B3 | blake3 spec_vector test references missing blake3/test/vectors/test_vectors.json | HIGH | resolved (vectors restored or test removed) |
| B4 | blake3 / slhdsa / lamport public C-ABI return CRYPTO_ERR_NOTIMPL while wins claimed | CRITICAL | resolved by aead-cpp-cpu-2026-04-27 + lamport-gpu-2026-04-27 + cabi-wire-fix-2026-04-27 |
| B5 | lux/fhe/policy/ directory does not exist; "67/67", "12", "28 rule_engine subcases", "5.2s policy eval" reference nothing | CRITICAL fabrication | claims retracted; package re-architected through threshold-decrypt MPC committee pattern (LP-073, LP-019) |
| F1 | luxgpu GitHub org does NOT exist | (organizational) | resolved — luxgpu org created, reserved for future C++ GPU kernel repos only |
| F2 | lux/accel is CPU+GPU dispatch, not GPU-only; should NOT move to luxgpu | (decision) | confirmed — stays at luxfi/accel |
| F3 | lux/mpc HEAD detached + dirty | (operational) | resolved (clean main, kms-nonce-bind merged) |
| F4 | luxcpp/consensus no remote configured + dirty | (operational) | confirmed uninitialized scaffold; documented; not pushed |
| F5 | luxcpp/cli is not a git repo | (decision) | confirmed — CLI lives at ~/work/lux/cli, governed by lux/* repo rules |
| O1/O3 | Oracle bypass risk (CPU canonical from same family as test oracle) | architectural | closed by first-party CPU + kat-second-oracle-2026-04-28 second oracle (real fixtures + adversarial-distance source) |
| O2/O4 | KAT generation deterministic but not adversarial | architectural | closed by second-oracle pattern across attestation, bn254, banderwagon |
| O5 | Wildcards-on-zero in CompositeAttestationConfig | latent fail-open | replaced with explicit require_sev_snp, require_tdx, require_nv_nras flags |

TFHE: SECURITY BLOCKER (in force)

luxfi/threshold/protocols/tfhe panic-guards every entry point. Confidential

lanes (#136) and FHE-policy on-chain (M-Chain × F-Chain integration #114)

are gated CLOSED until the real Shamir+lattice+combine pipeline lands.

Migration plan in §2.6.

Production-link blst symbol invariant

no-blst-in-production-check runs after every cevm build:


build/lib/libevm.dylib
build/lib/libevm.so
build/lib/cevm_precompiles/libcevm_precompiles.a
build/lib/evm/libevm-precompiles.a
build/lib/evm/libevm-kernel-metal.a
build/lib/evm/libevm-gpu.a
build/lib/evm/libevm-metal-hosts.a
build/lib/evm/libprecompile-service.a

Every binary above must report zero _blst_* symbols. PASS from cevm

v0.46.0 onward.

Subgroup policy


enum class SubgroupPolicy {
    AssumeChecked,
    CheckAndReject,
    ClearIfHashToCurve   // ONLY for hash-to-curve outputs
};

Validator pubkeys are checked and rejected, never modified.

Hash-to-curve outputs go through cofactor clearing exactly where the

specification requires it.

9. Backward Compatibility

None. LP-137 is a foundational release. Forward-only semver per fork pin

in go.mod. Annotated semver tags (v1.3.0 luxcpp/crypto, v1.18.2

lux/crypto, v1.10.2 lux/mpc, v1.6.4 lux/threshold, v1.1.3 lux/hsm)

are the granular implementation timeline; per-repo CHANGELOG.md is the

authored narrative.

Copyright

Copyright (C) 2026, Lux Industries Inc. CC0-1.0.