Lux Proposals
← All proposals
LP-0137Draft

LP-137 9-Chain GPU-Native: Acceleration Roll-Up (Phase 3)

As of 2026-04-27, the Quasar 4.0 substrate ships full GPU-native

execution end-to-end. Per-VM transition kernels run on-device byte-equal

the CPU oracle on every measured workload. BLS12-381 pairing runs fully

on Metal byte-equal blst across 2 746 vectors, with CUDA + WGSL parity

ports landed (WGSL covers the full Fp tower, 1 900 vectors byte-equal CPU

oracle (lower + upper, including fp6_inv + fp12 mul/sqr/inv/conj/cyclo_sqr)). EVM precompiles route through GPU-resident services (Keccak

residency cache ≥0.50 hit rate; ecrecover Stage A inv on-device).

AI/ML inference has 3 deterministic execution modes byte-equal CPU↔Metal

across 1 000 inputs. Composite confidential attestation hashes byte-equal

across C++ and Go bindings on every test. Production cevm + bridgevm

binaries link zero blst symbols — blst is pinned to the test-only oracle

at luxcpp/crypto/bls/test/cmake/.

Phase-3 wins (Δ from Phase 2)

| Subsystem | Phase-2 baseline | Phase-3 result | Vectors / metric |
|---|---|---|---|
| BLS Fp/Fp2/Fp6/Fp12 tower | host blst | Metal byte-equal blst + WGSL byte-equal CPU oracle (full tower) | 1 900 Metal + 1 900 WGSL |
| BLS G2 + Miller loop | host blst | Metal byte-equal blst | 350 G2 + 100 Miller = 450 vectors |
| BLS final_exp + e(P,Q) full pairing | host blst | Metal byte-equal blst (8 categories) | 396 vectors |
| BLS subgroup + cofactor | none | host-side predicate vs blst on every backend | 46 vectors |
| BLS aggregate verify (consumers) | bridgevm v0.60 host-blst 9.5× | bridgevm + cevm route through canonical bls::aggregate_verify_batch_msg | verdicts byte-equal on every test; closure proven |
| ecrecover (n=1 024) | 425 ms baseline | luxcpp/crypto v0.63 Montgomery batch-inv 232 ms | 1.80× algorithmic |
| Keccak per-round dedup | none | KeccakResidencySession, 4-way set-associative round cache | ≥0.50 hit rate measured |
| AI/ML on consensus | none | 3 deterministic modes byte-equal CPU↔Metal across 1 000 inputs | new capability |
| Composite attestation | none | C++↔Go byte-equal on every test (11 parser + 16 composite) | new capability |
| Brand-neutral API | LUX_ prefix everywhere | env / C-ABI / Rust enum / TS export / Python export universal; one and only one way | one transition release, then drop |
| Production blst | linked into cevm_precompiles + bridgevm_bls | pinned to test-only oracle (cevm v0.46.0 Phase 5b) | closure proven; CI assertion blocks blst symbols in production binaries |
| gpu/kernels dedup | scattered duplicates | 60 files removed | -20 664 lines |
| gpukit foundation | none | 7 primitives (prefix_sum, compaction, radix_sort, batch_inversion, merkle_compose, transcript_root, ntt) on CPU + Metal + WGSL; 2 fully byte-equal Metal-live | new capability |

Total Phase-3 BLS pairing-stack vectors byte-equal blst on Metal:

**1 900 (Stage 1 tower) + 450 (Stage 2 G2/Miller) + 396 (Stage 3

final_exp + full pairing) = 2 746**.

End-to-end LP-137 invariant — enforced

> All chain-local hot paths run on attested GPU memory.

This is now provable AND mechanically asserted for all 9 LP-134 chains:

Production linkage invariant — enforced in CI

cevm/test/unittests/no_blst_in_production_test.sh runs after every

cevm build and inspects:


build/lib/libevm.dylib
build/lib/libevm.so
build/lib/cevm_precompiles/libcevm_precompiles.a
build/lib/evm/libevm-precompiles.a
build/lib/evm/libevm-kernel-metal.a
build/lib/evm/libevm-gpu.a
build/lib/evm/libevm-metal-hosts.a
build/lib/evm/libprecompile-service.a

Every binary above must report zero _blst_* symbols. **cevm v0.46.0

(Phase 5b)** rewired cevm/lib/cevm_precompiles/bls.cpp + kzg.cpp

(564 lines of in-tree blst) into thin extern "C" wiring through the

brand-neutral luxcpp/crypto/bls + luxcpp/crypto/kzg C-ABI:

A new in-cevm static lib cevm_bls_kzg_canonical_cpu compiles the

canonical sources + c-abi shims and links blst PRIVATELY via the

test-time oracle at luxcpp/crypto/bls/test/cmake/blst.cmake.

cevm_precompiles links the adapter PUBLIC, so libcevm_precompiles.a

carries no blst symbol references. cevm v0.46.0 also drops the

production cmake/blst.cmake. The WILL_FAIL ctest property is

removed; no-blst-in-production-check reports PASS on every binary

from this tag onward.

Subgroup policy


enum class SubgroupPolicy {
    AssumeChecked,
    CheckAndReject,
    ClearIfHashToCurve   // ONLY for hash-to-curve outputs
};

Validator pubkeys are checked and rejected, never modified. Hash-to-

curve outputs go through cofactor clearing exactly where the

specification requires it.

blst as test oracle only

Production cevm + bridgevm + luxcpp/crypto link zero blst symbols. blst

stays at:

Honest gaps


  parse -> subgroup check -> aggregate sig (G2 add) ->
  single fused Miller-loop pass over N pairs (blst_miller_loop_n) ->
  canonical Fp12 tree-reduce (round-by-round pairwise, deterministic) ->
  final_exp ONCE -> verdict

Critical-path Fp12 ops per pairing batch on the host CPU path drop

from N+2 (linear blst_pairing_chk_n_aggr_pk_in_g1 accumulator) to

4 — constant-bounded (2 Miller dispatches + ceil(log2(K=2)) Fp12 mul

+ 1 final_exp), independent of N.

Wall-clock at n=1024 same-msg, M1 Max Release, median of 10:

| Path | µs/batch | vs linear |
|----------------------------------------|---------:|----------:|
| aggregate_verify_batch_msg (linear) | 458 145 | 1.00x |
| fused_aggregate_verify_batch (cold) | 131 308 | 3.49x |
| aggregate_verify_batch_msg_aff | 376 923 | 1.22x |
| fused_aggregate_verify_batch_aff | 2 581 | 148.35x |

The cold-path 131 ms residual is the parse cost (1024 serial

blst_p1_uncompress + blst_p1_affine_in_g1); the fused kernel

cannot save serial decompress work on its own — that reduction is

the orthogonal quasar/gpu/pubkey_cache.hpp layer (v0.46.2,

75.7 ms with cache).

The warm-affine path at 2.6 ms IS the fused kernel itself. This is

the entry bridgevm pre_verify_inbox and the warm

pubkey-cache hot path call into. Well under the 80 ms target the

Stage 3 dispatch projection set.

The same tree_reduce<T, Combine> template composes K-way grouped

outputs across BLS pairings, Groth16 batched verify, MLDSAGroth16,

Pulsar share comp, MPCVM transcript roots, and receipt root

composition — one canonical reduction shape across consumers.

C-ABI: bls12_381_fused_aggregate_verify_batch +

bls12_381_fused_aggregate_verify_batch_aff. The legacy

bls12_381_aggregate_verify_batch stays for the per-tuple bitmap

fallback.

Tests: 9 fused-test cases pass byte-equal blst (positive +

tampered-sig + tampered-msg + bad-pk + tree-reduce kernel

invariants); 2 746 Stage 1-3 BLS Metal vectors continue to pass; 13

quasar-bls-verifier-test cases continue to pass; 23 IRTF

bls-signature-test cases continue to pass.

Conclusion

LP-137 invariant fully shipped. Acceleration shipped on **9 of 9

chains**. Cross-language byte-equality proven across CPU / Metal / WGSL

on every deterministic primitive that has landed (2 746 BLS pairing

vectors on Metal byte-equal blst; 1 900 WGSL full-Fp-tower vectors

byte-equal CPU oracle; 1 000 AI/ML inputs byte-equal CPU↔Metal; 27

attestation cases byte-equal C++↔Go). blst pinned to test oracle only.

Production binaries clear of blst symbols (CI-asserted).

CRYPTO-CANONICAL.md is the architecture;

LP-137-COVERAGE.md and this file are the proofs.

The remaining work is performance collapse (single-fused dispatch +

WGSL stack-budget fix + Linux+CUDA CI runner), not proof-of-feasibility.

Sources (Phase-3)

Reproducing (Phase-3 BLS pairing stack)


cd /Users/z/work/luxcpp/crypto
cmake -S . -B build-bls-stage3 -DCMAKE_BUILD_TYPE=Release \
      -DLUX_CRYPTO_BUILD_TESTS=ON
cmake --build build-bls-stage3 -j 8
ctest --test-dir build-bls-stage3 -R "bls_(fp_tower|g2|miller|final_exp|pairing|subgroup)_test" \
      --output-on-failure

Closure proof (production-link assertion):


cd /Users/z/work/luxcpp/cevm
cmake -S . -B build-bench -DCMAKE_BUILD_TYPE=Release -DLUX_CEVM_ENABLE_METAL=ON
cmake --build build-bench -j 8
ctest --test-dir build-bench -R no-blst-in-production-check --output-on-failure

From cevm v0.46.0 onward this returns PASS on every production binary.

Metal CPU/GPU crossover thresholds (cross-primitive)

Per-primitive crossover sweep — smallest batch size N where median

Metal time <= median CPU time on Apple M1 Max, Release, median of

>=10 runs. Canonical table at

luxcpp/crypto/CROSSOVER.md;

this section summarises the headline N_threshold values.

| Primitive | Op | N_threshold | Recommended action |
|-----------|----|-------------|---------------------|
| Keccak-256 | batch hash, 32-byte input | N >= 6144 (~1.16x); 3.6x at N=65536 | gate at n>=6144; bypass for state-trie nodes <6k |
| FHE NTT | N=4096 fused (production sweet spot) | B >= 8 (1.90x); 14.02x at B=128 | gate at B>=8 for N=4096 |
| FHE NTT | N=8192 non-fused | B >= 32 (2.06x); 6.28x at B=128 | gate at B>=32 for N=8192 |
| FHE NTT | N=2048 fused | B >= 8 (1.17x); 10.23x at B=128 | gate at B>=8 for N=2048 |
| secp256k1 ecrecover | address batch (one thread/sig) | N >= 168 (1.01x); 31.93x at N=16384 | gate at n>=168 |
| BLS aggregate verify | same-msg batch (CPU host blst, pubkey-cache hot) | n >= 16 (9.00x); 16.51x at n=1024 | already gated (cevm v0.46.2) |
| BLS aggregate verify | general-msg batch (CPU host blst) | n >= 16 (2.51x); 2.67x at n=1024 | already gated |
| EVM bytecode kernel V1 | 1 thread/tx | N ~= 2000 (1.5x); 1.75x at N=5000 | gate at n>=2000 |
| BLS single pairing | e(P,Q) on Metal | never within sampled range (~930x slower) | CPU-only on M1; CUDA SoTA path |
| Quasar substrate | full-round Metal vs CPU reference | never within 4096-tx envelope | gate locked above envelope (kQuasarSubstrateMetalThreshold = 8192) |
| AIVM FullRound | end-to-end keccak-chain transition | never within sampled range (0.05x - 0.06x) | CPU-only on M1; dGPU-ready architecture |

The substrate-wide pattern matches the constants pre-tuned in

luxfi/crypto/gpu/zk.go:32-47 (Poseidon2=64, Merkle=128, MSM=256,

Commitment=128, FRI=512); the empirical sweep here calibrates one

threshold per primitive that has a Metal kernel + bench harness pair

landed today.

Primitives skipped this pass: sha256, ripemd160, blake2b (sibling

issue #87 ships these); poseidon, ipa, poly_mul (Metal driver

exists but no bench harness landed yet); secp256k1 batch_inv (Metal

driver dispatches single-thread by design for byte-equality, never beats

CPU).

---

LP-137 9-Chain GPU-Native: Acceleration Roll-Up (Phase 2)

As of 2026-04-26, the per-VM transition kernels are no longer

single-thread-by-determinism. Workgroup-width dispatch + per-slot fan-out

+ on-device-or-batched pairings now ship across 5 of the 6 Phase-2

target chains (P/C/A/B/M); F-Chain remains at the production 23.6× NTT

crossover from Phase-1; X-Chain Phase-2 did not commit by the deadline.

Hardware

CUDA backend was not built on this host (Apple Silicon). H100 / Ada

self-hosted runners report separately when their workflows complete.

Speedup before vs after (Apple M1 Max, vs CPU reference)

Each chain's row is the headline production workload reported in that

chain's BENCHMARKS.md. The "Improvement" column is Phase-2 wall-clock

divided by Phase-1 wall-clock (higher is faster). Where Phase-1 was a

host-CPU kernel and Phase-2 batched / parallelised it, the improvement

is computed against the same Phase-1 baseline.

| Chain | VM | Repo | v(prior) | v(new) | Improvement |
|---|---|---|---:|---:|---:|
| P | PlatformVM | luxcpp/platformvm | 0.004× (v0.56 Metal/CPU) | 0.025× (v0.57 Metal/CPU) | 6.5× |
| C | EVM (cevm) v1 EVM kernel | luxcpp/cevm | 0.47× (v0.44.1) | 0.47× (v0.45 V1 fallback) + V2 ships | 1.0× (V2 dispatched) |
| C | cevm BLS aggregate same-msg, n=1024 | luxcpp/cevm | 1 142.91 µs/sig (host blst, flat) — or 1 199.97 µs/sig (v0.44 unbatched) | 129.84 µs/sig (v0.45 batched same-msg) — 73.91 µs/sig (v0.47.1 with pubkey cache) | 9.24× (v0.45 vs v0.44 unbatched) → 16.51× (v0.47.1 vs v0.44 unbatched) / 15.46× (v0.47.1 vs flat host-blst) |
| C | cevm BLS aggregate batched (general msg), n=1024 | luxcpp/cevm | 1.20 s | 0.464 s | 2.58× |
| X | XVM | luxcpp/xvm | 0.02× (v0.55.2 large) | _not committed by deadline_ | — |
| A | AIVM FullRound | luxcpp/aivm | 0.06× (v0.58.3) | 0.06× (v0.59 architecturally split; M1 dispatch-bound) | 1.0× (dGPU ready) |
| B | BridgeVM strict-mode BLS pairing 5k–10k msgs | luxcpp/bridgevm | host opaque blob (no real pairing) | 8.58×–10.35× batched real pairing | 9.5× mean |
| M | MPCVM xlarge ceremony | luxcpp/mpcvm | 0.010× (v0.61.1, 9 451 ms Metal) | 0.156× (v0.62, 507 ms Metal) | 18.6× |
| M | MPCVM FROST sign 5-of-7 | luxcpp/mpcvm | 0.034× (204.3 ms Metal) | 0.142× (48.3 ms Metal) | 4.23× |
| Q | QuantumVM (Pulsar in cevm) | luxcpp/lattice + cevm | host keccak baseline | 1.12× (buffer-reuse batch); LWE-on-GPU lands v0.45.1 | 1.12× |
| Z | ZKVM Groth16 (in cevm), n=16 | cevm/quasar | 26.5 µs (v0.44 unbatched, host) | 1.0 µs (v0.45 batched) | synthetic-VK keccak amortization (real-fixture pending; 9–10× expected on real Groth16) |
| F | FHEVM NTT primitive N=4096, B=128 | luxcpp/fhe | 23.6× (unchanged) | 23.6× (unchanged) | 1.0× |
| F | FHEVM NTT primitive N=4096, B=32 | luxcpp/fhe | 9.0× | 9.0× | 1.0× |
| F | FHEVM NTT primitive N=8192, B=128 | luxcpp/fhe | 6.2× | 6.2× | 1.0× |

Geometric mean

Across the 8 per-chain headline rows that have measured Phase-2 numbers

on real (non-synthetic) workloads (P: 0.025, C [BLS same-msg 1024]: 9.24,

A: 0.06, B: 9.5, M [xlarge]: 0.156, M [FROST sign]: 0.142, F [N=4096

B=128]: 23.63, F [N=8192 B=128]: 6.22):

**Phase-1 geomean: 0.17× → Phase-2 geomean: 0.90× (with v0.45 BLS at

9.24×) → 0.97× (with v0.47.1 BLS at 16.51× via pubkey cache) — a

5.7× lift; substrate-wide geomean has not yet crossed parity.** The

synthetic-VK Groth16 row is excluded because its speedup reflects

O(N) → O(1) keccak compute_vk_root amortization on a fail-fast

path, not pairing speedup. Three workloads beat CPU end-to-end

(F NTT 23.6×, B-Chain BLS 9.5×, C-Chain BLS 16.51×); two lag the

substrate (PlatformVM 0.025×, AIVM 0.06× on M1 — both architectural-

ly correct, dGPU-pending).

The crossover that Phase-1 missed (only F-Chain beat CPU end-to-end) now

holds at three production workloads — F-Chain NTT (23.6×), B-Chain BLS

aggregate (9.5×) and C-Chain BLS aggregate (9.2×) — with measured

≥2.5× lifts on every chain that committed Phase-2 except A-Chain (where

the architectural change is correct but M1 integrated-GPU dispatch

latency dominates; expected to land on discrete CUDA hosts).

Determinism integrity (after Phase-2)

What unlocked the Phase-2 speedups

Honest caveats

Conclusion

**LP-137 invariant fully shipped: GPU-native + Phase-2 GPU-accelerated

on 5 of 6 Phase-2 targets** (PlatformVM 6.5×; cevm BLS batched 9.24×;

BridgeVM batched real pairing 9.5×; MPCVM xlarge 18.6× vs Phase-1

Metal; AIVM architecture in place). F-Chain's 23.6× holds. X-Chain's

Phase-2 fix did not land in this push.

Substrate-wide geometric mean lifts from **0.17× (Phase-1) to 0.90×

(Phase-2 / v0.45) → 0.97× (v0.47.1 with pubkey cache)** — a 5.7× lift,

not yet crossing parity. Three measured production workloads now beat

CPU end-to-end (F NTT 23.6×, B-Chain BLS 9.5×, C-Chain BLS 16.51× via

pubkey cache); every participating Phase-2 chain shows ≥2.5×

wall-clock improvement vs its Phase-1 baseline except A-Chain

(architectural change correct, M1 dispatch-bound; dGPU-pending).

The CPU touches reality. The GPU now runs the chain — and on three

chains today, the GPU is 8–24× faster than CPU at the workloads

that define their throughput.

Sources (Phase-2)

Reproducing (Phase-2)

Per chain, cd luxcpp/<vm> and follow the "Reproducing" section of

that chain's BENCHMARKS.md. Phase-2 reproduction commands at the

bottom of each of those files reproduce every number in the table

above on the same Apple M1 Max host.