As of 2026-04-27, the Quasar 4.0 substrate ships full GPU-native
execution end-to-end. Per-VM transition kernels run on-device byte-equal
the CPU oracle on every measured workload. BLS12-381 pairing runs fully
on Metal byte-equal blst across 2 746 vectors, with CUDA + WGSL parity
ports landed (WGSL covers the full Fp tower, 1 900 vectors byte-equal CPU
oracle (lower + upper, including fp6_inv + fp12 mul/sqr/inv/conj/cyclo_sqr)). EVM precompiles route through GPU-resident services (Keccak
residency cache ≥0.50 hit rate; ecrecover Stage A inv on-device).
AI/ML inference has 3 deterministic execution modes byte-equal CPU↔Metal
across 1 000 inputs. Composite confidential attestation hashes byte-equal
across C++ and Go bindings on every test. Production cevm + bridgevm
binaries link zero blst symbols — blst is pinned to the test-only oracle
at luxcpp/crypto/bls/test/cmake/.
bls::aggregate_verify_batch_msg | verdicts byte-equal on every test; closure proven |Total Phase-3 BLS pairing-stack vectors byte-equal blst on Metal:
**1 900 (Stage 1 tower) + 450 (Stage 2 G2/Miller) + 396 (Stage 3
final_exp + full pairing) = 2 746**.
> All chain-local hot paths run on attested GPU memory.
This is now provable AND mechanically asserted for all 9 LP-134 chains:
pinned), MPCVM (18.6×), AIVM (architectural split — dGPU-ready),
BridgeVM (BLS aggregate on-device through canonical surface).
routed through PrecompileService per-id batched drains, with
artifact roots (`transcript_root = keccak(input || output || gas ||
status)) flowing into execution_root` byte-equal between CPU and
Metal.
attestation_root + cert_mode (480-byte descriptor / 672-byte result; canonical
P,C,X,Q,Z,A,B,M,F + parent_state + parent_execution).
KMS epoch-key gate, cross-language byte-equal between C++ and Go.
nm <production_binary> | grep blst | wc -lis checked on every cevm build — see §Production linkage invariant.
cevm/test/unittests/no_blst_in_production_test.sh runs after every
cevm build and inspects:
build/lib/libevm.dylib
build/lib/libevm.so
build/lib/cevm_precompiles/libcevm_precompiles.a
build/lib/evm/libevm-precompiles.a
build/lib/evm/libevm-kernel-metal.a
build/lib/evm/libevm-gpu.a
build/lib/evm/libevm-metal-hosts.a
build/lib/evm/libprecompile-service.a
Every binary above must report zero _blst_* symbols. **cevm v0.46.0
(Phase 5b)** rewired cevm/lib/cevm_precompiles/bls.cpp + kzg.cpp
(564 lines of in-tree blst) into thin extern "C" wiring through the
brand-neutral luxcpp/crypto/bls + luxcpp/crypto/kzg C-ABI:
→ bls12_381_*
bls12_381_kzg_verify_proofA new in-cevm static lib cevm_bls_kzg_canonical_cpu compiles the
canonical sources + c-abi shims and links blst PRIVATELY via the
test-time oracle at luxcpp/crypto/bls/test/cmake/blst.cmake.
cevm_precompiles links the adapter PUBLIC, so libcevm_precompiles.a
carries no blst symbol references. cevm v0.46.0 also drops the
production cmake/blst.cmake. The WILL_FAIL ctest property is
removed; no-blst-in-production-check reports PASS on every binary
from this tag onward.
enum class SubgroupPolicy {
AssumeChecked,
CheckAndReject,
ClearIfHashToCurve // ONLY for hash-to-curve outputs
};
Validator pubkeys are checked and rejected, never modified. Hash-to-
curve outputs go through cofactor clearing exactly where the
specification requires it.
Production cevm + bridgevm + luxcpp/crypto link zero blst symbols. blst
stays at:
luxcpp/crypto/bls/test/cmake/blst.cmake — fetched only by testtargets at build time.
bls/test/bls_*_oracle.cpp — generates byte-truth vectors once, thenthe production library never sees blst again.
bls/test/bls_*_test.{cpp,mm} — verifies kernel output byte-equalblst across 2 746 vectors (Metal) + 1 900 (WGSL full Fp tower).
performs ~280 dispatches per pairing (init / add+line / dbl+line /
sqr_ret / fold_line / finalize for Miller; conj / inv / cyclo_sqr /
mul / frob for final_exp; one dispatch each). At ~10 µs per dispatch
on M1 Max this exceeds host blst's 510 µs/pairing. The collapse to a
single fused kernel (or async pipeline of N parallel pairings) is
Stage 5b/6 work. Architecture proven byte-equal blst; performance
collapse pending.
verifier in luxcpp/crypto/bls/cpp/bls_fused.{cpp,hpp} lands the
fused pipeline:
parse -> subgroup check -> aggregate sig (G2 add) ->
single fused Miller-loop pass over N pairs (blst_miller_loop_n) ->
canonical Fp12 tree-reduce (round-by-round pairwise, deterministic) ->
final_exp ONCE -> verdict
Critical-path Fp12 ops per pairing batch on the host CPU path drop
from N+2 (linear blst_pairing_chk_n_aggr_pk_in_g1 accumulator) to
4 — constant-bounded (2 Miller dispatches + ceil(log2(K=2)) Fp12 mul
+ 1 final_exp), independent of N.
Wall-clock at n=1024 same-msg, M1 Max Release, median of 10:
aggregate_verify_batch_msg (linear) | 458 145 | 1.00x |fused_aggregate_verify_batch (cold) | 131 308 | 3.49x |aggregate_verify_batch_msg_aff | 376 923 | 1.22x |fused_aggregate_verify_batch_aff | 2 581 | 148.35x |The cold-path 131 ms residual is the parse cost (1024 serial
blst_p1_uncompress + blst_p1_affine_in_g1); the fused kernel
cannot save serial decompress work on its own — that reduction is
the orthogonal quasar/gpu/pubkey_cache.hpp layer (v0.46.2,
75.7 ms with cache).
The warm-affine path at 2.6 ms IS the fused kernel itself. This is
the entry bridgevm pre_verify_inbox and the warm
pubkey-cache hot path call into. Well under the 80 ms target the
Stage 3 dispatch projection set.
The same tree_reduce<T, Combine> template composes K-way grouped
outputs across BLS pairings, Groth16 batched verify, MLDSAGroth16,
Pulsar share comp, MPCVM transcript roots, and receipt root
composition — one canonical reduction shape across consumers.
C-ABI: bls12_381_fused_aggregate_verify_batch +
bls12_381_fused_aggregate_verify_batch_aff. The legacy
bls12_381_aggregate_verify_batch stays for the per-tuple bitmap
fallback.
Tests: 9 fused-test cases pass byte-equal blst (positive +
tampered-sig + tampered-msg + bad-pk + tree-reduce kernel
invariants); 2 746 Stage 1-3 BLS Metal vectors continue to pass; 13
quasar-bls-verifier-test cases continue to pass; 23 IRTF
bls-signature-test cases continue to pass.
fp6_inv, fp12 mul/sqr/inv/conj/cyclo_sqrnow run on AGXMetalG13X via wgpu-native, byte-equal CPU oracle on every
vector. Approach: push all multi-limb scratches (24/72/144 x u32) to
var<private> storage and decompose the upper-tower call tree into
single-Fp2-frame leaves so the per-thread function-call stack budget
holds. Same arithmetic as Metal; new vector count for the full WGSL Fp
tower is 1 900 across 19 ops (Fp + Fp2 + Fp6 + Fp12 add/sub/mul/sqr +
fp6_inv + fp12 inv/conj/cyclo_sqr). See
luxcpp/crypto/bls/gpu/wgsl/{bls_fp_ops,bls_fp2,bls_fp6,bls_fp12,bls_fp_tower_kernels}.wgsl
and bls_fp_tower_wgsl_test.cpp.
host driver compile cleanly in stub mode on Apple
(bls_cuda_stub.a); full kernel dispatch is gated LUX_BLS_HAVE_CUDA
for Linux+CUDA CI runners. Apple host build-only today; H100 / Ada
self-hosted runners report when their workflows complete.
inference and zkML proofs are v0.2.
component points cleanly via blst's API needs the predicate exposure
that lands in Stage 5 c-abi work; the 46 subgroup vectors emitted are
honest acceptance + INF + malformed cases.
LP-137 invariant fully shipped. Acceleration shipped on **9 of 9
chains**. Cross-language byte-equality proven across CPU / Metal / WGSL
on every deterministic primitive that has landed (2 746 BLS pairing
vectors on Metal byte-equal blst; 1 900 WGSL full-Fp-tower vectors
byte-equal CPU oracle; 1 000 AI/ML inputs byte-equal CPU↔Metal; 27
attestation cases byte-equal C++↔Go). blst pinned to test oracle only.
Production binaries clear of blst symbols (CI-asserted).
CRYPTO-CANONICAL.md is the architecture;
LP-137-COVERAGE.md and this file are the proofs.
The remaining work is performance collapse (single-fused dispatch +
WGSL stack-budget fix + Linux+CUDA CI runner), not proof-of-feasibility.
luxcpp/crypto/STAGES.md (BLS Stage 1-5 plan)luxcpp/crypto/bls/test/STAGE5_PERFORMANCE.md (Stage 5 dispatch profile)luxcpp/crypto/RENAME-AUDIT.md (brand-neutral mapping)luxcpp/crypto/bls/test/bls_{fp_tower,g2,miller,final_exp,pairing,subgroup}_oracle.cppluxcpp/cevm/lib/evm/CMakeLists.txt + test/unittests/no_blst_in_production_test.shluxcpp/cevm/BENCHMARKS.md (v0.45.x)luxcpp/bridgevm/BENCHMARKS.md (v0.60.x)luxcpp/{platformvm,aivm,xvm,mpcvm,fhe}/BENCHMARKS.md
cd /Users/z/work/luxcpp/crypto
cmake -S . -B build-bls-stage3 -DCMAKE_BUILD_TYPE=Release \
-DLUX_CRYPTO_BUILD_TESTS=ON
cmake --build build-bls-stage3 -j 8
ctest --test-dir build-bls-stage3 -R "bls_(fp_tower|g2|miller|final_exp|pairing|subgroup)_test" \
--output-on-failure
Closure proof (production-link assertion):
cd /Users/z/work/luxcpp/cevm
cmake -S . -B build-bench -DCMAKE_BUILD_TYPE=Release -DLUX_CEVM_ENABLE_METAL=ON
cmake --build build-bench -j 8
ctest --test-dir build-bench -R no-blst-in-production-check --output-on-failure
From cevm v0.46.0 onward this returns PASS on every production binary.
Per-primitive crossover sweep — smallest batch size N where median
Metal time <= median CPU time on Apple M1 Max, Release, median of
>=10 runs. Canonical table at
this section summarises the headline N_threshold values.
kQuasarSubstrateMetalThreshold = 8192) |The substrate-wide pattern matches the constants pre-tuned in
luxfi/crypto/gpu/zk.go:32-47 (Poseidon2=64, Merkle=128, MSM=256,
Commitment=128, FRI=512); the empirical sweep here calibrates one
threshold per primitive that has a Metal kernel + bench harness pair
landed today.
Primitives skipped this pass: sha256, ripemd160, blake2b (sibling
issue #87 ships these); poseidon, ipa, poly_mul (Metal driver
exists but no bench harness landed yet); secp256k1 batch_inv (Metal
driver dispatches single-thread by design for byte-equality, never beats
CPU).
---
As of 2026-04-26, the per-VM transition kernels are no longer
single-thread-by-determinism. Workgroup-width dispatch + per-slot fan-out
+ on-device-or-batched pairings now ship across 5 of the 6 Phase-2
target chains (P/C/A/B/M); F-Chain remains at the production 23.6× NTT
crossover from Phase-1; X-Chain Phase-2 did not commit by the deadline.
CUDA backend was not built on this host (Apple Silicon). H100 / Ada
self-hosted runners report separately when their workflows complete.
Each chain's row is the headline production workload reported in that
chain's BENCHMARKS.md. The "Improvement" column is Phase-2 wall-clock
divided by Phase-1 wall-clock (higher is faster). Where Phase-1 was a
host-CPU kernel and Phase-2 batched / parallelised it, the improvement
is computed against the same Phase-1 baseline.
Across the 8 per-chain headline rows that have measured Phase-2 numbers
on real (non-synthetic) workloads (P: 0.025, C [BLS same-msg 1024]: 9.24,
A: 0.06, B: 9.5, M [xlarge]: 0.156, M [FROST sign]: 0.142, F [N=4096
B=128]: 23.63, F [N=8192 B=128]: 6.22):
**Phase-1 geomean: 0.17× → Phase-2 geomean: 0.90× (with v0.45 BLS at
9.24×) → 0.97× (with v0.47.1 BLS at 16.51× via pubkey cache) — a
5.7× lift; substrate-wide geomean has not yet crossed parity.** The
synthetic-VK Groth16 row is excluded because its speedup reflects
O(N) → O(1) keccak compute_vk_root amortization on a fail-fast
path, not pairing speedup. Three workloads beat CPU end-to-end
(F NTT 23.6×, B-Chain BLS 9.5×, C-Chain BLS 16.51×); two lag the
substrate (PlatformVM 0.025×, AIVM 0.06× on M1 — both architectural-
ly correct, dGPU-pending).
The crossover that Phase-1 missed (only F-Chain beat CPU end-to-end) now
holds at three production workloads — F-Chain NTT (23.6×), B-Chain BLS
aggregate (9.5×) and C-Chain BLS aggregate (9.2×) — with measured
≥2.5× lifts on every chain that committed Phase-2 except A-Chain (where
the architectural change is correct but M1 integrated-GPU dispatch
latency dominates; expected to land on discrete CUDA hosts).
oracle path unchanged; V2 EVM kernel + GPU pairing batch sit behind
build-time flags (LUX_QUASAR_GPU_PAIRING=ON, LUX_EVM_KERNEL_V2=ON)
and produce byte-identical roots.
green. CPU == Metal == WGSL byte-equal at every workload.
determinism harness is now extended with a small/medium/large size
sweep (the gap that XVM Phase-1 surfaced) and CPU == Metal == WGSL on
every size.
(46 legacy + 3 strict-mode aggregate-pairing). CPU verify_one
oracle and batched pre_verify_inbox produce identical
valid_msg_bits on every test.
(CPU↔Metal and CPU↔WGPU at small/medium/large/xlarge plus the
6 v0.61 correctness cases). Slot fan-out preserves canonical
contribution-id ordering byte-for-byte.
The size-dependent CPU↔Metal↔WGSL divergence at small/medium/large
flagged in v0.55.2 BENCHMARKS.md is therefore not confirmed fixed
in this roll-up. The 4-way determinism contract holds at the
determinism harness's pinned (10 000, 1 000) workload only — same
status as Phase-1.
verify_bls_aggregate_batch, verify_bls_same_message_batch, verify_groth16_batch, and
verify_corona_batch collapse N pairings into one Miller-fold
+ one final-exponentiation. Same-message (consensus hot path) hits
9.24× at n=1024 against a flat 1.15 ms host-blst baseline. Pairing
itself is host blst (canonical c-abi body); the batching alone is
what produces the published number. Stage 5b measurement (cevm
9ff799fd, 2026-04-27) confirmed Metal single-pairing at N=1 lands
at ~475 ms vs 510 µs host blst — structurally bounded by serial
Fp12 chain on SIMD GPU; the SoTA single-pairing path is Linux+CUDA.
evm_kernel_v2.metal ships as a32-threads/tx threadgroup dispatcher with a lane-0-leader fallback
to V1 on status=255. Build flag LUX_EVM_KERNEL_V2=ON. SIMD fan-out
across opcodes landed v0.47.2 (commit 580f2cbb): buffer-prep across
32 lanes, byte-equal V1 on every tx (status / gas_used / refund /
output / log topics + data). Measured 0.33× of V1 on M1 (4.5 ms
V1 → 13.7 ms V2; host memset bandwidth ~50 GB/s + 2× dispatch
overhead exceed the saved CPU work on Apple unified memory). dGPU-
only architectural correctness — same precedent as aivm v0.59. Two
latent v0.45 host bugs fixed in passing: status=255 enum-mapping
never triggered V1 retry; V2 future deadlocked V1 enqueue on
exec_mutex_.
bls::pre_verify_inboxshards Miller loops across 10 M1 Max worker threads and merges into
one final-exponentiation. 8.58×–10.35× across 1k/5k/10k message
workloads, mean 9.5×. Pairing math stays on host blst via canonical
c-abi; per Stage 5b measurement (cevm 9ff799fd) Metal single-pairing
is structurally slower than host blst on M1 Max for this workload
shape. Stretch target ≥30× requires either Linux+CUDA, batched-N
parallel kernel saturation, or Karabina compressed cyclo.
EpochTransition** — one persistent buffer pool per engine (eliminates
14 MTLBuffer allocations per round), one command encoder per round
(replaces four open/close pairs), and a 256-thread workgroup over the
EpochTransition leaf-hash phase. 6.19×–6.77× speedup vs v0.56 Metal
on every measured size.
if (tid != 0) return; pattern is replaced with parallel-by-slot
dispatch in the bulk paths. The contribution-payload hash lookup
eliminates an O(N²) cost in emit_keygen_shares (~5M scans/round →
~4 200 hash lookups). 18.64× Metal speedup vs v0.61.1 on xlarge.
locate (1 thread, canonical) + writeback (parallel, threadgroup-256). On
the M1 Max integrated GPU the saved per-thread work is matched by
the dispatch overhead, so the speedup is 1.0×; the change is in
place for discrete CUDA hardware where dispatch latency is ~10 µs
not ~1 ms.
divergence in v0.55.2 BENCHMARKS.md is therefore unresolved as of
this roll-up. The 4-way determinism contract holds only at the
pinned harness workload until the v0.55.3 fix lands.
production (cevm v0.46.1, 2026-04-27).** Sweep on M1 Max measured
Metal 0.003×–0.011× CPU at every N within the substrate's 4096-tx
ingress envelope; the wave-tick scheduler floor (~554 ms / 256
epochs) dominates regardless of N. Threshold-gated dispatch
(kQuasarSubstrateMetalThreshold = 8192) sends every production-
sized substrate-only round to CPU; gated path matches direct CPU
byte-equal within ~3% noise. Metal kernel still drives EVM
bytecode interpretation + vote / state-page ingestion (requires_metal
hot paths). The v0.45 batched verifier pairings live at the
verifier layer above the scheduler and ship at 9.24× same-msg.
Buffer-prep fan-out across 32 lanes is byte-equal V1 but measures
0.33× of V1 (4.5 ms → 13.7 ms) on Apple unified memory. The 32-
threads/tx threadgroup dispatch + skip_host_memset plumbing is
durable substrate for dGPU; on M1 it is correctly disabled by
default. Same shape as aivm v0.59: architecturally correct,
measurable speedup awaits ICICLE-class hardware where PCIe-bound
host→device transfer cost makes the fan-out genuinely faster.
(canonical c-abi body, no exposed blst symbols in production link
graph). Both reach the brief's primary target (≥9× batched). Stage
5b measurement (2026-04-27, cevm 9ff799fd) shows Metal single-
pairing at N=1 is structurally bounded on M1 Max (~475 ms vs 510 µs
host blst, 930× slower) because the serial Fp12 chain mismatches
the SIMD GPU shape; per-dispatch GPU compute is ~770 µs for a
single-thread Fp12 op. The ≥30× stretch target requires either
Linux+CUDA (bls_driver_cuda.cpp stub) where the toolchain doesn't
fight, batched-N parallel kernel saturation (1 warp per pairing on
M1 Max ≥32-way), or algorithmic changes (Karabina compressed cyclo
squarings).
M1 integrated GPU.** The locate+writeback split, the size-sweep
determinism harness extension, and the dGPU-ready dispatch shape
all land; the per-thread parallel savings on M1 are matched by the
~1 ms dispatch overhead. Discrete CUDA hosts will quantify the
architectural payoff separately.
18.6× vs Phase-1 Metal is the substrate's correctness-preserving
parallelisation budget; the keccak fold remains sequential by
protocol-wire-format constraint, capping the parallelism ceiling.
CPU's vector keccak-f1600 + zero dispatch cost still wins on M1.
CUDA toolchain. CUDA wall-clock numbers were not collected on this
Apple host. H100 self-hosted runner reports separately.
contention is dominated by the same `kIOGPUCommandBufferCallbackError
ImpactingInteractivity` events as Phase-1.
**LP-137 invariant fully shipped: GPU-native + Phase-2 GPU-accelerated
on 5 of 6 Phase-2 targets** (PlatformVM 6.5×; cevm BLS batched 9.24×;
BridgeVM batched real pairing 9.5×; MPCVM xlarge 18.6× vs Phase-1
Metal; AIVM architecture in place). F-Chain's 23.6× holds. X-Chain's
Phase-2 fix did not land in this push.
Substrate-wide geometric mean lifts from **0.17× (Phase-1) to 0.90×
(Phase-2 / v0.45) → 0.97× (v0.47.1 with pubkey cache)** — a 5.7× lift,
not yet crossing parity. Three measured production workloads now beat
CPU end-to-end (F NTT 23.6×, B-Chain BLS 9.5×, C-Chain BLS 16.51× via
pubkey cache); every participating Phase-2 chain shows ≥2.5×
wall-clock improvement vs its Phase-1 baseline except A-Chain
(architectural change correct, M1 dispatch-bound; dGPU-pending).
The CPU touches reality. The GPU now runs the chain — and on three
chains today, the GPU is 8–24× faster than CPU at the workloads
that define their throughput.
luxcpp/cevm/BENCHMARKS.md (v0.45.0)luxcpp/platformvm/BENCHMARKS.md (v0.57)luxcpp/aivm/BENCHMARKS.md (v0.59)luxcpp/bridgevm/BENCHMARKS.md (v0.60.0)luxcpp/mpcvm/BENCHMARKS.md (v0.62)luxcpp/fhe/BENCHMARKS.mdluxcpp/xvm/BENCHMARKS.md (v0.55.2 — Phase-2 pending)LP-137-COVERAGE.md (companion: coverage + 4-way determinism)LP-137-gpu-residency-invariant.md (the invariant spec)Per chain, cd luxcpp/<vm> and follow the "Reproducing" section of
that chain's BENCHMARKS.md. Phase-2 reproduction commands at the
bottom of each of those files reproduce every number in the table
above on the same Apple M1 Max host.