This LP audits the parallel-execution decomposition of every Lux chain.
For each of the nine chains in LP-134 we record (a) whether the
QuasarSTM (Block-STM 3.0) ordered MVCC fabric is in the hot path, (b)
the parallelism granularity, (c) the conflict-detection mechanism, and
(d) the GPU dispatch points wired through the QuasarGPU substrate
service ring (LP-132). The aim is to verify the substrate-residency
invariant from LP-137 — no chain-local hot path leaves attested GPU
memory in production — across all nine chains in one place.
This is a survey LP. It does not redesign anything. It reads the code
that already exists in luxcpp/cevm and the per-VM repos (platformvm,
xvm, aivm, bridgevm, mpcvm, fhe, plus the cert-lane verifiers
under cevm/lib/consensus/quasar/gpu/) and writes down what is wired,
what is partial, and what is CPU-only by design.
QuasarGPU (LP-132) routes every chain through the same wave-tick service
ring. The ring header is one enum class ServiceId : uint32_t with 17
slots (file: cevm/lib/consensus/quasar/gpu/quasar_gpu_layout.hpp:41).
Slots 0..11 are the substrate proper:
Slots 12..16 are per-chain transition fans:
pchain_validator_root |xchain_execution_root |achain_state_root |bchain_state_root |mchain_state_root |C/Q/Z/F do not need transition slots because their roots already exist
in the round descriptor (parent_block_hash, qchain_ceremony_root,
zchain_vk_root, fchain_state_root).
The Block-STM scheduler proper is in
cevm/lib/evm/gpu/scheduler.{hpp,cpp} (115 LOC) plus
cevm/lib/evm/gpu/parallel_engine.{hpp,cpp} (204 LOC). The GPU port
lives at cevm/lib/evm/gpu/cuda/block_stm.{cu,_host.{hpp,cpp}} and
cevm/lib/evm/gpu/metal/block_stm.metal. ConflictSpec ABI lives in
cevm/lib/evm/stm/conflict_spec.hpp (LP-090 v0.50).
pchain_validator_root | platformvm_*.{cu,metal,wgsl} (staking, slashing, validator-set, transition). BLS sig batch via quasar_bls_verifier.cpp. |drain_exec (EVM fiber VM, LP-009). Precompiles: ecrecover (cuda+metal), bls12-381 (cuda+metal), point_eval (metal), dex_match (metal), keccak (cuda+metal). modexp/blake2f/sha256/ripemd160/bn256 currently CPU-only. |xchain_execution_root via XVMTransition slot | xvm_utxo.{cu,metal,wgsl}, xvm_membership.*, xvm_asset.*, xvm_roots.*. |drain_cert_lane via quasar_corona_verifier.cpp | crypto/corona/gpu/metal/corona_{verify,sign,ops}.metal. Lattice ops in crypto/lattice (NTT, ring). |drain_cert_lane via quasar_groth16_verifier.cpp (uses vk_root from round descriptor) | crypto/bn254/gpu/metal/bn254.metal for pairings; crypto/kzg, crypto/ipa, crypto/banderwagon for adjacent rollup proofs. |drain_attest → achain_state_root via AIVMTransition | aivm_attestation.{cu,metal,wgsl}, aivm_anchor.*, ai_precompile_metal.mm. Composite attestation parsers in crypto/attestation. |drain_bridge → bchain_state_root via BridgeVMTransition | bridgevm_bls.cpp, bridgevm_liquidity.cu, bridgevm_kernels_common.{cuh,metal,wgsl}. |drain_cert_lane then transition → mchain_state_root | mpcvm_cggmp21.{cu,metal,wgsl}, mpcvm_frost.*, mpcvm_ceremony.*. CGGMP21/FROST kernels in crypto/{cggmp21,frost}/gpu. |drain_fhe → fchain_state_root field | OpenFHE-derived fhe/src/{binfhe,core,pke}/, NTT in crypto/ntt, lattice in luxcpp/lattice. MLX/Metal poly-mul kernels per BENCHMARKS_METAL_NTT.txt. |
P-Chain: tx → validator-set delta → pchain_validator_root
C-Chain: tx → ConflictSpec lane → Block-STM (Exec → Validate → Repair → Commit)
│ inside Exec: per-fiber EVM
│ inside Validate: per-read-set check
X-Chain: utxo → membership proof → xchain_execution_root
Q-Chain: share → Pulsar aggregate → qchain_ceremony_root
Z-Chain: proof → Groth16 pairing → zchain_vk_root
A-Chain: quote → attestation parser → achain_state_root
B-Chain: message → inbox replay + BLS verify → bchain_state_root
M-Chain: share → round aggregate → mchain_state_root
F-Chain: op-node → RLWE / NTT → fchain_state_root
The smallest serial-execution unit is bolded above. Everything below it
is parallelizable. The C-Chain is the only chain with full Block-STM
read-set MVCC. The other eight chains have **trivial conflict
structures** (UTXO double-spend, ceremony-round ordering, FHE op-graph
DAG, attestation append) so the heavyweight 3-tier validation pipe is
not the right tool — a single conflict check per item is enough.
crypto/secp256k1/cpp | crypto/secp256k1/gpu/{cuda,metal,wgsl} + cevm/.../precompiles/ecrecover_{cuda,metal} |crypto/keccak/cpp | crypto/keccak/gpu/{cuda,metal,wgsl} + cevm/.../cuda/keccak256.cu, cevm/.../metal/keccak256.metal |crypto/blake2b/cpp | crypto/blake2b/gpu/{cuda,metal,wgsl} |crypto/bn254/cpp | crypto/bn254/gpu/metal/bn254.metal (cuda/wgsl present) |crypto/bls/cpp | cevm/.../precompiles/bls12_381_{cuda,metal} |crypto/kzg/cpp | cevm/.../precompiles/point_eval_metal.mm (cuda absent — TODO) |crypto/{ipa,banderwagon,pedersen}/cpp | per-alg gpu/{cuda,metal,wgsl} |crypto/corona/cpu | crypto/corona/gpu/metal/corona_*.metal |quasar_groth16_verifier.cpp | dispatches into bn254 GPU |crypto/mldsa/cpp | crypto/mldsa/gpu/{cuda,metal,wgsl} |fhe/src (OpenFHE-derived) + crypto/ntt/cpp | fhe/build_mlx, crypto/ntt/gpu, luxcpp/lattice |crypto/{cggmp21,frost}/c-abi | crypto/{cggmp21,frost}/gpu |Threshold sigs on M-Chain are network-bound (round-trip share exchange);
the GPU role there is batch-verify, not serial wall-clock reduction.
No violations. Two partial entries (A-Chain replay-window, M-Chain
round-transcript) are by design — those stores are append-only and
larger than GPU residency budgets. Both go through the StateRequest
service so the GPU never blocks on them in the hot loop.
Block-STM execution itself is per-tx with MAX_INCARNATIONS=16,
MAX_READS_PER_TX=64, MAX_WRITES_PER_TX=64, hash-table size 65536
slots (cevm/lib/evm/gpu/cuda/block_stm.cu:42-50). One GPU thread per
worker, dispatch shape identical to Metal.
1. C-Chain is the only chain that needs and uses full Block-STM 3.0 with
six-source ConflictSpec. The other chains have lower-arity conflict
structures and use simpler pipelines (UTXO double-spend, ceremony
round ordering, FHE op-graph DAG, append-only attestations).
2. Every chain has a CUDA + Metal + WGSL kernel triple for its hot ops.
The triple is enforced by the byte-equality test in LP-137 (one CPU
canonical, three byte-equal GPU backends).
3. Two non-hot reach-backs exist (A-Chain replay-window, M-Chain
round-transcript). Both go through the StateRequest service so the
GPU hot loop is non-blocking.
4. Two precompile gaps on C-Chain: point_eval lacks a CUDA backend
(Metal only) and modexp/blake2f/sha256/ripemd160/bn256 are
currently CPU-only. None of these are common-path; bn256 is legacy
(the production pairing is bn254). They are low-priority.
5. No substrate-residency invariant violations.
luxcpp/cevm/lib/evm/gpu/scheduler.hpp:24, scheduler.cpp:1
luxcpp/cevm/lib/evm/gpu/parallel_engine.hpp:32, parallel_engine.cpp:1
luxcpp/cevm/lib/evm/gpu/mv_memory.hpp:1luxcpp/cevm/lib/evm/stm/conflict_spec.hpp:1luxcpp/cevm/lib/evm/gpu/cuda/block_stm.cu:1luxcpp/cevm/lib/evm/gpu/metal/block_stm.metal:1luxcpp/cevm/lib/consensus/quasar/gpu/quasar_wave.metal:642..2342luxcpp/cevm/lib/consensus/quasar/gpu/quasar_gpu_layout.hpp:41luxcpp/cevm/lib/consensus/quasar/gpu/quasar_groth16_verifier.{hpp,cpp}luxcpp/cevm/lib/consensus/quasar/gpu/quasar_corona_verifier.{hpp,cpp}luxcpp/cevm/lib/consensus/quasar/gpu/quasar_bls_verifier.{hpp,cpp}luxcpp/{platformvm,xvm,aivm,bridgevm,mpcvm,fhe}/srcluxcpp/cevm/lib/evm/gpu/precompiles/precompile_dispatch.hpp:1luxcpp/crypto/{secp256k1,keccak,blake2b,bn254,kzg,ipa,banderwagon,pedersen,corona,bls,mldsa,cggmp21,frost,ntt}/{cpp,gpu/{cuda,metal,wgsl}}