QuasarGPU is **CEVM's GPU-side execution adapter for Quasar-certified
rounds**. It executes the round on GPU — Block-STM, EVM fibers, batched
crypto, root construction — and produces the artifacts the Quasar
consensus engine certifies: block_hash, state_root, receipts_root,
execution_root, mode_root. It optionally aggregates per-lane
certificates from inbound vote/share/proof artifacts.
Critically, this is an execution adapter, not a replacement for
luxfi/consensus. The boundary is:
lux/consensus orders, votes, certifies, finalizes
cevm executes, validates, roots, receipts
quasar/gpu/ GPU-side execution adapter for Quasar rounds
> One sentence: **consensus decides what must be executed; cevm proves
> what execution produced; Quasar certifies the resulting commitment.**
Classical GPU EVM models dispatch a separate kernel per phase
(exec → wait → validate → wait → commit → wait). QuasarGPU replaces
all of that with one bounded kernel re-launched once per wave tick:
host: GPU wave-tick kernel:
───── ──────────────────────
begin_round(QuasarRoundDescriptor) → │ 12 service drains run in parallel:
push_txs(...) │
push_certs(...) (P/Q/Z lanes) │ gid 0: Ingress → Decode
poll_state_requests() → host LSM │ gid 1: Decode → Crypto/StateRequest
push_state_pages(...) │ gid 2: Crypto → Commit/DagReady/Exec
poll_quasar_certs() ← from CertOut │ gid 3: DagReady → Exec
│ gid 4: Exec → Validate (EVM fibers)
run_wave_tick() — re-dispatch ──────┤ gid 5: Validate → Commit/Repair (Block-STM)
│ gid 6: Repair → Exec (incarnation++)
poll_round_result() ← QuasarRoundResult │ gid 7: Commit → root material
│ gid 8: StateRequest (host poll)
│ gid 9: StateResp → resume Crypto
│ gid 10: CertLane → batch verify + aggregate
│ gid 11: CertOut (host poll)
No persistent hot-spinning kernel. Each wave tick is a bounded
dispatch; workgroups exit when their service ring is empty or their
budget is exhausted. The GPU scheduler is free to interleave other
kernels between ticks.
enum class ServiceId : uint32_t {
// 3.0 services (12) — substrate, shipped 2025-12-25
Ingress = 0, // host tx blobs
Decode = 1, // sender recovery / admission gate
Crypto = 2, // sig verify, route to Commit / DagReady / Exec
DagReady = 3, // MVCC ready set (Nebula mode)
Exec = 4, // EVM fiber VM
Validate = 5, // Block-STM read-set check
Repair = 6, // re-execute conflicting txs
Commit = 7, // commit + per-tx receipt keccak
StateRequest = 8, // GPU → host page faults (out)
StateResp = 9, // host → GPU page replies (in)
CertLane = 10, // BLS / Pulsar / MLDSAGroth16 lane artifacts
CertOut = 11, // GPU-emitted per-lane QuasarCert commitments
// 4.0 services (4 new) — production, shipped 2026-02-14 (v0.43, v0.47, v0.48)
AdaptiveSchedule = 12, // ConflictSpec admission, predictor consult (A)
BridgeAttest = 13, // cross-chain attestation drain (B)
MarketAuction = 14, // auction-rule batched matching for hot DEX lanes (M)
FiberCheckpoint = 15, // per CALL-frame checkpoint emit / GC (F)
Count = 16
};
The 4.0 service set is referred to as A/B/M/F for the four new
drains. Each new service uses a dedicated ring with the same RingHeader
+ items_arena layout as the 3.0 services and the same wave-tick budget
contract. The 4.0 activation height (2026-02-14) is when Count flips
from 12 to 16 in the layout type.
Every ring is a fixed-capacity device buffer with a RingHeader at the
front (head/tail/capacity/mask/items_ofs/pushed/consumed) and items
laid out back-to-back in a per-round items_arena. Cross-workgroup
visibility uses the relaxed-atomic + threadgroup_barrier(mem_device)
pattern proven in v0.30 (V3 wave-dispatch).
RingHeader | per-ring metadata + pushed/consumed counters | 48 B |IngressTx | raw tx envelope | 32 B |DecodedTx | post-recovery tx | 48 B |VerifiedTx | post-admission tx (carries blob_offset/blob_size for v0.39 EVM) | 32 B |RWSetEntry | one read or write per Block-STM tx | 24 B |ExecResult | tx execution output + RW set | 216 B (8 RW slots) |CommitItem | committable tx + per-tx receipt hash | 64 B |MvccSlot | open-addressing MVCC arena slot | 32 B |DagNode | Nebula DAG node + parents + children + pending envelope | 128 B |DagWriterSlot | most-recent-writer-per-key for v0.40 DAG construction | 32 B |StateRequest | GPU → host page-fault descriptor | 32 B |StatePage | host → GPU page-fault reply (≤64 B inline payload) | 96 B |QuasarCertIngress | per-lane cert artifact ingress | 96 B |QuasarCert | per-lane cert emission | 432 B |QuasarRoundDescriptor | host writes once per round | 304 B |QuasarRoundResult | GPU writes; host reads | 272 B |The kernel ships an inline FIPS-202 keccak permutation (24 rounds,
RC table, rotation table). Used for:
block_hash — finalization digeststate_root — accumulated commit chainreceipts_root — per-tx receipt hash chainexecution_root — Block-STM trace commitmentsubject re-derivation (replay protection)Zero CPU keccak fallback on the round path. The metal::keccak256_cpu
helper in lib/evm/gpu/metal/keccak_host.mm is reserved for legacy
non-Quasar consumers.
drain_exec runs the in-kernel EVM interpreter — a per-tx fiber
with:
pc, sp, gas, statusstack[64] of 256-bit values (4 × 64-bit limbs each = 13.6 MB for 4096 fibers)memory[1024] per-fiber scratchpending_key_*[4] cold-state suspend slotrw[8] Block-STM read/write setCurrently: 118 opcodes (full arithmetic incl. ADDMOD / MULMOD /
EXP, all signed/unsigned compares, bitwise incl. SAR/BYTE,
KECCAK256, env opcodes, PUSH0..PUSH32, DUP1..16, SWAP1..16,
MLOAD / MSTORE / MSTORE8 / MSIZE, SLOAD / SSTORE with cold-
miss suspend, JUMP / JUMPI / PC / JUMPDEST / GAS, STOP /
RETURN / REVERT / INVALID).
Cold-miss SLOAD → suspend (status=2, key in pending_key_*, push
StateRequest, restore stack, return). drain_state_resp resumes by
re-injecting and stamping the MVCC slot's `last_writer_tx |=
0x80000000` as the "loaded" sentinel.
Deferred to v0.40+: CALL family, CREATE/CREATE2, LOGn,
EXTCODE*, RETURNDATA*, TLOAD/TSTORE, MCOPY, BLOBHASH/BLOBBASEFEE.
drain_exec records read+write entries into the tx's RW set.
drain_validate runs MVCC version-check (mvcc_check_consistent); on
mismatch → conflict → repair queue with bumped incarnation. On match →
mvcc_apply_writes (atomic version bump) → push CommitItem.
Telemetry on QuasarRoundResult: conflict_count, repair_count,
fibers_suspended, fibers_resumed. Real measured behavior: 16 same-
key contending txs produce **120 conflicts → 120 repairs → 16
commits**. That's textbook Block-STM running entirely on GPU.
The host services GPU-emitted state requests via its LSM/cache/disk
path. The GPU never blocks on host I/O:
host loop:
while round in progress:
run_wave_tick()
reqs = poll_state_requests()
if reqs.empty(): continue
pages = state_db.batch_get(reqs)
push_state_pages(pages)
A faulted tx exits Decode into StateRequest and the workgroup moves on.
Other txs continue through the fast path while the slow tx awaits host
service. Finalization gate (commit.consumed == ingress.pushed) holds
across both fast and slow lanes.
drain_cert_lane reads inbound QuasarCertIngress items, dispatches
to a per-lane verifier, and accumulates stake into per-lane counters.
At 2/3 quorum threshold, a QuasarCert is emitted onto CertOut.
Three verifiers (LP-132 §§drain_cert_lane), shipped state per lane:
verify_bls_aggregate | v0.44 real BLS pairing (BLS12-381) — replaces 3.0 HMAC-keccak placeholder | lib/consensus/quasar/gpu/crypto/bls12_381.metal / .cu |verify_corona_share | v0.45 real Corona Module-LWE share verifier against Q-Chain ceremony key — replaces 3.0 HMAC-keccak placeholder | lib/consensus/quasar/gpu/crypto/corona.metal / .cu |verify_mldsa_groth16 | v0.45 real Groth16 over BLS12-381 against Z-Chain VK — replaces 3.0 HMAC-keccak placeholder | lib/consensus/quasar/gpu/crypto/groth16.metal / .cu |The 3.0 launch (2025-12-25) shipped HMAC-keccak placeholders across all
three lanes — real cryptographic verification (one-way with a master
secret; cross-lane domain tags reject replay), structured so the swap
to real BLS / Module-LWE (Corona) / Groth16 was a single function-pointer change.
That swap landed under QuasarSTM 4.0 (LP-135) in v0.44 (BLS) and v0.45
(Pulsar + Groth16) and activated on 2026-02-14. The HMAC-keccak
path is preserved as a development-only mode for deterministic test
vectors and is gated behind EVM_DEV_HMAC_VERIFIER=1. See
LP-010-quasar-stm-4 (4.0 paper, 2026-02-14) for the migration.
namespace quasar::gpu {
class QuasarGPUEngine {
public:
static std::unique_ptr<QuasarGPUEngine> create();
QuasarRoundHandle begin_round(const QuasarRoundDescriptor&);
void push_txs(QuasarRoundHandle, std::span<const HostTxBlob>);
void push_certs(QuasarRoundHandle, std::span<const HostCertIngress>);
void push_state_pages(QuasarRoundHandle, std::span<const HostStatePage>);
std::vector<HostStateRequest> poll_state_requests(QuasarRoundHandle);
std::vector<HostQuasarCert> poll_quasar_certs(QuasarRoundHandle);
QuasarRoundResult run_wave_tick(QuasarRoundHandle);
QuasarRoundResult run_until_done(QuasarRoundHandle, std::size_t max_ticks);
QuasarRoundResult poll_round_result(QuasarRoundHandle);
void request_close(QuasarRoundHandle);
void end_round(QuasarRoundHandle);
// ...
};
} // namespace quasar::gpu
quasar_wave.metal, quasar_gpu_engine.mm |__threadfence, same API | quasar_wave.cu, quasar_gpu_engine_cuda.cpp |EVM_CUDA, else nullptr | quasar_gpu_runtime.cpp |The Metal and CUDA kernels share the layout types (quasar_gpu_layout.hpp)
byte-for-byte; structural drift is caught by static_assert at compile
time on both sides.
T ≈ min(
E_gpu, // GPU event execution capacity
B_net / bytes_per_event, // network bandwidth
B_state / bytes_per_state_access, // hot-state memory bandwidth
L_independent / conflict_amplification, // lane independence
C_cert / cert_cost_per_root, // certificate throughput
D_da / data_availability_bytes // data availability
)
QuasarGPU's design thesis:
> **Increase T by maximizing L_independent and minimizing
> cert_cost_per_root.**
L_independent comes from lane partitioning (LP-010 § Lanes).
cert_cost_per_root is constant in TPS because validators sign roots,
not individual txs.
The pitch:
> Billion-event throughput, EVM-settled, Quasar-certified.
Not "1 B EVM TPS on a single shared state machine" — that's bounded by
contention and data availability, not GPU compute.
The substrate ships with a comprehensive test surface
(test/unittests/quasar_gpu_engine_test.mm):
empty_round | begin/close/finalize with no txs |single_tx_real_roots | block_hash, receipts_root, execution_root all non-zero |multi_tx_counters | per-stage pushed == consumed for 128 txs |bounded_backpressure | 1024 txs across small per-tick budget; finalizes |end_to_end_stress | 1024 txs in 8 wave ticks, ~150 ms on M1 Max |state_page_fault | host-serviced cold misses round-trip |root_determinism | same input → same roots across runs |block_stm_independent_txs | 64 disjoint-key txs, conflict_count=0 |block_stm_conflict_repair | 16 same-key txs, conflict_count=120, repair_count=120 |evm_fiber_arithmetic | PUSH1 1; PUSH1 2; ADD; STOP commits with correct gas |evm_fiber_storage | SLOAD cold-miss → suspend → resume; fibers_suspended ≥ 1 |evm_fiber_revert | REVERT path drains cleanly |quasar_quorum_round_trip | 3 BLS + 1 ML-DSA + 2 Pulsar; tampered & cross-lane replay rejected |Plus the broader 19-binary GPU test surface (parity, modes, host-bridge,
pipeline, unified, gpu-state, opcodes, dispatch, refund, access-list,
stack-depth, storage-overflow, etc.). All 19 PASS on Apple M1 Max.
QuasarGPU must not use:
malloc (use the per-round arena)__threadfence, layout-byte-identical with Metal |certificate_subject includes P/Q/Z roots), KnownTotalOrder, mode roots separated |The QuasarSTM 4.0 production spec is LP-135. The 4.0 production paper
(LP-010-quasar-stm-4, 2026-02-14) is the canonical reference.
cevm/lib/consensus/quasar/gpu/ |cevm/test/unittests/quasar_gpu_engine_test.mm |cevm/lib/consensus/quasar/gpu/quasar_gpu_layout.hpp |cevm/lib/consensus/quasar/gpu/quasar_wave.metal |cevm/lib/consensus/quasar/gpu/quasar_wave.cu |cevm/lib/consensus/quasar/gpu/quasar_gpu_engine.mm |cevm/lib/consensus/quasar/gpu/quasar_gpu_engine_cuda.cpp |cevm/lib/consensus/quasar/gpu/quasar_gpu_runtime.cpp |cert verifiers, 118-opcode EVM fiber, single-GPU; activated 2025-12-25.
A/B/M/F), real BLS12-381 / Corona Module-LWE / Groth16 verifiers
(v0.44/v0.45), 175-opcode EVM with full CALL/CREATE/LOG/EXTCODE/
RETURNDATA/TLOAD/TSTORE/MCOPY/BLOBHASH/BLOBBASEFEE coverage (v0.46),
predictive scheduling + Chiron hint roots (v0.47), Motor VersionBlock
(v0.48), CSMV scaffold + formal CPU reference + cross-backend
determinism harness (v0.49). Wire schema unchanged. See LP-135 and
LP-010-quasar-stm-4.
Copyright (C) 2025-2026, Lux Partners Limited. All rights reserved.