This LP defines the GPU-native EVM (cevm) for Lux Network: a
C++ EVM, forked from evmone, that executes the full bytecode interpreter
on GPU as a per-tx fiber VM, with Block-STM parallel scheduling
(LP-010) and async cold-state page faults. Production backends ship for
Apple Metal (M1/M2/M3 silicon), NVIDIA CUDA, and Dawn/WebGPU.
The previous v1 LP described a 60-opcode switch-dispatch path with
12× CPU speedup on a synthetic loop benchmark. This v2 update
documents the fiber VM that landed in the QuasarGPU substrate
(LP-132): 118 opcodes, suspend-on-cold-state, MVCC integration with
Block-STM, and the wave-tick scheduler that lets every consensus mode
(Nova linear / Nebula DAG) share one execution adapter.
CEVM ships two GPU-native execution surfaces today:
1. Wave-dispatch one-shot (LP-009 § One-Shot Wave) — one Metal/CUDA
dispatch per wave, one workgroup per tx, simple opcode coverage.
Used by the standalone evm-bench-kernel and evm-test-pipeline
binaries. Proven model: 32K-tx waves complete in 1 ms on M1 Max.
2. QuasarGPU wave-tick scheduler (LP-132) — bounded scheduler
kernel with 12 service rings. Includes the EVM fiber VM in
drain_exec, Block-STM in drain_validate / drain_repair, async
page faults via StateRequest/StateResp, and per-lane Quasar
cert verification. **This is the production execution path for
Quasar-certified rounds.**
The fiber VM is the headline new piece. It runs the full per-tx
interpreter on GPU with proper suspend/resume semantics so EVM
contracts can issue cold SLOAD/SSTORE without stalling the wave.
struct FiberSlot {
uint32_t tx_index, pc, sp, status;
uint64_t gas;
uint32_t rw_count, incarnation;
uint32_t pending_key_lo_lo, pending_key_lo_hi; // SLOAD suspend slot
uint32_t pending_key_hi_lo, pending_key_hi_hi;
uint32_t msize; // memory size
RWSetEntry rw[8];
uint64_t stack[64 * 4]; // 64 entries × 4 limbs (256-bit)
uint8_t memory[1024]; // per-fiber scratch
uint32_t blob_offset, blob_size; // bytecode pointer (host-shared MTLBuffer)
};
Fiber state machine:
0 (Ready) | not yet executing |1 (Running) | currently in drain_exec |2 (WaitingState) | suspended on cold-state SLOAD |3 (Committable) | exec done, awaiting Block-STM validate |4 (Reverted) | EVM REVERT or runtime fault |A single fiber occupies ~3.4 KB; 4096 fibers fit in ~13.9 MB device
memory.
struct U256 { ulong v[4]; }; // little-endian limbs
Implemented:
add/sub with carry propagation across limbsmul — 4×4 schoolbook multiply (16 partial products)div/mod/sdiv/smod — bit-at-a-time (256 iterations) forcorrectness; specialized fast paths for divisor < 2^64
shl/shr/sar — limb-level shift + bit shift, sign-extending SAReq/lt/gt/slt/sgt/iszero — straight comparisonsTests confirm exact match against the CPU reference on EIP-150 vectors.
Deferred to v0.40+ (with documented fallthrough returning Error):
CALL family, CREATE/CREATE2, LOGn, EXTCODE*, RETURNDATA*, TLOAD/TSTORE,
MCOPY, BLOBHASH/BLOBBASEFEE, SIGNEXTEND, BASEFEE, COINBASE, BLOCKHASH,
SELFBALANCE.
Per-opcode gas costs (Berlin-ish):
Memory-expansion gas is tracked in v0.40 (currently a fixed budget at
kFiberMemoryBytes).
on SLOAD slot=k:
look up MvccSlot for k
if slot empty (key_lo == 0 && key_hi == 0):
// cold miss
set fiber.status = WaitingState
pack k into fiber.pending_key_*
push StateRequest{tx_index, k} onto StateRequest ring
leave fiber.pc at the SLOAD opcode (resume re-runs)
return out of drain_exec for this fiber
on host StatePage arrival:
drain_state_resp claims the MvccSlot for k, sets last_writer_tx |= 0x80000000
re-injects the fiber into Crypto/Exec
resume sees status==WaitingState; advances to next instr after SLOAD
next SLOAD finds slot warm, proceeds normally
The high bit of last_writer_tx is the "loaded" sentinel —
distinguishes never-loaded from legitimately-loaded-zero. See LP-132 §
async page faults.
drain_exec populates each fiber's RWSetEntry rw[8] with read+write
versions for every storage / memory / state access. drain_validate
walks the RW set and compares observed versions against current MVCC
versions; mismatch → conflict → repair queue with bumped incarnation.
Measured behavior on contention: **16 same-key txs → 120 conflicts →
120 repairs → 16 commits** (textbook Block-STM).
cevm/lib/consensus/quasar/gpu/quasar_wave.metal |cevm/lib/consensus/quasar/gpu/quasar_wave.cu |cevm/lib/evm/gpu/kernel/evm_kernel.metal (Dawn shim TODO) |Runtime dispatcher: QuasarGPUEngine::create() selects Metal on
__APPLE__, CUDA on EVM_CUDA, else returns nullptr (LP-132).
10K transactions × 5K loop iterations (550M opcodes), retained as a
sanity benchmark:
Gas match: byte-identical CPU vs GPU.
End-to-end stress on quasar-gpu-engine-test (Apple M1 Max):
(End-to-end TPS includes: ingress + decode + admission + EVM exec +
Block-STM validate + commit + receipt keccak chain + cert verify +
quorum aggregation. Pure-execution micro-benchmarks are higher; these
numbers are honest end-to-end.)
Pitch: billion-event throughput, EVM-settled, Quasar-certified. Not
1 B EVM TPS on a single shared state machine — that's bounded by
contention and data availability, not GPU compute.
kFiberStackDepth=64 × 4 limbsin v0.39; reverts to status=Error if exceeded.
kFiberMemoryBytes=1024 in v0.39;larger expansion deferred to v0.40 with proper gas accounting.
8192 slots; full arena returns kMvccInvalidIdx` and the tx faults
cleanly (no infinite probe loop).
only (LP-132 §Forbidden Patterns).
┌───────────────────────────────────────────────────────┐
│ cevm: Ethereum-compatible EVM (forked from evmone) │
│ │
│ cevm/lib/evm/ the EVM core │
│ vm.cpp, baseline_*.cpp, advanced_*.cpp │
│ │
│ cevm/lib/evm/gpu/ one-shot wave path │
│ evm_kernel.metal │
│ evm_kernel.cu │
│ kernel/ │
│ │
│ cevm/lib/consensus/quasar/gpu/ │
│ ┌───────────────────────────────────────────┐ │
│ │ QuasarGPU wave-tick scheduler (LP-132) │ │
│ │ quasar_wave.metal / .cu │ │
│ │ drain_exec ← runs the fiber VM │ │
│ │ drain_validate / drain_repair │ │
│ │ drain_cert_lane / drain_state_resp │ │
│ └───────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────┘
LP-009 covers the EVM proper. LP-132 covers the wave-tick adapter that
embeds the EVM fiber VM into Quasar consensus rounds. LP-010 covers the
Block-STM scheduling fabric the EVM fibers run under.
drain_exec, 118 opcodes, SLOAD cold-miss |cevm/lib/evm/ |cevm/lib/consensus/quasar/gpu/ |cevm/test/unittests/quasar_gpu_engine_test.mm (13/13 PASS as of v0.39) |(118 opcodes), suspend/resume, Block-STM integration, QuasarGPU
adapter cross-reference (Quasar 3.0 ships 2025-12-25)
Copyright (C) 2025, Lux Partners Limited. All rights reserved.