Draft. No backwards compatibility. No flags. No CPU-by-default mode.
Activated at the genesis of the new final Lux network: **2025-12-25 16:20
Pacific (unix 1766708400)**. Predates every block on the new network.
The pre-Quasar Edition Lux network (2020–2025) ran CPU-only verify and
is a separate network out of scope.
The Lux ZAP Stack is GPU-native by design. The ZAP buffer (LP-022) lives
in Unified Memory Architecture (UMA), accessible to both CPU and GPU
without a single cudaMemcpy. Signature verification (BLS12-381,
Ed25519, secp256k1, ML-DSA), Merkle root computation, and lattice signing
(Corona, Pulsar) all run as CUDA / Metal / ROCm kernels in
luxfi/accel and luxfi/lattice. Network bytes go from NIC RAM straight
to GPU memory via GPUDirect RDMA. Kernel-bypass networking via DPDK
delivers 10M packets/sec to a single validator. There is no CPU
bottleneck.
This is not an optimization layer bolted onto a CPU-first design. It is
the architecture. The same ZAP byte stream that frames the wire (LP-200,
LP-201), pipelines verify stages (LP-202), and lands in ZapDB is the
byte stream the GPU reads directly. Eight years of "marshal between CPU
and GPU" disappears because there is no marshal: there is one buffer in
UMA, and both processors share it.
The components exist on disk today. luxfi/accel (v1.1.9) ships CUDA,
Metal, and ROCm kernels for BLS pairings and ML-DSA NTT. luxfi/lattice
(v7.1.4) ships GPU lattice ops for Corona. luxfi/zap/transport
(uma_*.go, dpdk*.go, gpudirect*.go) implements UMA allocation,
DPDK kernel-bypass ingest, and GPUDirect RDMA. luxfi/consensus/protocol/quasar/gpu_accel.go
exposes the Quasar consensus GPU hook. luxfi/threshold/protocols/corona/gpu/
holds the Corona GPU implementation.
What this LP specifies is the wire-up: the contract by which the
production hot path enters these kernels, the memory ownership model
between ZAP buffers and GPU memory, the dispatch shape per crypto
primitive, the throughput targets per validator class, and the fallback
contract when a GPU is unavailable.
The ZAP buffer is the single source of bytes for the wire, the mempool,
the verify stage, the consensus transcript, and the state diff. The GPU
reads those bytes directly. There is no DTO, no copy, no encode/decode.
┌─────────────┐ GPUDirect RDMA ┌─────────────┐
│ NIC RAM │ ─────────────────► │ GPU memory │
│ (Mellanox/ │ (DMA, no CPU │ (UMA-mapped)│
│ Broadcom) │ involvement) │ │
└─────────────┘ └─────────────┘
▲
│ UMA: CPU + GPU see same buffer
▼
┌─────────────┐
│ CPU cores │
│ (ZapDB, │
│ consensus │
│ state mgmt)│
└─────────────┘
UMA allocation. ZAP buffers used by the verify hot path are
allocated through zap/transport/uma.go's UMAAlloc(size). On Linux
the backing is cudaMallocManaged (CUDA) or hipMallocManaged (ROCm);
on Apple Silicon the backing is a Metal MTLBuffer with
storageModeShared (uma_darwin.go); the fallback is a heap allocation
with no GPU mapping (uma_other.go). The Go-visible type is
zap.UMABuffer, a []byte whose backing memory is page-locked and
visible to the active GPU device. CPU writes are immediately visible to
the GPU and vice versa, modulo the device's memory consistency model;
the spec requires the implementation to insert the device fence at
hand-off points (CPU→GPU and GPU→CPU).
GPUDirect ingress. For top-of-rack validators with a supported NIC
(Mellanox ConnectX-5/6/7, Broadcom Stingray), zap/transport/gpudirect_linux.go
registers a CUDA buffer with the NIC driver via nv_peer_mem (or the
RDMA MEMORY_REGION registration). Received packets DMA directly into
GPU memory without traversing CPU caches. The verify kernel reads the
ZAP buffer from the same memory region the NIC just wrote. End-to-end
network → GPU verify → state commit without a single memcpy.
DPDK ingest. For deployments without GPUDirect-capable NICs, DPDK
(zap/transport/dpdk_linux.go) provides the next-best path: kernel-bypass
poll-mode driver into a hugepage-backed mempool. The mempool is
registered as UMA, so the GPU still reads packet bytes without an
intermediate copy. CPU latency stays under 1μs from wire-arrival to
first-byte ZAP-magic-check.
Memory ownership rules.
Hand-offs are explicit fence points. There is no shared writer state.
A block arrives via DPDK or GPUDirect. The block carries N transactions
(N typically 100–10,000). The verify pipeline (LP-202) hands the block's
UMA buffer to the GPU verify stage.
1. Magic + schema-kind dispatch — CPU reads byte 0..4 of each tx
envelope. Two reasons this stays on CPU: branch density is too high
for GPU warp efficiency, and the work per tx (~50ns) is dominated by
warp launch overhead. The CPU produces a per-scheme tx-index list:
{bls: [3, 7, 18, ...], ed25519: [0, 1, 2, 4, ...], ml_dsa: [5, 12, ...], ...}.
2. Signature batch verify on GPU — one CUDA stream per scheme,
dispatched in parallel. Each kernel batches all tx of its scheme:
luxfi/accel (ops_crypto.go → accel.BatchVerifyBLS). Miller loop + final exponentiation per sig
in one block; N blocks in one grid.
per warp via coalesced reads.
ops_crypto.go.
luxfi/lattice/mldsa-gpu and luxfi/accel/ops_lattice.go. NTT-based polynomial verify, one
block per signature, threads = NTT butterflies, shared memory for
twiddle factors.
independent verifies parallelize one-thread-per-verify. CPU/GPU
hybrid: for small N, CPU wins on launch overhead; for N ≥ 32, GPU
wins. Crossover threshold is profile-configured.
3. Merkle root compute on GPU for the block's state-diff tree:
sha256 over N leaves in one kernel; tree reduction in log₂(N)
levels; each level is a kernel call reading the prior level's output
from UMA. Result lands in UMA where ZapDB reads it directly.
4. Threshold sig aggregation on GPU for consensus certs:
validator. One CUDA block per validator; threads handle the ring
polynomial coefficients; cross-validator reduction via global
memory. See ~/work/lux/threshold/protocols/corona/gpu/.
luxfi/lattice/v7 GPU kernels.
luxfi/consensus/protocol/quasar/gpu_accel.godispatches the per-leg verify (BLS leg + Pulsar leg + Corona leg
+ Magnetar leg) in parallel CUDA streams; a final CPU step reads
the four boolean results from UMA and computes the QuasarCert
verdict.
After all GPU kernels return, the CPU reads the verdict from UMA,
attaches the GPU-computed Merkle root to the next block proposal, and
ZapDB commits the state diff that already sits in UMA.
All numbers below are batch throughput (the production case — verify is
always batched). Single-sig latencies are CPU-launch-bound and are
listed separately. "Measured" means observed in the implementation
referenced in the row's source path; "estimated" means projected from
GPU peak FLOPS and kernel arithmetic intensity.
luxfi/accel Metal kernel |luxfi/lattice/v7 |luxfi/accel |threshold/corona/gpu |Measured 2026-06-04 on spark (CUDA 13.0.88, driver 580.159.03, 121 GB unified memory, 48 SMs). Bench source: /tmp/gpu-vtbl-bench/bench.c calling the public lux_gpu_* C API; backend = ~/work/lux-private/gpu-kernels/backends/cuda/build-wire/libluxgpu_backend_cuda.so (ABI v12, sm_120 native).
cuda_op_corona_combine_batch uses #pragma omp parallel for |Power efficiency: idle 16 W; full BLS-pairing or keccak-batch load 19-20 W. ΔP ≈ 4 W under load. At 49 197 pairings/sec / 4 W = ~12 300 pairings per watt — orders of magnitude better than x86 CPU baselines.
Per-pairing crossover: GPU per-pair (19 μs) beats single-thread CPU blst (~1 ms) from N=1; beats 20-core parallel CPU (~50 μs/pair) from N≥3-4.
Per-hash crossover (keccak256/sha3): GPU baseline ~200 μs kernel-launch overhead vs CPU 0.4-0.9 μs/op single-thread; GPU wins from N≥~500 single-thread or N≥~10 000 vs 20-core parallel. The kernel is small-input-optimized; large-input single-keccak (1 MiB) hits only 19 MB/s — not the intended workload.
Honest state of the CUDA plugin (backends/cuda/src/plugin.cpp at lux-private/gpu-kernels): of 76 vtbl ops, 16 are explicit LUX_BACKEND_ERROR_NOT_SUPPORTED stubs — primarily the direct-verify entries for signature schemes (Ed25519, sr25519, ML-DSA, SLH-DSA, ML-KEM) and the projective-coords BLS/BN254 ops. Native GPU paths exist for: BLS12-381 EIP-2537 (G1 add, G1 MSM, G2 add, G2 MSM, pairing, hash-to-curve), BN254 G1 ops + pairing-check, keccak256, sha3_256, blake3, modexp, the Magnetar 4 SLH-DSA primitives, and the Corona partial-sign step. The composite signature-verify wrappers are the gap to close before LP-203 numbers turn fully empirical.
// Allocate ZAP buffer in unified memory — visible to CPU and GPU
buf, err := zap.UMAAlloc(1024)
if err != nil {
return err
}
defer buf.Free()
// CPU writes ZAP fields at known offsets
v := zap.Wrap[QuasarCert](buf)
if err := v.Write(QuasarCertFields.BLS, blsSig); err != nil {
return err
}
// ... fill remaining fields
// Hand the buffer to the GPU verify kernel; no cudaMemcpy
result, err := accel.BatchVerify(ctx, []zap.UMABuffer{buf}, []bls.PublicKey{pk})
if err != nil {
return err
}
// CPU reads the kernel's verdict from UMA; CPU is free to mutate other
// UMA buffers concurrently (the kernel only reads `buf`).
The zap.UMABuffer type wraps the unified-memory allocation. Passing it
to accel.BatchVerify results in the GPU kernel reading the ZAP
buffer's bytes directly. The CPU can keep mutating other UMA buffers
concurrently — each kernel only reads the buffers it is handed. There
is no shared writer state.
For the consensus hot path, the call site is
consensus/protocol/quasar/gpu_accel.go's VerifyCertGPU(cert *QuasarCert),
which dispatches the four leg verifications in parallel CUDA streams,
waits on a single CUDA event, and reads the combined verdict from UMA.
For top-of-rack validators on Linux:
involvement, no syscall on the receive path.
through cudaHostRegister (or the equivalent ROCm/Metal call). The
GPU can read packet bytes directly.
isolcpus / nohz_full). No interrupts.
are dropped at the PMD thread before any further work.
UMA pointer to a ring slot. The verify kernel polls the ring on the
GPU side.
to first-byte ZAP-magic-check.
zap/transport/dpdk.go exposes the platform-agnostic interface;
dpdk_linux.go is the Linux PMD binding; dpdk_other.go is a stub for
non-Linux platforms (developer workstations fall back to the kernel
socket path).
GPU verify is the hot path. CPU verify is the fallback. The fallback is
not optional — every primitive in luxfi/accel and luxfi/lattice
ships with a CPU implementation, and the dispatcher chooses at runtime.
Detection. Per-kernel timeout based on the profile's round budget.
If the GPU does not return within the timeout, the dispatcher cancels
the in-flight kernel (CUDA stream destroy / Metal command buffer abort),
re-dispatches on CPU, and increments the lux_verify_cpu_fallback_total
metric.
Triggers.
kernel socket path).
Continuation. A validator on CPU fallback continues participating in
consensus. Throughput degrades to the CPU-baseline numbers in the
"Throughput targets" tables. In the strict-PQ profile (LP-177), CPU
fallback is not survivable — the 64-validator Corona aggregate
takes ~350 ms on CPU and the strict-PQ round budget is tighter than
that. See "Architectural decision: strict-PQ requires GPU" below.
Metrics (Prometheus).
lux_verify_gpu_calls_total{scheme="bls|ed25519|secp256k1|mldsa|slhdsa|corona|pulsar"}lux_verify_cpu_fallback_total{scheme,reason="absent|timeout|error|init_failed"}lux_verify_kernel_latency_seconds{scheme} (histogram)lux_verify_throughput_sigs_per_sec{scheme,device}lux_gpu_uma_pool_bytes_allocatedlux_gpu_uma_pool_bytes_freelux_dpdk_packets_received_totallux_dpdk_packets_dropped_total{reason="bad_magic|ring_full|verify_failed"}lux_gpudirect_bytes_dma_totalThe strict-PQ profile (LP-177) drops the BLS leg from Quasar and
runs Pulsar + Corona only. The Corona aggregate over 64 validators
takes ~350 ms on CPU and ~50 ms on M4 Max GPU (~15 ms on H100). The
strict-PQ round budget assumes the GPU number.
Decision. The strict-PQ profile requires GPU acceleration on
every validator participating in the strict-PQ round. A validator
running strict-PQ without GPU acceleration cannot keep up with the
round budget and will repeatedly time out, causing strict-PQ rounds to
fail. The Lux Operator (luxfi/operator) refuses to admit a validator
to a strict-PQ-enabled chain unless the validator's gpu.enabled is
true in config and the boot-time GPU probe succeeds.
Rationale. The strict-PQ profile exists for environments that
cannot rely on BLS pairings (post-quantum-only zones). Those
environments are validator-class deployments by definition — they are
not developer laptops. Requiring GPU on validators in those zones is
not a constraint; it is the only sensible deployment.
Trade-off. Developer workstations and CI runners cannot run the
strict-PQ profile without a GPU. The default profile (BLS + Pulsar +
Corona) does not require GPU; CPU fallback is survivable for the
default profile because the BLS leg is fast on CPU and Corona is not
in the critical path. Developers test strict-PQ either on Apple Silicon
(Metal) or on a GPU-equipped CI runner.
Enforcement.
consensus/protocol/quasar/gpu_accel.go exposes RequireGPU() whichreturns an error if the active profile is strict-PQ and the GPU probe
failed.
RequireGPU() early and refusesto come up on a strict-PQ chain without GPU.
LuxNetwork CRD validates that strict-PQ chains pin GPU-equipped node pools (GKE: cloud.google.com/gke-accelerator
label set; DOKS: GPU node pool).
Operators choose the GPU profile at boot via a config file (not an env
var; configuration is value-explicit and reviewable).
gpu:
enabled: true # set false to force CPU-only verify
cuda_device: 0 # GPU index for multi-GPU hosts
uma_pool_size: 4GB # unified-memory pool for ZAP buffers
dpdk_nic_queue: 0 # DPDK queue to bind for kernel-bypass ingest
gpudirect: true # enable RDMA NIC → GPU memory path
fallback_timeout_ms: 50 # per-kernel timeout before CPU fallback
strict_pq_required: true # refuses to boot on strict-PQ chain without GPU
Default for new validators: enabled: true, gpudirect: false (only
flipped on by operators with confirmed-capable NICs), `dpdk_nic_queue:
0 (DPDK off; flip on for top-of-rack), strict_pq_required: true`.
End-to-end pipeline timings for a 1000-tx block, composed across LP-200
(ZAP frame), LP-201 (QUIC + DPDK transport), LP-202 (verify
pipelining), and this LP-203 (GPU kernels).
The CPU-only column is the baseline that the pre-Quasar Edition Lux
network would have hit. The GPU-native columns are the new final
network's operating regime. The 24× speed-up on M4 Max and ~24× on H100
is what the LP-200 stack pays for in elegance: one byte stream means the
GPU reads it directly.
This spec is explicit about the line between code that exists and
integration that does not.
luxfi/accel package | ~/work/lux/accel | v1.1.9 (go.mod); CUDA / Metal / ROCm BLS pairing kernels; ML-DSA NTT in ops_lattice.go; secp256k1 in crypto_secp256k1_cgo.go |~/work/lux/accel/ops_crypto.go | Per-scheme batch entry points |~/work/lux/accel/ops_lattice.go | ML-DSA NTT, Corona-adjacent ops |~/work/lux/accel/ops_fhe.go | Reserved for LP-205 |luxfi/lattice package | ~/work/lux/lattice | v7.1.4; GPU lattice ops for Corona and Pulsar |~/work/lux/zap/transport/uma.go + uma_linux.go / uma_darwin.go / uma_other.go | Platform-split UMA allocation |~/work/lux/zap/transport/dpdk.go + dpdk_linux.go / dpdk_other.go | Linux PMD; non-Linux stub |~/work/lux/zap/transport/gpudirect.go + gpudirect_linux.go / gpudirect_other.go | RDMA NIC → GPU memory path |~/work/lux/consensus/protocol/quasar/gpu_accel.go (+ test) | Per-leg dispatch entry point |~/work/lux/threshold/protocols/corona/gpu/ | Corona R1/R2 GPU kernels |zap.UMAAlloc into the mempool buffer allocator | consensus/mempool (and per-VM mempools) | Mempool ZAP buffers land in UMA from creation |BatchVerify the default verify path in Quasar | consensus/protocol/quasar/cert.go | CPU verify becomes the explicit fallback, not the default |network/p2p consumer of zap/transport/dpdk | DPDK is a kernel-bypass NIC-ingestion mechanism that composes with LP-201's QUIC default; it does NOT register against the Transport.Pick() selector — DPDK changes how UDP packets reach the process, not how peers map to transports |Transport.Pick() — composes under whichever transport (QUIC, LP-207 RDMA-IB) the selector chose for the peer |consensus/profile boot check + Operator CRD validation | strict-PQ validator can't come up GPU-less |luxfi/accel dispatcher | Automatic degrade on GPU failure |luxfi/accel + consensus/protocol/quasar/gpu_accel.go | Operator visibility into GPU vs CPU pathing |Each integration task is a single concrete site; none introduce new
abstractions. The pattern is: read the ZAP buffer, dispatch to GPU,
read result from UMA. No DTOs, no marshal, no codec, no new Go
interface.
GPU kernels read)
Merkle compute)
this LP wires up)
hot-path wire-up)
the architectural decision above)
LP specs the crossover threshold)
LP's GPU kernels read)
LP consumes)
this LP's GPU kernels saturate that width)
activates: 2025-12-25T16:20:00-08:00
activates-unix: 1766708400
Predates every block on the new final Lux network. No backwards
compatibility with the pre-Quasar Edition CPU-only verify path.
across chain VMs — one GPU per validator host serving multiple chain
VMs' verify queues fairly.
luxfi/fhe (LP-167) and accel/ops_fhe.go into the consensus hot
path so encrypted votes can be tallied without decryption.
Intel TDX attestation of the GPU verify kernels so that GPU-computed
Merkle roots and consensus verdicts carry a remote-attestation
receipt.