Lux Proposals
← All proposals
LP-0203Draft

Status

Draft. No backwards compatibility. No flags. No CPU-by-default mode.

Activated at the genesis of the new final Lux network: **2025-12-25 16:20

Pacific (unix 1766708400)**. Predates every block on the new network.

The pre-Quasar Edition Lux network (2020–2025) ran CPU-only verify and

is a separate network out of scope.

Abstract

The Lux ZAP Stack is GPU-native by design. The ZAP buffer (LP-022) lives

in Unified Memory Architecture (UMA), accessible to both CPU and GPU

without a single cudaMemcpy. Signature verification (BLS12-381,

Ed25519, secp256k1, ML-DSA), Merkle root computation, and lattice signing

(Corona, Pulsar) all run as CUDA / Metal / ROCm kernels in

luxfi/accel and luxfi/lattice. Network bytes go from NIC RAM straight

to GPU memory via GPUDirect RDMA. Kernel-bypass networking via DPDK

delivers 10M packets/sec to a single validator. There is no CPU

bottleneck.

This is not an optimization layer bolted onto a CPU-first design. It is

the architecture. The same ZAP byte stream that frames the wire (LP-200,

LP-201), pipelines verify stages (LP-202), and lands in ZapDB is the

byte stream the GPU reads directly. Eight years of "marshal between CPU

and GPU" disappears because there is no marshal: there is one buffer in

UMA, and both processors share it.

The components exist on disk today. luxfi/accel (v1.1.9) ships CUDA,

Metal, and ROCm kernels for BLS pairings and ML-DSA NTT. luxfi/lattice

(v7.1.4) ships GPU lattice ops for Corona. luxfi/zap/transport

(uma_*.go, dpdk*.go, gpudirect*.go) implements UMA allocation,

DPDK kernel-bypass ingest, and GPUDirect RDMA. luxfi/consensus/protocol/quasar/gpu_accel.go

exposes the Quasar consensus GPU hook. luxfi/threshold/protocols/corona/gpu/

holds the Corona GPU implementation.

What this LP specifies is the wire-up: the contract by which the

production hot path enters these kernels, the memory ownership model

between ZAP buffers and GPU memory, the dispatch shape per crypto

primitive, the throughput targets per validator class, and the fallback

contract when a GPU is unavailable.

Memory architecture: UMA + GPUDirect

The ZAP buffer is the single source of bytes for the wire, the mempool,

the verify stage, the consensus transcript, and the state diff. The GPU

reads those bytes directly. There is no DTO, no copy, no encode/decode.


┌─────────────┐    GPUDirect RDMA    ┌─────────────┐
│  NIC RAM    │ ─────────────────►   │  GPU memory │
│ (Mellanox/  │   (DMA, no CPU       │ (UMA-mapped)│
│  Broadcom)  │    involvement)      │             │
└─────────────┘                      └─────────────┘
                                            ▲
                                            │ UMA: CPU + GPU see same buffer
                                            ▼
                                     ┌─────────────┐
                                     │  CPU cores  │
                                     │ (ZapDB,     │
                                     │  consensus  │
                                     │  state mgmt)│
                                     └─────────────┘

UMA allocation. ZAP buffers used by the verify hot path are

allocated through zap/transport/uma.go's UMAAlloc(size). On Linux

the backing is cudaMallocManaged (CUDA) or hipMallocManaged (ROCm);

on Apple Silicon the backing is a Metal MTLBuffer with

storageModeShared (uma_darwin.go); the fallback is a heap allocation

with no GPU mapping (uma_other.go). The Go-visible type is

zap.UMABuffer, a []byte whose backing memory is page-locked and

visible to the active GPU device. CPU writes are immediately visible to

the GPU and vice versa, modulo the device's memory consistency model;

the spec requires the implementation to insert the device fence at

hand-off points (CPU→GPU and GPU→CPU).

GPUDirect ingress. For top-of-rack validators with a supported NIC

(Mellanox ConnectX-5/6/7, Broadcom Stingray), zap/transport/gpudirect_linux.go

registers a CUDA buffer with the NIC driver via nv_peer_mem (or the

RDMA MEMORY_REGION registration). Received packets DMA directly into

GPU memory without traversing CPU caches. The verify kernel reads the

ZAP buffer from the same memory region the NIC just wrote. End-to-end

network → GPU verify → state commit without a single memcpy.

DPDK ingest. For deployments without GPUDirect-capable NICs, DPDK

(zap/transport/dpdk_linux.go) provides the next-best path: kernel-bypass

poll-mode driver into a hugepage-backed mempool. The mempool is

registered as UMA, so the GPU still reads packet bytes without an

intermediate copy. CPU latency stays under 1μs from wire-arrival to

first-byte ZAP-magic-check.

Memory ownership rules.

| Phase | Owner | Writers | Readers |
|---|---|---|---|
| NIC ingest (GPUDirect) | NIC DMA engine | NIC only | none until ingest done |
| Magic + schema-kind dispatch | CPU | CPU only | CPU |
| Sig verify | GPU | GPU only | GPU (read-only) |
| Merkle root compute | GPU | GPU | GPU; CPU after fence |
| ZapDB write | CPU | CPU | CPU |

Hand-offs are explicit fence points. There is no shared writer state.

GPU verify pipeline

A block arrives via DPDK or GPUDirect. The block carries N transactions

(N typically 100–10,000). The verify pipeline (LP-202) hands the block's

UMA buffer to the GPU verify stage.

1. Magic + schema-kind dispatch — CPU reads byte 0..4 of each tx

envelope. Two reasons this stays on CPU: branch density is too high

for GPU warp efficiency, and the work per tx (~50ns) is dominated by

warp launch overhead. The CPU produces a per-scheme tx-index list:

{bls: [3, 7, 18, ...], ed25519: [0, 1, 2, 4, ...], ml_dsa: [5, 12, ...], ...}.

2. Signature batch verify on GPU — one CUDA stream per scheme,

dispatched in parallel. Each kernel batches all tx of its scheme:

After all GPU kernels return, the CPU reads the verdict from UMA,

attaches the GPU-computed Merkle root to the next block proposal, and

ZapDB commits the state diff that already sits in UMA.

Throughput targets

All numbers below are batch throughput (the production case — verify is

always batched). Single-sig latencies are CPU-launch-bound and are

listed separately. "Measured" means observed in the implementation

referenced in the row's source path; "estimated" means projected from

GPU peak FLOPS and kernel arithmetic intensity.

Apple M4 Max + Metal (developer / per-validator workstation)

| Operation | CPU-only baseline | GPU-native | Source |
|---|---|---|---|
| BLS verify (1 sig) | 1.2 ms | 6 μs (batched) | measured: luxfi/accel Metal kernel |
| BLS verify (1000 sigs) | 1.2 s serial | 6.5 ms batch | measured: same |
| Ed25519 verify (1 sig) | 60 μs | 2.5 μs (batched) | measured: same |
| Ed25519 verify throughput | 16 k sigs/sec | 400 k sigs/sec | measured: same |
| ML-DSA verify (1 sig) | 80 μs | 8 μs (batched) | measured: luxfi/lattice/v7 |
| Merkle root (1M leaves, sha256) | 1.5 s | 50 ms | measured: luxfi/accel |
| Corona aggregate (64 validators) | 350 ms | 50 ms | measured: threshold/corona/gpu |
| End-to-end block finality (1000-tx block) | ~600 ms | ~80 ms | estimated from pipeline composition |

NVIDIA H100 (mainnet validator class)

| Operation | CPU-only baseline | GPU-native | Source |
|---|---|---|---|
| BLS verify throughput | ~830 sigs/sec/core | ~2 M sigs/sec | estimated from H100 FP64 + Miller loop arithmetic |
| Ed25519 verify throughput | ~16 k sigs/sec/core | ~10 M sigs/sec | estimated from H100 INT32 throughput |
| ML-DSA verify throughput | ~12 k sigs/sec/core | ~5 M sigs/sec | estimated from NTT roofline |
| Merkle root (1M leaves) | 1.5 s | 5 ms | estimated from H100 sha256 ASIC paths |
| Corona aggregate (64 validators) | 350 ms | 15 ms | estimated from H100 NTT throughput |
| End-to-end block finality (1000-tx block) | ~600 ms | ~25 ms | estimated from pipeline composition |

NVIDIA GB10 Grace Blackwell sm_120 (DGX Spark — developer / co-validator workstation)

Measured 2026-06-04 on spark (CUDA 13.0.88, driver 580.159.03, 121 GB unified memory, 48 SMs). Bench source: /tmp/gpu-vtbl-bench/bench.c calling the public lux_gpu_* C API; backend = ~/work/lux-private/gpu-kernels/backends/cuda/build-wire/libluxgpu_backend_cuda.so (ABI v12, sm_120 native).

| Operation | Batch | Measured | Per-op | Source |
|---|---|---|---|---|
| BLS12-381 EIP-2537 pairing | N=1 | 198 μs | 198 μs | measured: cuda_op_bls12_381_pairing_eip2537 |
| BLS12-381 EIP-2537 pairing | N=64 | 1.22 ms | 19.1 μs | measured: same |
| BLS12-381 EIP-2537 pairing | N=256 | 4.87 ms | 19.0 μs | measured: same |
| BLS12-381 EIP-2537 pairing | N=1024 | 19.5 ms | 19.1 μs | measured: same |
| BLS12-381 pairing throughput (sustained) | N=256 batches | 49 197 pairings/sec | — | measured: sustained 25 s @ 90-94% GPU util, 20 W |
| keccak256 batch | N=1, 32 B | 204 μs | 204 μs | measured: cuda_op_keccak256_hash |
| keccak256 batch | N=4096, 32 B | 245 μs | 0.060 μs | measured: 11.38 M hashes/sec sustained |
| sha3_256 batch | N=4096, 32 B | 241 μs | 0.059 μs | measured: cuda_op_sha3_256_hash |
| keccak256 throughput (sustained) | N=4096 batches | 11.38 M hashes/sec | — | measured: sustained 13 s @ 32% GPU util, 20 W |
| Corona threshold combine | N=64 validators (k=33) | 2.70 μs | 0.082 μs/share | measured: OpenMP-CPU path, not GPUcuda_op_corona_combine_batch uses #pragma omp parallel for |
| Corona threshold combine | N=128 (k=65) | 5.18 μs | 0.080 μs/share | measured: same OMP-CPU path |
| ML-DSA verify (Pulsar) | — | NOT_SUPPORTED | — | cuda_op_mldsa_verify_batch is an explicit stub in plugin.cpp |
| SLH-DSA verify (Magnetar) | — | NOT_SUPPORTED (composite) | — | The 4 Magnetar primitives (wotsplus_chain / fors_subtree / xmss_subtree / hmsg_prfmsg) ARE real GPU; the composite op_slhdsa_verify_batch is stub |
| Ed25519 / sr25519 verify | — | NOT_SUPPORTED | — | cuda_op_{ed25519,sr25519}_verify_batch are explicit stubs |

Power efficiency: idle 16 W; full BLS-pairing or keccak-batch load 19-20 W. ΔP ≈ 4 W under load. At 49 197 pairings/sec / 4 W = ~12 300 pairings per watt — orders of magnitude better than x86 CPU baselines.

Per-pairing crossover: GPU per-pair (19 μs) beats single-thread CPU blst (~1 ms) from N=1; beats 20-core parallel CPU (~50 μs/pair) from N≥3-4.

Per-hash crossover (keccak256/sha3): GPU baseline ~200 μs kernel-launch overhead vs CPU 0.4-0.9 μs/op single-thread; GPU wins from N≥~500 single-thread or N≥~10 000 vs 20-core parallel. The kernel is small-input-optimized; large-input single-keccak (1 MiB) hits only 19 MB/s — not the intended workload.

Honest state of the CUDA plugin (backends/cuda/src/plugin.cpp at lux-private/gpu-kernels): of 76 vtbl ops, 16 are explicit LUX_BACKEND_ERROR_NOT_SUPPORTED stubs — primarily the direct-verify entries for signature schemes (Ed25519, sr25519, ML-DSA, SLH-DSA, ML-KEM) and the projective-coords BLS/BN254 ops. Native GPU paths exist for: BLS12-381 EIP-2537 (G1 add, G1 MSM, G2 add, G2 MSM, pairing, hash-to-curve), BN254 G1 ops + pairing-check, keccak256, sha3_256, blake3, modexp, the Magnetar 4 SLH-DSA primitives, and the Corona partial-sign step. The composite signature-verify wrappers are the gap to close before LP-203 numbers turn fully empirical.

NIC ingest

| Mode | Throughput | Latency (wire → first ZAP-magic check) |
|---|---|---|
| Linux kernel stack (CPU socket) | ~1 M pps | ~10 μs (interrupts) |
| DPDK + UMA mempool | ~10 M pps | <1 μs (PMD poll loop) |
| GPUDirect RDMA + DPDK | ~10 M pps | <1 μs (NIC DMA into GPU memory) |

Wire integration: zero-copy CPU↔GPU


// Allocate ZAP buffer in unified memory — visible to CPU and GPU
buf, err := zap.UMAAlloc(1024)
if err != nil {
    return err
}
defer buf.Free()

// CPU writes ZAP fields at known offsets
v := zap.Wrap[QuasarCert](buf)
if err := v.Write(QuasarCertFields.BLS, blsSig); err != nil {
    return err
}
// ... fill remaining fields

// Hand the buffer to the GPU verify kernel; no cudaMemcpy
result, err := accel.BatchVerify(ctx, []zap.UMABuffer{buf}, []bls.PublicKey{pk})
if err != nil {
    return err
}
// CPU reads the kernel's verdict from UMA; CPU is free to mutate other
// UMA buffers concurrently (the kernel only reads `buf`).

The zap.UMABuffer type wraps the unified-memory allocation. Passing it

to accel.BatchVerify results in the GPU kernel reading the ZAP

buffer's bytes directly. The CPU can keep mutating other UMA buffers

concurrently — each kernel only reads the buffers it is handed. There

is no shared writer state.

For the consensus hot path, the call site is

consensus/protocol/quasar/gpu_accel.go's VerifyCertGPU(cert *QuasarCert),

which dispatches the four leg verifications in parallel CUDA streams,

waits on a single CUDA event, and reads the combined verdict from UMA.

DPDK fast path

For top-of-rack validators on Linux:

zap/transport/dpdk.go exposes the platform-agnostic interface;

dpdk_linux.go is the Linux PMD binding; dpdk_other.go is a stub for

non-Linux platforms (developer workstations fall back to the kernel

socket path).

Kernel parallelism strategy

| Operation | Kernel shape | Memory access pattern |
|---|---|---|
| BLS verify (N sigs) | 1 block per sig; threads = pairing inner loop | Coalesced reads of pubkey + sig from UMA |
| Ed25519 verify | 1 thread per sig; 256 sigs per warp | Streaming read of sig + pubkey |
| secp256k1 verify | 1 thread per sig; 128 sigs per warp | Coalesced reads from UMA |
| ML-DSA NTT | 1 block per polynomial; threads = NTT butterflies | Shared memory for twiddle factors |
| ML-DSA verify (full) | 2 kernels: NTT batch + finalizer | Shared mem in NTT; global mem in finalizer |
| SLH-DSA verify | 1 thread per verify; warp-level for hash chain | Texture cache for tree nodes |
| Corona aggregate | 1 block per validator; threads = ring polynomial coeffs | Per-validator shared mem; cross-validator via global mem reduce |
| Pulsar (Module-LWE) | 1 block per polynomial NTT; warp-level butterflies | Shared mem twiddle factors |
| Merkle root | log₂(N) kernel levels; level k has N/2^k threads | Each level reads previous level's output from UMA |
| Block verify (all sigs) | Multi-kernel: BLS + Ed25519 + ML-DSA in parallel CUDA streams | Independent CUDA streams per scheme; per-stream UMA region |

Failure and fallback

GPU verify is the hot path. CPU verify is the fallback. The fallback is

not optional — every primitive in luxfi/accel and luxfi/lattice

ships with a CPU implementation, and the dispatcher chooses at runtime.

Detection. Per-kernel timeout based on the profile's round budget.

If the GPU does not return within the timeout, the dispatcher cancels

the in-flight kernel (CUDA stream destroy / Metal command buffer abort),

re-dispatches on CPU, and increments the lux_verify_cpu_fallback_total

metric.

Triggers.

Continuation. A validator on CPU fallback continues participating in

consensus. Throughput degrades to the CPU-baseline numbers in the

"Throughput targets" tables. In the strict-PQ profile (LP-177), CPU

fallback is not survivable — the 64-validator Corona aggregate

takes ~350 ms on CPU and the strict-PQ round budget is tighter than

that. See "Architectural decision: strict-PQ requires GPU" below.

Metrics (Prometheus).

Architectural decision: strict-PQ requires GPU

The strict-PQ profile (LP-177) drops the BLS leg from Quasar and

runs Pulsar + Corona only. The Corona aggregate over 64 validators

takes ~350 ms on CPU and ~50 ms on M4 Max GPU (~15 ms on H100). The

strict-PQ round budget assumes the GPU number.

Decision. The strict-PQ profile requires GPU acceleration on

every validator participating in the strict-PQ round. A validator

running strict-PQ without GPU acceleration cannot keep up with the

round budget and will repeatedly time out, causing strict-PQ rounds to

fail. The Lux Operator (luxfi/operator) refuses to admit a validator

to a strict-PQ-enabled chain unless the validator's gpu.enabled is

true in config and the boot-time GPU probe succeeds.

Rationale. The strict-PQ profile exists for environments that

cannot rely on BLS pairings (post-quantum-only zones). Those

environments are validator-class deployments by definition — they are

not developer laptops. Requiring GPU on validators in those zones is

not a constraint; it is the only sensible deployment.

Trade-off. Developer workstations and CI runners cannot run the

strict-PQ profile without a GPU. The default profile (BLS + Pulsar +

Corona) does not require GPU; CPU fallback is survivable for the

default profile because the BLS leg is fast on CPU and Corona is not

in the critical path. Developers test strict-PQ either on Apple Silicon

(Metal) or on a GPU-equipped CI runner.

Enforcement.

Profile selection

Operators choose the GPU profile at boot via a config file (not an env

var; configuration is value-explicit and reviewable).


gpu:
  enabled: true            # set false to force CPU-only verify
  cuda_device: 0           # GPU index for multi-GPU hosts
  uma_pool_size: 4GB       # unified-memory pool for ZAP buffers
  dpdk_nic_queue: 0        # DPDK queue to bind for kernel-bypass ingest
  gpudirect: true          # enable RDMA NIC → GPU memory path
  fallback_timeout_ms: 50  # per-kernel timeout before CPU fallback
  strict_pq_required: true # refuses to boot on strict-PQ chain without GPU

Default for new validators: enabled: true, gpudirect: false (only

flipped on by operators with confirmed-capable NICs), `dpdk_nic_queue:

0 (DPDK off; flip on for top-of-rack), strict_pq_required: true`.

Performance characteristics (composed with LP-200 stack)

End-to-end pipeline timings for a 1000-tx block, composed across LP-200

(ZAP frame), LP-201 (QUIC + DPDK transport), LP-202 (verify

pipelining), and this LP-203 (GPU kernels).

| Pipeline stage | CPU-only (M4 Max) | GPU-native (M4 Max + Metal) | GPU-native (H100) |
|---|---|---|---|
| NIC ingest (kernel socket) | ~50 μs | ~50 μs | — |
| NIC ingest (DPDK) | — | ~5 μs | ~5 μs |
| NIC ingest (GPUDirect) | — | ~1 μs | ~1 μs |
| Tx magic check + scheme dispatch | 50 ns / tx | 50 ns / tx (CPU) | 50 ns / tx (CPU) |
| Sig verify (1 BLS) | 1.2 ms | 6 μs (batched) | 0.5 μs (batched) |
| Sig verify (1000 BLS) | 1.2 s serial | 6.5 ms batch | 0.5 ms batch |
| Corona aggregate (64 validators) | 350 ms | 50 ms | 15 ms |
| Merkle root (100k leaves) | 80 ms | 8 ms | 0.8 ms |
| State diff commit (ZapDB) | 3 ms | 3 ms | 3 ms |
| End-to-end block finality (1000-tx block) | ~600 ms | ~80 ms | ~25 ms |

The CPU-only column is the baseline that the pre-Quasar Edition Lux

network would have hit. The GPU-native columns are the new final

network's operating regime. The 24× speed-up on M4 Max and ~24× on H100

is what the LP-200 stack pays for in elegance: one byte stream means the

GPU reads it directly.

What's already implemented vs. integration TODO

This spec is explicit about the line between code that exists and

integration that does not.

Already on disk (verified 2026-06-03)

| Component | Path | Notes |
|---|---|---|
| luxfi/accel package | ~/work/lux/accel | v1.1.9 (go.mod); CUDA / Metal / ROCm BLS pairing kernels; ML-DSA NTT in ops_lattice.go; secp256k1 in crypto_secp256k1_cgo.go |
| Crypto ops dispatch | ~/work/lux/accel/ops_crypto.go | Per-scheme batch entry points |
| Lattice ops dispatch | ~/work/lux/accel/ops_lattice.go | ML-DSA NTT, Corona-adjacent ops |
| FHE ops dispatch | ~/work/lux/accel/ops_fhe.go | Reserved for LP-205 |
| luxfi/lattice package | ~/work/lux/lattice | v7.1.4; GPU lattice ops for Corona and Pulsar |
| ZAP UMA transport | ~/work/lux/zap/transport/uma.go + uma_linux.go / uma_darwin.go / uma_other.go | Platform-split UMA allocation |
| ZAP DPDK ingress | ~/work/lux/zap/transport/dpdk.go + dpdk_linux.go / dpdk_other.go | Linux PMD; non-Linux stub |
| ZAP GPUDirect ingress | ~/work/lux/zap/transport/gpudirect.go + gpudirect_linux.go / gpudirect_other.go | RDMA NIC → GPU memory path |
| Quasar GPU hook | ~/work/lux/consensus/protocol/quasar/gpu_accel.go (+ test) | Per-leg dispatch entry point |
| Corona GPU | ~/work/lux/threshold/protocols/corona/gpu/ | Corona R1/R2 GPU kernels |

Integration TODO (this LP unblocks)

| Task | Site | Outcome |
|---|---|---|
| Wire zap.UMAAlloc into the mempool buffer allocator | consensus/mempool (and per-VM mempools) | Mempool ZAP buffers land in UMA from creation |
| Make BatchVerify the default verify path in Quasar | consensus/protocol/quasar/cert.go | CPU verify becomes the explicit fallback, not the default |
| Hook DPDK ingress under the LP-201 QUIC implementation | network/p2p consumer of zap/transport/dpdk | DPDK is a kernel-bypass NIC-ingestion mechanism that composes with LP-201's QUIC default; it does NOT register against the Transport.Pick() selector — DPDK changes how UDP packets reach the process, not how peers map to transports |
| Wire GPUDirect into the same P2P transport path | same | GPUDirect-capable NICs DMA into UMA buffers directly; orthogonal to Transport.Pick() — composes under whichever transport (QUIC, LP-207 RDMA-IB) the selector chose for the peer |
| Profile gate: refuse strict-PQ without GPU | consensus/profile boot check + Operator CRD validation | strict-PQ validator can't come up GPU-less |
| Per-kernel timeout + CPU fallback | luxfi/accel dispatcher | Automatic degrade on GPU failure |
| Prometheus metrics | luxfi/accel + consensus/protocol/quasar/gpu_accel.go | Operator visibility into GPU vs CPU pathing |

Each integration task is a single concrete site; none introduce new

abstractions. The pattern is: read the ZAP buffer, dispatch to GPU,

read result from UMA. No DTOs, no marshal, no codec, no new Go

interface.

Cross-references

Activation marker


activates: 2025-12-25T16:20:00-08:00
activates-unix: 1766708400

Predates every block on the new final Lux network. No backwards

compatibility with the pre-Quasar Edition CPU-only verify path.

Future work