Lux Proposals
← All proposals
LP-0206Draft

Status

Draft. No backwards compatibility. No flag day.

Activated at the genesis of the new final Lux network: **2025-12-25 16:20

Pacific (unix 1766708400)**. The pre-Quasar Edition Lux network

(2020–2025) had no CXL fast path and is a separate network out of scope.

Abstract

Validators co-located in the same physical rack pay a measurable

penalty when they replicate state over Layer-A network transport

(LP-201 QUIC + ZAP), even on a perfect 100 Gbps fabric: kernel transit,

TLS framing, NIC PCIe traversal, and userspace copy compose to a floor

of ~50-100 μs end-to-end for a single state read. For validators in

the same rack — a common topology for Tier-1 single-DC primary

networks (LP-204) — this floor is unnecessary. CXL 3.0 (Compute Express

Link, July 2022) exposes a coherent shared memory region across

multiple host CPUs and GPUs over a 64 GB/s per-link interconnect with

sub-100ns latency for cache-line reads. The ZAP buffer pool that

LP-200 §"The buffer is the value" describes is the natural object to

share — buffers are immutable values, identity is sha256(buf), and

the LP-203 GPU verifier already reads them from a UMA-style region.

LP-206 specifies the CXL Coherent State Pool: a rack-scale Type-3

memory device hosting a single ZAP buffer pool that every validator

in the rack mounts. Validator-A writes a buffer once; validator-B,

-C, -D read it at L3-cache miss latency (~80ns on Sapphire Rapids).

The pool composes with LP-210 (Block-STM MVCC snapshots become shared

reads), LP-211 (cross-shard staged overlays visible to every shard

validator in the rack), and LP-203 (GPU verifier reads pool memory

directly via NVLink-C2C on GB10/GB200). Global DHT replication via

LP-201 continues — the pool is a fast path for co-racked validators,

not a substitute for global content addressing.

The deployment target is the Tier-1 single-DC mode of LP-204: a

sovereign L1 whose validator set runs in one or two data centers,

with sub-rack fast-path consensus and global QUIC for any cross-DC

peer. CXL hardware is not yet on the bench cluster — DGX Spark has

NVLink-C2C between Grace CPU and Blackwell GPU on one node but does

not pool across multiple Sparks — so the latency numbers in

§"Performance characteristics" are split between measured (the

NVLink-C2C 80ns Grace↔Blackwell figure) and Estimated (the

multi-host pooled-CXL 100ns figure cited from CXL 3.0 spec and Intel

Sapphire Rapids product briefs). The spec stands; the bench numbers

graduate when CXL 3.0 racks reach the lab.

Hardware requirements

A validator participating in the CXL Coherent State Pool needs:

| Component | Requirement | Reason |
|---|---|---|
| Host CPU | Intel Sapphire Rapids (Xeon 4th gen) or AMD Genoa (EPYC 9004) or later | CXL 1.1 host root complex with CXL.cache + CXL.mem semantics |
| CXL switch | Type-3 capable (CXL 3.0 preferred) | Pool topology requires Type-3 switch for memory pooling across hosts |
| Pooled memory device | CXL 3.0 Type-3, ≥ 256 GiB capacity | Hosts the shared ZAP buffer pool |
| Host CXL link | CXL 3.0 x16 (64 GB/s) or x8 (32 GB/s) | Bandwidth for buffer fan-out |
| GPU CXL bridge (optional) | NVIDIA GB10 / GB200 NVLink-C2C OR AMD MI300A CXL.mem | Direct GPU access to pool memory without PCIe hop |
| OS | Linux 6.6+ with CXL Type-3 driver | /sys/bus/cxl/devices/decoder* enumeration; libcxlmi or DAX access |

A validator without CXL falls back to the standard LP-201 QUIC path —

nothing breaks, only the in-rack fast path is unavailable. CXL is

opportunistic; it does not change the protocol.

Memory topology


                  Rack-scale CXL 3.0 fabric
   ┌──────────────────────────────────────────────────┐
   │                                                  │
   │      ┌──────────────────────────────────┐        │
   │      │  Pooled CXL Type-3 device        │        │
   │      │  256 GiB shared coherent memory  │        │
   │      │  ZAP buffer pool root            │        │
   │      └──────────────────────────────────┘        │
   │       ▲          ▲          ▲          ▲         │
   │       │          │          │          │         │
   │   ┌───┴──┐    ┌──┴───┐  ┌───┴──┐  ┌────┴─┐       │
   │   │ Val-A │    │ Val-B │  │ Val-C │  │ Val-D │  │
   │   │ CPU+  │    │ CPU+  │  │ CPU+  │  │ CPU+  │  │
   │   │ GPU   │    │ GPU   │  │ GPU   │  │ GPU   │  │
   │   └───────┘    └───────┘  └───────┘  └───────┘  │
   │                                                  │
   └──────────────────────────────────────────────────┘
                          │
                          │ LP-201 QUIC + DHT (global)
                          ▼
                  WAN peers / other racks

The pool exposes a single coherent address range. Every validator

mounts it at the same virtual address (mmap(MAP_SHARED) via the

Linux CXL DAX driver). Writes are coherent via CXL.cache (CXL 1.1+);

no software synchronization is needed beyond standard memory ordering

fences. Reads at the cache-line granularity (64 bytes typical) return

in ~80-100ns from the pool device's media, ~10ns if the line is

already in the reader's L3.

The ZAP buffer pool root

The pool's first 4 KiB is a fixed metadata block:


offset  size   field                description
0       8      magic                "ZAPPOOL\0"
8       4      version              0x01000000
12      4      rack_id              16-bit unique rack identifier
16      8      head_offset          offset of next free byte for new buffers
24      8      free_list_head       offset of first free buffer (LIFO)
32      8      epoch                monotonic; bumped per pool reset
40      32     root_buffer_hash     sha256 of the pool's root buffer (typically
                                    a directory of chain-state roots)
72      32     reserved             future
104     16     coordinator_lock     atomic CAS slot for write coordination
120     ...    reserved to 4096     future metadata

After 4 KiB, the pool is a free-form allocator. Each ZAP buffer is

prefixed with a 64-byte header:


offset  size   field                description
0       4      buffer_len           length in bytes (incl. header)
4       4      tx_kind              LP-022 TxKind discriminator
8       32     buffer_hash          sha256 of the buffer body
40      8      next_free            offset of next free buffer (when on free list)
48      16     reserved             future
64      ...    buffer_body          ZAP envelope bytes

Buffer identity is its sha256 per LP-200 §"One wire, one identity,

one transcript". Hash collisions are not a concern at sha256 strength.

Two validators that compute the same buffer write it once; the second

writer's CAS on head_offset fails harmlessly and the second writer

discards its local copy.

Allocation discipline

The allocator is append-only with LIFO free list. New buffers bump

head_offset via CAS. Reclaimed buffers (e.g. mempool eviction, MVCC

snapshot retirement) push onto the free list via CAS on

free_list_head. The allocator has no fragmentation concern because

buffers are uniform-typed (ZAP envelopes are bounded by LP-022

schemas) and the operating set is bounded by the pool size — when the

pool fills, the rack participates in a coordinated reset via

LP-211-style 2PC across the rack's validators (§"Failure modes" row 4

below).

Concurrent writers

Multiple validators may write distinct buffers concurrently; the CAS

on head_offset serializes the allocation point but not the body

write (the writer reserves a range via CAS, then writes the body at

its leisure, then issues a final atomic store to publish the buffer's

buffer_len field). Readers that observe a buffer with

buffer_len == 0 MUST treat it as "in flight" and re-read after a

yield. This is the standard single-writer-multiple-reader (SWMR) lock

-free publishing pattern, ported to CXL.cache semantics.

Composition with LP-200 ZAP stack

LP-200 §"What's NOT yet decomplected" lists the buffer-pool-per-

validator UMA region as the current state. CXL Coherent State Pool

upgrades that region from per-validator UMA to rack-pooled CXL:

| Concern | LP-200 today | LP-206 with CXL |
|---|---|---|
| Where the ZAP buffer lives | per-validator UMA region | rack-pooled CXL Type-3 device |
| Who can read it | the owning validator + its GPU via NVLink-C2C | every validator in the rack + every validator's GPU |
| Replication latency | LP-201 QUIC (~50-100 μs per hop on 100 Gbps fabric) | CXL coherent read (~80-100ns per cache line) |
| Buffer identity | sha256(buf) | unchanged — sha256(buf) |
| Buffer immutability | by convention (LP-200 §"The buffer IS the value") | enforced by allocator (append-only; reclamation is whole-buffer free) |

The LP-200 stack guarantee is preserved: the buffer is still the

value, identity is still sha256(buf), the transcript is still

TupleHash256 over typed offsets. The only change is the storage

medium and the read latency. A validator that produces a buffer

allocates it in the pool; a peer that needs it reads it from the

pool. No marshalling, no copy, no codec.

Composition with LP-203 GPU-native verify

LP-203 §"GPUDirect path" describes the verifier running on

Blackwell sm_120 reading ZAP buffers from a UMA-style region without

crossing the PCIe boundary. CXL Coherent State Pool extends that

guarantee across validators in the rack.

| LP-203 GPU integration | Path |
|---|---|
| Standalone Blackwell on one validator | GPU NVLink-C2C → host UMA buffer (LP-200 today) |
| GB10 / GB200 in a CXL rack | GPU NVLink-C2C → host CXL bridge → pool memory (LP-206 with CXL) |
| AMD MI300A in a CXL rack | GPU CXL.mem direct → pool memory (no host hop) |

The MI300A row is the cleanest topology — the GPU has its own CXL.mem

port and reads the pool directly. NVIDIA's GB10 / GB200 path takes

one extra hop through the Grace CPU's CXL root complex, adding ~30ns

to each cache-line read. Both are dramatically faster than the LP-203

fallback path (cross-validator gossip + PCIe DMA, ~100 μs).

Composition with LP-210 Block-STM MVCC

LP-210 §"Phase 3 — multi-version concurrency control" maintains a

versioned overlay per executed tx in the wave. Under the standard

deployment, each validator runs Block-STM independently and the

MVCC snapshot lives in that validator's local memory. With CXL

Coherent State Pool:

| Block-STM stage | Without CXL | With CXL |
|---|---|---|
| Wave begins | Each validator initializes its own MVCC snapshot stack | One validator initializes the snapshot stack in the pool; peers point at the same root |
| Per-tx execute (Phase 2) | Each validator re-executes Phase 2 independently | Validators can observe peer-computed overlay versions without re-execution (validation-only path) |
| Conflict re-execution (Phase 3 detect) | Each validator detects its own conflicts | Conflict detection is rack-global; one rack validator's conflict invalidation is observable by every peer |
| Phase 4 commit | Each validator commits to its local ZapDB | The committed overlay is pool-resident; ZapDB consumes from the pool with no copy |

The architectural decision is non-trivial: under CXL Coherent

State Pool, do co-racked validators share Block-STM execution, or

do they each replicate it for safety? LP-206 chooses **each

replicates locally, observes peer overlays for early

conflict-detection**. Sharing execution would be a single point of

failure (one buggy validator poisons the rack's overlay); replicating

preserves the security model. The peer-observable overlays accelerate

conflict detection — a validator that sees a peer's read-set

intersect its own write-set retires its retry immediately rather than

waiting for its own re-execution.

Composition with LP-211 cross-shard atomic

LP-211 §"Step 2 — per-shard prepare" stages an overlay locally on

each shard validator until the parent-L1 cert commits. Under CXL

Coherent State Pool, the staged overlay is pool-resident; every

shard validator in the rack observes the staging without round-trip:

| LP-211 step | Without CXL | With CXL |
|---|---|---|
| Step 2 stage overlay | Each shard validator stages locally | Pool stages once; every shard validator in the rack sees it |
| Step 3 prepare-ack collection | Coordinator collects acks over QUIC | Coordinator observes the staged overlay directly in pool; no ack round-trip needed for co-racked shards |
| Step 4 cross-shard commit | Each shard applies its overlay independently | Pool overlay graduates to ZapDB on every co-racked validator simultaneously |
| Step 4' cross-shard abort | Each shard LP-202 unwinds locally | Pool overlay reclaimed once; the LIFO free-list reclaim is global |

The cross-shard 2PC ack round-trip (LP-211 §Latency row "Prepare-acks

arrive at coordinator: ~130 ms") shrinks for co-racked shards from

~10 ms (gossip RTT) to ~10 μs (one CXL cache-line read by the

coordinator to verify the overlay is staged). For an all-in-one-rack

4-shard primary network running cross-shard at PQ-strict, end-to-end

cross-shard latency drops from ~260 ms to ~125 ms — bounded entirely

by the LP-209 wave commit cadence, not by 2PC coordination.

Composition with LP-201 global DHT

The pool is rack-local. Global replication remains LP-201's

responsibility:

| Scope | Mechanism |
|---|---|
| Intra-rack (same CXL fabric) | CXL Coherent State Pool — ~80ns read |
| Inter-rack (same DC) | LP-201 QUIC + LP-207 RDMA (sibling LP) |
| Inter-DC (WAN) | LP-201 QUIC + DHT |

Pool snapshots replicate to the DHT periodically. Each rack pool

emits a snapshot every N waves: serialize the pool's current set of

live buffers (skip free-list), chunk by LP-201 §"Content types and

TTL" rules, advertise to DHT. A validator joining a fresh rack pulls

the latest snapshot from DHT, populates its local pool, and synchronizes

forward via standard LP-201 catch-up.

A rack that crashes does not lose data because the snapshots are

already in DHT. The pool is a fast path, not a system of record.

The system of record is ZapDB (LP-200 §"Layer 1") plus the DHT.

Failure modes

| Failure | Trigger | Effect | Recovery |
|---|---|---|---|
| CXL link error on one validator | hardware fault | That validator's pool reads fail; falls back to LP-201 QUIC | Operator replaces the CXL cable / module; validator catches up via LP-201 DHT |
| Pool device media failure | flash / RAM error on pooled CXL device | The pool's buffers may return corrupt data (caught by sha256(buf) != buffer_hash mismatch) | Rack-coordinated pool reset (epoch bump in metadata); peers re-populate from DHT |
| Coherency loss (stale read) | CXL.cache invalidation lost (firmware bug) | A reader observes a partially-published buffer with mismatching buffer_hash | Reader re-reads on hash mismatch; treat as transient; if persistent, fall back to QUIC |
| Pool fills | sustained traffic without snapshot cadence | Allocator fails on next head_offset CAS | Coordinated rack pool reset (LP-211-style 2PC across the rack's validators on a PoolReset tx); peers re-populate from DHT |
| Pool device firmware update | scheduled maintenance | Pool offline during update | Validators on that rack temporarily fall back to LP-201 QUIC; rejoin pool when device returns |
| Adversary writes to the pool | only possible if attacker has rack-physical access (CXL is physical-layer trusted) | n/a in threat model | n/a — physical security is operator-domain; the network-layer threat model (LP-201 §Sybil resistance) is unchanged |
| Cross-rack pool desynchronization | network partition between rack pools and the DHT | Each rack continues to produce buffers in its own pool; DHT replication catches up post-partition | Standard LP-201 partition recovery — rack pools are eventually consistent via DHT |

Performance characteristics

| Metric | Hardware | Latency | Source |
|---|---|---|---|
| Cache-line read from pool (warm L3) | Sapphire Rapids host, CXL 3.0 pool device | ~10 ns | Estimated (CPU L3 latency floor) |
| Cache-line read from pool (cold) | Sapphire Rapids host, CXL 3.0 pool device | ~80-100 ns | Estimated (CXL 3.0 spec §3.1.4 cite, Intel product brief) |
| Cache-line read from pool (cold) | Genoa host, CXL 3.0 pool device | ~100-120 ns | Estimated (AMD CXL 1.1 + product roadmap) |
| GPU-side read via NVLink-C2C on GB10 | DGX Spark single-node bench | ~80 ns | Measured (Grace ↔ Blackwell, single host) |
| GPU-side read via NVLink-C2C on GB200 + pooled CXL | not yet measured | ~110 ns | Estimated (GB200 NVLink-C2C base + one CXL bridge hop) |
| GPU-side read via MI300A CXL.mem direct | not yet measured | ~80 ns | Estimated (AMD MI300A product brief — direct CXL.mem path) |
| Bandwidth per CXL 3.0 x16 link | spec | 64 GB/s | CXL 3.0 Specification, July 2022 |
| Aggregate pool bandwidth (4-host rack) | spec | 256 GB/s | 4 × 64 GB/s host links |
| Full 1 MiB buffer read | warm pool, CXL 3.0 x16 | ~16 μs | Estimated (1 MiB / 64 GB/s + 80ns first-byte) |
| LP-201 QUIC equivalent (same 1 MiB) | Mellanox ConnectX-7 NDR 400 Gbps | ~25 μs | Estimated (200 GB/s peak, kernel + TLS overhead) |
| LP-201 QUIC equivalent (cross-rack, same DC) | 100 Gbps Ethernet | ~80 μs | Estimated (10 GB/s effective + RTT) |

Honest gap. CXL 3.0 hardware is not on the bench cluster. DGX Spark

provides the NVLink-C2C 80ns Grace↔Blackwell measurement (single

host), which is the closest available analog. The multi-host pooled-CXL

numbers are Estimated from the CXL 3.0 specification and from Intel /

AMD product briefs. The estimates graduate to Measured when a CXL

3.0 rack reaches the lab (target: Q3 2026 Sapphire Rapids refresh

cycle with CXL 3.0 switch availability).

The win over LP-201 QUIC at the intra-rack scale is roughly 1000×

on cache-line reads, on full-buffer reads. The 1000× factor

dominates Block-STM MVCC conflict detection and cross-shard 2PC

prepare-ack observation — the use cases LP-206 was built for.

Security

CXL operates at the physical-layer trust boundary. A validator joining

a rack pool implicitly trusts the rack's physical security — the

operator confirms that no untrusted host is attached to the CXL

fabric. This is the standard data-center physical security model and

is unchanged by LP-206.

Memory-poisoning attack surface

A CXL device that returns corrupt data — whether by firmware bug or

adversary control — could poison every reader in the rack. LP-206

mitigates via per-buffer sha256 hash verification at the buffer

header. A reader that observes sha256(body) != buffer_hash rejects

the buffer and falls back to LP-201 QUIC. The hash check is the same

check LP-200 §"One wire, one identity, one transcript" prescribes;

LP-206 inherits it.

ECC + scrub interval

The pooled CXL device MUST run hardware ECC (CXL 3.0 spec §6.3.2

requires it for Type-3 devices). Operators SHOULD enable periodic

memory scrubbing at the device level — recommended scrub interval is

24 hours, balancing scrub bandwidth (~1% of device capacity per hour

at 1 GiB/s scrub rate) against soft-error half-life. Uncorrectable

ECC errors raise a CXL.io interrupt; the device's Linux driver

surfaces the affected page; the pool allocator marks the affected

buffer range as quarantined and the rack falls back to LP-201 QUIC

for that range until snapshot replacement.

Coherency-loss replay safety

LP-077 round digest binding (chain_id, epoch, height) protects

against replay across networks. A pool that returns a stale buffer

from a previous epoch (e.g. after a firmware bug skipped a coherency

invalidation) cannot trigger a stale-state attack because the

consumer is reading the buffer for a specific `(chain_id, epoch,

height)` and a stale buffer fails the transcript check at LP-200

§"Layer 4 — Transcript".

Composition with other LPs

| LP | Role |
|---|---|
| LP-022 | The wire protocol; buffers in the pool are LP-022 envelopes |
| LP-200 | The umbrella; pool storage is an upgrade to LP-200 §"Layer 1" UMA |
| LP-201 | The contract this LP implements — defines Transport.Pick(), Transport.Register(), the TransportCapability enum (including CapCXLPool), and the priority list. This LP registers cxl-pool against that contract |
| LP-202 | Cert tier degradation continues to apply; the pool does not change cert observability |
| LP-203 | GPU verifier reads pool memory via NVLink-C2C / CXL.mem |
| LP-207 | Sibling LP-201 implementation — registers rdma-ib / rdma-rocev2 for cross-rack peers; this LP registers cxl-pool for same-rack peers |
| LP-210 | Block-STM MVCC snapshots become pool-resident; peer-observable conflict detection |
| LP-211 | Cross-shard staged overlays become pool-resident; coordinator observes prepare-ack via pool read |
| LP-218 | Z-Chain rollup sequencer's batch buffers may also live in the pool (rack-co-located rollup sequencer) |

LP-201 owns the Transport.Pick() selector contract. This LP and

LP-207 register implementations against it. The three-tier transport

stack for Tier-1 single-DC mode of LP-204 is one contract and three

registered implementations:

| Distance | Transport | Registered name | Latency | Source LP |
|---|---|---|---|---|
| Intra-rack | CXL Coherent State Pool | cxl-pool | ~80 ns | this LP |
| Cross-rack within DC | RDMA over InfiniBand | rdma-ib / rdma-rocev2 | ~3 μs | LP-207 |
| Global / WAN | QUIC + DHT | quic | ~50-100 μs LAN, ~50-100 ms WAN | LP-201 |

Each tier composes — a validator in the rack uses the pool; the same

validator uses RDMA for cross-rack peers; the same validator uses

QUIC for WAN peers. The chosen transport is a function of the peer's

location, not of the protocol layer above. Selector semantics

(priority, registration, capability probe) are defined in LP-201

§"Transport.Pick() contract." Consensus, mempool, and state-sync at

layers above are transport-agnostic.

Activation marker


activates: 2025-12-25T16:20:00-08:00
activates-unix: 1766708400

CXL Coherent State Pool is callable from the genesis block of the new

final Lux network. Validators without CXL hardware silently fall back

to LP-201 QUIC; nothing on the wire changes.

Cross-references