Draft. No backwards compatibility. No flag day.
Activated at the genesis of the new final Lux network: **2025-12-25 16:20
Pacific (unix 1766708400)**. The pre-Quasar Edition Lux network
(2020–2025) had no CXL fast path and is a separate network out of scope.
Validators co-located in the same physical rack pay a measurable
penalty when they replicate state over Layer-A network transport
(LP-201 QUIC + ZAP), even on a perfect 100 Gbps fabric: kernel transit,
TLS framing, NIC PCIe traversal, and userspace copy compose to a floor
of ~50-100 μs end-to-end for a single state read. For validators in
the same rack — a common topology for Tier-1 single-DC primary
networks (LP-204) — this floor is unnecessary. CXL 3.0 (Compute Express
Link, July 2022) exposes a coherent shared memory region across
multiple host CPUs and GPUs over a 64 GB/s per-link interconnect with
sub-100ns latency for cache-line reads. The ZAP buffer pool that
LP-200 §"The buffer is the value" describes is the natural object to
share — buffers are immutable values, identity is sha256(buf), and
the LP-203 GPU verifier already reads them from a UMA-style region.
LP-206 specifies the CXL Coherent State Pool: a rack-scale Type-3
memory device hosting a single ZAP buffer pool that every validator
in the rack mounts. Validator-A writes a buffer once; validator-B,
-C, -D read it at L3-cache miss latency (~80ns on Sapphire Rapids).
The pool composes with LP-210 (Block-STM MVCC snapshots become shared
reads), LP-211 (cross-shard staged overlays visible to every shard
validator in the rack), and LP-203 (GPU verifier reads pool memory
directly via NVLink-C2C on GB10/GB200). Global DHT replication via
LP-201 continues — the pool is a fast path for co-racked validators,
not a substitute for global content addressing.
The deployment target is the Tier-1 single-DC mode of LP-204: a
sovereign L1 whose validator set runs in one or two data centers,
with sub-rack fast-path consensus and global QUIC for any cross-DC
peer. CXL hardware is not yet on the bench cluster — DGX Spark has
NVLink-C2C between Grace CPU and Blackwell GPU on one node but does
not pool across multiple Sparks — so the latency numbers in
§"Performance characteristics" are split between measured (the
NVLink-C2C 80ns Grace↔Blackwell figure) and Estimated (the
multi-host pooled-CXL 100ns figure cited from CXL 3.0 spec and Intel
Sapphire Rapids product briefs). The spec stands; the bench numbers
graduate when CXL 3.0 racks reach the lab.
A validator participating in the CXL Coherent State Pool needs:
/sys/bus/cxl/devices/decoder* enumeration; libcxlmi or DAX access |A validator without CXL falls back to the standard LP-201 QUIC path —
nothing breaks, only the in-rack fast path is unavailable. CXL is
opportunistic; it does not change the protocol.
Rack-scale CXL 3.0 fabric
┌──────────────────────────────────────────────────┐
│ │
│ ┌──────────────────────────────────┐ │
│ │ Pooled CXL Type-3 device │ │
│ │ 256 GiB shared coherent memory │ │
│ │ ZAP buffer pool root │ │
│ └──────────────────────────────────┘ │
│ ▲ ▲ ▲ ▲ │
│ │ │ │ │ │
│ ┌───┴──┐ ┌──┴───┐ ┌───┴──┐ ┌────┴─┐ │
│ │ Val-A │ │ Val-B │ │ Val-C │ │ Val-D │ │
│ │ CPU+ │ │ CPU+ │ │ CPU+ │ │ CPU+ │ │
│ │ GPU │ │ GPU │ │ GPU │ │ GPU │ │
│ └───────┘ └───────┘ └───────┘ └───────┘ │
│ │
└──────────────────────────────────────────────────┘
│
│ LP-201 QUIC + DHT (global)
▼
WAN peers / other racks
The pool exposes a single coherent address range. Every validator
mounts it at the same virtual address (mmap(MAP_SHARED) via the
Linux CXL DAX driver). Writes are coherent via CXL.cache (CXL 1.1+);
no software synchronization is needed beyond standard memory ordering
fences. Reads at the cache-line granularity (64 bytes typical) return
in ~80-100ns from the pool device's media, ~10ns if the line is
already in the reader's L3.
The pool's first 4 KiB is a fixed metadata block:
offset size field description
0 8 magic "ZAPPOOL\0"
8 4 version 0x01000000
12 4 rack_id 16-bit unique rack identifier
16 8 head_offset offset of next free byte for new buffers
24 8 free_list_head offset of first free buffer (LIFO)
32 8 epoch monotonic; bumped per pool reset
40 32 root_buffer_hash sha256 of the pool's root buffer (typically
a directory of chain-state roots)
72 32 reserved future
104 16 coordinator_lock atomic CAS slot for write coordination
120 ... reserved to 4096 future metadata
After 4 KiB, the pool is a free-form allocator. Each ZAP buffer is
prefixed with a 64-byte header:
offset size field description
0 4 buffer_len length in bytes (incl. header)
4 4 tx_kind LP-022 TxKind discriminator
8 32 buffer_hash sha256 of the buffer body
40 8 next_free offset of next free buffer (when on free list)
48 16 reserved future
64 ... buffer_body ZAP envelope bytes
Buffer identity is its sha256 per LP-200 §"One wire, one identity,
one transcript". Hash collisions are not a concern at sha256 strength.
Two validators that compute the same buffer write it once; the second
writer's CAS on head_offset fails harmlessly and the second writer
discards its local copy.
The allocator is append-only with LIFO free list. New buffers bump
head_offset via CAS. Reclaimed buffers (e.g. mempool eviction, MVCC
snapshot retirement) push onto the free list via CAS on
free_list_head. The allocator has no fragmentation concern because
buffers are uniform-typed (ZAP envelopes are bounded by LP-022
schemas) and the operating set is bounded by the pool size — when the
pool fills, the rack participates in a coordinated reset via
LP-211-style 2PC across the rack's validators (§"Failure modes" row 4
below).
Multiple validators may write distinct buffers concurrently; the CAS
on head_offset serializes the allocation point but not the body
write (the writer reserves a range via CAS, then writes the body at
its leisure, then issues a final atomic store to publish the buffer's
buffer_len field). Readers that observe a buffer with
buffer_len == 0 MUST treat it as "in flight" and re-read after a
yield. This is the standard single-writer-multiple-reader (SWMR) lock
-free publishing pattern, ported to CXL.cache semantics.
LP-200 §"What's NOT yet decomplected" lists the buffer-pool-per-
validator UMA region as the current state. CXL Coherent State Pool
upgrades that region from per-validator UMA to rack-pooled CXL:
sha256(buf) | unchanged — sha256(buf) |The LP-200 stack guarantee is preserved: the buffer is still the
value, identity is still sha256(buf), the transcript is still
TupleHash256 over typed offsets. The only change is the storage
medium and the read latency. A validator that produces a buffer
allocates it in the pool; a peer that needs it reads it from the
pool. No marshalling, no copy, no codec.
LP-203 §"GPUDirect path" describes the verifier running on
Blackwell sm_120 reading ZAP buffers from a UMA-style region without
crossing the PCIe boundary. CXL Coherent State Pool extends that
guarantee across validators in the rack.
The MI300A row is the cleanest topology — the GPU has its own CXL.mem
port and reads the pool directly. NVIDIA's GB10 / GB200 path takes
one extra hop through the Grace CPU's CXL root complex, adding ~30ns
to each cache-line read. Both are dramatically faster than the LP-203
fallback path (cross-validator gossip + PCIe DMA, ~100 μs).
LP-210 §"Phase 3 — multi-version concurrency control" maintains a
versioned overlay per executed tx in the wave. Under the standard
deployment, each validator runs Block-STM independently and the
MVCC snapshot lives in that validator's local memory. With CXL
Coherent State Pool:
The architectural decision is non-trivial: under CXL Coherent
State Pool, do co-racked validators share Block-STM execution, or
do they each replicate it for safety? LP-206 chooses **each
replicates locally, observes peer overlays for early
conflict-detection**. Sharing execution would be a single point of
failure (one buggy validator poisons the rack's overlay); replicating
preserves the security model. The peer-observable overlays accelerate
conflict detection — a validator that sees a peer's read-set
intersect its own write-set retires its retry immediately rather than
waiting for its own re-execution.
LP-211 §"Step 2 — per-shard prepare" stages an overlay locally on
each shard validator until the parent-L1 cert commits. Under CXL
Coherent State Pool, the staged overlay is pool-resident; every
shard validator in the rack observes the staging without round-trip:
The cross-shard 2PC ack round-trip (LP-211 §Latency row "Prepare-acks
arrive at coordinator: ~130 ms") shrinks for co-racked shards from
~10 ms (gossip RTT) to ~10 μs (one CXL cache-line read by the
coordinator to verify the overlay is staged). For an all-in-one-rack
4-shard primary network running cross-shard at PQ-strict, end-to-end
cross-shard latency drops from ~260 ms to ~125 ms — bounded entirely
by the LP-209 wave commit cadence, not by 2PC coordination.
The pool is rack-local. Global replication remains LP-201's
responsibility:
Pool snapshots replicate to the DHT periodically. Each rack pool
emits a snapshot every N waves: serialize the pool's current set of
live buffers (skip free-list), chunk by LP-201 §"Content types and
TTL" rules, advertise to DHT. A validator joining a fresh rack pulls
the latest snapshot from DHT, populates its local pool, and synchronizes
forward via standard LP-201 catch-up.
A rack that crashes does not lose data because the snapshots are
already in DHT. The pool is a fast path, not a system of record.
The system of record is ZapDB (LP-200 §"Layer 1") plus the DHT.
sha256(buf) != buffer_hash mismatch) | Rack-coordinated pool reset (epoch bump in metadata); peers re-populate from DHT |buffer_hash | Reader re-reads on hash mismatch; treat as transient; if persistent, fall back to QUIC |head_offset CAS | Coordinated rack pool reset (LP-211-style 2PC across the rack's validators on a PoolReset tx); peers re-populate from DHT |Honest gap. CXL 3.0 hardware is not on the bench cluster. DGX Spark
provides the NVLink-C2C 80ns Grace↔Blackwell measurement (single
host), which is the closest available analog. The multi-host pooled-CXL
numbers are Estimated from the CXL 3.0 specification and from Intel /
AMD product briefs. The estimates graduate to Measured when a CXL
3.0 rack reaches the lab (target: Q3 2026 Sapphire Rapids refresh
cycle with CXL 3.0 switch availability).
The win over LP-201 QUIC at the intra-rack scale is roughly 1000×
on cache-line reads, 2× on full-buffer reads. The 1000× factor
dominates Block-STM MVCC conflict detection and cross-shard 2PC
prepare-ack observation — the use cases LP-206 was built for.
CXL operates at the physical-layer trust boundary. A validator joining
a rack pool implicitly trusts the rack's physical security — the
operator confirms that no untrusted host is attached to the CXL
fabric. This is the standard data-center physical security model and
is unchanged by LP-206.
A CXL device that returns corrupt data — whether by firmware bug or
adversary control — could poison every reader in the rack. LP-206
mitigates via per-buffer sha256 hash verification at the buffer
header. A reader that observes sha256(body) != buffer_hash rejects
the buffer and falls back to LP-201 QUIC. The hash check is the same
check LP-200 §"One wire, one identity, one transcript" prescribes;
LP-206 inherits it.
The pooled CXL device MUST run hardware ECC (CXL 3.0 spec §6.3.2
requires it for Type-3 devices). Operators SHOULD enable periodic
memory scrubbing at the device level — recommended scrub interval is
24 hours, balancing scrub bandwidth (~1% of device capacity per hour
at 1 GiB/s scrub rate) against soft-error half-life. Uncorrectable
ECC errors raise a CXL.io interrupt; the device's Linux driver
surfaces the affected page; the pool allocator marks the affected
buffer range as quarantined and the rack falls back to LP-201 QUIC
for that range until snapshot replacement.
LP-077 round digest binding (chain_id, epoch, height) protects
against replay across networks. A pool that returns a stale buffer
from a previous epoch (e.g. after a firmware bug skipped a coherency
invalidation) cannot trigger a stale-state attack because the
consumer is reading the buffer for a specific `(chain_id, epoch,
height)` and a stale buffer fails the transcript check at LP-200
§"Layer 4 — Transcript".
Transport.Pick(), Transport.Register(), the TransportCapability enum (including CapCXLPool), and the priority list. This LP registers cxl-pool against that contract |rdma-ib / rdma-rocev2 for cross-rack peers; this LP registers cxl-pool for same-rack peers |LP-201 owns the Transport.Pick() selector contract. This LP and
LP-207 register implementations against it. The three-tier transport
stack for Tier-1 single-DC mode of LP-204 is one contract and three
registered implementations:
cxl-pool | ~80 ns | this LP |rdma-ib / rdma-rocev2 | ~3 μs | LP-207 |quic | ~50-100 μs LAN, ~50-100 ms WAN | LP-201 |Each tier composes — a validator in the rack uses the pool; the same
validator uses RDMA for cross-rack peers; the same validator uses
QUIC for WAN peers. The chosen transport is a function of the peer's
location, not of the protocol layer above. Selector semantics
(priority, registration, capability probe) are defined in LP-201
§"Transport.Pick() contract." Consensus, mempool, and state-sync at
layers above are transport-agnostic.
activates: 2025-12-25T16:20:00-08:00
activates-unix: 1766708400
CXL Coherent State Pool is callable from the genesis block of the new
final Lux network. Validators without CXL hardware silently fall back
to LP-201 QUIC; nothing on the wire changes.
rack-pooled CXL)
Transport.Pick() contract this LP implements; defines the CapCXLPool priority slot; pool
snapshots replicate via LP-201's Kademlia DHT)
pool-agnostic)
deployment target)
for cross-rack peers)
2022, §3.1.4 latency, §6.3.2 Type-3 ECC requirement
latency, GB10 single-host + GB200 multi-host topology