Draft. No backwards compatibility. No flag day.
Activated at the genesis of the new final Lux network: **2025-12-25 16:20
Pacific (unix 1766708400)**. The pre-Quasar Edition Lux network
(2020–2025) had no RDMA fast path and is a separate network out of scope.
This LP defines the RDMA-over-InfiniBand transport, an implementation
of the LP-201 Transport.Pick() contract for cross-rack peers within
a single data center. It registers two factories via
Transport.Register("rdma-ib", rdmaIBFactory) and
Transport.Register("rdma-rocev2", rdmaRoCEv2Factory). It does NOT
re-specify the selector — selector semantics, the priority list, the
capability enum (including CapRDMAIB and CapRDMARoCEv2), and the
fall-through to QUIC are all owned by LP-201 §"Transport.Pick()
contract."
Motivation. QUIC pays a kernel-transit + TLS-framing + userspace-copy
penalty that floors at ~50-100 μs end-to-end on a 100 Gbps fabric. For
Tier-1 single-DC mode of LP-204 — a sovereign L1 whose validator set
runs in one or two data centers and pays for InfiniBand or RoCEv2
fabric — RDMA over InfiniBand removes the kernel and the framing. The
wire payload is the same LP-022 ZAP frame, with the same LP-201
stream-type bytes (0xD0..0xDF). The delivery medium is InfiniBand
Verbs (ibv_post_send of an RDMA WRITE into a remote pre-registered
memory region) or RoCEv2 (the same Verbs over a UDP/IP fabric, RFC
7886). One-way NIC-to-NIC latency on Mellanox ConnectX-7 NDR
(400 Gbps) is ~600 ns; cross-rack RTT through one switch hop is
~3 μs — 20-30× faster than QUIC on the same fabric.
This LP specifies: the RDMA wire mapping for LP-201 stream types, the
RDMA-IB-specific failure modes (NIC link down, IB QP exhaustion,
registered-buffer exhaustion), the GPUDirect RDMA integration with
LP-203 (NIC writes directly into GPU memory; CPU not on the path),
and the byzantine-resistance posture (RDMA assumes a trusted
validator-set within the DC; LP-201's QUIC fall-through provides the
TLS 1.3 security envelope for cross-DC peers automatically).
DGX Spark, the current bench cluster, has no Mellanox NIC — the
GPUDirect probe failed because of this. The latency numbers in
§"Performance" are cited from Mellanox / NVIDIA product briefs
(ConnectX-7 NDR, ConnectX-6 HDR) and from published RDMA-Verbs
microbenchmarks. The numbers graduate from cited to measured when
a Mellanox-equipped validator reaches the lab (target: Q3 2026
ConnectX-7 NDR refresh on rack-scale clusters).
LP-201's Transport.Pick() selects this transport when both peers
report CapRDMAIB (or CapRDMARoCEv2) and the higher-priority
CapCXLPool (LP-206) is unavailable. The locality table below is
informational — the selector codifies it in priority order:
The transport is a function of the peer's location, not of the
message type. A ConsensusVote to a same-DC peer with IB lands
via RDMA-IB; the same ConsensusVote to a WAN peer lands via QUIC.
The application layer above does not know which transport carried
the message. See LP-201 §"Transport.Pick() contract" for the
selector semantics.
The recommended baseline is ConnectX-7 NDR for new Tier-1 single-
DC validator deployments. Cost per port is dominated by the NIC and
the IB switch; recommended IB switch is the NVIDIA Quantum-2 NDR-200
(64 ports, ~$15-30k street).
InfiniBand fabric or RoCEv2 over Ethernet. Both expose the same Verbs
interface (libibverbs on Linux). The wire-level differences
(InfiniBand Link Layer vs Ethernet + RoCE encapsulation) are invisible
above the Verbs layer.
Recommended deployment: InfiniBand fabric for greenfield Tier-1
single-DC sites (better congestion control, lower jitter at high
fan-in); RoCEv2 for existing Ethernet-built DCs (no fabric forklift).
Every RDMA peer pre-registers a set of memory buffers with the NIC.
The NIC obtains pinned physical pages and a key (rkey) that remote
peers use to address those buffers. LP-207 registers a pool of ZAP
buffers at boot, of total size equal to the worst-case in-flight
message set (typically 1-8 GiB per peer).
Memory registration is expensive (~milliseconds per registration);
LP-207 amortizes by registering once at boot and never re-registering
during normal operation. Registered buffers ARE the ZAP buffers
(LP-200 §"The buffer IS the value") — no separate "RDMA buffer" and
"ZAP buffer" structures.
Every RDMA payload is a ZAP frame, byte-identical to the QUIC payload
LP-201 carries. The RDMA operation envelope adds three pieces of
metadata on top:
LP-207 uses RDMA WRITE_WITH_IMM as the primary verb for unidirectional
gossip (consensus votes, tx gossip, snapshot advertise — LP-201's
unidirectional stream types). RDMA WRITE_WITH_IMM combines a one-sided
write to a remote registered buffer with a 32-bit immediate value
delivered to the receiver's completion queue. The immediate carries:
bits 0..19 sender NodeID hash (low 20 bits)
bits 20..27 stream-type byte (0xD0..0xDF)
bits 28..31 reserved
The receiver polls the completion queue; on completion it knows
(a) which buffer was written by wr_id, (b) the sender from the
immediate, and (c) the stream type from the immediate. The receiver
then validates the buffer's ZAP frame body via the standard LP-022
parser. No additional framing.
For bidirectional RPC (BlockRequest/Response, TxRequest/Response,
ChunkRequest/Response — LP-201's bidirectional stream types) LP-207
uses RDMA SEND + RDMA SEND in a request-response pair. SEND is
two-sided (the receiver posts receive buffers in advance) and matches
the request-response semantic naturally.
The receiver pre-posts receive buffers from the registered pool. When
an RDMA SEND completes, the receiver's polling thread observes a
completion-queue entry with the buffer's wr_id; the buffer is in
pinned RDMA memory, ready for parse via the LP-022 schema. No memcpy,
no kernel hop, no userspace allocation.
InfiniBand fabric is physically trusted. RoCEv2 over Ethernet may not
be. LP-207 supports two authentication modes:
The default for greenfield InfiniBand is IB-trusted. The default for
RoCEv2 over a shared L2 is IPsec.
Cross-DC peers ALWAYS fall back to QUIC (LP-201) regardless. The
operator never opens RDMA across a WAN edge; the security and
fragmentation cost is not worth the latency win.
LP-203 §"GPUDirect path" specifies the NIC writing directly into GPU
memory without crossing the host CPU. LP-207 RDMA + LP-203 GPUDirect
compose:
Sender: Receiver:
GPU produces ZAP frame NIC writes into GPU memory
in registered GPU buffer via GPUDirect RDMA
↓ ↓
ibv_post_send WRITE_WITH_IMM GPU polls receive
to receiver's GPU buffer buffer for new frames
CPU is not on the path. End-to-end NIC-to-GPU-memory latency on
ConnectX-7 NDR + Blackwell sm_120 is ~1 μs (estimated; matches
Mellanox + NVIDIA published GPUDirect RDMA benchmarks).
The GPU-side verifier (LP-203) processes frames directly from the
RDMA-landed buffer with no host-side roundtrip. For a wave of N
consensus votes, every vote lands in GPU memory in ~1 μs; the GPU
verifies all N in parallel; the host CPU sees only the aggregated
result.
GPUDirect RDMA requires the NIC and the GPU to be on the same PCIe
root complex (or connected via NVLink/CXL on GB10/GB200). On Sapphire
Rapids hosts, ensure NIC and GPU are in the same NUMA node. On AMD
Genoa hosts, ensure they share a CCD. Cross-NUMA GPUDirect RDMA
silently degrades to a kernel-mediated path; verify with `nvidia-smi
topo -m` at deployment.
LP-201 specifies sybil resistance at the validator level via stake
(LP-170 / LP-171) and rate limiting at the QUIC stream layer. LP-207
inherits both, with one caveat: **RDMA assumes a trusted validator-set
within the DC.**
The IB fabric is physical — only validators with a port on the IB
switch can participate. Operator policy controls who connects. This
is the same trust assumption as LP-206 CXL Coherent State Pool. For
sovereign L1 single-DC deployments where the operator runs every
validator, this is acceptable.
For mixed-tenancy deployments where some DC peers are not operator-
controlled (a third-party validator co-located in the same DC),
the operator pins the per-peer transport in the validator config.
The pin overrides LP-201's Transport.Pick() priority list for that
peer only:
peers:
- nodeID: NodeID-1
transport: rdma-ib # full operator control
- nodeID: NodeID-2
transport: rdma-rocev2-ipsec # same DC, third party
- nodeID: NodeID-3
transport: quic # cross-DC or untrusted
The pin is per-peer, not per-DC; the choice is the operator's risk
posture. Mechanically, LP-201's Pick consults the pinned name
first and bypasses capability negotiation for pinned peers.
If an adversary obtains a port on the IB fabric (e.g. via physical
intrusion or firmware compromise of a peer NIC), they can:
1. Forge RDMA WRITEs to other peers' buffers. The forged frames
are LP-022 envelopes and pass the parser, but they MUST be signed
per LP-200 §"Layer 5 — Signature". An unsigned or invalidly-signed
frame is rejected at the application layer.
2. Replay frames. Mitigated by LP-077 round digest binding
(chain_id, epoch, height) — a replayed cert from a previous
round fails the transcript check.
3. Denial of service via buffer exhaustion. The receiver's
registered buffer pool is finite; an attacker can fill it. LP-207
rate-limits per-peer at the Verbs layer (max in-flight WRITEs per
peer, configurable; default 256). A peer that exceeds the limit
has further WRITEs silently dropped.
The protocol-layer signature check (LP-200) is the security boundary,
not the transport-layer trust assumption. RDMA over an adversarial
fabric is no more vulnerable than QUIC over an adversarial network —
both rely on the application-layer signature.
A Tier-1 single-DC primary network operating under LP-204 has a mix
of locality classes:
A single validator runs all three transports simultaneously. The
operator does not have to choose at deploy time — the per-peer
transport selection is dynamic. Consensus uses all three in
parallel; the cert leg aggregation across the three classes is
identical (BLS aggregate, Pulsar threshold, Magnetar threshold).
The latency wins compound: intra-rack peers reach quorum in
microseconds via CXL; cross-rack peers reach quorum in low
milliseconds via RDMA; cross-DC peers reach quorum in tens of
milliseconds via QUIC. The cert observability tier (LP-202) advances
through PQ-off → PQ-fast → PQ-strict in the order legs arrive — and
the order is now a function of transport latency, not protocol
ordering.
This LP carries LP-201's stream-type bytes (0xD0..0xDF — ConsensusVote
0xD0, ConsensusProposal 0xD1, ..., DHTStore 0xDF) unchanged. The
stream-type byte rides as the high 8 bits of the RDMA WRITE_WITH_IMM
32-bit immediate (see §"Wire format" above for unidirectional gossip)
or in the first byte of the SEND payload (bidirectional RPC). A
validator that receives a frame does not know whether it arrived via
QUIC (LP-201), RDMA-IB (this LP), or CXL pool read (LP-206) — it sees
a ZAP frame with stream-type 0xD0..0xDF and processes it.
Selector ownership: Transport.Pick(peer) is defined in LP-201
§"Transport.Pick() contract." This LP registers two factories
(rdma-ib, rdma-rocev2) via LP-201's `Transport.Register(name,
factory) API. The priority slot, capability bits (CapRDMAIB` =
4, CapRDMARoCEv2 = 5), and fall-through behaviour are all
specified in LP-201. This LP MUST NOT re-specify them.
ib_write_lat benchmark |ib_send_lat benchmark |Honest gap. Numbers cited from Mellanox / NVIDIA product briefs
and from published ib_write_lat / ib_send_lat / ib_send_bw
RDMA Verbs benchmarks. DGX Spark has no Mellanox NIC — the LP-203
GPUDirect probe failed for this reason. The bench numbers graduate
from cited to measured when a Mellanox-equipped validator
reaches the lab.
The ~20-30× win over QUIC in the same DC is the operational case for
LP-207. For an LP-209 wave running at PQ-strict in the same DC, the
40-validator cert aggregation latency drops from ~3 ms (QUIC) to
~150 μs (RDMA) — a factor that compounds into LP-211 cross-shard 2PC
and LP-218 rollup batch finality.
IBV_WC_RNR_RETRY_EXC_ERR; transport returns "unreachable" to selector | Transport.Pick(peer) retries via QUIC fallback for that peer until NIC link returns |nvidia-smi topo -m at deploy; ensures NIC and GPU on same NUMA node |Transport.Pick(), Transport.Register(), the TransportCapability enum (including CapRDMAIB, CapRDMARoCEv2), and the priority list. This LP only registers two factories against that contract |cxl-pool for same-rack peers; this LP registers rdma-ib / rdma-rocev2 for cross-rack peers |The three-tier transport stack (LP-206 CXL → LP-207 RDMA → LP-201
QUIC) is one contract (LP-201) and three registered implementations.
There is no fourth transport spec'd today; future implementations
register against LP-201 without amending this LP, LP-206, or any
consumer call site. Every peer on every chain at every tier is mapped
to a registered implementation by LP-201's Transport.Pick() per
the priority list in LP-201 §"Transport.Pick() contract."
activates: 2025-12-25T16:20:00-08:00
activates-unix: 1766708400
RDMA over InfiniBand is callable from the genesis block of the new
final Lux network. Validators without RDMA hardware silently fall
back to LP-201 QUIC; nothing on the wire changes.
Transport.Pick() contract this LP implements; defines CapRDMAIB and CapRDMARoCEv2 priority
slots; carries the same 0xD0..0xDF stream-type schema range)
is transport-agnostic)
memory)
deployment target)
registered as cxl-pool for same-rack peers)
benefits from RDMA latency)
benefits from RDMA latency)
~600 ns one-way NIC latency
ibv_post_send semantics