LP-0207: RDMA over InfiniBand Transport — LP-201 Implementation for Co-Located Validators

← All proposals

LP-0207Draft

Status

Draft. No backwards compatibility. No flag day.

Activated at the genesis of the new final Lux network: **2025-12-25 16:20

Pacific (unix 1766708400)**. The pre-Quasar Edition Lux network

(2020–2025) had no RDMA fast path and is a separate network out of scope.

Abstract

This LP defines the RDMA-over-InfiniBand transport, an implementation

of the LP-201 Transport.Pick() contract for cross-rack peers within

a single data center. It registers two factories via

Transport.Register("rdma-ib", rdmaIBFactory) and

Transport.Register("rdma-rocev2", rdmaRoCEv2Factory). It does NOT

re-specify the selector — selector semantics, the priority list, the

capability enum (including CapRDMAIB and CapRDMARoCEv2), and the

fall-through to QUIC are all owned by LP-201 §"Transport.Pick()

contract."

Motivation. QUIC pays a kernel-transit + TLS-framing + userspace-copy

penalty that floors at ~50-100 μs end-to-end on a 100 Gbps fabric. For

Tier-1 single-DC mode of LP-204 — a sovereign L1 whose validator set

runs in one or two data centers and pays for InfiniBand or RoCEv2

fabric — RDMA over InfiniBand removes the kernel and the framing. The

wire payload is the same LP-022 ZAP frame, with the same LP-201

stream-type bytes (0xD0..0xDF). The delivery medium is InfiniBand

Verbs (ibv_post_send of an RDMA WRITE into a remote pre-registered

memory region) or RoCEv2 (the same Verbs over a UDP/IP fabric, RFC

7886). One-way NIC-to-NIC latency on Mellanox ConnectX-7 NDR

(400 Gbps) is ~600 ns; cross-rack RTT through one switch hop is

~3 μs — 20-30× faster than QUIC on the same fabric.

This LP specifies: the RDMA wire mapping for LP-201 stream types, the

RDMA-IB-specific failure modes (NIC link down, IB QP exhaustion,

registered-buffer exhaustion), the GPUDirect RDMA integration with

LP-203 (NIC writes directly into GPU memory; CPU not on the path),

and the byzantine-resistance posture (RDMA assumes a trusted

validator-set within the DC; LP-201's QUIC fall-through provides the

TLS 1.3 security envelope for cross-DC peers automatically).

Honest hardware gap

DGX Spark, the current bench cluster, has no Mellanox NIC — the

GPUDirect probe failed because of this. The latency numbers in

§"Performance" are cited from Mellanox / NVIDIA product briefs

(ConnectX-7 NDR, ConnectX-6 HDR) and from published RDMA-Verbs

microbenchmarks. The numbers graduate from cited to measured when

a Mellanox-equipped validator reaches the lab (target: Q3 2026

ConnectX-7 NDR refresh on rack-scale clusters).

Locality table (informational)

LP-201's Transport.Pick() selects this transport when both peers

report CapRDMAIB (or CapRDMARoCEv2) and the higher-priority

CapCXLPool (LP-206) is unavailable. The locality table below is

informational — the selector codifies it in priority order:

| Peer locality | Selected transport | Source LP |

|---|---|---|

| Same rack | CXL Coherent State Pool | LP-206 |

| Same DC, cross-rack, InfiniBand | RDMA-IB (this LP) | LP-207 |

| Same DC, cross-rack, RoCEv2 | RDMA-RoCEv2 (this LP) | LP-207 |

| Cross-DC, same continent | QUIC | LP-201 |

| WAN / global | QUIC | LP-201 |

The transport is a function of the peer's location, not of the

message type. A ConsensusVote to a same-DC peer with IB lands

via RDMA-IB; the same ConsensusVote to a WAN peer lands via QUIC.

The application layer above does not know which transport carried

the message. See LP-201 §"Transport.Pick() contract" for the

selector semantics.

Hardware

NIC requirements

|---|---|---|---|---|

The recommended baseline is ConnectX-7 NDR for new Tier-1 single-

DC validator deployments. Cost per port is dominated by the NIC and

the IB switch; recommended IB switch is the NVIDIA Quantum-2 NDR-200

(64 ports, ~$15-30k street).

Fabric

InfiniBand fabric or RoCEv2 over Ethernet. Both expose the same Verbs

interface (libibverbs on Linux). The wire-level differences

(InfiniBand Link Layer vs Ethernet + RoCE encapsulation) are invisible

above the Verbs layer.

Recommended deployment: InfiniBand fabric for greenfield Tier-1

single-DC sites (better congestion control, lower jitter at high

fan-in); RoCEv2 for existing Ethernet-built DCs (no fabric forklift).

Memory registration

Every RDMA peer pre-registers a set of memory buffers with the NIC.

The NIC obtains pinned physical pages and a key (rkey) that remote

peers use to address those buffers. LP-207 registers a pool of ZAP

buffers at boot, of total size equal to the worst-case in-flight

message set (typically 1-8 GiB per peer).

Memory registration is expensive (~milliseconds per registration);

LP-207 amortizes by registering once at boot and never re-registering

during normal operation. Registered buffers ARE the ZAP buffers

(LP-200 §"The buffer IS the value") — no separate "RDMA buffer" and

"ZAP buffer" structures.

Wire format

Every RDMA payload is a ZAP frame, byte-identical to the QUIC payload

LP-201 carries. The RDMA operation envelope adds three pieces of

metadata on top:

| Field | Size | Source |

|---|---|---|

| Stream-type byte | 1 | LP-201 stream-type byte (0xD0..0xDF) — same range, same meaning |

| Sender NodeID | 20 | The sender's identity (sha256(pubkey)[:20]) |

| ZAP frame body | variable | LP-022 envelope (the message itself) |

RDMA operation choice

LP-207 uses RDMA WRITE_WITH_IMM as the primary verb for unidirectional

gossip (consensus votes, tx gossip, snapshot advertise — LP-201's

unidirectional stream types). RDMA WRITE_WITH_IMM combines a one-sided

write to a remote registered buffer with a 32-bit immediate value

delivered to the receiver's completion queue. The immediate carries:


bits 0..19   sender NodeID hash (low 20 bits)
bits 20..27  stream-type byte (0xD0..0xDF)
bits 28..31  reserved

The receiver polls the completion queue; on completion it knows

(a) which buffer was written by wr_id, (b) the sender from the

immediate, and (c) the stream type from the immediate. The receiver

then validates the buffer's ZAP frame body via the standard LP-022

parser. No additional framing.

For bidirectional RPC (BlockRequest/Response, TxRequest/Response,

ChunkRequest/Response — LP-201's bidirectional stream types) LP-207

uses RDMA SEND + RDMA SEND in a request-response pair. SEND is

two-sided (the receiver posts receive buffers in advance) and matches

the request-response semantic naturally.

| LP-201 stream direction | LP-207 RDMA verb |

|---|---|

| Unidirectional (gossip / push) | WRITE_WITH_IMM (one-sided) |

| Bidirectional (RPC) | SEND + SEND (two-sided) |

Zero-copy receive

The receiver pre-posts receive buffers from the registered pool. When

an RDMA SEND completes, the receiver's polling thread observes a

completion-queue entry with the buffer's wr_id; the buffer is in

pinned RDMA memory, ready for parse via the LP-022 schema. No memcpy,

no kernel hop, no userspace allocation.

Authentication

InfiniBand fabric is physically trusted. RoCEv2 over Ethernet may not

be. LP-207 supports two authentication modes:

| Mode | Use | Auth |

|---|---|---|

| IB-trusted | InfiniBand fabric within a single DC | None at the per-message layer; trust is fabric-physical |

| RoCEv2 + IPsec | RoCEv2 over potentially-shared Ethernet | IPsec ESP on the RoCEv2 UDP frames; per-message TLS would defeat zero-copy |

The default for greenfield InfiniBand is IB-trusted. The default for

RoCEv2 over a shared L2 is IPsec.

Cross-DC peers ALWAYS fall back to QUIC (LP-201) regardless. The

operator never opens RDMA across a WAN edge; the security and

fragmentation cost is not worth the latency win.

GPUDirect RDMA integration

LP-203 §"GPUDirect path" specifies the NIC writing directly into GPU

memory without crossing the host CPU. LP-207 RDMA + LP-203 GPUDirect

compose:


Sender:                         Receiver:
  GPU produces ZAP frame          NIC writes into GPU memory
  in registered GPU buffer        via GPUDirect RDMA
        ↓                                ↓
  ibv_post_send WRITE_WITH_IMM    GPU polls receive
  to receiver's GPU buffer        buffer for new frames

CPU is not on the path. End-to-end NIC-to-GPU-memory latency on

ConnectX-7 NDR + Blackwell sm_120 is ~1 μs (estimated; matches

Mellanox + NVIDIA published GPUDirect RDMA benchmarks).

The GPU-side verifier (LP-203) processes frames directly from the

RDMA-landed buffer with no host-side roundtrip. For a wave of N

consensus votes, every vote lands in GPU memory in ~1 μs; the GPU

verifies all N in parallel; the host CPU sees only the aggregated

result.

Hardware requirement

GPUDirect RDMA requires the NIC and the GPU to be on the same PCIe

root complex (or connected via NVLink/CXL on GB10/GB200). On Sapphire

Rapids hosts, ensure NIC and GPU are in the same NUMA node. On AMD

Genoa hosts, ensure they share a CCD. Cross-NUMA GPUDirect RDMA

silently degrades to a kernel-mediated path; verify with `nvidia-smi

topo -m` at deployment.

Sybil / byzantine resistance

LP-201 specifies sybil resistance at the validator level via stake

(LP-170 / LP-171) and rate limiting at the QUIC stream layer. LP-207

inherits both, with one caveat: **RDMA assumes a trusted validator-set

within the DC.**

The IB fabric is physical — only validators with a port on the IB

switch can participate. Operator policy controls who connects. This

is the same trust assumption as LP-206 CXL Coherent State Pool. For

sovereign L1 single-DC deployments where the operator runs every

validator, this is acceptable.

For mixed-tenancy deployments where some DC peers are not operator-

controlled (a third-party validator co-located in the same DC),

the operator pins the per-peer transport in the validator config.

The pin overrides LP-201's Transport.Pick() priority list for that

peer only:


peers:
  - nodeID: NodeID-1
    transport: rdma-ib           # full operator control
  - nodeID: NodeID-2
    transport: rdma-rocev2-ipsec # same DC, third party
  - nodeID: NodeID-3
    transport: quic              # cross-DC or untrusted

The pin is per-peer, not per-DC; the choice is the operator's risk

posture. Mechanically, LP-201's Pick consults the pinned name

first and bypasses capability negotiation for pinned peers.

What an adversary on the IB fabric can do

If an adversary obtains a port on the IB fabric (e.g. via physical

intrusion or firmware compromise of a peer NIC), they can:

1. Forge RDMA WRITEs to other peers' buffers. The forged frames

are LP-022 envelopes and pass the parser, but they MUST be signed

per LP-200 §"Layer 5 — Signature". An unsigned or invalidly-signed

frame is rejected at the application layer.

2. Replay frames. Mitigated by LP-077 round digest binding

(chain_id, epoch, height) — a replayed cert from a previous

round fails the transcript check.

3. Denial of service via buffer exhaustion. The receiver's

registered buffer pool is finite; an attacker can fill it. LP-207

rate-limits per-peer at the Verbs layer (max in-flight WRITEs per

peer, configurable; default 256). A peer that exceeds the limit

has further WRITEs silently dropped.

The protocol-layer signature check (LP-200) is the security boundary,

not the transport-layer trust assumption. RDMA over an adversarial

fabric is no more vulnerable than QUIC over an adversarial network —

both rely on the application-layer signature.

Hybrid deployment

A Tier-1 single-DC primary network operating under LP-204 has a mix

of locality classes:

| Validator class | Count (example) | Transport |

|---|---|---|

| Same-rack co-located validators | 4-8 | LP-206 CXL pool |

| Same-DC cross-rack validators | 20-40 | LP-207 RDMA / RoCEv2 |

| Cross-DC validators (DR site, partner) | 5-10 | LP-201 QUIC |

| Global validators (third parties, archival) | 10+ | LP-201 QUIC |

A single validator runs all three transports simultaneously. The

operator does not have to choose at deploy time — the per-peer

transport selection is dynamic. Consensus uses all three in

parallel; the cert leg aggregation across the three classes is

identical (BLS aggregate, Pulsar threshold, Magnetar threshold).

The latency wins compound: intra-rack peers reach quorum in

microseconds via CXL; cross-rack peers reach quorum in low

milliseconds via RDMA; cross-DC peers reach quorum in tens of

milliseconds via QUIC. The cert observability tier (LP-202) advances

through PQ-off → PQ-fast → PQ-strict in the order legs arrive — and

the order is now a function of transport latency, not protocol

ordering.

Composition with LP-201

This LP carries LP-201's stream-type bytes (0xD0..0xDF — ConsensusVote

0xD0, ConsensusProposal 0xD1, ..., DHTStore 0xDF) unchanged. The

stream-type byte rides as the high 8 bits of the RDMA WRITE_WITH_IMM

32-bit immediate (see §"Wire format" above for unidirectional gossip)

or in the first byte of the SEND payload (bidirectional RPC). A

validator that receives a frame does not know whether it arrived via

QUIC (LP-201), RDMA-IB (this LP), or CXL pool read (LP-206) — it sees

a ZAP frame with stream-type 0xD0..0xDF and processes it.

Selector ownership: Transport.Pick(peer) is defined in LP-201

§"Transport.Pick() contract." This LP registers two factories

(rdma-ib, rdma-rocev2) via LP-201's `Transport.Register(name,

factory) API. The priority slot, capability bits (CapRDMAIB` =

4, CapRDMARoCEv2 = 5), and fall-through behaviour are all

specified in LP-201. This LP MUST NOT re-specify them.

Performance characteristics

|---|---|---|---|

Honest gap. Numbers cited from Mellanox / NVIDIA product briefs

and from published ib_write_lat / ib_send_lat / ib_send_bw

RDMA Verbs benchmarks. DGX Spark has no Mellanox NIC — the LP-203

GPUDirect probe failed for this reason. The bench numbers graduate

from cited to measured when a Mellanox-equipped validator

reaches the lab.

The ~20-30× win over QUIC in the same DC is the operational case for

LP-207. For an LP-209 wave running at PQ-strict in the same DC, the

40-validator cert aggregation latency drops from ~3 ms (QUIC) to

~150 μs (RDMA) — a factor that compounds into LP-211 cross-shard 2PC

and LP-218 rollup batch finality.

Failure modes

|---|---|---|---|

| NIC link down | cable / SFP failure | RDMA verbs return IBV_WC_RNR_RETRY_EXC_ERR; transport returns "unreachable" to selector | Transport.Pick(peer) retries via QUIC fallback for that peer until NIC link returns |

| Switch port congestion | overload | RDMA completion latency rises; per-peer rate-limit triggers; QUIC fallback for excess load | Operator scales switch fabric; congestion control via DCQCN (RoCEv2) or IB credit-based (IB) |

| Pre-registered buffer exhaustion | adversarial fill or bursty workload | Receiver drops further WRITEs from the offending peer | Per-peer rate limit; offender gradually backs off |

| Memory deregistration during operation | OOM kill or driver reset | Verbs return invalid-rkey; transport returns "unreachable" | Re-register on next admit; expensive but rare |

| Cross-NUMA GPUDirect silent degradation | misconfigured PCIe topology | GPU-side reads take ~80 μs instead of ~1 μs | Operator runs nvidia-smi topo -m at deploy; ensures NIC and GPU on same NUMA node |

| Adversary on IB fabric | physical or firmware compromise | Forged frames reach receivers but fail signature check | LP-200 §"Layer 5 — Signature" rejects; transport-layer trust does not gate protocol-layer validity |

Composition with other LPs

| LP | Role |

|---|---|

| LP-022 | Wire format for RDMA payloads (unchanged from QUIC payload) |

| LP-200 | The buffer-is-the-value guarantee; RDMA buffers ARE LP-200 buffers |

| LP-201 | The contract this LP implements — defines Transport.Pick(), Transport.Register(), the TransportCapability enum (including CapRDMAIB, CapRDMARoCEv2), and the priority list. This LP only registers two factories against that contract |

| LP-202 | Cert observability inherits whichever transport delivered the legs |

| LP-203 | GPUDirect RDMA writes consensus votes directly into GPU memory for verify |

| LP-204 | Tier-1 single-DC mode is the deployment target |

| LP-206 | Sibling LP-201 implementation — registers cxl-pool for same-rack peers; this LP registers rdma-ib / rdma-rocev2 for cross-rack peers |

| LP-209 | Wave commit cadence is RDMA-latency-bounded for in-DC validators |

| LP-211 | Cross-shard 2PC ack collection latency benefits directly |

| LP-218 | Rollup sequencer fan-in (multiple sequencer nodes assembling a batch) uses RDMA |

The three-tier transport stack (LP-206 CXL → LP-207 RDMA → LP-201

QUIC) is one contract (LP-201) and three registered implementations.

There is no fourth transport spec'd today; future implementations

consumer call site. Every peer on every chain at every tier is mapped

to a registered implementation by LP-201's Transport.Pick() per

the priority list in LP-201 §"Transport.Pick() contract."

Activation marker


activates: 2025-12-25T16:20:00-08:00
activates-unix: 1766708400

RDMA over InfiniBand is callable from the genesis block of the new

final Lux network. Validators without RDMA hardware silently fall

back to LP-201 QUIC; nothing on the wire changes.

Cross-references

LP-022 — ZAP wire protocol (RDMA payloads are LP-022 envelopes)
LP-200 — ZAP Stack umbrella (RDMA buffers ARE LP-200 buffers)
LP-201 — ZAP P2P (defines the Transport.Pick() contract this

LP implements; defines CapRDMAIB and CapRDMARoCEv2 priority

slots; carries the same 0xD0..0xDF stream-type schema range)

LP-202 — ZAP Pipelining and Atomic Unwind (cert observability

is transport-agnostic)

LP-203 — GPU-native verify (GPUDirect RDMA writes into GPU

memory)

LP-204 — Network of blockchains (Tier-1 single-DC mode is the

deployment target)

LP-206 — CXL Coherent State Pool (sibling LP-201 implementation

registered as cxl-pool for same-rack peers)

LP-209 — Mysticeti-style total order (wave commit cadence

benefits from RDMA latency)

LP-211 — Cross-shard atomic PQ-heavy cert (2PC ack collection

benefits from RDMA latency)

LP-218 — Z-Chain PQ rollups (sequencer fan-in via RDMA)
Mellanox / NVIDIA ConnectX-7 NDR product brief — 400 Gbps,

~600 ns one-way NIC latency

Mellanox / NVIDIA ConnectX-6 HDR product brief — 200 Gbps
InfiniBand Trade Association IB Verbs specification v1.7 —

ibv_post_send semantics

IETF RFC 7886 — RoCEv2 over UDP/IP