All 9 LP-134 chains GPU-native — strict definition satisfied (state +
canonical transition logic both on GPU). CPU only supplies packets, cold
pages, time, attestation, watchdog. No caveats.
LLVM source-based coverage on all five new VMs + cevm/quasar substrate.
Full roll-up with reproduction commands and per-VM analysis lives in
Aggregate: line ≥96% across the five new VMs (avg 97.96%, excludes cevm — see note); CPU reference
oracle — the security-critical byte-equivalence target — clears 90%
branch on every VM where it is the dominant translation unit. Branch
coverage gaps below 90% on whole-VM totals are itemized in each VM's
COVERAGE.md as physically-unreachable defenses (hash-table
linear-probe fallthrough behind arena-cap invariants, GPU driver
allocation-failure paths, switch-default arms over enum types). Bugs
caught and fixed during this push (rotl64 UB at n=0, keccak_f1600
clang -O3 miscompile, WGSL pointer/reserved-keyword issues, MPCVM
WGSL struct stride drift) are listed in the roll-up.
The Quasar substrate (luxcpp/cevm v0.44+) wires every chain's
transition root into QuasarRoundDescriptor; a single QuasarCert
binds the canonical state of all 9 chains via `certificate_subject =
keccak(... || P || C || X || Q || Z || A || B || M || F || ...)` in
fixed canonical order. The five new wave-tick services
(PlatformVMTransition, XVMTransition, AIVMTransition,
BridgeVMTransition, MPCVMTransition) reserve work-queue addresses
for per-VM ingress; the substrate already passes through them with
descriptor-direct writes.
WGSL "—" on C-Chain reflects EVM bytecode interpreter targeting
CUDA/Metal first; WGSL "partial" on Z-Chain reflects Groth16 pairing
arithmetic shipped on CUDA/Metal with WebGPU port pending.
QuasarRoundDescriptor extended with five 32-byte chain transition roots (xchain_execution_root, achain_state_root,
bchain_state_root, mchain_state_root, fchain_state_root); P/Q/Z
remain from v0.42; C reuses parent_block_hash (cevm round IS C).
compute_certificate_subject recipe extended to 11×32 byte hashinput in canonical P, C, X, Q, Z, A, B, M, F + parent_state +
parent_execution order. Both host (quasar_sig.hpp) and device
(quasar_wave.metal) layouts updated; descriptor sizeof = 480 bytes.
QuasarRoundResult echoes all 9 roots + certificate_subject_echoso downstream consumers reconstruct the cert subject from the result
alone; sizeof = 672 bytes.
ServiceId::Count bumped to 17 with five new transition servicesreserved at indices 12–16.
quasar_9chain_integration_test.mm)proves: subject keccak input matches the canonical 11-segment
reference byte-for-byte; flipping any single bit in any of the 9
chain roots produces a different subject (cert-binding holds);
swapping two roots also produces a different subject (canonical
order matters); engine echoes all 9 roots back into the result;
tampered descriptor's recompute diverges from the engine's echo.
The QuasarGPU substrate (LP-132) ships with a clear architectural
contract that a future audit must enforce as an *invariant*, not a
goal. This LP names that invariant, classifies every chain-state
object by where it must live, and documents the full optimization
stack required to make it real.
> The invariant: *No chain-local hot path leaves GPU memory.*
>
> CPU/host is allowed only for asynchronous external I/O — network
> ingress, cold-state page service, attestation handshake, watchdog,
> crash recovery, operator control plane. Everything else lives in
> GPU memory: mempool, access prediction, DAG/frontier scheduling,
> EVM fibers, Block-STM, precompiles, receipts, roots, cert lanes,
> DEX matching, compliance gates, audit commitments.
>
> The CPU touches reality. The GPU runs the chain.
This is the v0.42–v0.50 production roadmap on top of Quasar 3.0
(LP-135). Implementation milestones are bounded; the residency
invariant is hard-enforceable via residency-class tagging,
ForbiddenHot CPU-handler counters, and CI assertions.
enum class ResidencyClass : uint8_t {
DeviceHot, // must stay in GPU memory on the fast path
DeviceWarm, // GPU-resident cache, refillable from canonical source
HostCold, // cold canonical source-of-truth; async page-in only
HostControl, // host metadata / control plane only
ForbiddenHot, // SHOULD NEVER appear on the fast-path flamegraph
};
code cache / ABI selector profiles
historical access profiles
hot state windows
validator-set cache
stake table cache
ZK verifying keys
Pulsar public parameters
market metadata
risk / compliance rule tables
archival state (SSD / LSM)
cold trie nodes
full receipt archive
historical logs
full audit export
operator snapshots
watchdog status
kernel launch counters
device health
configuration load
attestation handshake (initial)
network socket ownership when no GPUDirect RDMA
If any of these run on host on the round path, the design has failed:
host-side tx sorting
host-side DAG construction
host-side Block-STM validate / repair
host-side precompile execution
host-side keccak root construction
host-side quorum accumulation
host-side DEX matching
host-side compliance decision
host-side receipt construction
host-side gas accounting
CI gate: a counter forbidden_hot_invocations increments on every
ForbiddenHot call site; CI fails if the counter advances during a
fast-path test.
using DevOffset = uint32_t;
template <typename T>
struct DevSlice {
DevOffset offset;
uint32_t count;
};
struct ArenaHeader {
uint32_t capacity;
uint32_t bump;
uint32_t high_watermark;
uint32_t gc_epoch;
};
18 canonical arenas:
TxBlobArena DecodedTxArena CalldataArena CodeArena
FiberFrameArena FiberStackArena FiberMemoryArena JournalArena
ReadSetArena WriteSetArena VersionArena ReceiptArena
LogArena RootArena PrecompileArena CertArena
DexArena AuditArena
Rules:
struct WorkItem {
ServiceId service;
DevOffset object_offset;
uint32_t object_count;
uint32_t priority;
uint32_t flags;
};
Payloads live in arenas. Rings carry only (offset, length, type).
Cache-friendly; small ring traffic; trivially RDMA-able.
Optimization checklist:
RingHeader.
struct ServicePressure {
uint32_t queue_depth;
uint32_t deadline_weight;
uint32_t dependency_weight;
uint32_t stall_count;
uint32_t hotness;
};
Per-tick budget:
budget(service) = base
+ queue_depth_weight
+ deadline_weight
+ unblock_weight
- stall_penalty
Near deadline:
Commit, Root, QuasarCert, QuorumOut.Decode, MempoolAdmission.3.0 substrate uses fixed gid → service. v0.42 lands adaptive
budgets; v0.43 adds work-stealing across services.
Graph rule: capture for stable topology (same services, same memory
pools, same stream graph). Don't graph fully dynamic topology.
quasar_wave kernel; relaunch is fairness |closing_flag + reads result; nothing else |Target: NIC → GPU event_ingress_ring (GPUDirect RDMA).
Fallback: NIC → host pinned batch → GPU ring.
struct IngressEnvelope {
uint16_t kind; // tx, vote, cert, state page, order
uint16_t flags;
uint32_t len;
uint64_t source_id;
uint64_t seq;
Hash payload_hash;
DevOffset payload_offset; // into TxBlobArena
};
Rings carry envelopes; payloads land in the corresponding arena.
Traffic classes (one ring each):
TC0 cert / votes
TC1 orderflow
TC2 state pages
TC3 tx gossip
TC4 audit / archive
Optimizations: batch small txs into MTU/jumbo/RDMA writes; per-TC
quotas; device-side dedup; device-side replay window; device-side
source quotas.
struct StateRequest {
Hash key;
StateKind kind; // Account, Storage, Code, TrieNode
uint32_t fiber_id;
uint32_t priority;
uint64_t deadline_ns;
};
Fault flow:
1. Fiber misses → emit StateRequest, mark SuspendedState,
schedule other work.
2. Host services via LSM/cache/disk; posts StatePage.
3. StateResp service inserts into DeviceWarm cache; wakes fibers.
State cache:
Page granularity — page by *locality*, not individual trie nodes:
account page contract storage prefix page
market page validator / stake page
code page precompile state page
Multi-GPU / RDMA: store consecutive versions per key/page (Motor-style
VersionBlock per LP-135) so one fetch returns all likely-visible
versions.
Per-fiber struct-of-arrays:
pc[] gas[] status[]
contract[] caller[] value[]
stack_offset[] memory_offset[] journal_offset[]
read_set_offset[] write_set_offset[]
Large fields go in arenas. Per-fiber stack alone (256-bit × 64 entries
× 4096 fibers ≈ 13.6 MB) is acceptable; full receipt bodies are not.
Opcode grouping — group by execution behavior, not numeric order:
Superinstructions (fuse common patterns):
PUSH + PUSH + SLOAD → fused storage load
CALLDATALOAD + AND/SHR selector → fused dispatch
MLOAD/MSTORE ABI copy → memcpy_abi
ERC20 balance slot calculation → erc20_slot
mapping slot keccak → mapping_slot
LOG append → log_append
Must produce identical gas + exception behavior. Validate against
cevm CPU reference.
struct GasState {
uint64_t remaining;
uint64_t refund;
uint64_t memory_words;
};
Optimizations:
Keccak appears everywhere: tx hash, sender recovery, mapping slots,
code hash, receipt root, state root, certificate subject, audit
root. Optimization:
struct HashJob {
HashJobKind kind;
DevOffset input_offset;
uint32_t input_len;
DevOffset output_offset;
};
code_hash and selector profiles.
struct PrecompileCall {
uint32_t tx_id;
uint32_t fiber_id;
uint16_t precompile_id;
uint16_t flags;
DevOffset input_offset;
uint32_t input_len;
DevOffset output_offset;
uint32_t output_capacity;
uint64_t gas_budget;
};
struct PrecompileResult {
uint32_t tx_id;
uint32_t fiber_id;
uint16_t status;
uint16_t flags;
uint32_t output_len;
uint64_t gas_used;
};
Fiber suspends on CALL precompile, woken when result arrives.
Per-precompile classes (no mixed-precompile kernel — branch
divergence kills throughput):
__constant__ / threadgroup memory.blst_pairing_chk_n_aggr_pk_in_g2).ML-DSA sigs. (Cite LP-020 §3.0.)
public_input_hash = H(certificate_subject || pchain_root || zchain_root || validator_set_root). general cert artifact ABI stays (offset, len) for forward
compatibility.
For a regulated DEX, compliance cannot be a CPU
callback.
GPU-resident compliance state:
KYC identity commitment jurisdiction flags
accreditation / investor status sanctions snapshot commitment
venue permissions asset transfer restrictions
position / risk limits disclosure / audit policy
Precompile:
ComplianceCheck(account, asset, venue, action, amount, jurisdiction)
→ { allowed | denied, reason_code, audit_commitment, gas_used }
Privacy: keep raw identity data committed/encrypted; GPU operates
over compact eligibility commitments.
Most orderflow on a co-located regulated DEX is not arbitrary
Solidity. Use GPU-native DEX precompiles:
OrderAppend OrderCancel
BatchAuction ContinuousLimitBook
RiskCheck MarginUpdate
FeeAccumulate OMASettlement
AuditCommit
Batch auction as reducer:
1. append order events during batch
2. at boundary: deterministic match
3. settle net account deltas
4. emit audit root
Avoids hot-key SSTORE contention on every order event.
Order book layout (SoA):
market_id → price level pages → order queue offsets
account_id → balance / margin lane
price[] qty[] side[] account[] timestamp_seq[] flags[]
No per-order pointer nodes.
Many "transactions" should be reducers, not writes.
enum class ReducerKind : uint8_t {
Add, Sub, Append,
BalanceDelta, FeeAccumulate,
OrderAppend, OrderCancel, AuctionMatch,
AuditAppend,
};
Commit rule:
1. Collect reducer ops per lane.
2. Sort by canonical order.
3. Apply deterministic reduction.
4. Emit one final state write.
Avoids Block-STM conflict storms on fee counters, audit logs, order
books, batch settlement, liquidity accounting.
lane_id = H(contract, storage_domain, account, market, asset, nonce_lane);
enum class LaneClass : uint8_t {
Owned, Shared, HotShared, Reducer, Serialized, Unknown,
};
Policy:
Scheduler updates lane class continuously from telemetry.
Tier 0 no writes / read-only fast path
Tier 1 lane-clock validation
Tier 2 key-level MVCC visible-version
Tier 3 semantic validation
Tier 4 repair
Avoid running exact MVCC validation for txs that touched only
unchanged owned lanes. Batch validation jobs by read-set length to
reduce branch divergence.
max_fast_repairs = 3
max_total_repairs = 8
hot-lane escalation threshold = 16
Priority:
1. Earlier canonical order first.
2. Unblocks many descendants first.
3. Near commit horizon first.
4. Higher fee only after safety priorities.
Checkpoint rollback (LP-010): rollback to before invalid read;
preserve decoded tx, calldata, code cache, unaffected reads.
Telemetry:
repair_amplification p99_incarnation
full_reexec_count checkpoint_rollback_count
hot_lane_escalations
Target repair_amplification < 1.01 on normal DEX workload.
Fibers emit:
read intent write intent
reducer intent receipt intent log intent
Commit service owns mutation of:
MVCC version chains lane clocks
receipt chains root material commit horizon
Avoids fiber CAS storms on global metadata.
Commit batching:
state_root commits to MVCC final state
receipts_root commits to receipt arena
execution_root commits to ordering / RW set / gas / status / logs
mode_root Nova linear-prefix root or Nebula causal-cut root
audit_root selective-disclosure commitment for regulated DEX
certificate_subject binding for QuasarCert lanes (LP-020)
execution_root commits to:
tx order / DAG frontier read/write commitments
gas used / status logs hash
precompile outputs conflict/repair metadata
Construction:
compact receipt format in GPU memory
logs stored as offset / len
bloom built on GPU
receipt root from compact material
archive export async (host pulls from GPU/SSD pipeline)
Fast path never copies full receipt bodies to host.
For billion-event throughput, do NOT globally replicate every raw
event synchronously. Layered:
raw event blobs (per-validator, ephemeral)
compressed event root (signed)
execution root (signed; LP-020 cert subject input)
settlement root (signed)
audit root (signed; selective disclosure)
selective disclosure data (privileged readers only)
Availability via erasure-coded blobs, co-located DA nodes, regulatory
archive lane, selective replay.
Per NVIDIA H100 confidential compute + Apple Secure Enclave / AMD
SEV-SNP:
certificate_subject.
confidential_attestation_root = H(
cpu_tee_measurement,
gpu_measurement,
quasar_gpu_binary_hash,
precompile_binary_hash,
market_policy_root)
Bind into certificate_subject (LP-020), audit_root, DEX batch
root.
GPU memory contains sensitive orderflow.
Telemetry exposes counts, latencies, roots, reason codes — never
raw orders, identities, unmasked accounts, sensitive compliance
facts.
struct CertArtifact {
QuasarCertLane lane;
Hash subject;
DevOffset artifact_offset;
uint32_t artifact_len;
Hash public_inputs_hash;
};
Per-lane optimization:
Invariant: all lanes bind same certificate_subject.
Shard by:
state lane market
account range precompile type
cert lane root construction stage
Avoid sharding by random tx index (causes cross-GPU state chatter).
Typical 8-GPU topology:
GPU 0 ingress / decode / admission
GPU 1 DEX / private orderflow / compliance
GPU 2 EVM fibers shard A
GPU 3 EVM fibers shard B
GPU 4 STM commit / root
GPU 5 cert lanes
GPU 6 audit / DA compression
GPU 7 replay verifier / hot spare
Cross-GPU communication:
simple transfer ERC20 transfer
DEX order append DEX match / settle
compliance check AMM swap
contract deploy router call
ZK verify cert vote
cold-state heavy unknown arbitrary EVM
Policy:
simple/owned lane → fast path
DEX append → reducer
compliance → precompile batch
ZK / cert → crypto service
unknown EVM → isolated fiber batch
cold-state heavy → lower priority unless near deadline
struct ContractProfile {
Hash code_hash;
uint32_t selector;
Hash predicted_lanes_root;
uint16_t confidence;
uint16_t observed_conflict_rate;
uint32_t avg_gas;
uint32_t cold_miss_rate;
};
First execution samples access set. Promote to known class when
stable; demote on access instability / high conflict / frequent
revert / cold-state-heavy.
Tier 0 fiber VM interpreter
Tier 1 superinstructions
Tier 2 selector-specific traces
Tier 3 contract-specific GPU JIT / AOT
Tier 4 native precompile
Promote on hot selector + stable access set + low divergence + high
volume.
struct JournalSegment {
uint32_t tx_id;
uint32_t depth;
DevOffset start_offset;
DevOffset end_offset;
};
speculative write intents only.
compute address on GPU
hash initcode on GPU
code deposit into CodeArena
dedup identical code hashes
delay code availability until commit
track CREATE2 address conflicts as lane conflicts
Create lane:
lane = H(deployer, nonce_or_salt, initcode_hash)
SELFDESTRUCT: journal only; commit-order resolved; lane markeddestructive.
TLOAD / TSTORE: tx-local transient arena; no global MVCC unlesscross-frame semantics require.
Never let destructive semantics bypass STM.
Merge from all sources into ConflictSpec:
EIP-2930 access lists historical profile
ABI selector simulation cache
contract profile DEX / precompile known lanes
user-declared spec learned predictor
ConflictSpec drives Prism refraction (LP-010) and lane prefetch.
Device-side filters for fast negative checks:
hot state presence code cache presence
known lane predictor duplicate tx
duplicate vote / cert artifact spent nonce
known-invalid signature
Dedup on GPU:
same tx hash / order id same vote artifact
same cert lane artifact same state request
same code hash same precompile input
same keccak job
DEX/orderflow: dedup cancels/replaces by (account, order_id).
Services read:
deadline_ns current_tick_budget
commit_pressure cert_pressure
Policy:
early admit / explore / execute broadly
mid prioritize high-score frontiers
late freeze admission; repair only commit-horizon blockers
root / certify
imminent emit best valid prefix/cut
GPU makes the decision. Host supplies clock / deadline only.
Candidate score:
score = fees + app rewards + MEV/auction surplus
- conflict_cost - cold_state_cost - repair_cost
- deadline_risk - compliance_risk
Run multiple candidate frontiers in parallel:
high fee candidate low conflict candidate
DEX-priority candidate cert-fast candidate
Pick best certifiable result before deadline.
Replay protection — bind everything to:
chain_id epoch round mode
validator root P/Q/Z roots
attestation root parent roots
Determinism: no nondeterministic atomic ordering affecting roots,
no floating-point consensus math, no hash-table iteration order, no
unordered reducer output, no race-dependent version chain insertion.
Side channels — for regulated/private lanes:
Metrics exposed:
wave_tick_count service_queue_depths
service_budget_allocations lane_fast_valid_rate
conflict_rate repair_amplification
cold_miss_rate precompile_batch_sizes
root_latency cert_lane_latency
GPU memory pressure arena high-watermarks
RDMA ingress latency
Never expose: private order contents, identities, unmasked accounts.
test_precompile_call_does_not_invoke_cpu_handler
test_keccak_roots_produced_on_gpu
test_receipt_root_produced_on_gpu
test_quorum_status_produced_on_gpu
test_block_stm_repair_scheduled_on_gpu
test_state_miss_suspends_fiber_no_host_fallback
Mechanism: forbidden_hot_invocations counter increments on every
ForbiddenHot call site; CI fails on any increment during fast-path
tests.
cold page delayed duplicate RDMA packet
invalid cert artifact bad Groth16 public input
stale P-chain validator root wrong Q-chain ceremony root
wrong Z-chain VK root hot-lane conflict storm
out-of-memory arena pressure
CPU reference == Metal == CUDA
same state_root
same receipts_root
same execution_root
same certificate_subject
> QuasarGPU reaches minimum latency when all chain-local state
> transitions, including EVM precompiles and consensus-adjacent
> certificate work, execute against device-resident arenas, with the
> host reduced to asynchronous ingress, cold-page service,
> attestation, and watchdog control.
The slogan stays right:
> The CPU touches reality. The GPU runs the chain.
GPU acceleration measured against CPU reference on Apple M1 Max
(32-core integrated GPU, 10.4 TFLOPS FP32, 64 GB unified RAM, macOS
26.4) — full roll-up in
**Acceleration shipped on 9 of 9 chains. BLS pairing fully on-device on
Metal (CUDA build, WGSL full Fp tower); 2 746+ vectors byte-equal blst.
Production binaries clear of blst symbols (CI-asserted). blst pinned to
test-only oracle at luxcpp/crypto/bls/test/cmake/blst.cmake.**
Three production workloads beat CPU end-to-end at the v0.45 (Phase-2)
crossover (F NTT 23.6×, B-Chain BLS 9.5×, C-Chain BLS 9.2×); Phase-3
adds correctness-complete on-device pairing across all stages of the
BLS12-381 tower (Fp/Fp2/Fp6/Fp12 + G2 + Miller + final_exp + e(P,Q)),
plus AI/ML inference byte-equal CPU↔Metal across 1 000 inputs, plus
composite confidential attestation byte-equal C++↔Go.
Headline by chain (representative workload, Phase-1 → Phase-2):
The GPU-residency invariant is satisfied + accelerated at the
architectural level on every chain (state and canonical transition
logic on device, 4-way byte-equal determinism per
LP-137-COVERAGE.md). Phase-2 lands the in-code, line-cited
milestones from Phase-1:
verify_bls_aggregate_batch, verify_bls_same_message_batch, verify_groth16_batch,
verify_corona_batch. Same-message hot path 9.24× at n=1024
(target ≥10×, residual is blst_p1_uncompress cost). Pairing math
itself stays on the canonical host body: the Stage 5b single-CB
Metal driver measured at 475 ms/pairing on M1 Max vs 510 µs host
blst (~930× slower) — Metal at N=1 is structurally bounded by the
serial Fp12 chain mismatching the SIMD GPU shape. Residency
invariant intact (production link graph is blst-symbol-free; the
canonical c-abi body remains the host computation). SoTA single-
pairing path is Linux+CUDA (bls_driver_cuda.cpp stub).
evm_kernel_v2.metal32-threads/tx threadgroup dispatcher with V1 fallback at status=255.
Build flag LUX_EVM_KERNEL_V2=ON. SIMD opcode fan-out lands v0.45.x.
bls::pre_verify_inbox shards Miller loops across 10 M1 cores, one
final-exp. 8.58×–10.35× across 1k/5k/10k messages. Per-pairing
Miller compute stays on host (same Stage 5b measurement applies);
Metal SoTA path requires N>1 batched-parallel kernel saturation,
Linux+CUDA pivot, or Karabina compressed cyclo squarings.
EpochTransition** (shipped) — 6.19×–6.77× Metal speedup on every
measured workload size vs v0.56.
— 18.64× Metal speedup vs v0.61.1 on xlarge.
size-sweep determinism harness, dGPU-ready dispatch shape. M1
integrated GPU dispatch latency dominates; speedup measurable on
discrete CUDA hosts (separate H100 runner).
divergence flagged in v0.55.2 BENCHMARKS.md is unresolved as of
this roll-up. 4-way determinism contract holds at the pinned
harness workload only until v0.55.3 lands.
the production CKKS slot at 23.6×. Wiring FHEpke / FHEbinfhe
through the threshold dispatcher pulls CKKS / BFV / BGV / TFHE
into the same band.
CUDA backends build but were not run on this Apple host; H100 / Ada
self-hosted runners report separately. The CPU reference oracle —
the byte-equivalence ground truth — clears its release-blocking gate
on every chain that ships a Phase-2 GPU engine (cevm 6/6, platformvm
15/15, aivm 47/47, bridgevm 49/49, mpcvm 21/21, fhe primitive parity).
> **Acceleration shipped on 9 of 9 chains. BLS pairing fully on-device
> on Metal (CUDA build, WGSL full Fp tower); 2 746+ vectors byte-equal
> blst. Production binaries clear of blst symbols (CI-asserted). blst
> pinned to test-only oracle at
> luxcpp/crypto/bls/test/cmake/blst.cmake. Substrate-wide geometric
> mean lift **0.17× (Phase-1) → 0.97× (Phase-3 with v0.47.1 pubkey
> cache)** — 5.7× lift, has not yet crossed parity. Three workloads
> beat CPU end-to-end: F NTT 23.6×, B-Chain BLS 9.5×, C-Chain BLS
> 16.51× via pubkey cache. Full numbers, Phase-1↔Phase-2↔Phase-3
> deltas, and BLS pairing-stack vector totals in
> LP-137-BENCHMARKS.md.**
Status of each invariant the LP commits to. "Enforced" = mechanically
asserted in CI on every build. "Satisfied" = code path proven correct
once but not blocked by CI. "Pending" = not yet landed.
> The test oracle is non-authoritative: blst may appear only in test
> targets and never in production link graphs. Items 6 and 7 below are
> the mechanical enforcement of this rule.
LP-137-COVERAGE.md) |quasar_9chain_integration_test.mm, 7 tests) |cevm::crypto::bls::* c-abi. The c-abi body still computes pairing on host CPU; per Stage 5b (2026-04-27) Metal single-pairing on M1 Max is ~930× slower than host blst, so on-device pairing is not in production. SoTA on-device path is Linux+CUDA or batched-N kernel saturation. |cevm/cmake/blst.cmake; blst pinned to luxcpp/crypto/bls/test/cmake/blst.cmake test-only) |var<private> storage and decomposing the upper-tower call tree into single-Fp2-frame leaves so AGXMetalG13X's per-thread function-call stack budget holds) |External:
Copyright (C) 2025, Lux Partners Limited. All rights reserved.