Lux Proposals
← All proposals
LP-0163Final

Abstract

Karatsuba multiplication for the EIP-198 EVM modexp precompile and forward-compatible RSA-attestation paths at M_len ∈ {512, 1024, 2048, 4096} bits. Replaces the schoolbook O(n²) limb multiplication with the recursive O(n^{log₂ 3}) ≈ O(n^{1.585}) Karatsuba split. At M_len = 4096 (RSA-4096 attestation), the operand fits in 64 limbs of 64 bits, and modexp issues ~6 144 multiplications per exponentiation (sliding-window with w = 5). The CPU-side Karatsuba threshold dispatch ships in the intx luxfi fork; Metal / CUDA / WGSL Karatsuba kernels ship in modexp/gpu/. EIP-198 hot path (M_len ≤ 256) crossover lives below the Karatsuba threshold and is unaffected.

Specification

Algorithm

Karatsuba splits n-limb operands A = A_h · B^{n/2} + A_l and B = B_h · B^{n/2} + B_l (where B = 2^64):


karatsuba_mul(A, B, n):
    if n <= K_THRESHOLD:
        return schoolbook_mul(A, B, n)        # 32-limb cutover
    h = n / 2
    A_l, A_h = split(A, h)
    B_l, B_h = split(B, h)
    z_0 = karatsuba_mul(A_l, B_l, h)
    z_2 = karatsuba_mul(A_h, B_h, h)
    z_1 = karatsuba_mul(A_l + A_h, B_l + B_h, h+1) - z_0 - z_2
    return z_2 * B^{2h} + z_1 * B^h + z_0

Three recursive multiplications + linear-time additions vs four for schoolbook. The cutover K_THRESHOLD = 32 limbs (= 2048 bits) was chosen by sweeping the crossover on each backend; below 2048 bits schoolbook's tighter inner loop wins.

modexp wraps Karatsuba in sliding-window exponentiation:


modexp(b, e, m):
    R = mont_form(1, m)
    pre[i] = mont_mul(b^(2i+1), m) for i in 0..2^{w-1}-1     # window = 5
    for chunk in scan(e, window=5):
        R = mont_sqr(R, m) repeated chunk.zeros + 1 times
        if chunk.window != 0:
            R = mont_mul(R, pre[chunk.window >> 1], m)       # uses karatsuba_mul
    return mont_unform(R, m)

mont_mul reduces with Barrett, then calls karatsuba_mul for the underlying limb product when n > K_THRESHOLD. For M_len ≤ 32 limbs the function dispatches to schoolbook unchanged.

KAT

Performance

Per-modexp wall-clock at e = 65537 (RSA-2048/4096 verify hot path). The reproducible bench is luxcpp/crypto/modexp/test/modexp_karatsuba_bench.cpp (CIOS schoolbook vs Karatsuba-SOS, single-thread, deterministic PRNG, warmup + median):

| M_len (bits) | Prior (intx CIOS schoolbook) | LP-163 (Karatsuba-SOS) | Ratio |
|---:|---|---|---:|
| 256 | unchanged (below cutover) | unchanged | 1.0× |
| 512 | unchanged (below cutover) | unchanged | 1.0× |
| 1024 | unchanged (below cutover) | unchanged | 1.0× |
| 2048 | bench output | bench output | bench output |
| 4096 | bench output | bench output | bench output |

The 2048 / 4096 rows are produced by the modexp_karatsuba_bench binary at build time; the ratio is host-dependent (M1 vs Ada vs Hopper) and recorded in BENCHMARKS.md per host. Numbers are CPU-side (intx fork). Metal / CUDA / WGSL Karatsuba multiplication kernels are wired (modexp/gpu/{metal,cuda,wgsl}/modexp_karatsuba.{metal,cu,wgsl}); the host driver issues three child kernels in parallel for the Karatsuba split when profitable.

Implementation

CPU canonical

GPU kernels

C-ABI surface

Unchanged from LP-158:


int modexp_eip198(const uint8_t *input,    // header + B || E || M
                  size_t         input_len,
                  uint8_t       *output,
                  size_t         output_len);

The Karatsuba path is internal to intx and not exposed to callers.

Determinism harness

Cryptographic safety

Crossover

K_THRESHOLD = 32 limbs (2048 bits) for the limb multiplier itself. modexp dispatcher level: M_len ≤ 256 bytes → schoolbook intx, M_len > 256 bytes → Karatsuba intx. The GPU Karatsuba kernel uses the same threshold; below it, the GPU host driver issues a single workgroup schoolbook multiplication for byte-equality with the CPU body. Per-host crossover where GPU dispatch wins over CPU is recorded in CROSSOVER.md as the linux-amd64 CI lane fills it in.

References