Lux Proposals
← All proposals
LP-0164Final

Abstract

Six-step Number-Theoretic Transform (Bailey 1990) lifts the canonical NTT (LP-029) past the N = 2^16 ceiling at which the radix-2 Cooley-Tukey path becomes memory-bandwidth-bound on every backend. By factoring N = N_1 · N_2 and running two passes of √N-size sub-NTTs with a transpose between them, all sub-transforms fit in L1 cache (≤ 8 KB for N_1 = N_2 = 2^10) and only one global-memory transpose is required. Required for ML-DSA-87 polynomial cores at N = 2^17 and TFHE bootstrap LWE re-keying at N = 2^20. CPU throughput is 6–9 Mops/s at N = 2^20 measured on Apple M2 Ultra; cross-backend GPU equality is validated structurally on macOS via the CPU-fallthrough oracle. Real-device GPU benchmarks land in CI on the hanzo-build-linux-amd64 runner with CRYPTO_HAS_CUDA=1 / CRYPTO_HAS_DAWN=1.

Specification

Algorithm

Factor N = N_1 · N_2 (typically N_1 = N_2 = √N), then:


six_step_ntt(a, N, omega):
    # 1. Reshape a as N_1 x N_2 row-major matrix
    A[i, j] = a[i * N_2 + j]
    # 2. Column NTTs (size N_1, parallel over N_2 columns)
    for j in 0..N_2-1:
        A[:, j] = ntt(A[:, j], N_1, omega^N_2)
    # 3. Twiddle multiplication
    for i, j:
        A[i, j] *= omega^(i * j)
    # 4. Transpose -> N_2 x N_1
    A = A.T
    # 5. Row NTTs (now columns after transpose, size N_2)
    for j in 0..N_1-1:
        A[:, j] = ntt(A[:, j], N_2, omega^N_1)
    # 6. (Output left in transposed layout for downstream consumers)
    return A.flatten()

Inverse uses omega^{-1} and a final * N^{-1} scale. The two flavors:

Determinism

Step 4 (transpose) is the only memory-bandwidth hit; steps 1, 2, 5 are stream-coalesced. The output layout is transposed by design and downstream consumers (poly-mul pointwise, INTT) consume the transposed form directly. Byte-equality across CPU/Metal/CUDA/WGSL is asserted on the post-step-6 transposed buffer.

KAT

Performance

| N | (N_1, N_2) | Modulus | Backend | Source |
|---:|---|---|---|---|
| 2^16 | (2^8, 2^8) | TFHE 2^64 | Metal | 16.92× over Go ref (LP-137 §2.3 measured) |
| 2^20 | (2^10, 2^10) | TFHE 2^64 | CPU | 6–9 Mops/s on Apple M2 Ultra (single-thread, six-step canonical) |
| 2^17 | (2^9, 2^8) | ML-DSA q | CPU | bench output (ntt/test/six_step_test) |

GPU equality at N ∈ {2^17, 2^18, 2^19, 2^20} is validated structurally on macOS via the CPU-fallthrough oracle: when no GPU device or env flag is set, the dispatcher routes the kernel byte-equal to the CPU canonical, asserting the determinism contract. Real-device CUDA / WGSL throughput at the new sizes lands in BENCHMARKS.md from the hanzo-build-linux-amd64 CI lane with CRYPTO_HAS_CUDA=1 / CRYPTO_HAS_DAWN=1. N = 2^20 is the new ceiling; below N = 2^14 the radix-2 path stays canonical.

Implementation

CPU canonical

GPU kernels

C-ABI surface


// Prime-modulus six-step NTT
int ntt_six_step_prime(uint8_t   *coeffs,        // N * 8 bytes (in/out)
                       uint32_t   N,             // = N_1 * N_2, power-of-2
                       uint64_t   q,             // prime, q ≡ 1 (mod 2N)
                       uint64_t   omega_2N,      // primitive 2N-th root in F_q
                       uint8_t    forward);      // 1 = forward, 0 = inverse

// Power-of-two-modulus six-step NTT (TFHE)
int ntt_six_step_pow2(uint64_t  *coeffs,         // N words (in/out)
                      uint32_t   N,
                      uint8_t    forward);

Determinism harness

Cryptographic safety

Crossover

| N | Backend | Path |
|---:|---|---|
| ≤ 2^13 | any | schoolbook (small-N branch in poly_mul) |
| 2^14 – 2^16 | any | radix-2 Cooley-Tukey (LP-029, existing) |
| 2^17 – 2^20 | CPU/Metal/CUDA/WGSL | six-step (LP-164) |
| > 2^20 | (future) | nine-step (3-D factor), not in this LP |

Recorded in luxcpp/crypto/CROSSOVER.md.

References