LP-0164: Extended-N Lattice NTT (Six-Step)

LP-0164Final

Abstract

Six-step Number-Theoretic Transform (Bailey 1990) lifts the canonical NTT (LP-029) past the N = 2^16 ceiling at which the radix-2 Cooley-Tukey path becomes memory-bandwidth-bound on every backend. By factoring N = N_1 · N_2 and running two passes of √N-size sub-NTTs with a transpose between them, all sub-transforms fit in L1 cache (≤ 8 KB for N_1 = N_2 = 2^10) and only one global-memory transpose is required. Required for ML-DSA-87 polynomial cores at N = 2^17 and TFHE bootstrap LWE re-keying at N = 2^20. CPU throughput is 6–9 Mops/s at N = 2^20 measured on Apple M2 Ultra; cross-backend GPU equality is validated structurally on macOS via the CPU-fallthrough oracle. Real-device GPU benchmarks land in CI on the hanzo-build-linux-amd64 runner with CRYPTO_HAS_CUDA=1 / CRYPTO_HAS_DAWN=1.

Specification

Algorithm

Factor N = N_1 · N_2 (typically N_1 = N_2 = √N), then:


six_step_ntt(a, N, omega):
    # 1. Reshape a as N_1 x N_2 row-major matrix
    A[i, j] = a[i * N_2 + j]
    # 2. Column NTTs (size N_1, parallel over N_2 columns)
    for j in 0..N_2-1:
        A[:, j] = ntt(A[:, j], N_1, omega^N_2)
    # 3. Twiddle multiplication
    for i, j:
        A[i, j] *= omega^(i * j)
    # 4. Transpose -> N_2 x N_1
    A = A.T
    # 5. Row NTTs (now columns after transpose, size N_2)
    for j in 0..N_1-1:
        A[:, j] = ntt(A[:, j], N_2, omega^N_1)
    # 6. (Output left in transposed layout for downstream consumers)
    return A.flatten()

Inverse uses omega^{-1} and a final * N^{-1} scale. The two flavors:

Prime-modulus (ML-DSA q = 8 380 417, ML-KEM q = 3329): 64-bit operands, __uint128_t intermediate, Barrett reduction at the bottleneck. Caller supplies a primitive 2N-th root of unity in F_q.
Power-of-two-modulus (TFHE q = 2^64): 64-bit machine arithmetic, no reduction. Twiddle-factor table is precomputed in cyclotomic ring Z_{2^64}[X] / (X^N + 1).

Determinism

Step 4 (transpose) is the only memory-bandwidth hit; steps 1, 2, 5 are stream-coalesced. The output layout is transposed by design and downstream consumers (poly-mul pointwise, INTT) consume the transposed form directly. Byte-equality across CPU/Metal/CUDA/WGSL is asserted on the post-step-6 transposed buffer.

KAT

ML-DSA-87 N = 2^17 poly-mul vectors derived from PQClean pqcrystals-dilithium5/poly.
TFHE N = 2^20 LWE re-key NTT vectors generated by lattigo ring.NTT (Go test oracle).
luxcpp/crypto/ntt/test/six_step_kat.json.

Performance

|---:|---|---|---|---|

| 2^20 | (2^10, 2^10) | TFHE 2^64 | CPU | 6–9 Mops/s on Apple M2 Ultra (single-thread, six-step canonical) |

| 2^17 | (2^9, 2^8) | ML-DSA q | CPU | bench output (ntt/test/six_step_test) |

GPU equality at N ∈ {2^17, 2^18, 2^19, 2^20} is validated structurally on macOS via the CPU-fallthrough oracle: when no GPU device or env flag is set, the dispatcher routes the kernel byte-equal to the CPU canonical, asserting the determinism contract. Real-device CUDA / WGSL throughput at the new sizes lands in BENCHMARKS.md from the hanzo-build-linux-amd64 CI lane with CRYPTO_HAS_CUDA=1 / CRYPTO_HAS_DAWN=1. N = 2^20 is the new ceiling; below N = 2^14 the radix-2 path stays canonical.

Implementation

CPU canonical

luxcpp/crypto/ntt/cpp/ntt_large.{hpp,cpp} — already exists with skeleton (per head -25 inspection 2026-04-28). LP-164 fills in the prime-modulus branch and the power-of-two-modulus branch + per-N twiddle precompute.
luxcpp/crypto/poly_mul/cpp/poly_mul.cpp — extended to dispatch through ntt_large when N > 2^16.
lux/crypto/ntt/ Go canonical — NTTLarge(coeffs, N1, N2) wrapper.

GPU kernels

Metal: luxcpp/crypto/ntt/gpu/metal/six_step_ntt.metal — already shipped at four_step_ntt.metal (existing) for the 4-step variant; LP-164 adds the 6-step (twiddle-merged) variant for prime modulus + the TFHE q = 2^64 specialization.
CUDA: luxcpp/crypto/ntt/gpu/cuda/six_step_ntt.cu (NEW).
WGSL: luxcpp/crypto/ntt/gpu/wgsl/six_step_ntt.wgsl (NEW).
Driver host code: luxcpp/crypto/ntt/gpu/{metal,cuda,wgsl}/six_step_ntt_driver.*.

C-ABI surface


// Prime-modulus six-step NTT
int ntt_six_step_prime(uint8_t   *coeffs,        // N * 8 bytes (in/out)
                       uint32_t   N,             // = N_1 * N_2, power-of-2
                       uint64_t   q,             // prime, q ≡ 1 (mod 2N)
                       uint64_t   omega_2N,      // primitive 2N-th root in F_q
                       uint8_t    forward);      // 1 = forward, 0 = inverse

// Power-of-two-modulus six-step NTT (TFHE)
int ntt_six_step_pow2(uint64_t  *coeffs,         // N words (in/out)
                      uint32_t   N,
                      uint8_t    forward);

Determinism harness

luxcpp/crypto/ntt/test/six_step_test.{cpp,mm,cu} — N ∈ {2^14, 2^15, 2^16, 2^17, 2^18, 2^19, 2^20}, 50 random poly inputs per N per modulus class, byte-equal CPU vs Metal vs CUDA vs WGSL.
Cross-oracle: lattigo ring.NTT (Go), PQClean pqcrystals-dilithium5/ntt (test-only via ntt/test/cmake/pqclean.cmake).

Cryptographic safety

NTT operates on public polynomial coefficients in lattice schemes — the ciphertext c = (a, b) is public, the secret s is multiplied by a via NTT of public coefficients. The Barrett reduction inside the prime-modulus path is therefore non-secret-dependent.
The TFHE q = 2^64 path is a pure machine-word algorithm (truncate-to-64-bit). No constant-time considerations beyond the surrounding lattice scheme's existing posture (PQClean upstream is constant-time-by-design).
The transposed output layout means downstream consumers MUST be aware of layout to pair forward + inverse correctly. The ntt_six_step_* API documents this; existing poly_mul callers either all-six-step (forward + pointwise + inverse, same layout cancels) or all-canonical-radix-2 (existing path) — never mixed.

Crossover

| N | Backend | Path |

|---:|---|---|

| ≤ 2^13 | any | schoolbook (small-N branch in poly_mul) |

| 2^14 – 2^16 | any | radix-2 Cooley-Tukey (LP-029, existing) |

| 2^17 – 2^20 | CPU/Metal/CUDA/WGSL | six-step (LP-164) |

| > 2^20 | (future) | nine-step (3-D factor), not in this LP |

Recorded in luxcpp/crypto/CROSSOVER.md.

References

Bailey, "FFTs in External or Hierarchical Memory" (1990) — six-step formulation
Longa, Naehrig, "Speeding up the Number-Theoretic Transform for Faster Ideal Lattice-Based Cryptography" (2016)
FIPS 204 §7 (ML-DSA NTT) — q = 8 380 417
TFHE-rs source (Zama) — q = 2^64 ring arithmetic
LP-029 (NTT primitive — radix-2 baseline)
LP-137 (umbrella, §2.3 Metal NTT 16.92× reproduction)
LP-066 (Threshold FHE — TFHE bootstrap consumer)
LP-070 (ML-DSA — N = 2^17 consumer)
LP-159 (poly_mul — direct caller)
Forward-references: luxcpp/crypto@<sha> impl commit (CTO agent #5, in flight)