Six-step Number-Theoretic Transform (Bailey 1990) lifts the canonical NTT (LP-029) past the N = 2^16 ceiling at which the radix-2 Cooley-Tukey path becomes memory-bandwidth-bound on every backend. By factoring N = N_1 · N_2 and running two passes of √N-size sub-NTTs with a transpose between them, all sub-transforms fit in L1 cache (≤ 8 KB for N_1 = N_2 = 2^10) and only one global-memory transpose is required. Required for ML-DSA-87 polynomial cores at N = 2^17 and TFHE bootstrap LWE re-keying at N = 2^20. CPU throughput is 6–9 Mops/s at N = 2^20 measured on Apple M2 Ultra; cross-backend GPU equality is validated structurally on macOS via the CPU-fallthrough oracle. Real-device GPU benchmarks land in CI on the hanzo-build-linux-amd64 runner with CRYPTO_HAS_CUDA=1 / CRYPTO_HAS_DAWN=1.
Factor N = N_1 · N_2 (typically N_1 = N_2 = √N), then:
six_step_ntt(a, N, omega):
# 1. Reshape a as N_1 x N_2 row-major matrix
A[i, j] = a[i * N_2 + j]
# 2. Column NTTs (size N_1, parallel over N_2 columns)
for j in 0..N_2-1:
A[:, j] = ntt(A[:, j], N_1, omega^N_2)
# 3. Twiddle multiplication
for i, j:
A[i, j] *= omega^(i * j)
# 4. Transpose -> N_2 x N_1
A = A.T
# 5. Row NTTs (now columns after transpose, size N_2)
for j in 0..N_1-1:
A[:, j] = ntt(A[:, j], N_2, omega^N_1)
# 6. (Output left in transposed layout for downstream consumers)
return A.flatten()
Inverse uses omega^{-1} and a final * N^{-1} scale. The two flavors:
q = 8 380 417, ML-KEM q = 3329): 64-bit operands, __uint128_t intermediate, Barrett reduction at the bottleneck. Caller supplies a primitive 2N-th root of unity in F_q.q = 2^64): 64-bit machine arithmetic, no reduction. Twiddle-factor table is precomputed in cyclotomic ring Z_{2^64}[X] / (X^N + 1).Step 4 (transpose) is the only memory-bandwidth hit; steps 1, 2, 5 are stream-coalesced. The output layout is transposed by design and downstream consumers (poly-mul pointwise, INTT) consume the transposed form directly. Byte-equality across CPU/Metal/CUDA/WGSL is asserted on the post-step-6 transposed buffer.
N = 2^17 poly-mul vectors derived from PQClean pqcrystals-dilithium5/poly.N = 2^20 LWE re-key NTT vectors generated by lattigo ring.NTT (Go test oracle).luxcpp/crypto/ntt/test/six_step_kat.json.N | (N_1, N_2) | Modulus | Backend | Source |ntt/test/six_step_test) |GPU equality at N ∈ {2^17, 2^18, 2^19, 2^20} is validated structurally on macOS via the CPU-fallthrough oracle: when no GPU device or env flag is set, the dispatcher routes the kernel byte-equal to the CPU canonical, asserting the determinism contract. Real-device CUDA / WGSL throughput at the new sizes lands in BENCHMARKS.md from the hanzo-build-linux-amd64 CI lane with CRYPTO_HAS_CUDA=1 / CRYPTO_HAS_DAWN=1. N = 2^20 is the new ceiling; below N = 2^14 the radix-2 path stays canonical.
luxcpp/crypto/ntt/cpp/ntt_large.{hpp,cpp} — already exists with skeleton (per head -25 inspection 2026-04-28). LP-164 fills in the prime-modulus branch and the power-of-two-modulus branch + per-N twiddle precompute.luxcpp/crypto/poly_mul/cpp/poly_mul.cpp — extended to dispatch through ntt_large when N > 2^16.lux/crypto/ntt/ Go canonical — NTTLarge(coeffs, N1, N2) wrapper.luxcpp/crypto/ntt/gpu/metal/six_step_ntt.metal — already shipped at four_step_ntt.metal (existing) for the 4-step variant; LP-164 adds the 6-step (twiddle-merged) variant for prime modulus + the TFHE q = 2^64 specialization.luxcpp/crypto/ntt/gpu/cuda/six_step_ntt.cu (NEW).luxcpp/crypto/ntt/gpu/wgsl/six_step_ntt.wgsl (NEW).luxcpp/crypto/ntt/gpu/{metal,cuda,wgsl}/six_step_ntt_driver.*.
// Prime-modulus six-step NTT
int ntt_six_step_prime(uint8_t *coeffs, // N * 8 bytes (in/out)
uint32_t N, // = N_1 * N_2, power-of-2
uint64_t q, // prime, q ≡ 1 (mod 2N)
uint64_t omega_2N, // primitive 2N-th root in F_q
uint8_t forward); // 1 = forward, 0 = inverse
// Power-of-two-modulus six-step NTT (TFHE)
int ntt_six_step_pow2(uint64_t *coeffs, // N words (in/out)
uint32_t N,
uint8_t forward);
luxcpp/crypto/ntt/test/six_step_test.{cpp,mm,cu} — N ∈ {2^14, 2^15, 2^16, 2^17, 2^18, 2^19, 2^20}, 50 random poly inputs per N per modulus class, byte-equal CPU vs Metal vs CUDA vs WGSL.ring.NTT (Go), PQClean pqcrystals-dilithium5/ntt (test-only via ntt/test/cmake/pqclean.cmake).c = (a, b) is public, the secret s is multiplied by a via NTT of public coefficients. The Barrett reduction inside the prime-modulus path is therefore non-secret-dependent.q = 2^64 path is a pure machine-word algorithm (truncate-to-64-bit). No constant-time considerations beyond the surrounding lattice scheme's existing posture (PQClean upstream is constant-time-by-design).ntt_six_step_* API documents this; existing poly_mul callers either all-six-step (forward + pointwise + inverse, same layout cancels) or all-canonical-radix-2 (existing path) — never mixed.N | Backend | Path |poly_mul) |Recorded in luxcpp/crypto/CROSSOVER.md.
q = 8 380 417q = 2^64 ring arithmeticluxcpp/crypto@<sha> impl commit (CTO agent #5, in flight)