Next: Other Multiplication, Previous: Higher degree Toom'n'half, Up: Multiplication Algorithms [Index]

#### 15.1.6 FFT Multiplication

At large to very large sizes a Fermat style FFT multiplication is used,following Schönhage and Strassen (see References). Descriptions of FFTsin various forms can be found in many textbooks, for instance Knuth section4.3.3 part C or Lipson chapter IX. A brief description of the form used inGMP is given here.

The multiplication done is *x*y mod 2^N+1*, for a given*N*. A full product *x*y* is obtained by choosing *N>=bits(x)+bits(y)* and padding*x* and *y* with high zero limbs. The modular product is the nativeform for the algorithm, so padding to get a full product is unavoidable.

The algorithm follows a split, evaluate, pointwise multiply, interpolate andcombine similar to that described above for Karatsuba and Toom-3. A *k*parameter controls the split, with an FFT-*k* splitting into *2^k*pieces of *M=N/2^k* bits each. *N* must be a multiple of*(2^k)* mp_bits_per_limb* sothe split falls on limb boundaries, avoiding bit shifts in the split andcombine stages.

The evaluations, pointwise multiplications, and interpolation, are all donemodulo *2^N'+1* where *N'* is *2M+k+3* rounded up to amultiple of *2^k* and of `mp_bits_per_limb`

. The results ofinterpolation will be the following negacyclic convolution of the inputpieces, and the choice of *N'* ensures these sums aren’t truncated.

--- \ bw[n] = / (-1) * x[i] * y[j] --- i+j==b*2^k+n b=0,1

The points used for the evaluation are *g^i* for *i=0* to*2^k-1* where *g=2^(2N'/2^k)*. *g* is a*2^k'*th root of unity mod *2^N'+1*, which produces necessarycancellations at the interpolation stage, and it’s also a power of 2 so thefast Fourier transforms used for the evaluation and interpolation do onlyshifts, adds and negations.

The pointwise multiplications are done modulo *2^N'+1* and eitherrecurse into a further FFT or use a plain multiplication (Toom-3, Karatsuba orbasecase), whichever is optimal at the size *N'*. The interpolation isan inverse fast Fourier transform. The resulting set of sums of *x[i]*y[j]* are added at appropriate offsets to give the final result.

Squaring is the same, but *x* is the only input so it’s one transform atthe evaluate stage and the pointwise multiplies are squares. Theinterpolation is the same.

For a mod *2^N+1* product, an FFT-*k* is an *O(N^(k/(k-1)))* algorithm, the exponent representing *2^k* recursedmodular multiplies each *1/2^(k-1)* the size of the original.Each successive *k* is an asymptotic improvement, but overheads mean eachis only faster at bigger and bigger sizes. In the code, `MUL_FFT_TABLE`

and `SQR_FFT_TABLE`

are the thresholds where each *k* is used. Eachnew *k* effectively swaps some multiplying for some shifts, adds andoverheads.

A mod *2^N+1* product can be formed with a normal*NxN->2N* bit multiply plus a subtraction, so an FFTand Toom-3 etc can be compared directly. A *k=4* FFT at*O(N^1.333)* can be expected to be the first faster than Toom-3 at*O(N^1.465)*. In practice this is what’s found, with`MUL_FFT_MODF_THRESHOLD`

and `SQR_FFT_MODF_THRESHOLD`

being between300 and 1000 limbs, depending on the CPU. So far it’s been found that onlyvery large FFTs recurse into pointwise multiplies above these sizes.

When an FFT is to give a full product, the change of *N* to *2N*doesn’t alter the theoretical complexity for a given *k*, but for thepurposes of considering where an FFT might be first used it can be assumedthat the FFT is recursing into a normal multiply and that on that basis it’sdoing *2^k* recursed multiplies each *1/2^(k-2)* the size ofthe inputs, making it *O(N^(k/(k-2)))*. This would mean*k=7* at *O(N^1.4)* would be the first FFT faster than Toom-3.In practice `MUL_FFT_THRESHOLD`

and `SQR_FFT_THRESHOLD`

have beenfound to be in the *k=8* range, somewhere between 3000 and 10000 limbs.

The way *N* is split into *2^k* pieces and then *2M+k+3* isrounded up to a multiple of *2^k* and `mp_bits_per_limb`

means thatwhen *2^k>= mp\_bits\_per\_limb* the effective

*N*is amultiple of

*2^(2k-1)*bits. The

*+k+3*means some values of

*N*just under such a multiple will be rounded to the next. Thecomplexity calculations above assume that a favourable size is used, meaningone which isn’t padded through rounding, and it’s also assumed that the extra

*+k+3*bits are negligible at typical FFT sizes.

The practical effect of the *2^(2k-1)* constraint is to introduce astep-effect into measured speeds. For example *k=8* will round *N*up to a multiple of 32768 bits, so for a 32-bit limb there’ll be 512 limbgroups of sizes for which `mpn_mul_n`

runs at the same speed. Or for*k=9* groups of 2048 limbs, *k=10* groups of 8192 limbs, etc. Inpractice it’s been found each *k* is used at quite small multiples of itssize constraint and so the step effect is quite noticeable in a time versussize graph.

The threshold determinations currently measure at the mid-points of sizesteps, but this is sub-optimal since at the start of a new step it can happenthat it’s better to go back to the previous *k* for a while. Somethingmore sophisticated for `MUL_FFT_TABLE`

and `SQR_FFT_TABLE`

will beneeded.

Next: Other Multiplication, Previous: Higher degree Toom'n'half, Up: Multiplication Algorithms [Index]