FFT Multiplication (GNU MP 6.2.1) (2024)

Next: Other Multiplication, Previous: Higher degree Toom'n'half, Up: Multiplication Algorithms [Index]

15.1.6 FFT Multiplication

At large to very large sizes a Fermat style FFT multiplication is used,following Schönhage and Strassen (see References). Descriptions of FFTsin various forms can be found in many textbooks, for instance Knuth section4.3.3 part C or Lipson chapter IX. A brief description of the form used inGMP is given here.

The multiplication done is x*y mod 2^N+1, for a givenN. A full product x*y is obtained by choosing N>=bits(x)+bits(y) and paddingx and y with high zero limbs. The modular product is the nativeform for the algorithm, so padding to get a full product is unavoidable.

The algorithm follows a split, evaluate, pointwise multiply, interpolate andcombine similar to that described above for Karatsuba and Toom-3. A kparameter controls the split, with an FFT-k splitting into 2^kpieces of M=N/2^k bits each. N must be a multiple of(2^k)*mp_bits_per_limb sothe split falls on limb boundaries, avoiding bit shifts in the split andcombine stages.

The evaluations, pointwise multiplications, and interpolation, are all donemodulo 2^N'+1 where N' is 2M+k+3 rounded up to amultiple of 2^k and of mp_bits_per_limb. The results ofinterpolation will be the following negacyclic convolution of the inputpieces, and the choice of N' ensures these sums aren’t truncated.

 --- \ bw[n] = / (-1) * x[i] * y[j] --- i+j==b*2^k+n b=0,1

The points used for the evaluation are g^i for i=0 to2^k-1 where g=2^(2N'/2^k). g is a2^k'th root of unity mod 2^N'+1, which produces necessarycancellations at the interpolation stage, and it’s also a power of 2 so thefast Fourier transforms used for the evaluation and interpolation do onlyshifts, adds and negations.

The pointwise multiplications are done modulo 2^N'+1 and eitherrecurse into a further FFT or use a plain multiplication (Toom-3, Karatsuba orbasecase), whichever is optimal at the size N'. The interpolation isan inverse fast Fourier transform. The resulting set of sums of x[i]*y[j] are added at appropriate offsets to give the final result.

Squaring is the same, but x is the only input so it’s one transform atthe evaluate stage and the pointwise multiplies are squares. Theinterpolation is the same.

For a mod 2^N+1 product, an FFT-k is an O(N^(k/(k-1))) algorithm, the exponent representing 2^k recursedmodular multiplies each 1/2^(k-1) the size of the original.Each successive k is an asymptotic improvement, but overheads mean eachis only faster at bigger and bigger sizes. In the code, MUL_FFT_TABLEand SQR_FFT_TABLE are the thresholds where each k is used. Eachnew k effectively swaps some multiplying for some shifts, adds andoverheads.

A mod 2^N+1 product can be formed with a normalNxN->2N bit multiply plus a subtraction, so an FFTand Toom-3 etc can be compared directly. A k=4 FFT atO(N^1.333) can be expected to be the first faster than Toom-3 atO(N^1.465). In practice this is what’s found, withMUL_FFT_MODF_THRESHOLD and SQR_FFT_MODF_THRESHOLD being between300 and 1000 limbs, depending on the CPU. So far it’s been found that onlyvery large FFTs recurse into pointwise multiplies above these sizes.

When an FFT is to give a full product, the change of N to 2Ndoesn’t alter the theoretical complexity for a given k, but for thepurposes of considering where an FFT might be first used it can be assumedthat the FFT is recursing into a normal multiply and that on that basis it’sdoing 2^k recursed multiplies each 1/2^(k-2) the size ofthe inputs, making it O(N^(k/(k-2))). This would meank=7 at O(N^1.4) would be the first FFT faster than Toom-3.In practice MUL_FFT_THRESHOLD and SQR_FFT_THRESHOLD have beenfound to be in the k=8 range, somewhere between 3000 and 10000 limbs.

The way N is split into 2^k pieces and then 2M+k+3 isrounded up to a multiple of 2^k and mp_bits_per_limb means thatwhen 2^k>=mp\_bits\_per\_limb the effective N is amultiple of 2^(2k-1) bits. The +k+3 means some values ofN just under such a multiple will be rounded to the next. Thecomplexity calculations above assume that a favourable size is used, meaningone which isn’t padded through rounding, and it’s also assumed that the extra+k+3 bits are negligible at typical FFT sizes.

The practical effect of the 2^(2k-1) constraint is to introduce astep-effect into measured speeds. For example k=8 will round Nup to a multiple of 32768 bits, so for a 32-bit limb there’ll be 512 limbgroups of sizes for which mpn_mul_n runs at the same speed. Or fork=9 groups of 2048 limbs, k=10 groups of 8192 limbs, etc. Inpractice it’s been found each k is used at quite small multiples of itssize constraint and so the step effect is quite noticeable in a time versussize graph.

The threshold determinations currently measure at the mid-points of sizesteps, but this is sub-optimal since at the start of a new step it can happenthat it’s better to go back to the previous k for a while. Somethingmore sophisticated for MUL_FFT_TABLE and SQR_FFT_TABLE will beneeded.

Next: Other Multiplication, Previous: Higher degree Toom'n'half, Up: Multiplication Algorithms [Index]

FFT Multiplication (GNU MP 6.2.1) (2024)
Top Articles
Latest Posts
Article information

Author: Msgr. Benton Quitzon

Last Updated:

Views: 6136

Rating: 4.2 / 5 (63 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Msgr. Benton Quitzon

Birthday: 2001-08-13

Address: 96487 Kris Cliff, Teresiafurt, WI 95201

Phone: +9418513585781

Job: Senior Designer

Hobby: Calligraphy, Rowing, Vacation, Geocaching, Web surfing, Electronics, Electronics

Introduction: My name is Msgr. Benton Quitzon, I am a comfortable, charming, thankful, happy, adventurous, handsome, precious person who loves writing and wants to share my knowledge and understanding with you.