XMCFFT NAME XMCFFT - multiple multitasked complex Fast Fourier Transform (FFT) SYNOPSIS CALL XMCFFT(isign, n, m, scale, x, inc1x, inc2x, y, inc1y, inc2y, table, ntable, work, nwork) DESCRIPTION XMCFFT computes the Fourier Transform of each column of the complex matrix X, and it stores the results in the columns of matrix Y. Author: Greg Hammett (hammett@princeton.edu) Princeton University Plasma Physics Laboratory. 4 October, 1993. (with some help from Gary Kerbel at NERSC, gdk@kerbel.nersc.gov) I have written a new XMCFFT routine which multitasks on multiple CPU's much more efficiently than CRAY's original MCFFT routine. The new routine is plug compatible with CRAY's MCFFT routine (i.e., it has all of the same calling arguments as CRAY's MCFFT routine, and it calls the same cfftmlt kernel routine), so type "man mcfft" to see a description of how to use the routine. A few key differences: *** nwork must be 4*n*m or larger (unlike the original mcfft which allowed smaller nwork, at the cost of slowing down the code). *** Make sure to set the environment variable NCPUS, the maximum number of CPUS you would like to try to use. NCPUS can be as large as 16 on the NERSC C-90, but during heavy time sharing it is unlikely to be able to get more than 2-4 processors, so it usually doesn't pay to set NCPUS larger than 4. If not set, XMCFFT will assume NCPUS is 1. *** The CRAY mcfft man pages claims that table (which holds trig results) will be initialized properly on the first call, and will reinitialize table if necessary. However, I don't know of any foolproof way for xmcfft to ensure that table has been previously initialized properly, without reinitializing it. Thus, I would recommend always making an initial call of xmcfft with isign=0 (and n and ntable properly set) so that table can be properly initialized. If table(4) = n, then xmcfft will assume that the rest of the table array has already been initialized properly. However, with stack allocation of variables, it is possible that table(4) remains at n while the rest of table gets trashed between calls to mcfft. Then xmcfft would think the rest of the table is initialized properly, even though it might be garbage. This is why I recommend always making an isign=0 call first... Note that if xmcfft is called with different values of n, you can save different table arrays for each n to avoid recalculating it. xmcfft initializes table by calling cftfax (as recommended in the man pages for cfftmlt). However, cftfax and mcfft initialize table to slightly different values. The difference between the two is in the last decimal place, so the error is extrememly small. If you want to get identical FFTs from xmcfft and mcfft, you should initialize table by a call to mcfft with isign=0 before calling xmcfft with isign=+-1 to do the actual FFT. OPTIMIZATION NOTES: The optimization notes in the mcfft man pages apply here as well. It is very useful to have odd inc2x and inc2y in order to reduce memory bank conflicts (which can slow the code down by a factor of 4!). mcfft apparently makes some multitasking decisions very poorly, so that it actually uses more CPU time as NCPUS is increased! xmcfft does much better. The following compares the cpu time in milliseconds per fft call (where n=128 FFT's are done m=1024 times): NCPUS Vector mcfft xmcfft Length cpu time cpu time m/NCPUS (ms) (ms) 1 1024 6.0 6.4 4 256 11.9 6.8 MP_HOLDTIME= 50000 16 64 36.7 7.9 MP_HOLDTIME= 50000 16 64 65. 10.9 MP_HOLDTIME=10000000 Thus xmcfft should be about 36.7/4.9 = 4.6 to 65/10.9 = 6.0 times faster than mcfft a dedicated 16 processor machine. (Indeed, our gyrofluid plasma turbulence code is 4.5 times faster using xmcfft rather than mcfft during Special Parallel Processing (SPP) runs with all 16 processors dedicated.) (Curiosly, for some unknown reason, on 10-14-93 around 15 minutes after midnight EDT, the mcfft cpu time for NCPUS=16 and MP_HOLDTIME=50000 dropped from 36.7 ms to 18.8 ms on one run, though for most runs it fluctuates between 20 and 26 ms. xmcfft is still significantly faster, especially for large MP_HOLDTIME (see "man fpp" for info) but I don't know how mcfft suddenly got faster...) The code both vectorizes and parallelizes in the second index (while the FFT's are performed along the first index), so m must be sufficiently large for this to be efficient. The operation count for this routine is about m*alpha*n*log2(n), where alpha ranges from around 3 for n=8 up to 4 for n=512. If the vector length m/NCPUS > 128, and n>32, this routine can achieve 450-550 megaflops (dominated by the time spent in the cray subroutine cfftmlt). The new xmcftt routine will reduce the actual number parallel processes (NUCPUS) so that the length of each vector m/NUCPUS > 16. (If NUCPUS is made larger than this, then the vectors get so short that the speedup in parallelization is wiped out by the slowdown in vectorization.) The following table shows the megaflop speed on one processor of the C-90 of the underlying cfftmlt routine for a problem with n=128, as the vector length m is varied: m Megaflops 512 495 128 492 64 423 32 326 16 211 8 123