XMCFFT

NAME
     XMCFFT - multiple multitasked complex Fast Fourier Transform (FFT)

SYNOPSIS
     CALL XMCFFT(isign, n, m, scale, x, inc1x, inc2x, y, inc1y, inc2y,
     table, ntable, work, nwork)

DESCRIPTION
     XMCFFT computes the Fourier Transform of each column of the complex
     matrix X, and it stores the results in the columns of matrix Y.

Author: Greg Hammett (hammett@princeton.edu)
	Princeton University Plasma Physics Laboratory.  4 October, 1993.
	(with some help from Gary Kerbel at NERSC, gdk@kerbel.nersc.gov)

I have written a new XMCFFT routine which multitasks on multiple CPU's much
more efficiently than CRAY's original MCFFT routine.  The new routine is
plug compatible with CRAY's MCFFT routine (i.e., it has all of the same
calling arguments as CRAY's MCFFT routine, and it calls the same cfftmlt
kernel routine), so type "man mcfft" to see a description of how to use the
routine.

A few key differences:

*** nwork must be 4*n*m or larger (unlike the original mcfft which allowed
smaller nwork, at the cost of slowing down the code).

*** Make sure to set the environment variable NCPUS, the maximum number of
CPUS you would like to try to use.  NCPUS can be as large as 16 on the
NERSC C-90, but during heavy time sharing it is unlikely to be able to get
more than 2-4 processors, so it usually doesn't pay to set NCPUS larger
than 4.  If not set, XMCFFT will assume NCPUS is 1.

*** The CRAY mcfft man pages claims that table (which holds trig results)
will be initialized properly on the first call, and will reinitialize table
if necessary.  However, I don't know of any foolproof way for xmcfft to
ensure that table has been previously initialized properly, without
reinitializing it.  

Thus, I would recommend always making an initial call of xmcfft with
isign=0 (and n and ntable properly set) so that table can be properly
initialized. 

If table(4) = n, then xmcfft will assume that the rest of the table array
has already been initialized properly.  However, with stack allocation of
variables, it is possible that table(4) remains at n while the rest of
table gets trashed between calls to mcfft.  Then xmcfft would think the
rest of the table is initialized properly, even though it might be garbage.
This is why I recommend always making an isign=0 call first...  Note that
if xmcfft is called with different values of n, you can save different
table arrays for each n to avoid recalculating it.

xmcfft initializes table by calling cftfax (as recommended in the man pages
for cfftmlt). However, cftfax and mcfft initialize table to slightly
different values.  The difference between the two is in the last decimal
place, so the error is extrememly small.  If you want to get identical FFTs
from xmcfft and mcfft, you should initialize table by a call to mcfft with
isign=0 before calling xmcfft with isign=+-1 to do the actual FFT.

OPTIMIZATION NOTES:

The optimization notes in the mcfft man pages apply here as well.
It is very useful to have odd inc2x and inc2y in order to reduce memory
bank conflicts (which can slow the code down by a factor of 4!).

mcfft apparently makes some multitasking decisions very poorly, so that it
actually uses more CPU time as NCPUS is increased!  xmcfft does much
better.  The following compares the cpu time in milliseconds per fft call
(where n=128 FFT's are done m=1024 times):

NCPUS	Vector 		mcfft		xmcfft
	Length		cpu time	cpu time
	m/NCPUS		(ms)		(ms)

1	1024		 6.0		 6.4 
4	256		11.9		 6.8	  MP_HOLDTIME=   50000
16	64		36.7		 7.9	  MP_HOLDTIME=   50000
16	64		65.		10.9	  MP_HOLDTIME=10000000

Thus xmcfft should be about 36.7/4.9 = 4.6 to 65/10.9 = 6.0 times faster
than mcfft a dedicated 16 processor machine.  (Indeed, our gyrofluid plasma
turbulence code is 4.5 times faster using xmcfft rather than mcfft during
Special Parallel Processing (SPP) runs with all 16 processors dedicated.)
(Curiosly, for some unknown reason, on 10-14-93 around 15 minutes after
midnight EDT, the mcfft cpu time for NCPUS=16 and MP_HOLDTIME=50000 dropped
from 36.7 ms to 18.8 ms on one run, though for most runs it fluctuates
between 20 and 26 ms. xmcfft is still significantly faster, especially for
large MP_HOLDTIME (see "man fpp" for info) but I don't know how mcfft
suddenly got faster...)

The code both vectorizes and parallelizes in the second index (while the
FFT's are performed along the first index), so m must be sufficiently large
for this to be efficient.  The operation count for this routine is about
m*alpha*n*log2(n), where alpha ranges from around 3 for n=8 up to 4 for
n=512. If the vector length m/NCPUS > 128, and n>32, this routine can
achieve 450-550 megaflops (dominated by the time spent in the cray
subroutine cfftmlt).  The new xmcftt routine will reduce the actual number
parallel processes (NUCPUS) so that the length of each vector m/NUCPUS >
16.  (If NUCPUS is made larger than this, then the vectors get so short
that the speedup in parallelization is wiped out by the slowdown in
vectorization.)  The following table shows the megaflop speed on one
processor of the C-90 of the underlying cfftmlt routine for a problem with
n=128, as the vector length m is varied:

m	Megaflops

512	495
128	492
 64	423
 32	326
 16	211
  8	123