Details on how I used the NAS benchmarks.

The speeds reported are from the NPB1 bencharks for the class A FT problem, which is for FFTs on a 256x256x128 grid (class B uses a 512*256*256 grid and so would scale to a larger size computer). The figure I've put together is a composite figure taken from a couple of different sources. In essence, it is a selected merger of the NB2 results at:
http://science.nas.nasa.gov/Software/NPB/NPB2Results/971117/g2a.ft.A.html
and the NPB1 results at
http://science.nas.nasa.gov/Software/NPB/NPB1Results/971126/g2a.ft.A.html.

Most of the actual numbers came out of tables in http://science.nas.nasa.gov/Software/NPB/Reports/NAS-96-018.fm.html. I have converted their normalized results to MFLOP/sec rates, using their quote of 196 MFLOP/sec for the Cray Y-MP on this FT problem. The NEC SX4 results for the class A FT are missing, so I took the results from the class B FT (which is on a larger grid).

Results for a T3E-900 are not directly given for the NBP1 class A FT. Instead, I used the results for the T3E-900 on the NBP2 class A FT problem, as given in the NPB2 section of the NAS web site, and corrected for the ratio of NPB1 to NPB2 compute time also given there for the T3E-900. The NPB2 benchmarks are written in MPI, with minimal modifications to run on each computer. The NPB1 FFT benchmarks are allowed to make heavy use of vendor-written optimized machine code. For the class A FT problem on the T3E-900, NPB1 is about 2-4 times faster than NPB2 (depending on the number of processors).