CPU/Memory speeds

Comparison of Pentium4 and AMD 1900+ performance

(petrel002 & petrel025 nodes at pppl.gov)

Processor	CPU/Bus Speed	MFLOPS on small / large vector loops daxpy: y(i) = y(i) + a*x(i) for vector lengths n=100 or 10⁷ small vectors fit in cache and measure CPU speed large vector problems measure memory bandwidth speed benchmark source files		STREAM Memory Bandwidth triad loop: y(i) = w(i) + a*x(i) for large vectors > 2M words http://www.cs.virginia.edu/stream/ my version of stream benchmark
		Intel ifc Fortran95 compiler	Lahey lf95 Fortran95 compiler
Pentium4/Xeon petrel002	1.7 GHz / 400 MHz	1523 / 176 Mflops (for small / large vectors)	671 / 173 Mflops	1600 MB/s	200 MW/s (64 bit words)
AMD Athlon MP1900+ petrel025	1.6 GHz / 266 MHz	1032 / 70 Mflops	1011 / 85 Mflops	800 MB/s	100 MW/s
Cray C-90 (Circa 1991)				9500 MB/s/proc	1187 MW/s/proc
NEC SX-6					4000 MW/s/proc

Conclusions:

(1) While small problems that fit in cache can get within a factor of 2-3 of the theoretical peak speed (for example, 1523 MFLOPS sustained was achieved on the P4), the bottleneck for large problems that don't fit in cache is the bandwidth to main memory, declining to only 176 MFLOPS. This is a more typical range of performance for computer simulations of 3-D physical systems, which usually involve processing large arrays of information. For this daxpy loop, with 2 reads + 1 write and 2 flops (add+multiply) for each iteration, the sustained memory transfer rate is about 50% (AMD) to 66% (Pentium4) of the peak bus transfer rate. (For example, the 400 MHz Bus used by the Pentium can transfer 400Mwords/sec (64-bit words).

(2) For large problems, the AMD computer had lower performance than the Pentium4 computer, roughly consistent with its slower bus speed.

(3) For problems that fit in cache, programs compiled with the Intel Fortran95 compiler can sometimes be about a factor of 2 faster on a Pentium4 chip than programs compiled with the Lahey-Fujitsu compiler (though this is dependent on the contents of the loop and they often give comparable performance). On the AMD chip, the two compilers are comparable in performance. I found that "-O2" was actually better than "-O3" in the Intel compiler for the cases I looked at.

(4) These results are fairly consistent with the stream memory bandwidth benchmark

(5) The Cray C-90, introduced in the early 1990's had a much higher memory bandwidth, because its architecture employed multiple memory banks (and didn't need a cache). To make up for the lower memory bandwith per processor, one needs to parallelize to many processors...

(6) One can roughly double the speed of bandwidth-limited problems that don't fit in cache by going from double to single precision.

(7) For certain special problems, involving for example matrix-matrix mutliplies with dense matrices, one can speed up a code by reading sections of the matrix from main memory and using it many times before reading another section of the matrix. This is one of the ways linpack & lapack get high performance on some problems. But the types of 3-D PDE problems we are often interested in correspond to manipulating sparse matrices for which one is limited by memory bandwidth.