Comparison of Pentium4 and AMD 1900+ performance

(petrel002 & petrel025 nodes at pppl.gov)

Processor CPU/Bus Speed
MFLOPS on small / large vector loops
daxpy:  y(i) = y(i) + a*x(i)
for vector lengths n=100 or 107
small vectors fit in cache and measure CPU speed
large vector problems measure memory bandwidth
speed benchmark source files 
 STREAM Memory Bandwidth
  triad loop:    y(i) = w(i) + a*x(i)
for large vectors > 2M words
http://www.cs.virginia.edu/stream/
my version of stream benchmark


Intel ifc
Fortran95 compiler
Lahey lf95
Fortran95 compiler


Pentium4/Xeon
petrel002
1.7 GHz / 400 MHz
1523 / 176 Mflops
(for small / large vectors)
  671 / 173 Mflops
1600 MB/s
200 MW/s
(64 bit words)
AMD Athlon MP1900+
petrel025
1.6 GHz / 266 MHz
1032 /   70 Mflops
1011 /   85 Mflops
 800 MB/s
100 MW/s
Cray C-90 (Circa 1991)



9500 MB/s/proc
1187 MW/s/proc
NEC SX-6




4000 MW/s/proc


Conclusions:

(1) While small problems that fit in cache can get within a factor of 2-3 of the theoretical peak speed (for example, 1523 MFLOPS sustained was achieved on the P4), the bottleneck for large problems that don't fit in cache is the bandwidth to main memory, declining to only 176 MFLOPS.  This is a more typical range of performance for computer simulations of 3-D physical systems, which usually involve processing large arrays of information.  For this daxpy loop, with 2 reads + 1 write and 2 flops (add+multiply) for each iteration, the sustained memory transfer rate is about 50% (AMD) to 66% (Pentium4) of the peak bus transfer rate.  (For example, the 400 MHz Bus used by the Pentium can transfer 400Mwords/sec (64-bit words).

(2)  For large problems, the AMD computer had lower performance than the Pentium4 computer, roughly consistent with its slower bus speed.

(3)  For problems that fit in cache, programs compiled with the Intel Fortran95 compiler can sometimes be about a factor of 2 faster on a Pentium4 chip than programs compiled with the Lahey-Fujitsu compiler (though this is dependent on the contents of the loop and they often give comparable performance).  On the AMD chip, the two compilers are comparable in performance. I found that "-O2" was actually better than "-O3" in the Intel compiler for the cases I looked at.

(4) These results are fairly consistent with the stream memory bandwidth benchmark

(5) The Cray C-90, introduced in the early 1990's had a much higher memory bandwidth, because its architecture employed multiple memory banks (and didn't need a cache).  To make up for the lower memory bandwith per processor, one needs to parallelize to many processors...

(6) One can roughly double the speed of bandwidth-limited problems that don't fit in cache by going from double to single precision.

(7) For certain special problems, involving for example matrix-matrix mutliplies with dense matrices, one can speed up a code by reading sections of the matrix from main memory and using it many times before reading another section of the matrix.  This is one of the ways linpack & lapack get high performance on some problems.  But the types of 3-D PDE problems we are often interested in correspond to manipulating sparse matrices for which one is limited by memory bandwidth.