Memory/Remote-Processor Bandwidth Problems.

Summary: Raw CPU MFLOP speed has been exponentially increasing following Moore's Law, but memory bandwidth and interprocessor communication rates have been increasing very slowly or even declining, making a collision inevitable unless bandwidths can be improved also.

STREAM Logo (Image)

More info at:

  • http://www.cs.virginia.edu/stream/
  • http://www.nersc.gov/research/FTG/pcp/index.html

    The following table summarizes some key results from http://www.cs.virginia.edu/stream/standard/Bandwidth.html and elswhere. The cited bandwidth is in 8-byte MegaWords per second:

    -------------------------------------------------------------
    
    Computer              # of    Peak MFLOPS      Bandwidth to       Balance
                    processors  (per processor)    memory             (Bandwidth/MFLOP)
                                                   per processor
                                                   (MWord/s)
    
    
    Cray T932              32    1800            1403                      0.78
    
    Cray C90               16     960             811                      0.84
    
    Cray T3E-900          512     900              65  (to local memory)   0.072
                                                   38  (remote memory      0.042
                                                        @300 MB/s/proc**)
    
    NEC SX4	               32    2000            1707                      0.8
    
    SGI Origin2000-300    128     600	       26  (to local memory)   0.043
    
    IBM SP-2             2048    1460              16  (remote memory)     0.011
    (Power3, Dec 2000)
    
    400 MHz Pentium		1     400              39                      0.097
    
    Apple PowerG3 400 MHz   1     600              26		       0.043
    
    Estimated parameters:
    Beowulf cluster      max?     800               1.6 (remote memory)    0.002
    (100 Mbit ethernet) 
    
    Beowulf cluster      max?     800              16   (remote memory)    0.02
    (Gbit ethernet, Myrinet similar?)
    
    (The 100 Mbit Beowulf cluster is something of a straw man, since Gbit
    ethernet and Myrinet are available for experts.)
    
    -------------------------------------------------------------
    

    For some types of problems, where there are a large number of floating point operations per memory access (such as linpack tests on large matrices), memory bandwidth is not a problem. But there are other problems where the memory bandwidth reported above is a limiting factor.

    There are plans to move more of the memory onto CPU chips (essentially making the cache very big), which helps solve the local memory bandwidth problem, but the problem of the bandwidth for communicating to remote memory on other processors remains, or may get work as multiple CPU's get put on single chip...

    ** I've heard that because of some extra layers of caching or copying, the bandwidth to remote memory for the Cray T3E may drop from 300 to 150 MB/sec/proc. In general this bandwidth number should be the bi-section bandwidth divided by the number of processors. Another performance measure which is important for many types of problems is the latency.