More info at:
The following table summarizes some key results from http://www.cs.virginia.edu/stream/standard/Bandwidth.html and elswhere. The cited bandwidth is in 8-byte MegaWords per second:
------------------------------------------------------------- Computer # of Peak MFLOPS Bandwidth to Balance processors (per processor) memory (Bandwidth/MFLOP) per processor (MWord/s) Cray T932 32 1800 1403 0.78 Cray C90 16 960 811 0.84 Cray T3E-900 512 900 65 (to local memory) 0.072 38 (remote memory 0.042 @300 MB/s/proc**) NEC SX4 32 2000 1707 0.8 SGI Origin2000-300 128 600 26 (to local memory) 0.043 IBM SP-2 2048 1460 16 (remote memory) 0.011 (Power3, Dec 2000) 400 MHz Pentium 1 400 39 0.097 Apple PowerG3 400 MHz 1 600 26 0.043 Estimated parameters: Beowulf cluster max? 800 1.6 (remote memory) 0.002 (100 Mbit ethernet) Beowulf cluster max? 800 16 (remote memory) 0.02 (Gbit ethernet, Myrinet similar?) (The 100 Mbit Beowulf cluster is something of a straw man, since Gbit ethernet and Myrinet are available for experts.) -------------------------------------------------------------
For some types of problems, where there are a large number of floating point operations per memory access (such as linpack tests on large matrices), memory bandwidth is not a problem. But there are other problems where the memory bandwidth reported above is a limiting factor.
There are plans to move more of the memory onto CPU chips (essentially making the cache very big), which helps solve the local memory bandwidth problem, but the problem of the bandwidth for communicating to remote memory on other processors remains, or may get work as multiple CPU's get put on single chip...
** I've heard that because of some extra layers of caching or copying, the bandwidth to remote memory for the Cray T3E may drop from 300 to 150 MB/sec/proc. In general this bandwidth number should be the bi-section bandwidth divided by the number of processors. Another performance measure which is important for many types of problems is the latency.