hydra.pppl.gov is a Dec Alpha Workstation 4/275, i.e., model 4 with a 275 MHz clock, which is bought circa 1995-1996. In 1997, the new Cray T3E-900 uses a 450 MHz chip. Compiled with f77 -O -fast: hydra% speed Vector length = 1000 did 10 million * ops in 0.2352 cpu secs = 42.5141 MFLOPS did 10 million / ops in 0.6949 cpu secs = 14.3903 MFLOPS did 70 million multiple * and + ops in 0.4987 cpu secs = 140.3548 MFLOPS did 10 million logs ops in 2.8177 cpu secs = 3.5490 MFLOPS did 10 million sqrts ops in 1.3566 cpu secs = 7.3712 MFLOPS 5.519u 0.185s 0:06.30 90.3% 0+2k 0+0io 0pf+0w hydra% speed Vector length = 1000000 did 10 million * ops in 2.1238 cpu secs = 4.7086 MFLOPS did 10 million / ops in 2.2672 cpu secs = 4.4106 MFLOPS did 70 million multiple * and + ops in 2.8001 cpu secs = 24.9987 MFLOPS did 10 million logs ops in 3.3369 cpu secs = 2.9968 MFLOPS did 10 million sqrts ops in 2.9134 cpu secs = 3.4325 MFLOPS 13.465u 1.104s 0:17.40 83.6% 0+152k 0+0io 0pf+0w *********************************** Compiled with f77 -O -fast -r8 (double precision): (No degradation of performance for double precision if the problem fits in Cache, since the Dec Alpha is a 64 bit processor.) hydra% speed Vector length = 500 did 10 million * ops in 0.2869 cpu secs = 34.8500 MFLOPS did 10 million / ops in 0.9438 cpu secs = 10.5956 MFLOPS did 70 million multiple * and + ops in 0.4939 cpu secs = 141.7417 MFLOPS did 10 million logs ops in 2.8509 cpu secs = 3.5077 MFLOPS did 10 million sqrts ops in 1.5206 cpu secs = 6.5763 MFLOPS hydra% speed Vector length = 1000000 did 10 million * ops in 3.7303 cpu secs = 2.6808 MFLOPS did 10 million / ops in 4.4184 cpu secs = 2.2633 MFLOPS did 70 million multiple * and + ops in 5.1806 cpu secs = 13.5119 MFLOPS did 10 million logs ops in 4.4066 cpu secs = 2.2693 MFLOPS did 10 million sqrts ops in 4.3032 cpu secs = 2.3239 MFLOPS *********************************** Run on a newer and faster Dec Alpha at jaeri: Compiled with f77 -O -fast -r8 (double precision): wave.naka.jaeri.go.jp> speed Vector length = 500 did 10 million * ops in 0.0830 cpu secs = 120.5400 MFLOPS did 10 million / ops in 0.4724 cpu secs = 21.1692 MFLOPS did 70 million multiple * and + ops in 0.2186 cpu secs = 320.1844 MFLOPS did 10 million logs ops in 1.2971 cpu secs = 7.7095 MFLOPS did 10 million sqrts ops in 0.5641 cpu secs = 17.7265 MFLOPS wave.naka.jaeri.go.jp> speed Vector length = 5000 did 10 million * ops in 0.2460 cpu secs = 40.6583 MFLOPS did 10 million / ops in 0.4929 cpu secs = 20.2889 MFLOPS did 70 million multiple * and + ops in 0.4011 cpu secs = 174.5044 MFLOPS did 10 million logs ops in 1.3069 cpu secs = 7.6519 MFLOPS did 10 million sqrts ops in 0.5993 cpu secs = 16.6871 MFLOPS wave.naka.jaeri.go.jp> speed Vector length = 1000000 did 10 million * ops in 1.3566 cpu secs = 7.3712 MFLOPS did 10 million / ops in 1.4093 cpu secs = 7.0955 MFLOPS did 70 million multiple * and + ops in 1.8066 cpu secs = 38.7473 MFLOPS did 10 million logs ops in 1.6026 cpu secs = 6.2399 MFLOPS did 10 million sqrts ops in 1.0346 cpu secs = 9.6659 MFLOPS ***************************************** Run on termita.pppl.gov, a 1.5 MHz Pentium-4 (according to /proc/cpuinfo) with the Lahey-Fujitsu compiler: lf95 --dbl -I/usr/local/include -O --tpp --wide -o speed speed.f speedsub.o \ second.o Vector length = 100 did 10 million * ops in 0.1000 cpu secs = 100.0000 MFLOPS did 10 million / ops in 0.2500 cpu secs = 40.0000 MFLOPS did 70 million multiple * and + ops in 0.1700 cpu secs = 411.7647 MFLOPS did 10 million logs ops in 0.6900 cpu secs = 14.4928 MFLOPS did 10 million sqrts ops in 0.2700 cpu secs = 37.0370 MFLOPS Vector length = 10000000 did 10 million * ops in 0.1800 cpu secs = 55.5556 MFLOPS did 10 million / ops in 0.2700 cpu secs = 37.0370 MFLOPS did 70 million multiple * and + ops in 0.2600 cpu secs = 269.2308 MFLOPS did 10 million logs ops in 0.7400 cpu secs = 13.5135 MFLOPS did 10 million sqrts ops in 0.2600 cpu secs = 38.4615 MFLOPS ***************************************** Run on loki.pppl.gov, a 750 MHz Compaq Alpha EV67. fort -extend_source -r8 -fast -assume no2underscores -o speed speed.f \ speedsub.o second.o Vector length = 100 did 10 million * ops in 0.0410 cpu secs = 243.8132 MFLOPS did 10 million / ops in 0.1641 cpu secs = 60.9522 MFLOPS did 70 million multiple * and + ops in 0.0938 cpu secs = 746.6665 MFLOPS did 10 million logs ops in 0.5059 cpu secs = 19.7684 MFLOPS did 10 million sqrts ops in 0.2090 cpu secs = 47.8503 MFLOPS Vector length = 10000000 did 10 million * ops in 0.2314 cpu secs = 43.2068 MFLOPS did 10 million / ops in 0.2324 cpu secs = 43.0252 MFLOPS did 70 million multiple * and + ops in 0.3096 cpu secs = 226.1201 MFLOPS did 10 million logs ops in 0.5410 cpu secs = 18.4837 MFLOPS did 10 million sqrts ops in 0.2666 cpu secs = 37.5091 MFLOPS **************************************** Run on petrel001.pppl.gov, a 1.7 MHz dual processor Xeon Pentium-4 (according to /proc/cpuinfo) with the Lahey-Fujitsu compiler: lf95 --dbl -O --tpp --wide -o speed speed.f speedsub.o second.o Vector length = 100 did 10 million * ops in 0.0254 cpu secs = 393.8559 MFLOPS did 10 million / ops in 0.2266 cpu secs = 44.1378 MFLOPS did 70 million multiple * and + ops in 0.0918 cpu secs = 762.5522 MFLOPS did 10 million logs ops in 0.5723 cpu secs = 17.4744 MFLOPS did 10 million sqrts ops in 0.2285 cpu secs = 43.7606 MFLOPS Vector length = 10000000 did 10 million * ops in 0.1719 cpu secs = 58.1818 MFLOPS did 10 million / ops in 0.2324 cpu secs = 43.0252 MFLOPS did 70 million multiple * and + ops in 0.2188 cpu secs = 320.0000 MFLOPS did 10 million logs ops in 0.5762 cpu secs = 17.3559 MFLOPS did 10 million sqrts ops in 0.2266 cpu secs = 44.1380 MFLOPS Tried more compiler options but found little difference: lf95 --dbl -O --tpp --wide --nap --nchk --npca --nsav --ntrace --prefetch 2 --staticlink -o speed speed.f speedsub.o second.o Vector length = 100 did 10 million * ops in 0.0312 cpu secs = 320.0000 MFLOPS did 10 million / ops in 0.2246 cpu secs = 44.5218 MFLOPS did 70 million multiple * and + ops in 0.0801 cpu secs = 874.1476 MFLOPS did 10 million logs ops in 0.6211 cpu secs = 16.1006 MFLOPS did 10 million sqrts ops in 0.2266 cpu secs = 44.1380 MFLOPS Vector length = 10000000 did 10 million * ops in 0.1699 cpu secs = 58.8509 MFLOPS did 10 million / ops in 0.2363 cpu secs = 42.3139 MFLOPS did 70 million multiple * and + ops in 0.2148 cpu secs = 325.8193 MFLOPS did 10 million logs ops in 0.6250 cpu secs = 16.0000 MFLOPS did 10 million sqrts ops in 0.2285 cpu secs = 43.7606 MFLOPS ************************************************ Run on petrel050, a 1.6 GHz AMD Athlon. Vector length = 100 did 10 million * ops in 0.0195 cpu secs = 512.0065 MFLOPS did 10 million / ops in 0.1074 cpu secs = 93.0908 MFLOPS did 70 million multiple * and + ops in 0.0508 cpu secs = 1378.4684 MFLOPS did 10 million logs ops in 0.7480 cpu secs = 13.3681 MFLOPS did 10 million sqrts ops in 0.1465 cpu secs = 68.2668 MFLOPS Vector length = 10000000 did 10 million * ops in 0.3281 cpu secs = 30.4762 MFLOPS did 10 million / ops in 0.3340 cpu secs = 29.9416 MFLOPS did 70 million multiple * and + ops in 0.3945 cpu secs = 177.4254 MFLOPS did 10 million logs ops in 0.7480 cpu secs = 13.3682 MFLOPS did 10 million sqrts ops in 0.2578 cpu secs = 38.7878 MFLOPS -------------------------------------------- Similar results below with lff95 blas: lf95 --dbl -O --tpp --wide --nap --nchk --npca --nsav --ntrace --prefetch 2 --staticlink -L/usr/local/lff95/lib -lblas -o speed speed.f speedsub.o second.o /usr/local/lff95/lib/libblas.a On petrel002, 1.7MHz Pentium4, with simple fortran blas: lf95 --dbl -O --tpp --wide --nap --nchk --npca --nsav --ntrace --prefetch 2 --staticlink -X9 -o speed speed.f speedsub.o daxpy.o Vector length = 100 did 40 million * ops in 0.1250 cpu secs = 320.0000 MFLOPS did 79 million sweep ops in 0.4199 cpu secs = 190.5118 MFLOPS did 79 million fast sweep ops in 0.2051 cpu secs = 390.0932 MFLOPS did 40 million / ops in 0.9023 cpu secs = 44.3290 MFLOPS did 280 million multiple *+ ops in 0.4785 cpu secs = 585.1424 MFLOPS did 80 million Dot *+ ops in 0.1367 cpu secs = 585.1418 MFLOPS did 40 million logs ops in 2.5000 cpu secs = 16.0000 MFLOPS did 40 million sqrts ops in 0.9160 cpu secs = 43.6674 MFLOPS did 80 million daxpy *+ ops in 0.1094 cpu secs = 731.4286 MFLOPS Vector length = 10000000 did 40 million * ops in 0.7070 cpu secs = 56.5746 MFLOPS did 79 million sweep ops in 0.4219 cpu secs = 189.6294 MFLOPS did 79 million fast sweep ops in 0.2070 cpu secs = 386.4152 MFLOPS did 40 million / ops in 0.9180 cpu secs = 43.5745 MFLOPS did 280 million multiple *+ ops in 1.0195 cpu secs = 274.6361 MFLOPS did 80 million Dot *+ ops in 0.3809 cpu secs = 210.0515 MFLOPS did 40 million logs ops in 2.5312 cpu secs = 15.8025 MFLOPS did 40 million sqrts ops in 0.9199 cpu secs = 43.4820 MFLOPS did 80 million daxpy *+ ops in 0.4902 cpu secs = 163.1870 MFLOPS --------------------------------------------- On petrel002 with the Intel Fortran Compiler: ifc -r8 -O3 -tpp7 -axiMKW -pad -unroll -parallel -o speed speed.f speedsub.o daxpy.o Vector length = 100 did 40 million * ops in 0.0547 cpu secs = 731.0483 MFLOPS did 79 million sweep ops in 0.2974 cpu secs = 268.9735 MFLOPS did 79 million fast sweep ops in 0.2056 cpu secs = 389.0168 MFLOPS did 40 million / ops in 0.8310 cpu secs = 48.1345 MFLOPS did 280 million multiple *+ ops in 0.1715 cpu secs = 1632.6051 MFLOPS did 80 million Dot *+ ops in 0.0492 cpu secs = 1624.9306 MFLOPS did 40 million logs ops in 1.0771 cpu secs = 37.1372 MFLOPS did 40 million sqrts ops in 0.8290 cpu secs = 48.2491 MFLOPS did 80 million daxpy *+ ops in 0.0873 cpu secs = 916.5917 MFLOPS Vector length = 10000000 did 40 million * ops in 0.7175 cpu secs = 55.7453 MFLOPS did 79 million sweep ops in 0.2986 cpu secs = 267.9583 MFLOPS did 79 million fast sweep ops in 0.2035 cpu secs = 393.1624 MFLOPS did 40 million / ops in 0.8859 cpu secs = 45.1534 MFLOPS did 280 million multiple *+ ops in 0.9347 cpu secs = 299.5722 MFLOPS did 80 million Dot *+ ops in 0.3121 cpu secs = 256.3605 MFLOPS did 40 million logs ops in 1.1246 cpu secs = 35.5690 MFLOPS did 40 million sqrts ops in 0.8782 cpu secs = 45.5487 MFLOPS did 80 million daxpy *+ ops in 0.7142 cpu secs = 112.0097 MFLOPS ------------------------------------------ On petrel002 with the Intel Fortran Compiler, without the parallel switch, actually makes things a little faster: ifc -r8 -O3 -tpp7 -axiMKW -pad -unroll -o speed speed.f speedsub.o daxpy.o Vector length = 100 did 40 million * ops in 0.0521 cpu secs = 767.3434 MFLOPS did 79 million sweep ops in 0.2914 cpu secs = 274.5826 MFLOPS did 79 million fast sweep ops in 0.1631 cpu secs = 490.5870 MFLOPS did 40 million / ops in 0.8289 cpu secs = 48.2577 MFLOPS did 280 million multiple *+ ops in 0.1708 cpu secs = 1638.8730 MFLOPS did 80 million Dot *+ ops in 0.0470 cpu secs = 1702.5958 MFLOPS did 40 million logs ops in 1.0711 cpu secs = 37.3448 MFLOPS did 40 million sqrts ops in 0.8298 cpu secs = 48.2064 MFLOPS did 80 million daxpy *+ ops in 0.0940 cpu secs = 851.3746 MFLOPS Vector length = 10000000 did 40 million * ops in 0.9257 cpu secs = 43.2120 MFLOPS did 79 million sweep ops in 0.3081 cpu secs = 259.6536 MFLOPS did 79 million fast sweep ops in 0.1637 cpu secs = 488.7421 MFLOPS did 40 million / ops in 0.8874 cpu secs = 45.0752 MFLOPS did 280 million multiple *+ ops in 0.8738 cpu secs = 320.4460 MFLOPS did 80 million Dot *+ ops in 0.3608 cpu secs = 221.7087 MFLOPS did 40 million logs ops in 1.1171 cpu secs = 35.8067 MFLOPS did 40 million sqrts ops in 0.9037 cpu secs = 44.2637 MFLOPS did 80 million daxpy *+ ops in 0.5031 cpu secs = 159.0131 MFLOPS Running two jobs simultaneously (with nops doubled to keep it busy for a while) while there is a third big job from another user also running, shows that it slows down some. "top" reports that sometimes the two "speed" jobs are on the same processor, each getting 50% of the CPU time, while sometimes one "speed" job is getting 100% of one CPU while the other "speed" job is getting 50% of a CPU it is sharing with the third big job. It's to be expected that there is some performance degredation when a job doesn't have the processor all to itself (and so has to share cache and memory, etc), but this does indicate that an individual job is not parallelized and so isn't affected if there is another job on the other processor. Running one "speed" job while there is one other big job running on the other processor results in undegraded performance for the one "speed" job.