This is a note about how parallelism is implemented in gs2, in response to
a question from Darin Ernst.  
**************************************************************************
(written by Bill Dorland, 14-Jan-1999)


Most of the communication is handled by calls to subroutines in the MPI
library, which is documented at <http://www.mcs.anl.gov/mpi/index.html>.

There are only a few significant parallel constructs used: 

send		 ! Send data from one processor to another
receive		 ! Receive data on one processor that is sent from another
broadcast	 ! Send data from one processor to all
synchronize	 ! Wait for all processors before continuing
reductions	 ! Find global sum, min or max of a quantity that is
		 ! different on each processor.
identify self	 ! Find out number of local processor

All calls to MPI subroutines in the code are contained in the module mp.
To use MPI, one needs to assure that mp.f90 is a file with the real
MPI subroutine calls -- i.e., mp_mpi.f90.  On some computers, one may have
to include information on the compile line to link to the MPI library
properly. 

On the other hand, if one wishes to run the code on a serial machine that
does not have the MPI library available, one should use copy or link
to put the contents of mp_stub.f90 into mp.f90.  

MPI is a very portable, widely used library from Argonne.  I recommend
using it.

SGI/Cray computers generally have MPI libraries available.  They also
support SHMEM, which is a proprietary set of routines found only on
SGI/Cray machines.  For some operations (send and receive), SHMEM routines
are faster than the analogous MPI calls.  Thus, send and receive calls in
some places in the code may be replaced by SHMEM calls on computers that
support SHMEM.  This is accomplished by assuring that the contents of
shmem.f90 are a copy from or link to shmem_cray.f90.  Otherwise, one may
stick to straight MPI calls by using shmem_stub.f90.  The SHMEM interface
has only been checked on the T3E at NERSC. 

What issues influence whether to use SHMEM?  

     -- SHMEM may be faster for some runs.  In fact, we included SHMEM
        calls in an attempt to optimize the code.  However, despite the
        fact that the SHMEM puts and gets are faster than the MPI sends and
        receives, I find that the difference in run time for the cases I
        have checked tends to be very small.  This can happen when the send
        and receive operations are not the bottleneck in execution.

     -- SHMEM is less portable than MPI.  This is a significant drawback
        for SHMEM.  Although the man pages on the T3E at NERSC indicate
        that SHMEM calls on all SGI/Cray products have the same interfaces,
        I believe that Darin is discovering that this is not the case.  A
        symptom of non-portability is:

	   Origin compiler fails to recognize !DIR SYMMETRIC layout
           directive.  SHMEM calls are faster essentially because
           communicated information is laid out in memory in a specific
           way, which is specified on the T3E with this syntax.  Darin
           reported that the Origin compiler didn't support this directive.
           Without figuring out how the special memory layouts are
           specified, one should expect SHMEM to fail, and therefore one
           should use the more portable MPI library calls only.


For these reasons, I recommend getting the MPI version running on a new
computer before worrying about SHMEM.  I had hoped that the SHMEM version
would run on an Origin 2000.  However, this appears to be problematic.  I
know that the MPI version has been compiled and run successfully on the
Origin 2000 at the ACL at LANL, so I think this is the best way to go
forward.  I'm sorry for the errors in the shmem_stub.f90 file -- I had no
computer to test this on after Peter left the program, since he was the one
who had accounts at ACL.