Excerpted from the history file.

With much thanks to Peter Liu, Gryffin is now ported to machines with MPI
or SHMEM libraries.  There are no substantive changes to the input files.
Here, the steps required to compile and run the code are documented, as
well as some of the details of the parallelism, and a list of changes in
functionality and capability.  A few important points are marked with '!!'.

A preliminary version of the electromagnetic terms of Snyder and Hammett is
also included in this version, but should not be used without direct
consulation with one of those two individuals first. 

The recommended way to compile on the T3E is with MPI and SHMEM.  However,
for runs with N_processors >= 2*md AND iperiod = 3, do not use SHMEM.  !!

After checking out the code, go to the src directory.  Then: 

1. Links 							!!
	To compile on the T3E with the MPI and SHMEM libraries: 

	% make t3e_shmem

	Or, to compile on the T3E with only MPI:

	% make t3e

	To compile on the C90: 

	% make c90
	
2. Load the right NetCDF module: 				!!

	T3E version: 
	
	% module load USG netcdf3.5a

	C90 version: 

	% module load netcdf

3. Make: 							!!

	% make itgc

Running: 

C90: Running the code on the C90 is the same as before.

T3E: To run on the T3E, you must decide how many processors to use before
running.  You also have to execute a special command which allows the
NetCDF I/O to be done in parallel.  This command is automatically issued if
you use the xitgc script.

Number of processors: 

If the number of processors (NPROC) satisfies 

	NPROC <= (nffty-1)/3 + 1

using Fortran rules for arithmetic, the work is parallelized over only the
ky index.  For md <= NPROC < 2*md, there should be no significant
improvement in parallel performance relative to NPROC = md.  

For NPROC = N*md, where N >= 2, the work is parallelized over both ky and
kx.  No significant improvement in parallel performance is expected for
non-integer values of N when N >= 2.  That is, NPROC = 2*md will probably
give about the same performance as NPROC = 3*md - 1.

To see how many processors are available, use the mppview command.  Further
information about the run-time environment on the T3E may be found at
[http://www.nersc.gov].
	
As before, the code produces a file called runname.nc.  This file can be
processed with postc, which has not been ported to the T3E because NCAR
graphics are not available there.

Developmental notes: 

Overview: 

The code is parallelized over k_theta (ky) for NPROC < 2*md, and both
k_theta and k_x for NPROC > 2*md-1.  The style of programming is "data
parallel."  The FFT's needed for the nonlinear terms are in kx and ky;
necessary redistributions are done so that each transform is done on
processor.  Thus, there are no calls to multiprocessor FFT's.  Almost all
communication occurs when the data is redistributed.

The moments and the fields are distributed over the processors.  Most of
the control arrays and quantities like rkperp are replicated on all
processors.

The communication is handled by MPI through a Fortran-90 interface
(mp_mpi.f90) written by Peter Liu.  There is a dummy file (mp_stub.f90) for
all these calls that is used when MPI is not available.  In future
development, all calls to MPI should be directed through the mp module, to
keep the code as portable as possible.

Most of the layout directives and local processor information are found in
Peter's gryffin_layouts.f90.  This module needs to be used whenever one
makes reference to array sections, processor number, etc.

The module gryffin_redist.f90 contains the routines that initialize the
redistributions of the data for the nonlinear terms (and for boundary
conditions if NPROC > 2*md-1).

Differences in the command line syntax are handled by command_line_x.f90.
The function iargc() returns the number of arguments, and the subroutine
cl_getarg (mnemonic: command_line_getarg) returns the arguments.

The NetCDF interface can either use parallel i/o (only available on T3E) or
serial i/o.  The binary and ascii formatted i/o routines are serial (they
work in parallel machines, but ship all their data to/from processor 0
which does the actual writing/reading). Restarts work in exactly the same
way as in the past.

The nmlstrip script had to be changed to be compatible with the T3E.  

Details: 

1. MPI

The mp_mpi.f90 module was written by Peter Liu.  It serves as a convenient
interface to the MPI library, which has been ported to many machines.  More
information about MPI can be found at
[http://www.mcs.anl.gov/mpi/index.html].

Peter has used Fortran-90 to overload many MPI calls and to simplify the
calling syntax.

The available routines from mp are: 

init_mp
finish_mp
broadcast
sum_reduce
sum_allreduce
max_reduce
max_allreduce
min_reduce
min_allreduce
send
receive
barrier

A few variables made available by this module are: 

nproc
iproc
proc0

A short description of each follows.

nproc: This is the integer number of processors being used in the run.  It
is assumed to be unchanging as the code runs.

iproc: This integer is the number of the local processor.  
       0 < iproc < nproc-1.

proc0: This logical variable is true if the local processor has iproc == 0.

init_mp: This should be called before any other MPI calls.  An MPI
initialization routine is called, and the variables iproc, nproc, and proc0
are assigned. 

finish_mp: This should be called after all other MPI calls.  Typically, it
will appear at the end of the main block of code.

broadcast: This subroutine broadcasts information from one processor to all
other processors.  The syntax is: 

	call broadcast(a[, i])

where "a" can be any one of: 

	character, integer, one-dimensional integer array, real,
	one-dimensional real array, complex, one-dimensional complex array, 
	logical, one-dimensional logical array

and the optional argument "i" is the number of the processor from which the
data is broadcast.  If the routine is called with only one argument ("a")
then the data is broadcast from processor zero (iproc == 0).  

!! Calls to broadcast should be executed on all processors.  

The MPI broadcast command causes the non-sending processors to wait until
the data arrives.  The broadcasting processor continues on as soon as the
data is sent.

Correct example: 

Assign a variable i on one processor and broadcast the value to the
remaining processors.

	if(proc0) i = 5
	call broadcast(i) 

Incorrect example: 

	if(proc0) then
	   i = 5
	   call broadcast(i)
	endif	

In this latter example, processors other than processor zero do not know
that the data is coming, and will not receive it.

sum_reduce: Sum across processors and send the result to one processor.
The syntax of this call is

	call sum_reduce(a, i)

where "a" is any of 
	
	integer, one-dimensional integer array, real,
	one-dimensional real array, complex, one-dimensional complex array

and "i" is the number of the processor that receives the answer.

The routine sum_allreduce does the same thing, except that the result is
sent to all the processors.  The syntax is:

	call sum_allreduce(a)

Examples of the correct usage of sum_allreduce may be found in
diagnostics.f90.  The thing to remember is that the part of the data in the
sum that is local (on-processor) should be summed first, and then the final
results gathered up.

[For specific implementation notes about global sums in gryffin, search for
"sum intrinsic" below.]

max_reduce, min_reduce, etc., all work analogously to sum_reduce, except
complex data types are not allowed.

send: This routine sends data from one processor to another.  The syntax is: 

	call send(a, dest[, tag])

where "a" is any one of:

	integer, one-dimensional integer array, real,
	one-dimensional real array, complex, one-dimensional complex array,
	logical, one-dimensional logical array

and "dest" is the processor number that the data is being sent to, and tag
is something I am not familiar with.  Peter says it is redundant.

receive: This is the companion call to send.  It has the syntax: 

	call send(a, dest[, tag])

where "a" is as with the send command, and "dest" is the processor number
that the data is coming from.

!! Send and receive calls should be made in pairs.  

barrier: This routine causes all processors to wait until all other
processors are at this point in the code.  It is rarely needed.  There are
three occasions that require it: 

	a. at key points in the netcdf calls (see below), 
	b. when lots of data is being sent to one processor, and 
	c. when race conditions may occur.  

For examples of (2) and (3), see init.f, where all data is sent to
processor zero for writing, and gryffin_redist_shmem, where race conditions
would otherwise be a problem for the rflags and sflags variables. 

2. Layouts in Gryffin

The data layouts are stored in a derived type variable, m_lo (mnemonic:
moment layout).  There are shorthand names available, as well: 

	m_low 	Lowest local value of m index (ky)
	m_high 	Highest local value of m index (ky)
	m_alloc	Appropriate upper m index for allocating arrays

	n_low 	Lowest local value of n index (kx)
	n_high 	Highest local value of n index (kx)
	n_alloc	Appropriate upper n index for allocating arrays

The routines available in this module are: 

init_gryffin_layouts
proc_id
idx_local

and the data available is: 

m_layout_type
m_lo

Details: 

init_gryffin_layouts: This routine initializes the m_lo structure, and
thereby determines how the data is laid out across processors.  It has one
integer arguments, md = (nffty-1)/3+1, and nd=2*((nfftx-1)/3)+1.

The m_layout_type has the information needed for communications and
parallel coding.  The pieces are: 

m_lo%iproc:  integer, present processor number
m_lo%md:     integer, argument passed to initialization call
m_lo%llim_world: integer, overall lower limit to "m" index (= 1). 
m_lo%ulim_world: integer, overall upper limit to "m" index (= md). 
m_lo%llim_proc: integer, lower limit of "m" index for this processor.
m_lo%ulim_proc: integer, upper limit of "m" index for this processor.
m_lo%ulim_alloc: integer, upper limit of "m" index for this processor 
		 in all allocation statements. 
m_lo%blocksize: integer, number of "m" indices on most processors.  

!! Why does the m_lo%ulim_alloc variable appear?  This is needed to prevent
zero-extent arrays from being passed as arguments of procedures, since this
may be an error for calls to Fortran77 or C library routines.
m_lo%ulim_alloc is the larger of m_lo%ulim_proc and m_lo%llim_proc.  Thus,
arrays allocated with this index for an upper limit will never have zero
size, even when NPROC > md.  

Regarding zero-extent arrays, Fortran90 generally knows how to deal with
them and does the logically reasonable thing with them (this makes it
easier to deal with boundary conditions for example).  It would probably be
possible to simplify some of the coding logic by allowing for the
possibility that arrays might be zero-sized on some processors, by writing
a fortran90 wrapper around any fortran-77/C routines involved.  For
example, an mp_mpi.f90 wrapper for mpi could catch cases where it has been
sent a zero-sized array and do the right thing.  (For example, in the mpi
routine max_reduce, it could give replace zero-extent arrays on some
processors with -infinity, and for sum_reduce it could replace an empty
array with 0).

!! Warning: sum intrinsic: 

Because we are using this construct, note that the use of intrinsics such
as sum, max, etc., may give wrong answers when used over the "m" index,
since some parts of the arrays may have garbage in them.  These operations
should be explicitly carried out with Fortran-77 style (indexed) loops.

proc_id: This integer function returns the number of the processor that has
a particular piece of data on it.  The syntax is: 

	j = proc_id(m_lo, m, n)

where m_lo is an initialized variable of the type m_layout_type, "m" is the
m index of the data of interest, and "n" is the n index.  This function is
particularly useful when used to specify indices in send, receive and
broadcast calls.

idx_local: This logical function is true if the data indexed by "m" and "n"
is on the local processor.  The syntax is:

logical :: local

	local = idx_local(m_lo, m, n)

idx_local is useful for making sure that a section of code that refers to
only one part of the distributed data is carried out only on the processor
that has that data.  In Gryffin, it is often used when handling the m = 1
mode (ky = 0 mode), which is often qualitatively different from the other
ky modes.

!! Warning: A hard-to-find, but very easy bug to insert into the code is
neglect to specify the lower bounds of a passed or declared assumed-size
array.  Fortran-90 will assume the lower bound is 1, but the different
processors expect lower bounds of m_lo%llim_proc.

3. Nonlinear layouts. 

The nonlinear layouts are found in gryffin_redist.  Presently, the two
directions not being transformed at a given point in time will be lumped
together and distributed among the NPROC processors.  The ulim_alloc idea
is used here, too.

4. Data redistribution.

The data is redistributed in redistribute.  redistribute.f90 has several
gather, scatter and fill routines overloaded for easy.  All coding is based
on routines originally written by Peter Liu.  The generalizations and
initialization routines were written by Bill Dorland.

5. Nonlinear terms

The nonlinear terms are evaluated in nlps.f90.  On the T3E, this file is a
link to nlps_par.f90, which uses the gryffin_redist module.  The logic is
exactly the same as in nlps_serial; the only substantive changes are in the
way the data is redistributed.

!! Before using nlps_serial, a call to subroutine 'reality' needs to be
included.  The structure can be copied from the nlps_par.f90 module. !!

The variable nredist can be set to be larger than 2 if there are more than
two sizes of arrays expected.  Right now, 2 is appropriate since in general
kdpass and ldb are not equal.

6. I/O

!! Only processor zero reads the input namelist.  The results are broadcast to
the remaining processors.  Thus, if you add a variable to the input
namelist, you must broadcast it after the namelist is read.  This is
important, and a common source of error.  It may be possible in general to
have all processors read the input files, but for portability, we chose
this method.

There is a portable version of NetCDF available on mcurie.nersc.gov.  Here
are some notes about it from Greg Hammett, who had good interactions with
rkowens@nersc.gov. 

------------------------------------------------------------------
Using NetCDF on the Cray T3E:

# Set up netcdf environment variables:				!!

module load USG netcdf3.5a

# link the code with the netCDF library:

f90 options... $NETCDF

# interpose the "global FFIO layer" between the system I/O and the
# executable:							!!

assign -F global f:netcdf_file_name

see coding requirements at the end of

http://www.nersc.gov/~rkowen/netcdf/usage.html

more info at 
http://www.nersc.gov/~rkowen/
http://www.nersc.gov/~rkowen/netcdf
http://www.nersc.gov/~rkowen/netcdf/usage.html
man netcdf

------------------------------------------------------------------

Each processor reads and writes its own section of distributed data to/from
the file.  Only processor zero writes the arrays that are replicated across
all processors.  All processors read the replicated arrays.  

!! Restarts work as before.  That is, if you just finished a run called 'run'
and wish to continue the run as 'run_1', you execute:

% cp run.in run_1.in

In run_1.in, change nread = 0 to nread = 1.  Then, type

% ln -s run.nc run_1.ncp
% assign -F global f:run_1.ncp
% ./xitgc [-n nproc] run_1

It would be convenient to have a script to handle this common operation. 

!! Things that were broken in the port: 

1. Correlation functions are not calculated.
2. It is no longer possible to specify mlo and mhi -- i.e., there is always
   a k_theta = 0 mode in the system. 

!! Things that were fixed in the port: 

1. All the testsuite cases work now.  The problem with the previous .f90
   code was apparently related to uninitialized arrays.

!! Things that were noted in the port: 

1. The l_right variable is handled in possibly inconsistent ways in some
   diagnostics.

!! Things that were changed in the port: 

1. The FFT's in kparfft are done in a different order. 

!! New files: 

command_line_nag.f90:  Works with NAG f90 compiled codes
command_line_unix.f90:  Works on the C90
command_line_posix.f90: Works on the T3E
gryffin_layouts.f90
gryffin_redist.f90
io_ascii.f90
io_binary.f90
io_netcdf.f90
mp_mpi.f90
mp_stub.f90
nlps_par.f90 (replaces old nl.f90 for parallel runs)
nlps_serial.f90 (same as old nl.f90)

!! Unexpectedly strongly changed files: 

io.f90
nmlstrip

At the same time of this upgrade, the code was brought more in line with
Fortran 90 standard by placing more subroutines in modules.  This resulted
in the following new files:

bounce_avg.f90
bounds.f90
constants.f90
diagnostics_arrays.f90
dmix.f90
field_eq.f90
fields.f90
fields_init.f90
linear.f90
nlps_par.f90
nlps_serial.f90
nonlinear.f90
redistribute_mpi.f90
redistribute_shmem.f90

and these files are no longer part of the distribution: 

baphi.f90 		moved to bounce_avg.f90
kapav.f90		moved to bounce_avg.f90

bcon.f90		moved to bounds.f90
connect.f90		moved to bounds.f90

filter.f90		moved to linear.f90
flr_terms.f90		moved to linear.f90
kparfft.f90		moved to linear.f90

nl.f90			became nlps_par.f90 and nlps_serial.f90

nonlin.f90		moved to nonlinear.f90
nonline.f90 		moved to nonlinear.f90

pertb.f90		moved to fields_init.f90

poisson.f90		moved to field_eq.f90