Excerpted from the history file. With much thanks to Peter Liu, Gryffin is now ported to machines with MPI or SHMEM libraries. There are no substantive changes to the input files. Here, the steps required to compile and run the code are documented, as well as some of the details of the parallelism, and a list of changes in functionality and capability. A few important points are marked with '!!'. A preliminary version of the electromagnetic terms of Snyder and Hammett is also included in this version, but should not be used without direct consulation with one of those two individuals first. The recommended way to compile on the T3E is with MPI and SHMEM. However, for runs with N_processors >= 2*md AND iperiod = 3, do not use SHMEM. !! After checking out the code, go to the src directory. Then: 1. Links !! To compile on the T3E with the MPI and SHMEM libraries: % make t3e_shmem Or, to compile on the T3E with only MPI: % make t3e To compile on the C90: % make c90 2. Load the right NetCDF module: !! T3E version: % module load USG netcdf3.5a C90 version: % module load netcdf 3. Make: !! % make itgc Running: C90: Running the code on the C90 is the same as before. T3E: To run on the T3E, you must decide how many processors to use before running. You also have to execute a special command which allows the NetCDF I/O to be done in parallel. This command is automatically issued if you use the xitgc script. Number of processors: If the number of processors (NPROC) satisfies NPROC <= (nffty-1)/3 + 1 using Fortran rules for arithmetic, the work is parallelized over only the ky index. For md <= NPROC < 2*md, there should be no significant improvement in parallel performance relative to NPROC = md. For NPROC = N*md, where N >= 2, the work is parallelized over both ky and kx. No significant improvement in parallel performance is expected for non-integer values of N when N >= 2. That is, NPROC = 2*md will probably give about the same performance as NPROC = 3*md - 1. To see how many processors are available, use the mppview command. Further information about the run-time environment on the T3E may be found at [http://www.nersc.gov]. As before, the code produces a file called runname.nc. This file can be processed with postc, which has not been ported to the T3E because NCAR graphics are not available there. Developmental notes: Overview: The code is parallelized over k_theta (ky) for NPROC < 2*md, and both k_theta and k_x for NPROC > 2*md-1. The style of programming is "data parallel." The FFT's needed for the nonlinear terms are in kx and ky; necessary redistributions are done so that each transform is done on processor. Thus, there are no calls to multiprocessor FFT's. Almost all communication occurs when the data is redistributed. The moments and the fields are distributed over the processors. Most of the control arrays and quantities like rkperp are replicated on all processors. The communication is handled by MPI through a Fortran-90 interface (mp_mpi.f90) written by Peter Liu. There is a dummy file (mp_stub.f90) for all these calls that is used when MPI is not available. In future development, all calls to MPI should be directed through the mp module, to keep the code as portable as possible. Most of the layout directives and local processor information are found in Peter's gryffin_layouts.f90. This module needs to be used whenever one makes reference to array sections, processor number, etc. The module gryffin_redist.f90 contains the routines that initialize the redistributions of the data for the nonlinear terms (and for boundary conditions if NPROC > 2*md-1). Differences in the command line syntax are handled by command_line_x.f90. The function iargc() returns the number of arguments, and the subroutine cl_getarg (mnemonic: command_line_getarg) returns the arguments. The NetCDF interface can either use parallel i/o (only available on T3E) or serial i/o. The binary and ascii formatted i/o routines are serial (they work in parallel machines, but ship all their data to/from processor 0 which does the actual writing/reading). Restarts work in exactly the same way as in the past. The nmlstrip script had to be changed to be compatible with the T3E. Details: 1. MPI The mp_mpi.f90 module was written by Peter Liu. It serves as a convenient interface to the MPI library, which has been ported to many machines. More information about MPI can be found at [http://www.mcs.anl.gov/mpi/index.html]. Peter has used Fortran-90 to overload many MPI calls and to simplify the calling syntax. The available routines from mp are: init_mp finish_mp broadcast sum_reduce sum_allreduce max_reduce max_allreduce min_reduce min_allreduce send receive barrier A few variables made available by this module are: nproc iproc proc0 A short description of each follows. nproc: This is the integer number of processors being used in the run. It is assumed to be unchanging as the code runs. iproc: This integer is the number of the local processor. 0 < iproc < nproc-1. proc0: This logical variable is true if the local processor has iproc == 0. init_mp: This should be called before any other MPI calls. An MPI initialization routine is called, and the variables iproc, nproc, and proc0 are assigned. finish_mp: This should be called after all other MPI calls. Typically, it will appear at the end of the main block of code. broadcast: This subroutine broadcasts information from one processor to all other processors. The syntax is: call broadcast(a[, i]) where "a" can be any one of: character, integer, one-dimensional integer array, real, one-dimensional real array, complex, one-dimensional complex array, logical, one-dimensional logical array and the optional argument "i" is the number of the processor from which the data is broadcast. If the routine is called with only one argument ("a") then the data is broadcast from processor zero (iproc == 0). !! Calls to broadcast should be executed on all processors. The MPI broadcast command causes the non-sending processors to wait until the data arrives. The broadcasting processor continues on as soon as the data is sent. Correct example: Assign a variable i on one processor and broadcast the value to the remaining processors. if(proc0) i = 5 call broadcast(i) Incorrect example: if(proc0) then i = 5 call broadcast(i) endif In this latter example, processors other than processor zero do not know that the data is coming, and will not receive it. sum_reduce: Sum across processors and send the result to one processor. The syntax of this call is call sum_reduce(a, i) where "a" is any of integer, one-dimensional integer array, real, one-dimensional real array, complex, one-dimensional complex array and "i" is the number of the processor that receives the answer. The routine sum_allreduce does the same thing, except that the result is sent to all the processors. The syntax is: call sum_allreduce(a) Examples of the correct usage of sum_allreduce may be found in diagnostics.f90. The thing to remember is that the part of the data in the sum that is local (on-processor) should be summed first, and then the final results gathered up. [For specific implementation notes about global sums in gryffin, search for "sum intrinsic" below.] max_reduce, min_reduce, etc., all work analogously to sum_reduce, except complex data types are not allowed. send: This routine sends data from one processor to another. The syntax is: call send(a, dest[, tag]) where "a" is any one of: integer, one-dimensional integer array, real, one-dimensional real array, complex, one-dimensional complex array, logical, one-dimensional logical array and "dest" is the processor number that the data is being sent to, and tag is something I am not familiar with. Peter says it is redundant. receive: This is the companion call to send. It has the syntax: call send(a, dest[, tag]) where "a" is as with the send command, and "dest" is the processor number that the data is coming from. !! Send and receive calls should be made in pairs. barrier: This routine causes all processors to wait until all other processors are at this point in the code. It is rarely needed. There are three occasions that require it: a. at key points in the netcdf calls (see below), b. when lots of data is being sent to one processor, and c. when race conditions may occur. For examples of (2) and (3), see init.f, where all data is sent to processor zero for writing, and gryffin_redist_shmem, where race conditions would otherwise be a problem for the rflags and sflags variables. 2. Layouts in Gryffin The data layouts are stored in a derived type variable, m_lo (mnemonic: moment layout). There are shorthand names available, as well: m_low Lowest local value of m index (ky) m_high Highest local value of m index (ky) m_alloc Appropriate upper m index for allocating arrays n_low Lowest local value of n index (kx) n_high Highest local value of n index (kx) n_alloc Appropriate upper n index for allocating arrays The routines available in this module are: init_gryffin_layouts proc_id idx_local and the data available is: m_layout_type m_lo Details: init_gryffin_layouts: This routine initializes the m_lo structure, and thereby determines how the data is laid out across processors. It has one integer arguments, md = (nffty-1)/3+1, and nd=2*((nfftx-1)/3)+1. The m_layout_type has the information needed for communications and parallel coding. The pieces are: m_lo%iproc: integer, present processor number m_lo%md: integer, argument passed to initialization call m_lo%llim_world: integer, overall lower limit to "m" index (= 1). m_lo%ulim_world: integer, overall upper limit to "m" index (= md). m_lo%llim_proc: integer, lower limit of "m" index for this processor. m_lo%ulim_proc: integer, upper limit of "m" index for this processor. m_lo%ulim_alloc: integer, upper limit of "m" index for this processor in all allocation statements. m_lo%blocksize: integer, number of "m" indices on most processors. !! Why does the m_lo%ulim_alloc variable appear? This is needed to prevent zero-extent arrays from being passed as arguments of procedures, since this may be an error for calls to Fortran77 or C library routines. m_lo%ulim_alloc is the larger of m_lo%ulim_proc and m_lo%llim_proc. Thus, arrays allocated with this index for an upper limit will never have zero size, even when NPROC > md. Regarding zero-extent arrays, Fortran90 generally knows how to deal with them and does the logically reasonable thing with them (this makes it easier to deal with boundary conditions for example). It would probably be possible to simplify some of the coding logic by allowing for the possibility that arrays might be zero-sized on some processors, by writing a fortran90 wrapper around any fortran-77/C routines involved. For example, an mp_mpi.f90 wrapper for mpi could catch cases where it has been sent a zero-sized array and do the right thing. (For example, in the mpi routine max_reduce, it could give replace zero-extent arrays on some processors with -infinity, and for sum_reduce it could replace an empty array with 0). !! Warning: sum intrinsic: Because we are using this construct, note that the use of intrinsics such as sum, max, etc., may give wrong answers when used over the "m" index, since some parts of the arrays may have garbage in them. These operations should be explicitly carried out with Fortran-77 style (indexed) loops. proc_id: This integer function returns the number of the processor that has a particular piece of data on it. The syntax is: j = proc_id(m_lo, m, n) where m_lo is an initialized variable of the type m_layout_type, "m" is the m index of the data of interest, and "n" is the n index. This function is particularly useful when used to specify indices in send, receive and broadcast calls. idx_local: This logical function is true if the data indexed by "m" and "n" is on the local processor. The syntax is: logical :: local local = idx_local(m_lo, m, n) idx_local is useful for making sure that a section of code that refers to only one part of the distributed data is carried out only on the processor that has that data. In Gryffin, it is often used when handling the m = 1 mode (ky = 0 mode), which is often qualitatively different from the other ky modes. !! Warning: A hard-to-find, but very easy bug to insert into the code is neglect to specify the lower bounds of a passed or declared assumed-size array. Fortran-90 will assume the lower bound is 1, but the different processors expect lower bounds of m_lo%llim_proc. 3. Nonlinear layouts. The nonlinear layouts are found in gryffin_redist. Presently, the two directions not being transformed at a given point in time will be lumped together and distributed among the NPROC processors. The ulim_alloc idea is used here, too. 4. Data redistribution. The data is redistributed in redistribute. redistribute.f90 has several gather, scatter and fill routines overloaded for easy. All coding is based on routines originally written by Peter Liu. The generalizations and initialization routines were written by Bill Dorland. 5. Nonlinear terms The nonlinear terms are evaluated in nlps.f90. On the T3E, this file is a link to nlps_par.f90, which uses the gryffin_redist module. The logic is exactly the same as in nlps_serial; the only substantive changes are in the way the data is redistributed. !! Before using nlps_serial, a call to subroutine 'reality' needs to be included. The structure can be copied from the nlps_par.f90 module. !! The variable nredist can be set to be larger than 2 if there are more than two sizes of arrays expected. Right now, 2 is appropriate since in general kdpass and ldb are not equal. 6. I/O !! Only processor zero reads the input namelist. The results are broadcast to the remaining processors. Thus, if you add a variable to the input namelist, you must broadcast it after the namelist is read. This is important, and a common source of error. It may be possible in general to have all processors read the input files, but for portability, we chose this method. There is a portable version of NetCDF available on mcurie.nersc.gov. Here are some notes about it from Greg Hammett, who had good interactions with rkowens@nersc.gov. ------------------------------------------------------------------ Using NetCDF on the Cray T3E: # Set up netcdf environment variables: !! module load USG netcdf3.5a # link the code with the netCDF library: f90 options... $NETCDF # interpose the "global FFIO layer" between the system I/O and the # executable: !! assign -F global f:netcdf_file_name see coding requirements at the end of http://www.nersc.gov/~rkowen/netcdf/usage.html more info at http://www.nersc.gov/~rkowen/ http://www.nersc.gov/~rkowen/netcdf http://www.nersc.gov/~rkowen/netcdf/usage.html man netcdf ------------------------------------------------------------------ Each processor reads and writes its own section of distributed data to/from the file. Only processor zero writes the arrays that are replicated across all processors. All processors read the replicated arrays. !! Restarts work as before. That is, if you just finished a run called 'run' and wish to continue the run as 'run_1', you execute: % cp run.in run_1.in In run_1.in, change nread = 0 to nread = 1. Then, type % ln -s run.nc run_1.ncp % assign -F global f:run_1.ncp % ./xitgc [-n nproc] run_1 It would be convenient to have a script to handle this common operation. !! Things that were broken in the port: 1. Correlation functions are not calculated. 2. It is no longer possible to specify mlo and mhi -- i.e., there is always a k_theta = 0 mode in the system. !! Things that were fixed in the port: 1. All the testsuite cases work now. The problem with the previous .f90 code was apparently related to uninitialized arrays. !! Things that were noted in the port: 1. The l_right variable is handled in possibly inconsistent ways in some diagnostics. !! Things that were changed in the port: 1. The FFT's in kparfft are done in a different order. !! New files: command_line_nag.f90: Works with NAG f90 compiled codes command_line_unix.f90: Works on the C90 command_line_posix.f90: Works on the T3E gryffin_layouts.f90 gryffin_redist.f90 io_ascii.f90 io_binary.f90 io_netcdf.f90 mp_mpi.f90 mp_stub.f90 nlps_par.f90 (replaces old nl.f90 for parallel runs) nlps_serial.f90 (same as old nl.f90) !! Unexpectedly strongly changed files: io.f90 nmlstrip At the same time of this upgrade, the code was brought more in line with Fortran 90 standard by placing more subroutines in modules. This resulted in the following new files: bounce_avg.f90 bounds.f90 constants.f90 diagnostics_arrays.f90 dmix.f90 field_eq.f90 fields.f90 fields_init.f90 linear.f90 nlps_par.f90 nlps_serial.f90 nonlinear.f90 redistribute_mpi.f90 redistribute_shmem.f90 and these files are no longer part of the distribution: baphi.f90 moved to bounce_avg.f90 kapav.f90 moved to bounce_avg.f90 bcon.f90 moved to bounds.f90 connect.f90 moved to bounds.f90 filter.f90 moved to linear.f90 flr_terms.f90 moved to linear.f90 kparfft.f90 moved to linear.f90 nl.f90 became nlps_par.f90 and nlps_serial.f90 nonlin.f90 moved to nonlinear.f90 nonline.f90 moved to nonlinear.f90 pertb.f90 moved to fields_init.f90 poisson.f90 moved to field_eq.f90