Doing a search below will usually lead straight to the problem.
We continually update this guide; so please click here to get the most recent version.of the troubleshooting guide.
Problem: This can be a result of not using a parallel system that is suitable for sparse linear solvers.
Problem: In a previous call to VecSetSizes(), MatSetSizes(), VecCreateXXX() or MatCreateXXX() you passed in local and global sizes that do not make sense for the correct number of processors. For example if you pass in a local size of 2 and a global size of 100 and run on two processors, this cannot work since the sum of the local sizes is 4, not 100.
Problem:
An argument to
a function is invalid. In Fortran this may be caused by forgeting to
list an argument in the call, especially the final ierr.
Otherwise it
is usually caused by memory corruption; that is somewhere the code is
writing out of array bounds. To track this down rerun the debug version
of the code with the option -malloc_debug. Occasionally the
code may crash only with the optimized version, in that case
run the optimized version with -malloc_debug. If you determine the
problem
is from memory corruption you can put the macro CHKMEMQ in the code
near the crash to determine exactly what line is causing the
problem.
If -malloc_debug does not help for GNU/Linux you can try using http://valgrind.org to look for memory corruption,
on the Apple do "man libgmalloc" to see how to detect memory corruption.
Problem: this is most likely due to memory corruption, see Corrupt Argument
Problem: A zero pivot in LU, ILU, Cholesky, or ICC sparse factorization does not always mean that the matrix is singular. You can use -pc_factor_shift_nonzero or -pc_factor_shift_positive_definite, -[level]_pc_factor_shift_nonzero, -[level]_pc_factor_shift_postive_definite to prevent the zero pivot. For lu, ilu, cholesky, or icc and [level] is sub is for a block in the bjacobi or ASM preconditioner and -mg_levels and -mg_coarse are for inside multigrid smoothers or the coarse grid solver). See PCFactorSetShiftNonzero(), PCFactorSetShiftPd().
This error can also happen if your matrix is singular , see KSPSetNullSpace() for how to handle this.
If this error occurs in the zeroth row of the matrix it is likely you have an error in the code that generates the matrix.
Problem: You have not set the variable PETSC_ARCH to the architecture of your machine (e.g., sun4, rs6000).
Cure: Include in your .cshrc file some code to set it automatically. Or remember to include the PETSC_ARCH in the command line every time you use make. For instance, PETSC_ARCH=sun4 example4
makefile:12:
/bmake/common/base: No such file or directory
makefile:13: /bmake/common/test: No such file or directory
make: *** No rule to make target `/bmake/common/test'. Stop.
Problem: The variable PETSC_DIR is not set or does not point to the PETSc directory; in this case it points to the directory /home/joe.
Cure: Make sure the variable PETSC_DIR in the makefile points to the PETSc directory. Be aware that at many sites, your home directory may have different names on different machines so it is usually better to make the path relative, rather than absolute. That is, use PETSC_DIR = ../../petsc rather than PETSC_DIR = /c/cafa/username/petsc.
Problem: The file ex1.o was compiled on a different architecture or with a different compiler.
Cure: Remove all .o files and recompile from scatch
Problem: this is due to some bug in a call to an MPI routine.
Cure: Run the program with the option -start_in_debugger. In the debugger, type "break p4_error" (or "stop in p4_error" for dbx); then type "cont". When the program aborts, use debugger commands such as "where" to track down the problem with the call.
We have gotten this on the HP-UX using the native (vendor provided) make.
Cure: Install and use Gnu make. To force PETSc to use an alternative make, edit the file petsc/bmake/$PETSC_ARCH/base and change OMAKE to your alternative.
Problem: The libraries on the IBM SP front-end for X may be different than on the nodes.
Cure: Get your system administrator to make sure the dynamic libraries on the nodes are IDENTICAL to those on the compiler server.
Problem: You have compiled some of your code with the option to check for arrays out of bound. (on the IBM rs6000 this is the -C option)
Cure: Recompile all code making sure it does not check for arrays out of bound. The use of VecGetArray(), etc. requires accessing arrays out of bounds; this is done safely. -
Problem: The libraries were compiled without support for X windows.
Cure: Make sure that config/configure.py was run with the option --with-x=1
Problem: PETSc cannot work on a machine where the length of C integers does not equal the length of Fortran integers.
Cure: Change your compilers so that you use ones that have the same length for integers. Or check compiler flags to see if you can change the default integer lengths to match.
Problem: The system has detected an unaligned variable. This is usually an unaligned double.
Cure: Make sure in Fortran that you always write double precision numbers as 10.d0 etc not just as 10. cause then it will be stored as a single precision number and may not be properly aligned. -
Problem: This occurs when trying to access a PETSc array from Fortran. The array may have been obtained with VecGetArray(), MatGetArray(), etc. On the IRIX64 this is because the Fortran address's are so far away from the C address that you cannot move between them with an integer offset (integers are just not big enough). On other machines this is because the distance between the Fortran array starting point and the C array starting point is not divisible by the length of a double (or complex). This one cannot access the other with an integer offset.
Cure: 1) Rewrite Fortran code to not use the particular XXXGetXXX() routine. For example, use VecSetValues() instead of directly stuffing the values into the array. 2) Determine how to force the Fortran and or C compiler to commonly align doubles or complex numbers. That is, if all doubles are double aligned then this won't be a problem, if all complex are quad aligned then it is not a problem. If you determine how to do this for a particular machine, please let use know so we can add it to PETSc.
Problem: As
shown
below, libxlf.a contains the Fortran routine getenv(), which is
being used instead of the UNIX routine that we really need.
This seems to occur when using gcc/g++ instead of xlc.
Cure: Edit the file petsc/bmake/rs6000/bpackages and define FC_LIB as as follows, making sure to list "-lbsd -lc" BEFORE libxlf.a and any other Fortran libraries. FC_LIB = -lbsd -lc /usr/lib/libxlf.a
Problem: Possibly some of the following:
Cures:
Problem: This is due to the case insensitivity of Windows file systems. Instead of using string.h , the compiler is picking up String.h - a C++ include-file, causing these errors.
Cure: In the gcc include dir do "cp string.h string_bak.h" - Edit petsc/src/sys/src/str.c replace string.h with string_bak.h - Edit petsc/src/sys/src/memc.c replace memory.h with string_bak.h - recompile.
Problem: bug in SGI's implementation of MPI called version 2.0 (confirmed by SGI)
Cure: Upgrade to SGI's version 3.0 of MPI.
Problem: Message seems harmless
Cure: Change
the CLINKER and FLINKER in bmake/IRIX64/base to
CLINKER = cc -64 ${COPTFLAGS} -Wl,-woff,84,-woff,85,-woff,134 -rpath
${LDIR}:${DYLIBPATH}
FLINKER = f77 -64 ${FOPTFLAGS} -Wl,-woff,84,-woff,85,-woff,134 -rpath
${LDIR}:${DYLIBPATH}
Problem: the graph of the matrix you are using is not symmetric
Cure: you must use symmetric matrices for partitioning
Problem: Actually this is not surprising. GMRES computes the norm of the residual at each iteration via a recurrence relation between the norms of the residuals at the previous iterations and quantities computed at the current iteration; it does not compute it via directly || b - A x^{n} ||. Sometimes, especially with an ill-conditioned matrix, or computation of the matrix-vector product via differencing, the residual norms computed by GMRES start to "drift" from the correct values. At the restart, we compute the residual norm directly, hence the "strange stuff," the difference printed. The drifting, if it remains small, is harmless (doesn't effect the accuracy of the solution that GMRES computes).
Cure: There realy isn't a cure, but if you use a more powerful preconditioner the drift will often be smaller and less noticeable. Of if you are running matrix-free you may need to tune the matrix-free parameters.
Probable problems:
Cures:
Problem: You are trying to assemble a matrix that has been factored. Normally this does not make sense, unless you are using an implace factorization and want to reuse the space.
Cure: Call MatSetUnfactored(Mat); before calling the MatSetValues() routines.
Symptom:
Get the following errors when using PETSc graphics on
windows/cygwin-X11
X Error of failed request: BadMatch (invalid parameter attributes)
Major opcode of failed request: 78 (X_CreateColormap)
Serial number of failed request: 8
Current serial number in output stream: 9
Problem: This problem might occur when using 25 color mode or 32bit color mode on windows.
Cure:
This
can be fixed by changing the display settings on windows to 16
bit colors or 24 bit colors.
Problem:
Some Krylov methods, for example tfqmr, actually have a
"sub-iteration"
of size 2 inside the loop; each of the two substeps has its own matrix
vector
product and application of the preconditioner and updates the residual
approximations. This is why you get this "funny" output where it looks
like
there are two residual norms per iteration. You can also think of it as
twice
as many iterations.
[0]PETSC ERROR: PetscInitialize_DynamicLibraries() line 63
in src/sys/src/dll/reg.c
[0]PETSC ERROR: Unable to locate PETSc dynamic library
/home/balay/spetsc/lib/libg/linux/libpetsc
You cannot move the dynamic libraries!
Problem: When using DYNAMIC libraries - the libraries cannot be moved after they are installed. This could also happen on clusters - where the paths are different on the (run) nodes - than on the (compile) front-end.
Cure: Do not use dynamic libraries & shared libraries. Run config/configure.py with --with-shared=0 --with-dynamic=0
PETSC: Attaching gdb to /opt/procast_mpich/procast051003/./procast of pid 31603 on display linux.:0.\ 0 on machine linux. : Can't get address for linux. Xt error: Can't open display: linux.:0.0
Problem: The remote nodes do not know where to display the debugger window.
Cure: Run with the additional option -display displayname where displayname is something like mymachine.0.0
Problem: As the error message indicates - 'mpd' - required for the version of MPICH you've installed isn't started
Cure: Start the
mpd daemon [should
be at MPI_DIR/bin/mpdboot].
Problem: You have not
Cure: You must have the same account credentials on all the nodes participating in the mpich job. If your cluster is set up with a domain controller then you can use a domain account to launch an mpich job. If you do not have a domain controller then you must set up user accounts on all the nodes individually with the same credentials on each node. Each user can have whatever password they choose, but they must use the same password on all the nodes. In other words, UserA-PasswordA must be the same on all the nodes and UserB-PasswordB must be the same on all the nodes, etc. In addition, you must pus the username/password in the registry or make it available in the environment.