Docs: Troubleshooting

Doing a search below will usually lead straight to the problem.

We continually update this guide; so please click here to get the most recent version.of the troubleshooting guide.

Symptom: No or little speedup:
Problem: This can be a result of not using a parallel system that is suitable for sparse linear solvers.
Symptom: Error detected in PetscSplitOwnership() about "sum of local lengths ...":
Problem: In a previous call to VecSetSizes(), MatSetSizes(), VecCreateXXX() or MatCreateXXX() you passed in local and global sizes that do not make sense for the correct number of processors. For example if you pass in a local size of 2 and a global size of 100 and run on two processors, this cannot work since the sum of the local sizes is 4, not 100.
Symptom: Corrupt argument:
Problem: An argument to a function is invalid. In Fortran this may be caused by forgeting to list an argument in the call, especially the final ierr.

Otherwise it is usually caused by memory corruption; that is somewhere the code is writing out of array bounds. To track this down rerun the debug version of the code with the option -malloc_debug. Occasionally the code may crash only with the optimized version, in that case run the optimized version with -malloc_debug. If you determine the problem is from memory corruption you can put the macro CHKMEMQ in the code near the crash to determine exactly what line is causing the problem.

If -malloc_debug does not help for GNU/Linux you can try using http://valgrind.org to look for memory corruption, on the Apple do "man libgmalloc" to see how to detect memory corruption.
Symptom: Caught signal:
Problem: this is most likely due to memory corruption, see Corrupt Argument
Symptom: Detected zero pivot in LU factorization:
Problem: A zero pivot in LU, ILU, Cholesky, or ICC sparse factorization does not always mean that the matrix is singular. You can use -pc_factor_shift_nonzero or -pc_factor_shift_positive_definite, -[level]_pc_factor_shift_nonzero, -[level]_pc_factor_shift_postive_definite to prevent the zero pivot. For lu, ilu, cholesky, or icc and [level] is sub is for a block in the bjacobi or ASM preconditioner and -mg_levels and -mg_coarse are for inside multigrid smoothers or the coarse grid solver). See PCFactorSetShiftNonzero(), PCFactorSetShiftPd().

This error can also happen if your matrix is singular , see KSPSetNullSpace() for how to handle this.

If this error occurs in the zeroth row of the matrix it is likely you have an error in the code that generates the matrix.
Symptom: alder>make
make: Warning: Can't find `../../bmake/':

Problem: You have not set the variable PETSC_ARCH to the architecture of your machine (e.g., sun4, rs6000).

Cure: Include in your .cshrc file some code to set it automatically. Or remember to include the PETSC_ARCH in the command line every time you use make. For instance, PETSC_ARCH=sun4 example4

Symptom: alder> make

makefile:12: /bmake/common/base: No such file or directory
makefile:13: /bmake/common/test: No such file or directory
make: *** No rule to make target `/bmake/common/test'. Stop.

or makefile:12: home/joe/bmake/common/base: No such file or directory
makefile:13: home/joe/bmake/common/test: No such file or directory
make: *** No rule to make target `home/joe/bmake/common/test'. Stop.

Problem: The variable PETSC_DIR is not set or does not point to the PETSc directory; in this case it points to the directory /home/joe.

Cure: Make sure the variable PETSC_DIR in the makefile points to the PETSc directory. Be aware that at many sites, your home directory may have different names on different machines so it is usually better to make the path relative, rather than absolute. That is, use PETSC_DIR = ../../petsc rather than PETSC_DIR = /c/cafa/username/petsc.

Symptom: alder> make
ex1 f77 -g -o ex1 ex1.o affine2d.o \ ../../libs/libsg/sun4/domain.a ../../libs/libsg/sun4/Xtools.a ../../libs/libsg/sun4/tools.a ../../libs/libsg/sun4/liblapack.a ../../libs/libsg/sun4/blas.a ../../libs/libsg/sun4/system.a -lX11 -lm
ld: ex1.o: bad magic number Compilation failed *** Error code 4

Problem: The file ex1.o was compiled on a different architecture or with a different compiler.

Cure: Remove all .o files and recompile from scatch

using MPICH [0] Truncated message (in CHK_MSGLEN)
[0] Aborting program!
p0_8959: p4_error: (null): 1

Problem: this is due to some bug in a call to an MPI routine.

Cure: Run the program with the option -start_in_debugger. In the debugger, type "break p4_error" (or "stop in p4_error" for dbx); then type "cont". When the program aborts, use debugger commands such as "where" to track down the problem with the call.

Symptom: on HP-UX
Make: Unknown flag argument -. Stop.
Make: Unknown flag argument -. Stop.
Make: Unknown flag argument -. Stop.

We have gotten this on the HP-UX using the native (vendor provided) make.

Cure: Install and use Gnu make. To force PETSc to use an alternative make, edit the file petsc/bmake/$PETSC_ARCH/base and change OMAKE to your alternative.

Symptom: on IBM SP
Could not load program /afs/rpi.edu/big/00/0000/hongwl/petsc/petsc/src/ksp/examp les/ex1
Symbol XSetWMProperties in pmd2 is undefined Symbol XSetWMName in pmd2 is undefined
Error was: Exec format error

Problem: The libraries on the IBM SP front-end for X may be different than on the nodes.

Cure: Get your system administrator to make sure the dynamic libraries on the nodes are IDENTICAL to those on the compiler server.

Symptom: using Fortran While using VecGetArray(), MatGetArray(), ISGetArray()
/usr/local/mpi/bin/mpirun.ch_p4: 17545 Breakpoint then program stops

Problem: You have compiled some of your code with the option to check for arrays out of bound. (on the IBM rs6000 this is the -C option)

Cure: Recompile all code making sure it does not check for arrays out of bound. The use of VecGetArray(), etc. requires accessing arrays out of bounds; this is done safely. -

Symptom: You create Draw windows or ViewerDraw windows or use options -ksp_monitor or_draw -snes_monitor_draw and the program seems to run OK but windows never open.

Problem: The libraries were compiled without support for X windows.

Cure: Make sure that config/configure.py was run with the option --with-x=1

Symptom:

Problem: PETSc cannot work on a machine where the length of C integers does not equal the length of Fortran integers.

Cure: Change your compilers so that you use ones that have the same length for integers. Or check compiler flags to see if you can change the default integer lengths to match.

On DEDC alpha Unaligned access pid=15199 va=140021674 pc=12001e8d8 ra=12001e8c0 type=ldt

Problem: The system has detected an unaligned variable. This is usually an unaligned double.

Cure: Make sure in Fortran that you always write double precision numbers as 10.d0 etc not just as 10. cause then it will be stored as a single precision number and may not be properly aligned. -

Symptom: PetscScalarAddressToFortran:C and Fortran arrays are not commonly aligned
or are too far apart to be indexed by an integer.
Locations: C 1920156 Fortran 2438656 [
0] MPI Abort by user Aborting program !
[0] Aborting program!

Problem: This occurs when trying to access a PETSc array from Fortran. The array may have been obtained with VecGetArray(), MatGetArray(), etc. On the IRIX64 this is because the Fortran address's are so far away from the C address that you cannot move between them with an integer offset (integers are just not big enough). On other machines this is because the distance between the Fortran array starting point and the C array starting point is not divisible by the length of a double (or complex). This one cannot access the other with an integer offset.

Cure: 1) Rewrite Fortran code to not use the particular XXXGetXXX() routine. For example, use VecSetValues() instead of directly stuffing the values into the array. 2) Determine how to force the Fortran and or C compiler to commonly align doubles or complex numbers. That is, if all doubles are double aligned then this won't be a problem, if all complex are quad aligned then it is not a problem. If you determine how to do this for a particular machine, please let use know so we can add it to PETSc.

Symptom: On rs6000 machines, the program encounters a segmentation fault when initializing MPI.
[light] mpirun ex1
/light_home2/lmcinnes/mpich/lib/rs6000/ch_p4/mpirun: 23817 Memory fault
See, e.g., the following debugger session:
[light] 525%gdb ex1
GDB is free software and you are welcome to distribute copies of it under certain conditions; type "show copying" to see the conditions. There is absolutely no warranty for GDB; type "show warranty" for details. GDB 4.13 (rs6000-ibm-aix3.2), Copyright 1994 Free Software Foundation, Inc...
(gdb) run -p4pg joe
Starting program: /light_home2/lmcinnes/petsc-2.0.15/src/sles/examples/tutorials/ex1 -p4pg joe
Program received signal SIGSEGV, Segmentation fault. 0x10003750 in getenv ()
(gdb) where
#0 0x10003750 in getenv ()
#1 0x10001438 in MPIR_Init (=0x2ff7f630, =0x2ff7f634)
#2 0x10001384 in MPI_Init (=0x2ff7f630, =0x2ff7f634)
#3 0x100004d8 in main (argc=3, args=0x2ff7f65c) at ex1.c:37
#4 0x10000430 in __start ()

Problem: As shown below, libxlf.a contains the Fortran routine getenv(), which is being used instead of the UNIX routine that we really need. This seems to occur when using gcc/g++ instead of xlc.

Cure: Edit the file petsc/bmake/rs6000/bpackages and define FC_LIB as as follows, making sure to list "-lbsd -lc" BEFORE libxlf.a and any other Fortran libraries. FC_LIB = -lbsd -lc /usr/lib/libxlf.a

Symptom: The program seems to use more and more memory as it runs, even though you don't think you are allocating more memory.

Problem: Possibly some of the following:

You are creating new PETSc objects but never freeing them.
There is a memory leak in PETSc or your code.
Something much more subtle: (if you are using Fortran). When you declare a large array in Fortran, the operating system does not allocate all the memory pages for that array until you start using the different locations in the array. Thus, in a code, if at each step you start using later values in the array your virtual memory usage will "continue" to increase as measured by ps or top.
You are running with the -log, -log_mpe, or -log_all option. He a great deal of logging information is stored in memory until the conclusion of the run.
You are linking with the MPI profiling libraries; these cause logging of all MPI activities. Another Symptom is at the conclusion of the run it may print some message about writing log files.

Cures:

Run with the -malloc_debug option and -malloc_dump. Or use the commands PetscMallocDump() and PetscMallocLogDump() sprinkled in your code to track memory that is allocated and not later freed. Use the commands PetscMallocSpace() and PetscGetResidentSetSize() to monitor memory allocated and total memory used as the code progresses.
This is just the way Unix works and is harmless.
Do not use the -log, -log_mpe, or -log_all option, or use PLogEventDeactivate() or PLogEventDeactivateClass(), PLogEventMPEDeactivate() to turn off logging of specific events.
Make sure you do not link with the MPI profiling libraries.

Symptom: Under Windows when Installing using g++ libfast in: /users/petsc/petsc_prj/petsc/src/sys/src str.c: In function `void PetscStrncpy(char *, char *, int)': str.c:36: warning: implicit declaration of function `int strncpy(...)' ... ...

Problem: This is due to the case insensitivity of Windows file systems. Instead of using string.h , the compiler is picking up String.h - a C++ include-file, causing these errors.

Cure: In the gcc include dir do "cp string.h string_bak.h" - Edit petsc/src/sys/src/str.c replace string.h with string_bak.h - Edit petsc/src/sys/src/memc.c replace memory.h with string_bak.h - recompile.

Symptom: on SGI (or Origin) using SGI MPI version 2.0 MPI Error, rank:0, function:MPI_ERRHANDLER_SET, Invalid communicator MPI_Abort() called, aborting program! Other, random crashes in MPI.

Problem: bug in SGI's implementation of MPI called version 2.0 (confirmed by SGI)

Cure: Upgrade to SGI's version 3.0 of MPI.

Symptom: on SGI Powerchallenge running 6.2 ld64: WARNING 134: weak definition of __dcis in /usr/lib64/mips4/libftn.so preempts that weak definition in /usr/lib64/mips4/libm.so. ld64: WARNING 134: weak definition of __rcis in /usr/lib64/mips4/libftn.so preempts that weak definition in /usr/lib64/mips4/libm.so.

Problem: Message seems harmless

Cure: Change the CLINKER and FLINKER in bmake/IRIX64/base to
CLINKER = cc -64 ${COPTFLAGS} -Wl,-woff,84,-woff,85,-woff,134 -rpath ${LDIR}:${DYLIBPATH}
FLINKER = f77 -64 ${FOPTFLAGS} -Wl,-woff,84,-woff,85,-woff,134 -rpath ${LDIR}:${DYLIBPATH}

Symptom: when calling MatPartitioningApply() you get a message Error! Key 16615 not found

Problem: the graph of the matrix you are using is not symmetric

Cure: you must use symmetric matrices for partitioning

Symptom: With GMRES At restart the second residual norm printed does not match the first
26 KSP Residual norm 3.421544615851e-04
27 KSP Residual norm 2.973675659493e-04
28 KSP Residual norm 2.588642948270e-04
29 KSP Residual norm 2.268190747349e-04
30 KSP Residual norm 1.977245964368e-04
30 KSP Residual norm 1.994426291979e-04 <----- At restart the residual norm is printed a second time

Problem: Actually this is not surprising. GMRES computes the norm of the residual at each iteration via a recurrence relation between the norms of the residuals at the previous iterations and quantities computed at the current iteration; it does not compute it via directly || b - A x^{n} ||. Sometimes, especially with an ill-conditioned matrix, or computation of the matrix-vector product via differencing, the residual norms computed by GMRES start to "drift" from the correct values. At the restart, we compute the residual norm directly, hence the "strange stuff," the difference printed. The drifting, if it remains small, is harmless (doesn't effect the accuracy of the solution that GMRES computes).

Cure: There realy isn't a cure, but if you use a more powerful preconditioner the drift will often be smaller and less noticeable. Of if you are running matrix-free you may need to tune the matrix-free parameters.

Symptom: on Cray T3E/T3D My mixed Fortran/(C or C++) code works fine on other machines but does not link (or links but crashes on) on the Cray T3D or T3E.

Probable problems:

NOT LINKING: the Cray Fortran compiler changes all Fortran routine names to all caps, so when you call them from C/C++ with all small letters, the linker cannot find the.
STRANGE CRASHING: the Cray Fortran compiler uses double precision to denote quad precision and single precision to denote "regular" double precision.

Cures:

You must make sure that when you call Fortran routines from C/C++ the name of the routine called (in C/C++) is in all caps. The PETSc macro HAVE_FORTRAN_CAPS is defined on machines like the Cray so you can use it in your C/C++ like this #if defined(HAVE_FORTRAN_CAPS) #define myfortranroutine_ MYFORTRANROUTINE #elif !defined(HAVE_FORTRAN_UNDERSCORE) #define myfortranroutine_ #define myfortranroutine #endif /* some C code that calls Fortran */ myfortranroutine_(.....). See src/fortran/custom/zoptions.c for examples.
To get the Fortran compiler to to behave like a normal Unix Fortran compiler you must make sure that all of your Fortran routines are compiled with the -dp flag. If you use the PETSc makefiles and macro FC to compile your Fortran code this will handle this automatically.

Symptom: [0] PETSC ERROR: MatAssemblyBegin() line 1858 in src/mat/interface/matrix.c Not for factored matrix

Problem: You are trying to assemble a matrix that has been factored. Normally this does not make sense, unless you are using an implace factorization and want to reuse the space.

Cure: Call MatSetUnfactored(Mat); before calling the MatSetValues() routines.

Symptom: Error while running program

> mpirun ex1
p0_27137: p4_error: semget failed for setnum=%d: 0

Problem: Inproperly installed or configured MPICH. Often this results from compiling the socket based version of MPICH, with device ch_p4 but using the mpirun associated with the shared memory version or the other way around.

Cure: First, make sure that you can run plain old MPI programs (those without PETSC). Make sure you are using the correct version of mpirun for the installed version of MPICH or reinstall MPICH
.

Symptom: Error when compiling PETSc examples

> ld : -lg2c no such file

Problem: Your fortran compiler is probably using libf2c.a instead of libg2c.a

Cure: Edit bmake/${PETSC_ARCH}/petscconf and replace -lg2c with -lf2c

Symptom: Get the following errors when using PETSc graphics on windows/cygwin-X11

X Error of failed request: BadMatch (invalid parameter attributes)
Major opcode of failed request: 78 (X_CreateColormap)
Serial number of failed request: 8
Current serial number in output stream: 9

Problem: This problem might occur when using 25 color mode or 32bit color mode on windows.

Cure: This can be fixed by changing the display settings on windows to 16 bit colors or 24 bit colors.

Symptom: Some Krylov methods seem to print two residual norms per iteration, for example

> 1198 KSP Residual norm 1.366052062216e-04
> 1198 KSP Residual norm 1.931875025549e-04
> 1199 KSP Residual norm 1.366026406067e-04
> 1199 KSP Residual norm 1.931819426344e-04

Problem: Some Krylov methods, for example tfqmr, actually have a "sub-iteration"
of size 2 inside the loop; each of the two substeps has its own matrix vector
product and application of the preconditioner and updates the residual
approximations. This is why you get this "funny" output where it looks like
there are two residual norms per iteration. You can also think of it as twice
as many iterations.

Symptom: The example compiles fine - but at runtime gives the following error:

[0]PETSC ERROR: PetscInitialize_DynamicLibraries() line 63 in src/sys/src/dll/reg.c
[0]PETSC ERROR: Unable to locate PETSc dynamic library /home/balay/spetsc/lib/libg/linux/libpetsc
You cannot move the dynamic libraries!

Problem: When using DYNAMIC libraries - the libraries cannot be moved after they are installed. This could also happen on clusters - where the paths are different on the (run) nodes - than on the (compile) front-end.

Cure: Do not use dynamic libraries & shared libraries. Run config/configure.py with --with-shared=0 --with-dynamic=0

Symptom: When running with -start_in_debugger one gets the error message

PETSC: Attaching gdb to /opt/procast_mpich/procast051003/./procast of pid 31603 on display linux.:0.\ 0 on machine linux. : Can't get address for linux. Xt error: Can't open display: linux.:0.0

Problem: The remote nodes do not know where to display the debugger window.

Cure: Run with the additional option -display displayname where displayname is something like mymachine.0.0

Symptom:
# make PETSC_ARCH=asterix-mpd test
Running test examples to verify correct installation
Possible error running C/C++ src/snes/examples/tutorials/ex19 with 1 MPI process
See http://www.mcs.anl.gov/petsc/petsc-as/documentation/troubleshooting.html
mpdrun_asterix: cannot connect to local mpd (/tmp/mpd2.console_balay); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)

Problem: As the error message indicates - 'mpd' - required for the version of MPICH you've installed isn't started

Cure: Start the mpd daemon [should be at MPI_DIR/bin/mpdboot].

Symptom: On Windows, make test fails to launch programs with the error "Logon failure: unknown user name or bad password"

Problem: You have not

Cure: You must have the same account credentials on all the nodes participating in the mpich job. If your cluster is set up with a domain controller then you can use a domain account to launch an mpich job. If you do not have a domain controller then you must set up user accounts on all the nodes individually with the same credentials on each node. Each user can have whatever password they choose, but they must use the same password on all the nodes. In other words, UserA-PasswordA must be the same on all the nodes and UserB-PasswordB must be the same on all the nodes, etc. In addition, you must pus the username/password in the registry or make it available in the environment.