1. Use transpgrid.pppl.gov or one of the TRANSP nodes swift* 2. Do NOT run anything as pshare or yourself in a tr_<user> directory !!! a)Login as "pshare" and ssh to tr_<user>, the run production account which is allocated to the owner of the crashed run (this account name is visible on the PPPL grid monitor). The standard TRANSP environment ($TRANSPROOT, etc.) will be defined for you. b)check for the presence of <runid>.trexe If it contains tshare, type: tshare_setup 3. To find the crashed run, go to: a)Normal crash $RESULTDIR/<tok>/<runid> b)If a grid run never started and there is no .REQUEST file, do instead: cd ~/incoming to find the .REQUEST file. Then follow step 9. c)If TRANSP was hanging, and got killed by pbs, you see $QSHARE/<runid>_<TOK>.stuck_nors -and- $QSHARE/<runid>_<TOK>.pbs.processed The run should be in $RESULTDIR/<tok>/<runid> d)If files could not be retrieved because of privilege issues (someone messing as pshare in a tr_ directory) run will still be in $WORKDIR. 1) get host from $QSHARE/<runid>_<TOK>.stopped 2) cd /l/<host>/tr_<user>/transp_compute/<TOK>/<runid> 3) chmod/chown of offending directory 4. Interim output of the crashed run is available in $LOOKDIR/<TOK> or via MDSplus Server = transpgrid.pppl.gov Tree = trlook_<tok> 5. Input data can be examined using `trdat'. trdat <tok> .m <runid> -- to get menu 6. Logfiles are available in the crashed run directory, or via the web interface. If you suspect a pbs or NFS problem, look at $LOGDIR/runlog/<tok>/<runid>.dbg 7. The run can be debugged using standard "developer" methods (dbxadd, totalview...). Use the triage program to see if other developers are working on this stopped run or to lock it if you plan on debugging the run yourself. 8. Final disposition of run: there are four possibilities. For all of these, you may want to email the run's owner. The run owner's email address can be found in <RUNID>_<TOK>.REQUEST in the crashed run directory. a) run is repaired: Note: If you want to change the log-level output, setenv LOG_LEVEL info 1. start run -either- tr_restart <runid> link to restart the run via PBS (at the point where it crashed). -or- tr_restart <runid> delete to discard everything and start fresh via PBS. -or- if run is still in $WORKDIR tr_restart <runid> move 2. consider `cvs commit' of any bugfixes. if pshare: make sure you update tshare 3. optionally, email owner of run. b) run is hopeless: 1. email advice to owner of run. 2. tr_cleanup <runid> <tok> <year> to delete the run. c) run cannot be repaired but is worth saving: 1. email owner of run; note that run is incomplete. 2. On node where aborted run resides trlook <tok> <runid> archive to archive the run "as is". d) run aborted during "rebuild of pshare": Restarting the run might not work, you better re-run it from scratch by starting from the beginning. 1. On node where aborted run resides tr_restart <runid> delete 9. If there was a pbs problem, or a bug in a script: a) login to transpgrid as tr_<user> b) cd into directory where run is: 1) if aborted during pretr it should be in: $RESULTDIR/<tok>/<runid> 2) if script bug, might be still in: $WORKDIR/<tok>/<runid> c) restart run I. normal run (input read from mdsplus or tarfile): 1) you must have a REQUEST file (standard runs produce the REQUEST file from standard input) 2) tr_restart <runid> delete II. GA job: 1) need to extract .REQUEST from tar file? 2) check REQUEST file for option tr_restart <runid> delete [start_remote | create | trees] III. if you need to start from beginning 1) standadrd run with input from stdin: transp_enqueue < ~/incoming/<runid>_<tok>.REQUEST 2) with a tar file in ~/incoming: transp_enqueue <runid> <tok> [ NOMDSPLUS | CREATE | UFILE ] Note: transp_enqueue passes "UFILE" as "start_remote" to trp_pre2 & tr_pretr.pl See: Submit manually 10. Additional notes on debugging Collaboratory runs: a) do not run anything as pshare or under your own userid from within the abort directory, $RESULTDIR/<tok>/<runid> ! This will cause privilege problems later on. b) the source code is owned by pshare, not tr_<user>! You need a pshare window to edit the main source code. But be careful, this is the "live" TRANSP production source code! c) debugging should be done as tr_<user>. The $DBGDIR directory is owned by tr_<user> and the temporary copy of the source code residing there can be modified. Such modifications are seen only by the current tr_<user> session. d) if you made changes to source code, be sure to run uplink to update pshare's and/or tshare's binaries. See Rob's Notes
1. login as pshare or tr_<user># 2. tshare_setup
1. login as pshare on portalr5 2. tshare_gcc_setup (defines $SWAPGCC for trqsub) 3. use R5 pbs queues: mque or dawson
All data will automatically be retrieved from the compute nodes and jobs will be restarted. If there were many jobs running, it can take quite a while to retrieve them - some can be 10GB large. Do NOT maually interfere with the cron job, enq_all.
To handle scheduled outages see $CODESYSDIR/source/doc/power_outage.doc
Note: If transp is hanging, e.g. in TEQ, pbs will "legitimately" kill the job after 3 days. In this case you might wish to figure out why it was hanging. Runs with a Restart file older than 3 days, will not be automatically re-started. This is indicated by the status file $QSHARE/<runid>_<TOK>.stuck_nors The cronjob handles pbs problems automatically, but to prevent it from repeatedly restarting, it creates the file $QSHARE/<runid>_<TOK>.pbs.processed So if you want the cronjob to rescue the run again, remove this file. Also, for the cronjob to process a problem run, all mpi-nodes MUST be up again to retrieve the stranded files. If a node is down, you can find the bad node name in: $QSHARE/<runid>_<TOK>.down Only if you really think the cronjob can not handle the situation: 1. You must prevent the cron job from processing the run simultaneusly - otherwise you will lose it. a) Verify the cron job is not running (enq_all runs pbs_requeue & pbs_recover): ps -efw | grep pbs_ It runs every 5 min (.02, .07,...) b) rm <runid>.pbs mv <runid>.active <runid>.stopped 2. Run verify_jobs, to see if all "active" jobs are really running, and all "stopped" jobs are really stopped. If you see: "active" but "UNKNOWN", see if job is still on host in /l/<host>/pshr###/transp_compute/<TOK> a) If Run is stuck on host: Need to manually move run 1) login as tr_<user> 2) mv $QSHARE/<runid>_<TOK>.active $QSHARE/<runid>_<TOK>.stopped 3) cd /l/<>/pshr###/transp_compute/<TOK>/<runid> 4) tr_restart <runid> move this will copy the run into $RESULTDIR/<TOK>/<runid> and submit it for re-start. b) Run is in $RESULTDIR/<TOK>/<runid> 1) login as tr_<user> 2) mv $QSHARE/<runid>_<TOK>.active $QSHARE/<runid>_<TOK>.stopped 3) cd $RESULTDIR/<TOK>/<runid> 4) tr_restart <runid> 3. Phantom jobs qstat returns "R", but job is not running. Very likely "qdel" is not working. You can't restat the job, since TRANSP does not allow multiple runs. You must ask Kevin to kill the phantom job first. qstat returns "R", but job is not running. Very likely "qdel" is not working. You can't restat the job, since TRANSP does not allow multiple runs. You must ask Kevin to kill the phantom job first.
This is now obsolete
1. User must get a new grid proxy and send it to transpgrid. a) The user's proxy will be in /tmp/x509up_* he/she should find it via 'ls /tmp/x509up_*' b) The user must send it via $CODESYSDIR/qcsh/tr_griduserproxy_send /tmp/<file-name> or globus-url-copy -nodcau file:///tmp/x509up_<file-name> gsiftp://transpgrid.pppl.gov/~/.globus/griduserproxy.x509 2. At PPPL, as appropriate tr_<user> cp .globus/griduserproxy.x509 $QSHARE/<tok>/<runid>/globus_gass_cache re_proxy <runid> <tok> You find the tr_<user> on the monitor or in $QSHARE/<runid>_<tok>.globus
1. Check logs: a) Run verify_jobs b) examine $LOGDIR/runlog/<TOK>/<runid>* c) cat $LOGDIR/<TOK>_<runid>.log d) cat $RESULTDIR/<tok>/<runid>.pbs{out,err} e) grep <runid> $LOGDIR/transptemp.log f) cat $LOGDIR/pbslog/<runid>* g) grep <runid> $LOGDIR/cronlog/* h) on isis: tracejob -n<Days> <pbs-id> 2. Run completed somehow but post-processing needs to be done: a) be sure to do: source ~pshare/globus/.userrc_csh b) tr_recover.pl <runid> <tok> <year> all tr_recover.pl with all will prompt you for plotcon, etc. and does Steps 3 and 4 below. 3. Run completed, but stopped before writing to mdsplus, or failed during mdsplot: a) <runid>mds.sh if crash directory is empty, run manually, e.g.: mdsplot T T transp_nstx s transpgrid.pppl.gov n 123450101 q 12345A01 Get server, tree and mds-shot from: $QSHARE/<tok>/<runid>/<runid>_<tok>.REQUEST Be sure to check in the REQUEST file the user really wants MDSplus output. If run did not complete: b) source ~pshare/globus/.userrc_csh (or . ~pshare/globus/.userrc) finishup <runid> c) follow 4. 4. Run stopped after finishup: source ~pshare/globus/.userrc_csh (or . ~pshare/globus/.userrc) tr_recover.pl <runid> <tok> <year> copies all Output files to $ARCDIR corrects status sends email to user 5. Job was lost or user did something accidentally and wants to restore it: 1) copy /u/tr_/transp/backup/<TOK>/<runid>.zip file into save directory 2) ask user to run tr_cleanup <runid> <tok> <year> 3) tr_cleanup <runid> 4) tr_start <runid> with all options 5) tr_send <runid> 6) after job was started kill it with 7) qdell <pbs.jobid> 8) move saved <runid>.zip file to $RESULTDIR/<TOK>/<runid> 9) unzip <runid>.zip 10) now this is a crashed run that is stuck on host follow procedure 2. PBS problems but instead of tr_restart <runid> move run tr_restart <runid> link
triage
program manages the bug tracking of a stopped run.
The stopped run can be locked to prevent other developers from simultaneously debugging
a run, a cause can be attached which is communicated to the user through the stopped
job web page and an action assigned for controling when a stopped job is cleaned up.
more info.
$NTCCHOME/{lib,mod,bin} is automatically updated every night via a xshare cronjob on transpgrid.
For details see: NTCC Software for PPPL
If you can't wait, you can update it manually: 1. login as xshare 2. be sure that source has been 'git' updated 3. compile/link the code a) on transpgrid RedHat6 for intel/2015.u1 / openmpi b) on sunfire06 CentOs6 for gcc/6.1.0 / openmpi c) Note: CentOs6 for intel/2015.u1 / openmpi version is "tshare" 4. On portal or transpgrid: cd $TRANSPROOT/ntcc 5. ./update_ntcc
To update the "tshare" based "swim" module, built as pshare with tshare_setup: (tshare version is also supported on stix) 1. login as xshare on transpgrid: 2. cd $TRANSPROOT/ntcc 3. ./update_ntcc_tshare