1. Use transpgrid.pppl.gov or one of the production nodes petrel011.pppl.gov -- petrel018.pppl.gov. 2. Login as "pshare". If examining a crashed Collaboratory run, then, ssh from "pshare" to pshr####, the run production account which is allocated to the owner of the crashed run (this account name is visible on the PPPL grid monitor). The standard TRANSP environment ($TRANSPROOT, etc.) will be defined for you. 3. To find the crashed run, go to $RESULTDIR/<tok>/<runid>/<runid> If a grid run never started and there is no .REQUEST file, do instead: cd ~/incoming to find the .REQUEST file. Then follow step 10. 4. Collaboratory runs: acquiring the run owner's Globus proxy so you can examine his/her MDSplus data (e.g. with trdat): re_proxy <runid> <tok> 5. Interim output of the crashed run is available in $LOOKDIR/<TOK> or via MDSplus Server = transpgrid.pppl.gov or _transpgrid.pppl.gov (globus) Tree = trlook_<tok> 6. Input data can be examined using `trdat'. trdat <tok> .m <runid> -- to get menu 7. The logfile is available in the crashed run directory, or via the web interface. 8. The run can be debugged using standard "developer" methods (dbxadd, totalview...). Use the triage program to see if other developers are working on this stopped run or to lock it if you plan on debugging the run yourself. 9. Final disposition of run: there are four possibilities. For all of these, you may want to email the run's owner. The run owner's email address can be found in <RUNID>_<TOK>.REQUEST in the crashed run directory. a) run is repaired: 1. for grid users: be sure you have a proxy 2. start run -either- tr_restart <runid> link to restart the run via PBS (at the point where it crashed). -or- tr_restart <runid> delete to discard everything and start fresh via PBS. 3. consider `cvs commit' of any bugfixes. 4. optionally, email owner of run. b) run is hopeless: 1. email advice to owner of run. 2. tr_cleanup <runid> <tok> <year> to delete the run. c) run cannot be repaired but is worth saving: 1. email owner of run; note that run is incomplete. 2. On node where aborted run resides trlook <tok> <runid> archive to archive the run "as is". d) run aborted during "rebuild of pshare": Restarting the run might not work, you better re-run it from scratch by starting from the beginning. 1. On node where aborted run resides tr_restart <runid> delete 10. If there was a pbs problem, or a bug in a script: a) login to transpgrid as pshr#### b) save .REQUEST file from ~/incoming/ cd $TMPDIR cp ~/incoming/<runid>.REQUEST . c) re_proxy <runid> <tok> d) save proxy file = $X509_USER_PROXY (it will get lost in $RESULTDIR) cp $X509_USER_PROXY ~/.globus/ d) /u/pshare/globus/transp__pbssubmit <runid> <tok> <full path of REQUEST> 11. Additional notes on debugging Collaboratory runs: a) do not run anything under your own userid from within the abort directory, $RESULTDIR/<tok>/<runid>/<runid> ! This will cause privilege problems later on. b) the source code is owned by pshare, not pshr####! You need a pshare window to edit the main source code. But be careful, this is the "live" TRANSP production source code! c) debugging can be done as pshr####. The $DBGDIR directory is owned by pshr#### and the temporary copy of the source code residing there can be modified. Such modifications are seen only by the current pshr#### session. d) if you made changes to source code, be sure to run uplink to update pshare's binarie. See Rob's Notes
1. User must get a new grid proxy and send it to transpgrid. a) The user's proxy will be in /tmp/x509up_* he/she should find it via 'ls /tmp/x509up_*' b) The user must send it via $CODESYSDIR/qcsh/tr_griduserproxy_send /tmp/<file-name> or globus-url-copy file:///tmp/x509up_<file-name> gsiftp://transpgrid.pppl.gov/~/incoming/.globus/griduserproxy.x509 2. At PPPL, as appropriate pshr#### re_proxy <runid> <tok> You find the pshr#### on the monitor or in $QSHARE/<runid>_<tok>.globus
1. Check logs: a) $LOGDIR/<petrel*>/<runid>.log b) $RESULTDIR/<tok>/<runid>.pbsout c) $LOGDIR/pbslog/ 2. Run completed somehow but post-processing needs to be done: a) be sure to do: source ~pshare/globus/.userrc_csh b) tr_recover.pl <runid> <tok> <year> all tr_recover.pl with all will prompt you for plotcon, etc. and does Steps 3 and 4 below. 3. Run completed, but stopped before writing to mdsplus, or failed during mdsplot: a) <runid>mds.sh if crash directory is empty, run manually, e.g.: mdsplot T T transp_nstx s transpgrid.pppl.gov n 123450101 q 12345A01 Get server, tree and mds-shot from: $QSHARE/<tok>/<runid>/<runid>_<tok>.REQUEST Be sure to check in the REQUEST file the user really wants MDSplus output. If run did not complete: b) source ~pshare/globus/.userrc_csh (or . ~pshare/globus/.userrc) finishup <runid> c) follow 4. 4. Run stopped after finishup: source ~pshare/globus/.userrc_csh (or . ~pshare/globus/.userrc) tr_recover.pl <runid> <tok> <year> copies all Output files to $ARCDIR corrects status sends email to user
triage
program manages the bug tracking of a stopped run.
The stopped run can be locked to prevent other developers from simultaneously debugging
a run, a cause can be attached which is communicated to the user through the stopped
job web page and an action assigned for controling when a stopped job is cleaned up.
more info.