Notes for PPPL TRANSP Support Personnel
Index
Crashed Runs
Expired Proxy
Recovering from other problems
Other Tools
1. Use transpgrid.pppl.gov or one of the production nodes
petrel011.pppl.gov -- petrel018.pppl.gov.
2. Login as "pshare". If examining a crashed Collaboratory run,
then, ssh from "pshare" to pshr####, the run production account which
is allocated to the owner of the crashed run (this account name is
visible on the PPPL grid monitor). The standard TRANSP environment
($TRANSPROOT, etc.) will be defined for you.
3. To find the crashed run, go to
$RESULTDIR/<tok>/<runid>/<runid>
If a grid run never started and there is no .REQUEST file, do instead:
cd ~/incoming to find the .REQUEST file.
Then follow step 10.
4. Collaboratory runs: acquiring the run owner's Globus proxy so you
can examine his/her MDSplus data (e.g. with trdat):
re_proxy <runid> <tok>
5. Interim output of the crashed run is available
in $LOOKDIR/<TOK>
or via MDSplus
Server = transpgrid.pppl.gov or _transpgrid.pppl.gov (globus)
Tree = trlook_<tok>
6. Input data can be examined using `trdat'.
trdat <tok> .m <runid> -- to get menu
7. The logfile is available in the crashed run directory, or via the
web interface.
8. The run can be debugged using standard "developer" methods
(dbxadd, totalview...). Use the triage program to see if
other developers are working on this stopped run or to lock
it if you plan on debugging the run yourself.
9. Final disposition of run: there are four possibilities. For all
of these, you may want to email the run's owner. The run owner's
email address can be found in <RUNID>_<TOK>.REQUEST in the
crashed run directory.
a) run is repaired:
1. for grid users: be sure you have a proxy
2. start run
-either-
tr_restart <runid> link
to restart the run via PBS (at the point where it crashed).
-or-
tr_restart <runid> delete
to discard everything and start fresh via PBS.
3. consider `cvs commit' of any bugfixes.
4. optionally, email owner of run.
b) run is hopeless:
1. email advice to owner of run.
2. tr_cleanup <runid> <tok> <year>
to delete the run.
c) run cannot be repaired but is worth saving:
1. email owner of run; note that run is incomplete.
2. On node where aborted run resides
trlook <tok> <runid> archive
to archive the run "as is".
d) run aborted during "rebuild of pshare":
Restarting the run might not work, you better
re-run it from scratch by starting from the beginning.
1. On node where aborted run resides
tr_restart <runid> delete
10. If there was a pbs problem, or a bug in a script:
a) login to transpgrid as pshr####
b) save .REQUEST file from ~/incoming/
cd $TMPDIR
cp ~/incoming/<runid>.REQUEST .
c) re_proxy <runid> <tok>
d) save proxy file = $X509_USER_PROXY (it will get lost in $RESULTDIR)
cp $X509_USER_PROXY ~/.globus/
d) /u/pshare/globus/transp__pbssubmit <runid> <tok> <full path of REQUEST>
11. Additional notes on debugging Collaboratory runs:
a) do not run anything under your own userid from within
the abort directory, $RESULTDIR/<tok>/<runid>/<runid> !
This will cause privilege problems later on.
b) the source code is owned by pshare, not pshr####! You need a
pshare window to edit the main source code. But be careful,
this is the "live" TRANSP production source code!
c) debugging can be done as pshr####. The $DBGDIR directory is
owned by pshr#### and the temporary copy of the source code
residing there can be modified. Such modifications are seen
only by the current pshr#### session.
d) if you made changes to source code, be sure to run uplink
to update pshare's binarie. See Rob's Notes
Top
1. User must get a new grid proxy and send it to transpgrid.
a) The user's proxy will be in /tmp/x509up_*
he/she should find it via 'ls /tmp/x509up_*'
b) The user must send it via
$CODESYSDIR/qcsh/tr_griduserproxy_send /tmp/<file-name>
or
globus-url-copy file:///tmp/x509up_<file-name>
gsiftp://transpgrid.pppl.gov/~/incoming/.globus/griduserproxy.x509
2. At PPPL, as appropriate pshr####
re_proxy <runid> <tok>
You find the pshr#### on the monitor or in $QSHARE/<runid>_<tok>.globus
Top
1. Check logs:
a) $LOGDIR/<petrel*>/<runid>.log
b) $RESULTDIR/<tok>/<runid>.pbsout
c) $LOGDIR/pbslog/
2. Run completed somehow but post-processing needs to be done:
a) be sure to do: source ~pshare/globus/.userrc_csh
b) tr_recover.pl <runid> <tok> <year> all
tr_recover.pl with all will prompt you for plotcon, etc.
and does Steps 3 and 4 below.
3. Run completed, but stopped before writing to mdsplus, or failed during
mdsplot:
a) <runid>mds.sh
if crash directory is empty, run manually, e.g.:
mdsplot T T transp_nstx s transpgrid.pppl.gov n 123450101 q 12345A01
Get server, tree and mds-shot from:
$QSHARE/<tok>/<runid>/<runid>_<tok>.REQUEST
Be sure to check in the REQUEST file the user really wants MDSplus
output.
If run did not complete:
b) source ~pshare/globus/.userrc_csh (or . ~pshare/globus/.userrc)
finishup <runid>
c) follow 4.
4. Run stopped after finishup:
source ~pshare/globus/.userrc_csh (or . ~pshare/globus/.userrc)
tr_recover.pl <runid> <tok> <year>
copies all Output files to $ARCDIR
corrects status
sends email to user
Top
- triage
- The
triage program manages the bug tracking of a stopped run.
The stopped run can be locked to prevent other developers from simultaneously debugging
a run, a cause can be attached which is communicated to the user through the stopped
job web page and an action assigned for controling when a stopped job is cleaned up.
more info.
- Common Errors
- a list of TRANSP fatal errors
- tr_save
- If you made a "private run" with your own TRANSP version, and want to permanently archive it, use tr_save
- Pre-requisite:
- TR.DAT, TR.INF and *.CDF files must be on a cluster directory, that can be accessed by pshare;
you must provide a TR.INF file
- Ufiles, as pointed to by TR.DAT must be accessible by pshare
- Now do:
- login as pshare
- cd < location of your run >
- tr_save <runid> <tok> <yr>
- If you restarted an aborted run of another user, with your private TRANSP code, as "yourself", and want to archive it, see Archiving Notes
- mds_get_inf
- to retrieve Namelist, Contents (TF.PLN), or other text nodes from
MDSplus, or to get names of Ufiles.
- Looking up user - pshr####
- On transpgrid:/etc/grid-security/mdsip.hosts
- On cluster: ~pshare/PSHR.LIST
- To extract email:
grep pshr#### ~pshare/PSHR.LIST | cut -d: -f3
- Submitting Collaboratory Runs
- See Scripts for Experts
Top
Home