Notes for PPPL TRANSP Support Personnel

Index
Crashed Runs
Expired Proxy
Recovering from other problems
Other Tools

Crashed Runs:


1.  Use transpgrid.pppl.gov or one of the production nodes 
    petrel011.pppl.gov -- petrel018.pppl.gov.

2.  Login as "pshare".  If examining a crashed Collaboratory run,
    then, ssh from "pshare" to pshr####, the run production account which
    is allocated to the owner of the crashed run (this account name is 
    visible on the PPPL grid monitor).  The standard TRANSP environment
    ($TRANSPROOT, etc.) will be defined for you.

3.  To find the crashed run, go to

    $RESULTDIR/<tok>/<runid>/<runid>

    If a grid run never started and there is no .REQUEST file, do instead:
    cd ~/incoming to find the .REQUEST file.
    Then follow step 10.

4.  Collaboratory runs: acquiring the run owner's Globus proxy so you
    can examine his/her MDSplus data (e.g. with trdat):

    re_proxy <runid> <tok> 

5.  Interim output of the crashed run is available
        in $LOOKDIR/<TOK>
    or  via MDSplus 
    Server = transpgrid.pppl.gov or _transpgrid.pppl.gov (globus)
    Tree   = trlook_<tok>

6.  Input data can be examined using `trdat'.
    trdat <tok> .m <runid>   -- to get menu

7.  The logfile is available in the crashed run directory, or via the
    web interface.

8.  The run can be debugged using standard "developer" methods
    (dbxadd, totalview...).  Use the triage program to see if 
    other developers are working on this stopped run or to lock
    it if you plan on debugging the run yourself.

9.  Final disposition of run:  there are four possibilities.  For all
    of these, you may want to email the run's owner.  The run owner's
    email address can be found in <RUNID>_<TOK>.REQUEST  in the 
    crashed run directory.
    
    a) run is repaired:
        1. for grid users: be sure you have a proxy
        2.  start run
            -either-
             tr_restart <runid> link 
             to restart the run via PBS (at the point where it crashed).
            -or-
             tr_restart <runid> delete 
             to discard everything and start fresh via PBS.
        3.  consider `cvs commit' of any bugfixes.
        4.  optionally, email owner of run.

    b) run is hopeless:

        1.  email advice to owner of run.
        2.  tr_cleanup <runid> <tok> <year>
            to delete the run.

    c) run cannot be repaired but is worth saving:

        1.  email owner of run; note that run is incomplete.
        2.  On node where aborted run resides
	    trlook <tok> <runid> archive 
            to archive the run "as is".

    d) run aborted during "rebuild of pshare":
       Restarting the run might not work, you better 
       re-run it from scratch by starting from the beginning.

       1. On node where aborted run resides
          tr_restart <runid> delete 


10.  If there was a pbs problem, or a bug in a script:
     a) login to transpgrid as pshr#### 
     b) save .REQUEST file from ~/incoming/
        cd $TMPDIR
        cp ~/incoming/<runid>.REQUEST .
     c) re_proxy <runid> <tok>
     d) save proxy file = $X509_USER_PROXY (it will get lost in $RESULTDIR)
        cp  $X509_USER_PROXY ~/.globus/
     d) /u/pshare/globus/transp__pbssubmit <runid> <tok> <full path of REQUEST>

11.  Additional notes on debugging Collaboratory runs:

    a) do not run anything under your own userid from within
       the abort directory, $RESULTDIR/<tok>/<runid>/<runid> !
       This will cause privilege problems later on.

    b) the source code is owned by pshare, not pshr####! You need a
       pshare window to edit the main source code.  But be careful,
       this is the "live" TRANSP production source code!

    c) debugging can be done as pshr####.  The $DBGDIR directory is 
       owned by pshr#### and the temporary copy of the source code
       residing there can be modified.  Such modifications are seen
       only by the current pshr#### session.
    
    d) if you made changes to source code, be sure to run uplink
       to update pshare's binarie. See  Rob's Notes   

Top

Expired Proxy:

1. User must get a new grid proxy and send it to transpgrid.
   a) The user's proxy will be in /tmp/x509up_*
      he/she should find it via 'ls /tmp/x509up_*'
   b) The user must send it via
      $CODESYSDIR/qcsh/tr_griduserproxy_send  /tmp/<file-name>
      or
      globus-url-copy file:///tmp/x509up_<file-name>
      gsiftp://transpgrid.pppl.gov/~/incoming/.globus/griduserproxy.x509 

2. At PPPL, as appropriate pshr#### 
    re_proxy <runid> <tok> 
    You find the pshr#### on the monitor or in $QSHARE/<runid>_<tok>.globus

Top

Recovering from other problems:

1. Check logs:
    a) $LOGDIR/<petrel*>/<runid>.log
    b) $RESULTDIR/<tok>/<runid>.pbsout
    c) $LOGDIR/pbslog/

2. Run completed somehow but post-processing needs to be done:
   a) be sure to do: source ~pshare/globus/.userrc_csh 
   b)  tr_recover.pl  <runid> <tok> <year> all 
      tr_recover.pl with all will prompt you for plotcon, etc.
      and does Steps 3 and 4 below.

3. Run completed, but stopped before writing to mdsplus, or failed during
   mdsplot:
    a) <runid>mds.sh
       if crash directory is empty, run manually, e.g.:
       mdsplot T T transp_nstx s transpgrid.pppl.gov n 123450101 q 12345A01
       Get server, tree and mds-shot from:
       $QSHARE/<tok>/<runid>/<runid>_<tok>.REQUEST
       Be sure to check in the REQUEST file the user really wants MDSplus
       output.
    If run did not complete:
    b) source ~pshare/globus/.userrc_csh (or . ~pshare/globus/.userrc)
       finishup <runid>
    c) follow 4.

4. Run stopped after finishup:
     source ~pshare/globus/.userrc_csh (or . ~pshare/globus/.userrc)
     tr_recover.pl <runid> <tok> <year> 
       copies all Output files to $ARCDIR
       corrects status
       sends email to user

Top

Other Tools:

triage
The triage program manages the bug tracking of a stopped run. The stopped run can be locked to prevent other developers from simultaneously debugging a run, a cause can be attached which is communicated to the user through the stopped job web page and an action assigned for controling when a stopped job is cleaned up. more info.

Common Errors
a list of TRANSP fatal errors

tr_save
If you made a "private run" with your own TRANSP version, and want to permanently archive it, use tr_save
  1. Pre-requisite:
    1. TR.DAT, TR.INF and *.CDF files must be on a cluster directory, that can be accessed by pshare;
      you must provide a TR.INF file
    2. Ufiles, as pointed to by TR.DAT must be accessible by pshare
  2. Now do:
    1. login as pshare
    2. cd < location of your run >
    3. tr_save <runid> <tok> <yr>

If you restarted an aborted run of another user, with your private TRANSP code, as "yourself", and want to archive it, see Archiving Notes

mds_get_inf
to retrieve Namelist, Contents (TF.PLN), or other text nodes from MDSplus, or to get names of Ufiles.

Looking up user - pshr####
Submitting Collaboratory Runs
See Scripts for Experts

Top

Home