Notes for PPPL TRANSP Support Personnel

Index
Crashed Runs
Tshare Setup
Power Outages
PBS Problems
Expired Proxy
Recovering from other problems
Other Tools
Running mpi jobs with a debugger
Tracing mpi jobs
Manually updating ntcc libs and Transp Utilities
JCodeSys in pshare
NSTX SQL Logbook

Crashed Runs:


1.  Use transpgrid.pppl.gov or one of the TRANSP nodes swift* 
    

2.   Do NOT run anything as pshare or yourself in a tr_<user> directory !!!

    a)Login as "pshare" and ssh to tr_<user>, the run production account which
      is allocated to the owner of the crashed run (this account name is 
      visible on the PPPL grid monitor).  The standard TRANSP environment
      ($TRANSPROOT, etc.) will be defined for you.
    b)check for the presence of <runid>.trexe
      If it contains tshare, type:
      tshare_setup
    
3.  To find the crashed run, go to:

    a)Normal crash
      $RESULTDIR/<tok>/<runid>

    b)If a grid run never started and there is no .REQUEST file, do instead:
      cd ~/incoming to find the .REQUEST file.
      Then follow step 9.
    
    c)If TRANSP was hanging, and got killed by pbs, you see 
        $QSHARE/<runid>_<TOK>.stuck_nors  
        -and-
        $QSHARE/<runid>_<TOK>.pbs.processed  
      The run should be in $RESULTDIR/<tok>/<runid>

    d)If files could not be retrieved because of privilege issues
      (someone messing as pshare in a tr_ directory) 
      run will still be in $WORKDIR.
      1) get host from $QSHARE/<runid>_<TOK>.stopped
      2) cd /l/<host>/tr_<user>/transp_compute/<TOK>/<runid>
      3) chmod/chown of offending directory


4.  Interim output of the crashed run is available
        in $LOOKDIR/<TOK>
    or  via MDSplus 
    Server = transpgrid.pppl.gov
    Tree   = trlook_<tok>

5.  Input data can be examined using `trdat'.
    trdat <tok> .m <runid>   -- to get menu

6.  Logfiles are available in the crashed run directory, or via the
    web interface.
    If you suspect a pbs or NFS problem, look at 
    $LOGDIR/runlog/<tok>/<runid>.dbg

7.  The run can be debugged using standard "developer" methods
    (dbxadd, totalview...).  Use the triage program to see if 
    other developers are working on this stopped run or to lock
    it if you plan on debugging the run yourself.

8.  Final disposition of run:  there are four possibilities.  For all
    of these, you may want to email the run's owner.  The run owner's
    email address can be found in <RUNID>_<TOK>.REQUEST  in the 
    crashed run directory.
    
    a) run is repaired:
       Note: If you want to change the log-level output,
       setenv LOG_LEVEL info

        1.  start run
            -either-
             tr_restart <runid> link 
             to restart the run via PBS (at the point where it crashed).
            -or-
             tr_restart <runid> delete 
             to discard everything and start fresh via PBS.
            -or-
             if run is still in $WORKDIR
             tr_restart <runid> move

        2.  consider `cvs commit' of any bugfixes.
            if pshare: make sure you update tshare

        3.  optionally, email owner of run.

    b) run is hopeless:

        1.  email advice to owner of run.
        2.  tr_cleanup <runid> <tok> <year>
            to delete the run.

    c) run cannot be repaired but is worth saving:

        1.  email owner of run; note that run is incomplete.
        2.  On node where aborted run resides
	    trlook <tok> <runid> archive 
            to archive the run "as is".

    d) run aborted during "rebuild of pshare":
       Restarting the run might not work, you better 
       re-run it from scratch by starting from the beginning.

       1. On node where aborted run resides
          tr_restart <runid> delete 



9.  If there was a pbs problem, or a bug in a script:
     a) login to transpgrid as tr_<user> 
     b) cd into directory where run is:
         1) if aborted during pretr it should be in:
           $RESULTDIR/<tok>/<runid>
         2) if script bug, might be still in:
           $WORKDIR/<tok>/<runid>

     c) restart run
        I. normal run (input read from mdsplus or tarfile):
          1) you must have a REQUEST file
            (standard runs produce the REQUEST file from standard input) 
          2) tr_restart <runid> delete
        II. GA job:
          1) need to extract .REQUEST from tar file?
          2) check REQUEST file for option
             tr_restart <runid> delete [start_remote | create | trees]
        III. if you need to start from beginning
          1) standadrd run with input from stdin:
             transp_enqueue < ~/incoming/<runid>_<tok>.REQUEST
          2) with a tar file in ~/incoming:
             transp_enqueue <runid> <tok> [ NOMDSPLUS | CREATE | UFILE ]
          Note: transp_enqueue passes "UFILE" as "start_remote" to trp_pre2 & tr_pretr.pl

     See: Submit manually
 
10.  Additional notes on debugging Collaboratory runs:

    a) do not run anything as pshare or under your own userid from within
       the abort directory, $RESULTDIR/<tok>/<runid> !
       This will cause privilege problems later on.

    b) the source code is owned by pshare, not tr_<user>! You need a
       pshare window to edit the main source code.  But be careful,
       this is the "live" TRANSP production source code!

    c) debugging should be done as tr_<user>.  The $DBGDIR directory is 
       owned by tr_<user> and the temporary copy of the source code
       residing there can be modified.  Such modifications are seen
       only by the current tr_<user> session.
    
    d) if you made changes to source code, be sure to run uplink
       to update pshare's and/or tshare's binaries.
       See  Rob's Notes   

Top

Tshare Setup:

Top

Power Outages:

All data will automatically be retrieved from the compute nodes and jobs will be restarted.
If there were many jobs running, it can take quite a while to retrieve them - some can be 10GB large.
Do NOT maually interfere with the cron job, enq_all.    

To handle scheduled outages see $CODESYSDIR/source/doc/power_outage.doc


PBS Problems:

Note:
If transp is hanging, e.g. in TEQ, pbs will "legitimately" kill the job after 3 days.
In this case you might wish to figure out why it was hanging.
Runs with a Restart file older than 3 days, will not be automatically re-started.
This is indicated by the status file $QSHARE/<runid>_<TOK>.stuck_nors 

The cronjob handles pbs problems automatically, but to prevent it from repeatedly restarting, it creates the file 
$QSHARE/<runid>_<TOK>.pbs.processed
So if you want the cronjob to rescue the run again, remove this file.  

Also, for the cronjob to process a problem run, all mpi-nodes MUST be up again to retrieve the stranded files. 
If a node is down, you can find the bad node name in: $QSHARE/<runid>_<TOK>.down


Only if you really think the cronjob can not handle the situation:

1. You must prevent the cron job from processing the run simultaneusly - otherwise you will lose it. 

  a)  Verify the cron job is not running (enq_all runs pbs_requeue & pbs_recover):
      ps -efw | grep pbs_
      It runs every 5 min (.02, .07,...)

  b)  rm <runid>.pbs
      mv <runid>.active  <runid>.stopped

2.  Run  verify_jobs, to see if all "active" jobs are really running,
    and all "stopped" jobs are really stopped.
    If you see:
    "active"  but "UNKNOWN", see if job is still on host in
    /l/<host>/pshr###/transp_compute/<TOK>

    a) If Run is stuck on host: 
       Need to manually move run 
       1) login as tr_<user> 
       2) mv $QSHARE/<runid>_<TOK>.active $QSHARE/<runid>_<TOK>.stopped
       3) cd /l/<>/pshr###/transp_compute/<TOK>/<runid>
       4) tr_restart <runid> move 
          this will copy the run into $RESULTDIR/<TOK>/<runid>
          and submit it for re-start.


    b) Run is in $RESULTDIR/<TOK>/<runid>
       1) login as tr_<user> 
       2) mv $QSHARE/<runid>_<TOK>.active $QSHARE/<runid>_<TOK>.stopped
       3) cd  $RESULTDIR/<TOK>/<runid>
       4) tr_restart <runid>

3. Phantom jobs
   qstat returns "R", but job is not running.
   Very likely "qdel" is not working.
   You can't restat the job, since TRANSP does not allow multiple runs.
   You must ask Kevin to kill the phantom job first.
   qstat returns "R", but job is not running.
   Very likely "qdel" is not working.
   You can't restat the job, since TRANSP does not allow multiple runs.
   You must ask Kevin to kill the phantom job first.
Top

Expired Proxy:

This is now obsolete

1. User must get a new grid proxy and send it to transpgrid.
   a) The user's proxy will be in /tmp/x509up_*
      he/she should find it via 'ls /tmp/x509up_*'
   b) The user must send it via
      $CODESYSDIR/qcsh/tr_griduserproxy_send  /tmp/<file-name>
      or
      globus-url-copy -nodcau file:///tmp/x509up_<file-name>
      gsiftp://transpgrid.pppl.gov/~/.globus/griduserproxy.x509 

2. At PPPL, as appropriate tr_<user> 
    cp .globus/griduserproxy.x509 $QSHARE/<tok>/<runid>/globus_gass_cache
    re_proxy <runid> <tok> 
    You find the tr_<user> on the monitor or in $QSHARE/<runid>_<tok>.globus

Top

Recovering from other problems:

1. Check logs:
    a) Run verify_jobs
    b) examine $LOGDIR/runlog/<TOK>/<runid>*
    c) cat $LOGDIR/<TOK>_<runid>.log
    d) cat $RESULTDIR/<tok>/<runid>.pbs{out,err}
    e) grep <runid> $LOGDIR/transptemp.log
    f) cat $LOGDIR/pbslog/<runid>*
    g) grep <runid> $LOGDIR/cronlog/*
    h) on isis:
       tracejob -n<Days> <pbs-id>

2. Run completed somehow but post-processing needs to be done:
   a) be sure to do: source ~pshare/globus/.userrc_csh 
   b)  tr_recover.pl  <runid> <tok> <year> all 
      tr_recover.pl with all will prompt you for plotcon, etc.
      and does Steps 3 and 4 below.

3. Run completed, but stopped before writing to mdsplus, or failed during
   mdsplot:
    a) <runid>mds.sh
       if crash directory is empty, run manually, e.g.:
       mdsplot T T transp_nstx s transpgrid.pppl.gov n 123450101 q 12345A01
       Get server, tree and mds-shot from:
       $QSHARE/<tok>/<runid>/<runid>_<tok>.REQUEST
       Be sure to check in the REQUEST file the user really wants MDSplus
       output.
    If run did not complete:
    b) source ~pshare/globus/.userrc_csh (or . ~pshare/globus/.userrc)
       finishup <runid>
    c) follow 4.

4. Run stopped after finishup:
     source ~pshare/globus/.userrc_csh (or . ~pshare/globus/.userrc)
     tr_recover.pl <runid> <tok> <year> 
       copies all Output files to $ARCDIR
       corrects status
       sends email to user

5. Job was lost or user did something accidentally and wants to restore it:
   1) copy /u/tr_/transp/backup/<TOK>/<runid>.zip file into save directory
   2) ask user to run  tr_cleanup <runid> <tok> <year> 
   3)  tr_cleanup <runid> 
   4)  tr_start <runid>   with all options
   5)  tr_send <runid>  
   6) after job was started kill it with
   7)  qdell <pbs.jobid> 
   8) move saved <runid>.zip file to
      $RESULTDIR/<TOK>/<runid>
   9)  unzip <runid>.zip 
  10) now this is a crashed run that is stuck on host
      follow procedure  2. PBS problems   but instead of 
      tr_restart <runid> move 
      run
      tr_restart <runid> link 

Top

Other Tools:

triage
The triage program manages the bug tracking of a stopped run. The stopped run can be locked to prevent other developers from simultaneously debugging a run, a cause can be attached which is communicated to the user through the stopped job web page and an action assigned for controling when a stopped job is cleaned up. more info.
To check status of triage:
plook
plook -h

Common Errors
a list of TRANSP fatal errors

tr_change_year
If a run is archived in the wrong year directory, it must be re-written into mdsplus. It is not sufficient to move the files in $TRINF & $ARCDIR.
As pshare, run
tr_change_year <runid> <tok> <year> <new-year>

tr_save
If you made a "private run" with your own TRANSP version, and want to permanently archive it, use tr_save
  1. Pre-requisite:
    1. TR.DAT, TR.INF and *.CDF files must be on a cluster directory, that can be accessed by pshare;
      you must provide a TR.INF file
    2. Ufiles, as pointed to by TR.DAT must be accessible by pshare
  2. Now do:
    1. login as pshare
    2. cd < location of your run >
    3. tr_save <runid> <tok> <yr>

If you restarted an aborted run of another user, with your private TRANSP code, as "yourself", and want to archive it, see Archiving Notes

mds_get_inf
to retrieve Namelist, Contents (TF.PLN), or other text nodes from MDSplus, or to get names of Ufiles.

Looking up user - tr_<user>

Searching Production Log
Submitting Collaboratory Runs
See Scripts for Experts

Suspending a Collaboratory Run
ssh -l tr_<user> transpgrid
echo <time> > $QSHARE/<runid>_<tok>.HALT_RQST
chmod g+w $QSHARE/<runid>_<tok>.HALT_RQST

Stopping a Collaboratory Run & Restart
tr_kill <runid> <tok> trp_run
will qdel & cron job will move/restart

Stopping a Collaboratory Run
tr_kill <runid> <tok>
as if TRANSP aborted
Top

Running mpi jobs with a debugger


Top

Tracing mpi jobs

Some info about mpi jobs
Top

Manually updating ntcc Libraries and Transp Utilities:

$NTCCHOME/{lib,mod,bin} is automatically updated every night via a xshare cronjob on transpgrid.

For details see: NTCC Software for PPPL

If you can't wait, you can update it manually:

1. login as xshare 
2. be sure that source has been 'git'  updated 
3. compile/link the code
   a) on transpgrid  RedHat6 for intel/2015.u1 / openmpi
   b) on sunfire06 CentOs6 for gcc/6.1.0 / openmpi
   c) Note:  CentOs6 for intel/2015.u1 / openmpi version is "tshare"

4. On portal or transpgrid:
   cd $TRANSPROOT/ntcc
5. ./update_ntcc

To update the "tshare" based "swim" module, built as pshare with tshare_setup:
   (tshare version is also supported on stix)
   1. login as xshare on transpgrid:
   2. cd $TRANSPROOT/ntcc
   3. ./update_ntcc_tshare

To update ntcc modules, .tar,.zip files, for the web (in /p/xshare/RIB/files/)
Top

JCodeSys

Some NOTES about the shared library and java code in pshare.
Top

NSTX SQL Logbook

Some NOTES about adding comments to the NSTX logbook.
Top

Home