Notes for PPPL TRANSP Support Personnel

Index
Crashed Runs
Tshare Setup
Power Outages
PBS Problems
Expired Proxy
Recovering from other problems
Other Tools
Running mpi jobs with a debugger
Tracing mpi jobs
Manually updating ntcc libs and Transp Utilities
JCodeSys in pshare
NSTX SQL Logbook

Power Outages:

All data will automatically be retrieved from the compute nodes and jobs will be restarted.
If there were many jobs running, it can take quite a while to retrieve them - some can be 10GB large.
Do NOT maually interfere with the cron job, enq_all.

To handle scheduled outages see $CODESYSDIR/source/doc/power_outage.doc

Recovering from other problems:

1. Check logs:
    a) Run verify_jobs
    b) examine $LOGDIR/runlog/<TOK>/<runid>*
    c) cat $LOGDIR/<TOK>_<runid>.log
    d) cat $RESULTDIR/<tok>/<runid>.pbs{out,err}
    e) grep <runid> $LOGDIR/transptemp.log
    f) cat $LOGDIR/pbslog/<runid>*
    g) grep <runid> $LOGDIR/cronlog/*
    h) on isis:
       tracejob -n<Days> <pbs-id>

2. Run completed somehow but post-processing needs to be done:
   a) be sure to do: source ~pshare/globus/.userrc_csh 
   b)  tr_recover.pl  <runid> <tok> <year> all 
      tr_recover.pl with all will prompt you for plotcon, etc.
      and does Steps 3 and 4 below.

3. Run completed, but stopped before writing to mdsplus, or failed during
   mdsplot:
    a) <runid>mds.sh
       if crash directory is empty, run manually, e.g.:
       mdsplot T T transp_nstx s transpgrid.pppl.gov n 123450101 q 12345A01
       Get server, tree and mds-shot from:
       $QSHARE/<tok>/<runid>/<runid>_<tok>.REQUEST
       Be sure to check in the REQUEST file the user really wants MDSplus
       output.
    If run did not complete:
    b) source ~pshare/globus/.userrc_csh (or . ~pshare/globus/.userrc)
       finishup <runid>
    c) follow 4.

4. Run stopped after finishup:
     source ~pshare/globus/.userrc_csh (or . ~pshare/globus/.userrc)
     tr_recover.pl <runid> <tok> <year> 
       copies all Output files to $ARCDIR
       corrects status
       sends email to user

5. Job was lost or user did something accidentally and wants to restore it:
   1) copy /u/tr_/transp/backup/<TOK>/<runid>.zip file into save directory
   2) ask user to run  tr_cleanup <runid> <tok> <year> 
   3)  tr_cleanup <runid> 
   4)  tr_start <runid>   with all options
   5)  tr_send <runid>  
   6) after job was started kill it with
   7)  qdell <pbs.jobid> 
   8) move saved <runid>.zip file to
      $RESULTDIR/<TOK>/<runid>
   9)  unzip <runid>.zip 
  10) now this is a crashed run that is stuck on host
      follow procedure  2. PBS problems   but instead of 
      tr_restart <runid> move 
      run
      tr_restart <runid> link

Top

The triage program manages the bug tracking of a stopped run. The stopped run can be locked to prevent other developers from simultaneously debugging a run, a cause can be attached which is communicated to the user through the stopped job web page and an action assigned for controling when a stopped job is cleaned up. more info.

To check status of triage:
plook
plook -h

Common Errors

a list of TRANSP fatal errors

tr_change_year

If a run is archived in the wrong year directory, it must be re-written into mdsplus. It is not sufficient to move the files in $TRINF & $ARCDIR.
As pshare, run
tr_change_year <runid> <tok> <year> <new-year>

tr_save

If you made a "private run" with your own TRANSP version, and want to permanently archive it, use tr_save

Pre-requisite:
1. TR.DAT, TR.INF and *.CDF files must be on a cluster directory, that can be accessed by pshare;
  you must provide a TR.INF file
2. Ufiles, as pointed to by TR.DAT must be accessible by pshare
Now do:
1. login as pshare
2. cd < location of your run >
3. tr_save <runid> <tok> <yr>

If you restarted an aborted run of another user, with your private TRANSP code, as "yourself", and want to archive it, see Archiving Notes

mds_get_inf

to retrieve Namelist, Contents (TF.PLN), or other text nodes from MDSplus, or to get names of Ufiles.

Looking up user - tr_<user>

On transpgrid:/etc/mdsip.hosts
On cluster: ~pshare/PSHR.LIST

To extract email:
grep tr_<user> ~pshare/PSHR.LIST | cut -d: -f3

Searching Production Log

Search for a Runid to see

if alternate code was used:
email address
pbs-id (R4 or R5)
tr_find <runid> [tok]
e.g.: tr_find 121090A05

Search Log(s) directly:

grep <runid> $LOGDIR/transptemp.log
grep <runid> $LOGDIR/qlog/transptemp.log*

Submitting Collaboratory Runs

See Scripts for Experts

Suspending a Collaboratory Run

ssh -l tr_<user> transpgrid

echo <time> > $QSHARE/<runid>_<tok>.HALT_RQST

chmod g+w $QSHARE/<runid>_<tok>.HALT_RQST

Stopping a Collaboratory Run & Restart

tr_kill <runid> <tok> trp_run

will qdel & cron job will move/restart

Stopping a Collaboratory Run

tr_kill <runid> <tok>

as if TRANSP aborted
Top

Running mpi jobs with a debugger

e.g.: 4 cores under Totalview
mpiexec -np 4 -tv <runid>db.dbx <runid> R
This does not work with Pathscale - Openmpi.
Totalview with Pathscale Openmpi
- totalview <runid>db.dbx -a <runid> R
- setup mpi in Totalview window
  Make sure to add -mca btl ^sm
- define the following environment variable for connection across nodes
  export TVDSVRLAUNCHCMD=ssh
  setenv TVDSVRLAUNCHCMD ssh
DDT
- define environment DDTMPIRUN=`which mpiexec`
- setup mpi in DDT window
- add mpirun arguments under "advanced"
  -mca btl ^sm (for Openmpi 1.2.7)

Top

Tracing mpi jobs

Some info about mpi jobs
Top

Manually updating ntcc Libraries and Transp Utilities:

$NTCCHOME/{lib,mod,bin} is automatically updated every night via a xshare cronjob on transpgrid.

For details see: NTCC Software for PPPL

If you can't wait, you can update it manually:

1. login as xshare 
2. be sure that source has been 'git'  updated 
3. compile/link the code
   a) on transpgrid  RedHat6 for intel/2015.u1 / openmpi
   b) on sunfire06 CentOs6 for gcc/6.1.0 / openmpi
   c) Note:  CentOs6 for intel/2015.u1 / openmpi version is "tshare"

4. On portal or transpgrid:
   cd $TRANSPROOT/ntcc
5. ./update_ntcc

To update the "tshare" based "swim" module, built as pshare with tshare_setup:
   (tshare version is also supported on stix)
   1. login as xshare on transpgrid:
   2. cd $TRANSPROOT/ntcc
   3. ./update_ntcc_tshare

To update ntcc modules, .tar,.zip files, for the web (in /p/xshare/RIB/files/)

xshare version: logged in as xshare
cd $TRANSPROOT/ntcc
./update_tars
tshare version: logged in as pshare
cd $TRANSPROOT/ntcc
./update_tars_tshare

Top

JCodeSys

Some NOTES about the shared library and java code in pshare.
Top

NSTX SQL Logbook

Some NOTES about adding comments to the NSTX logbook.
Top

Home

Notes for PPPL TRANSP Support Personnel

Crashed Runs:

Tshare Setup: