triage program


Introduction

The triage program is a developer tool for recording the cause of a stopped run and for communicating to other developers that a stopped run is being analyzed. For basic help info,
  triage -h
To start the triage program in a gui for a particular run,
  triage <runid> <tok> <year>
If the <year> argument is left off, the year will be extracted from the $QSHARE/<runid>_<year>.year file. If run in the crash directory of a stopped job ($RESULTDIR/<tok>/<runid>/<runid>), all of the arguments can be omitted.

Environment

Because the triage program is typically run as pshare, the environment variable TRIAGE_DEVELOPER should be set to your developer name before running the triage program. This name will be used for file locking and new reports. The triage program expects to be able to find $QSHARE, $CODESYSDIR/codata (with the files triageCause.txt, triageAction.txt, triageTime.txt) and the <runid>_<tok>.year file

Reports

The triage program manages "Reports" which record the cause of a stopped run, the time of the report, the developer's name and the action which will be taken for the stopped job. The top half of the triage program lists the shot info and the existing reports which can be stepped through with the buttons,
  Back    - move back one report
  Forward - move forward one report
  Last    - jump to the last report
The numerical label on the far right is the index of the current report. This report information is stored as ascii in the $QSHARE/<runid>_<tok>.reason file.

Locking

To prevent two developers from simultaneously changing the reason file a lock file called $QSHARE/<runid>_<tok>.reason_lock is used. When the lock file does not exist, the reason file is considered unlocked and the label in the upper right of the triage program will read "Unlocked" in green. You can now lock this file by hitting the "Lock" button which if successful will change the label to yellow with your developer's name (from the TRIAGE_DEVELOPER environment variable). If someone else has locked this reason file, the label will show up as red with the developer's name. A lock will also be acquired if you try to add a new report. If you own the lock on exiting, the triage program will ask you if you want to release the lock. You generally want to release the lock but you would keep it if you would like to dissuade other developers from looking at the run. The lock is checked every 3 seconds and the reports will be updated whenever the lock is released.

Adding a new report

The bottom half of the triage gui is devoted to the new report editor. You can set the developer name (if you forgot to use the TRIAGE_DEVELOPER environment variable), the cause of the stopped run, the action to be taken and extra comments can be added. The causes and actions are selected by keyword from drop down lists. These lists are read from the files $CODESYSDIR/codata/triageCause.txt and $CODESYSDIR/codata/triageAction.txt. Feel free to edit and add to these files. The button in the lower left will cause the reason file to be locked, the new report added and the entire reason file to be reread. The developer should then perform, if necessary, the action represented in the report such as restarting the run or cleaning it up. In the future, our scripts will archive the triage reports (depending on the action) for future reference.

Run Expiration

The last action in the reason file determines when a stopped run will be automatically cleaned up by the scripts (Not currently implemented). The expiration time for a given action is looked up in the file $CODESYSDIR/codata/triageTime.txt. When the age of the reason file exceeds this expiration time, the run is cleaned up. The "Expiration" label at the top of the triage program lists the amount of time left before the run is cleaned up. A run will not be cleaned up if it is locked.

Web Interface

The stopped jobs monitor at
  https://w3.pppl.gov/transp/transpgrid_monitor_stopped
will show the current state of the triage in the "analysis" section. A listing of all the triage reports can be seen by following the link for a stopped job (click the "..." link on the far left) then clicking on the "Stopped Job Analysis" link. This is equivalent to running
  triage -l <runid> <tok> <year>

Quick Triage

Some causes for a stopped job are so common that a special switch can be given on the triage command line so the triage gui does not need to be used. The switch "-c <comment>" can also be used to add a comment. as an example,
    triage -i -c 'unable to read CUR ufile'  37065Z61 TFTR 88
To read the triage reports without going through the gui,
    triage -l <runid> <tok> <year>

Where is it?

The triage program is written in java and launched from the script /p/beast/bin/triage. The java binaries and source can be found somewhere in /p/beast/java.