triage program
Introduction
The triage program is a developer tool for recording the cause of a stopped run
and for communicating to other developers that a stopped run is being analyzed.
For basic help info,
triage -h
To start the triage program in a gui for a particular run,
triage <runid> <tok> <year>
If the <year> argument is left off, the year will be extracted from the
$QSHARE/<runid>_<year>.year
file.
If run in the crash directory of a stopped job
($RESULTDIR/<tok>/<runid>/<runid>
), all of the arguments can be omitted.
Environment
Because the triage program is typically run as pshare, the environment variable
TRIAGE_DEVELOPER should be set to your developer name before running the triage
program. This name will be used for file locking and new reports.
The triage program expects to be able to find $QSHARE
,
$CODESYSDIR/codata
(with the files triageCause.txt
,
triageAction.txt
, triageTime.txt
)
and the <runid>_<tok>.year file
Reports
The triage program manages "Reports" which record the cause of a stopped run,
the time of the report, the developer's name and the action which will be
taken for the stopped job. The top half of the triage program lists the shot info
and the existing reports which can be stepped through with the buttons,
Back - move back one report
Forward - move forward one report
Last - jump to the last report
The numerical label on the far right is the index of the current report. This
report information is stored as ascii in the $QSHARE/<runid>_<tok>.reason file.
Locking
To prevent two developers from simultaneously changing the reason file a lock file called
$QSHARE/<runid>_<tok>.reason_lock
is used. When the lock file does not exist,
the reason file is considered unlocked and the label in the upper right of the
triage program will read "Unlocked" in green. You can now lock this file by hitting
the "Lock" button which if successful will change the label to yellow with
your developer's name (from the TRIAGE_DEVELOPER
environment variable). If someone else
has locked this reason file, the label will show up as red with the developer's name.
A lock will also be acquired if you try to add a new report. If you own the lock
on exiting, the triage program will ask you if you want to release the lock.
You generally want to release the lock but you would keep it if you would like to dissuade
other developers from looking at the run. The lock is checked every 3 seconds
and the reports will be updated whenever the lock is released.
Adding a new report
The bottom half of the triage gui is devoted to the new report editor. You can
set the developer name (if you forgot to use the TRIAGE_DEVELOPER
environment variable),
the cause of the stopped run, the action to be taken and extra comments can be
added. The causes and actions are selected by keyword from drop down lists. These
lists are read from the files $CODESYSDIR/codata/triageCause.txt
and
$CODESYSDIR/codata/triageAction.txt
. Feel free to edit and add to these files. The button
in the lower left will cause the reason file to be locked, the new report added
and the entire reason file to be reread. The developer should then perform, if necessary,
the action represented in the report such as restarting the run or cleaning it up.
In the future, our scripts will archive the triage reports (depending on the action) for
future reference.
Run Expiration
The last action in the reason file determines when a stopped run will be automatically
cleaned up by the scripts (Not currently implemented). The expiration time for a given
action is looked up in the file $CODESYSDIR/codata/triageTime.txt
. When the age of the
reason file exceeds this expiration time, the run is cleaned up. The "Expiration" label at the
top of the triage program lists the amount of time left before the run is cleaned up.
A run will not be cleaned up if it is locked.
Web Interface
The stopped jobs monitor at
https://w3.pppl.gov/transp/transpgrid_monitor_stopped
will show the current state of the triage in the "analysis" section. A listing of
all the triage reports can be seen by following the link for a stopped job (click the "..."
link on the far left) then clicking on the "Stopped Job Analysis" link. This is equivalent
to running
triage -l <runid> <tok> <year>
Quick Triage
Some causes for a stopped job are so common that a special switch can be given on the
triage command line so the triage gui does not need to be used. The switch "-c <comment>"
can also be used to add a comment.
as an example,
triage -i -c 'unable to read CUR ufile' 37065Z61 TFTR 88
To read the triage reports without going through the gui,
triage -l <runid> <tok> <year>
Where is it?
The triage program is written in java and launched from the script /p/beast/bin/triage
.
The java binaries and source can be found somewhere in /p/beast/java
.