- Restarting the Parallel Service Health Checker
If the "Last check time" is in red, then the Parallel Service Health
Checker needs restarting.
As root on transpgrid.
$ /usr/pppl/nfsboot-hosts/scripts/cppg_pfgs restart
- Manually starting the Parallel Service Queue Server
As pshare on transpgrid
$ /u/pshare/PFGS/sbin/pfgs_stoppbs
$ /u/pshare/PFGS/sbin/pfgs_startpbs -V kestrel 8 396:00
This will provide eight nodes and an idle timeout of 396 hours.
The number of nodes can be set as appropriate. The timeout for
the present (Nov. 5, 2007) should be long. The Parallel
Service Monitor will show the status of the Queue Server.
- Getting results from a step
Script pfgs_getresults waits for a marker file to be created by the
step and then returns that marker file and a tar file containing
results from the step. A globus proxy must be active.
If a marker file is not available from the step and the job is
no longer running then a marker file is created by this script
with a first line containing "1" and a second line containing
"remark=explanation". If the job is still running, no marker
file is returned or created.
- Starting a step
Script pfgs_enqueue sends the user tar file and submits a step command.
A globus proxy must be active.
The Parallel Service queue server does not have to running for a
step command to be queued.
- Cleaning up a run
Script pfgs_cleanjob will cleanup a failed run by removing acitve or
pending step queue entries, killing any running processes, and
cleaning the runs directories. Both owner and package/tokyy/runid
ident are checked before any deletion.
- Log files for a step
Queue server log: /p/transpgrid/pfgs/logs/pfgs.log
Step status log: /u//PFGS/logs/.log
Step run log: /u//PFGS/logs/pfgs_.stdout
Step run log: /u//PFGS/logs/pfgs_.stderr