Abnormal Job Termination
Termination because of CPU limit exceeded
- jobs get an XCPU signal that can be catched by the job. In that case termination procedures can be executed, before the SIGKILL signal is sent
- SIGKILL will be sent a few minutes after XCPU was sent. It cannot be catched.
Restart after the ececution host has crashed
- if a host crashes when a given job is running, the job will be restarted. In that case the variable RESTARTED is set to 1
- The job will be reexecuted from the beginning on any free host. If the job can be restarted using results achieved so far, then check for the variable RESTARTED and force the job to be executed on the same host by inserting
qalter -q $QUEUE $JOB_ID
in your job script
Signalling the end of the job
- with the qsub option -notify a SIGUSR1/SIGUSR2 signal is sent to the job one minute before the job is suspended/killed (configurable queue attribute notify)
(see: http://www-zeuthen.desy.de/www_users/rz/maillists/linux/msg00005.html)