Condor-G: Dealing with Held Jobs
This tutorial covers how to recover from errors on the grid while using Condor-G.
When an problem occurs in the middleware, Condor-G will place your job on "Hold". Held jobs remain in the queue, but wait for user intervention. When you resolve the problem, you can use condor_release to free job to continue.
You can place jobs on hold yourself, perhaps if you want to delay your run using condor_hold
For this example, we'll make the output file non-writable. The job will be unable to copy the results back and will be placed on hold.
Submit the job again, but this time immediately after submitting it, mark the output file as read-only:
$ condor_submit myjob.submit ; chmod a-w job.output
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 77546
Watch the job with tail. When the job goes on hold, use Ctrl-C to exit tail. Note that condor_q reports that the job is in the "H" or Held state.
$ tail -f --lines=500 job.log
000 (77546.000.000) 03/08 15:44:01 Job submitted from host: <172.16.149.233:58263>
...
017 (77546.000.000) 03/08 15:44:12 Job submitted to Globus
RM-Contact: red.unl.edu:/jobmanager-fork
JM-Contact: https://red.unl.edu:41228/6523/1141854092/
Can-Restart-JM: 1
...
027 (77546.000.000) 03/08 15:44:12 Job submitted to grid resource
GridResource: gt2 red.unl.edu:/jobmanager-fork
GridJobId: gt2 red.unl.edu:/jobmanager-fork https://red.unl.edu:41228/6523/1141854092/
...
001 (77546.000.000) 03/08 15:44:13 Job executing on host: gt2 red.unl.edu:/jobmanager-fork
...
012 (77546.000.000) 03/08 15:46:50 Job was held.
Globus error 155: the job manager could not stage out a file
Code 2 Subcode 155
...
Ctrl-C
$ condor_q mfurukaw
-- Submitter: osg-test2.unl.edu : <172.16.149.233:58263> : osg-test2.unl.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
77546.0 mfurukaw 3/8 15:43 0+00:02:37 H 0 0.0 myscript.sh TestJo
1 jobs; 0 idle, 0 running, 1 held
Fix the problem (make the file writable again), then release the job. You can specifiy the job's ID, or just use "-all" to release all held jobs.
$ chmod u+w job.output
$ condor_release -all
All jobs released.
Again, watch the log until the job finishes:
$ tail -f --lines=500 job.log
000 (77546.000.000) 03/08 15:44:01 Job submitted from host: <172.16.149.233:58263>
...
017 (77546.000.000) 03/08 15:44:12 Job submitted to Globus
RM-Contact: red.unl.edu:/jobmanager-fork
JM-Contact: https://red.unl.edu:41228/6523/1141854092/
Can-Restart-JM: 1
...
027 (77546.000.000) 03/08 15:44:12 Job submitted to grid resource
GridResource: gt2 red.unl.edu:/jobmanager-fork
GridJobId: gt2 red.unl.edu:/jobmanager-fork https://red.unl.edu:41228/6523/1141854092/
...
001 (77546.000.000) 03/08 15:44:13 Job executing on host: gt2 red.unl.edu:/jobmanager-fork
...
012 (77546.000.000) 03/08 15:46:50 Job was held.
Globus error 155: the job manager could not stage out a file
Code 2 Subcode 155
...
013 (77546.000.000) 03/08 15:49:06 Job was released.
via condor_release (by user mfurukaw)
...
001 (77546.000.000) 03/08 15:49:19 Job executing on host: gt2 red.unl.edu:/jobmanager-fork
...
005 (77546.000.000) 03/08 15:49:24 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
Ctrl-C
Your job finished, the results have been retreived successfully:
$ cat job.output
I'm process id 17407 on node010
Wed Mar 8 15:41:40 CST 2006
Running as binary /home/localGridUser/.globus/.gass_cache/local/md5/1f/205f630d4aca51dbdaa8c38f89606a/md5/e0/9b5779fe3bae4f34472467d2015bf1/data TestJob 10
My name (argument 1) is TestJob
My sleep duration (argument 2) is 10
Sleep of 10 seconds finished. Exiting
RESULT: 0 SUCCESS
Before continuing, clean up the results:
$ rm job.*