[Neurodebian-users] condor_qsub sentinel jobs release dependencies early in some cases
Chase,Philip B
pbc at ufl.edu
Fri Nov 2 13:10:42 UTC 2012
I have hint as to what might be causing problems with my sentinel jobs. Late yesterday I saw this:
$ condor_q
-- Failed to fetch ads from: <127.0.0.1:50116> : name.domain
Moments later all the sentinel jobs released their dependents.
The condor docs suggest my use of BIND_ALL_INTERFACES = TRUE is not enough and that I must explicitly set NETWORK_INTERFACE to the non-loopback address. The pool is reconfiguring right now. I'll retest.
Philip
Philip B. Chase
Assistant Director
Clinical and Translational Science IT
University of Florida
pbc at ufl.edu
352-294-5164
From: <Chase>, Philip Chase <pbc at ufl.edu<mailto:pbc at ufl.edu>>
Date: Thursday, November 1, 2012 2:51 PM
To: "neurodebian-users at lists.alioth.debian.org<mailto:neurodebian-users at lists.alioth.debian.org>" <neurodebian-users at lists.alioth.debian.org<mailto:neurodebian-users at lists.alioth.debian.org>>
Subject: [Neurodebian-users] condor_qsub sentinel jobs release dependencies early in some cases
I am seeing a transient failure in condor jobs dependencies. The sentinel jobs created by condor_qsub are releasing dependencies early in some cases. I adding logging to the sentinel job scripts and can see the count of "hold_jids" returned by the following line drop dramatically to zero
condor_q -long -attributes Owner \$hold_jids | grep "$USER" | wc –l
The sentinel job then releases the held job and exits. As the count of hold_jids is in reality non-zero, those jobs proceed to normal completion, but out of sequence.
In most of my tests the hold is released at the correct time. Has anyone seen transient failures on condor_q that might cause this?
In my configuration the submit node is the head node and also a worker node. Is it possible the server is too busy to make a timely response so the client gives up?
Thanks,
Philip
Philip B. Chase
Assistant Director
Clinical and Translational Science IT
University of Florida
pbc at ufl.edu<mailto:pbc at ufl.edu>
352-294-5164
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/neurodebian-users/attachments/20121102/4bd1b1fb/attachment.html>
More information about the Neurodebian-users
mailing list