[Neurodebian-users] condor_qsub sentinel jobs release dependencies early in some cases
Chase,Philip B
pbc at ufl.edu
Thu Nov 1 18:51:28 UTC 2012
I am seeing a transient failure in condor jobs dependencies. The sentinel jobs created by condor_qsub are releasing dependencies early in some cases. I adding logging to the sentinel job scripts and can see the count of "hold_jids" returned by the following line drop dramatically to zero
condor_q -long -attributes Owner \$hold_jids | grep "$USER" | wc –l
The sentinel job then releases the held job and exits. As the count of hold_jids is in reality non-zero, those jobs proceed to normal completion, but out of sequence.
In most of my tests the hold is released at the correct time. Has anyone seen transient failures on condor_q that might cause this?
In my configuration the submit node is the head node and also a worker node. Is it possible the server is too busy to make a timely response so the client gives up?
Thanks,
Philip
Philip B. Chase
Assistant Director
Clinical and Translational Science IT
University of Florida
pbc at ufl.edu
352-294-5164
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/neurodebian-users/attachments/20121101/affb4494/attachment.html>
More information about the Neurodebian-users
mailing list