[Neurodebian-users] condor_qsub sentinel jobs release dependencies early in some cases

Chase,Philip B pbc at ufl.edu
Thu Nov 1 18:51:28 UTC 2012


I am seeing a transient failure in condor jobs dependencies.  The sentinel jobs created by condor_qsub are releasing dependencies early in some cases.  I adding logging to the sentinel job scripts and can see the count of "hold_jids" returned by the following line drop dramatically to zero

  condor_q -long -attributes Owner \$hold_jids | grep "$USER" | wc –l

The sentinel job then releases the held job and exits.  As the count of hold_jids is in reality non-zero, those jobs proceed to normal completion, but out of sequence.

In most of my tests the hold is released at the correct time.  Has anyone seen transient failures on condor_q that might cause this?

In my configuration the submit node is the head node and also a worker node.  Is it possible the server is too busy to make a timely response so the client gives up?

Thanks,
Philip

Philip B. Chase
Assistant Director
Clinical and Translational Science IT
University of Florida
pbc at ufl.edu
352-294-5164
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/neurodebian-users/attachments/20121101/affb4494/attachment.html>


More information about the Neurodebian-users mailing list