[Neurodebian-users] condor_qsub sentinel jobs release dependencies early in some cases

Chase,Philip B pbc at ufl.edu
Fri Nov 2 13:10:42 UTC 2012


I have hint as to what might be causing problems with my sentinel jobs.  Late yesterday I saw this:

  $ condor_q

  -- Failed to fetch ads from: <127.0.0.1:50116> : name.domain

Moments later all the sentinel jobs released their dependents.

The condor docs suggest my use of BIND_ALL_INTERFACES = TRUE is not enough and that I must explicitly set NETWORK_INTERFACE to the non-loopback address.  The pool is reconfiguring right now.  I'll retest.

Philip

Philip B. Chase
Assistant Director
Clinical and Translational Science IT
University of Florida
pbc at ufl.edu
352-294-5164

From: <Chase>, Philip Chase <pbc at ufl.edu<mailto:pbc at ufl.edu>>
Date: Thursday, November 1, 2012 2:51 PM
To: "neurodebian-users at lists.alioth.debian.org<mailto:neurodebian-users at lists.alioth.debian.org>" <neurodebian-users at lists.alioth.debian.org<mailto:neurodebian-users at lists.alioth.debian.org>>
Subject: [Neurodebian-users] condor_qsub sentinel jobs release dependencies early in some cases

I am seeing a transient failure in condor jobs dependencies.  The sentinel jobs created by condor_qsub are releasing dependencies early in some cases.  I adding logging to the sentinel job scripts and can see the count of "hold_jids" returned by the following line drop dramatically to zero

  condor_q -long -attributes Owner \$hold_jids | grep "$USER" | wc –l

The sentinel job then releases the held job and exits.  As the count of hold_jids is in reality non-zero, those jobs proceed to normal completion, but out of sequence.

In most of my tests the hold is released at the correct time.  Has anyone seen transient failures on condor_q that might cause this?

In my configuration the submit node is the head node and also a worker node.  Is it possible the server is too busy to make a timely response so the client gives up?

Thanks,
Philip

Philip B. Chase
Assistant Director
Clinical and Translational Science IT
University of Florida
pbc at ufl.edu<mailto:pbc at ufl.edu>
352-294-5164
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/neurodebian-users/attachments/20121102/4bd1b1fb/attachment.html>


More information about the Neurodebian-users mailing list