[Piuparts-devel] Bug#670150: piuparts: rescheduling old logs by piuparts-master

Mon Apr 23 12:21:44 UTC 2012

Package: piuparts
Version: 0.43
Severity: wishlist

Hi,

there were thoughts to move rescheduling of old logs into the master,
here come a few more.

The method should be more powerful than just evaluating
max-reschedule-old-age and max-reschedule-old-count. There should be a
possibility to do full or partial rebuilds (without need to rm -rf the whole
tree).

Therefore I'd suggest to keep the reschedule_old_logs script, but
instead of rm'ing the logs just mark them for recycling. time limits
should stay, count limits can go away from the config as the master will
automatically adjust to the available processing power.

One possibility to mark the logs for recycling is to hardlink them into
a recycle/ subdirectory. That should be pretty race and lock free.
So reschedule_old_logs would just link the candidates there.

I'm not exactly sure how to enable the recycling in the master-slave
protocol, the slave probably needs to be in a recycling mode as well and
will need to send some "recycle N" command at the beginning of the
communication. Recycling must happen before the master computes package
states. Master may only accept recycle commands if it "remembers" that
it was out of packages during the previous slave contact (otherwise skip
the command). Due to package dependencies even if N logs were recycled,
less than N packages may be available for testing afterwards.

Pseudocode for master implementation:

proc recycle($N):
  $i = 0
  while ($i < $N && select random logfile from "recycle/*.log" as $log)
    if find $log in "{pass,bugged,affected,fail,untestable}/*.log" as $oldlog
      if $log.INODE == $oldlog.INODE
        rm $oldlog
	++$i
      else
        # INODE mismatch => the log was updated inbetween, skip recycle
    else
      # the log is missing - maybe was outdated, skip recycle
    rm $log
  return $i

So for a larger manual reschedule we can just link all pass/*.logs to
recycle/ (or all that are older than some event).

Manual rescheduling of "a few" packages should still happen by plain rm
(and will get prioritized that way).

For the slave I have no clear plan how to enter/leave "recycle mode"
Entering is probably easy: if we went through all sections without doing
something, turn on recycling and try again.
How should recycling and precedence go along? I think in the recycling
case we should do a round-robin ignoring precedence.
There could also be a max-recycled setting which may be different from
max-reserved.

Leaving recycle mode should be somehow indicated by the master, once
there is more work (from non-recycling origin) to be done, the client
should stop recycling mode immediately and continue regularly.

At slave side a section can be
* config-disabled - retry in 12 hours
* local-busy - another slave instance is processing it, retry in 15 min
* master-busy - retry in 2 min + random(180 s) 
* master-failed - retry in 15 minutes
* master-idle - no work in normal mode, retry in 1 hour
* master-idle-recycle - no work in recycle mode, retry ...
looks like the slave needs three timeouts to work with these conditions
if mode==normal
  # obey sleep_recycle, too: if there is nothing to do in recycle mode
  # there won't be anything for normal mode, too
  if now < max(sleep_error, sleep_idle, sleep_recycle)
    return 0
else # mode==recycle
  if now < max(sleep_error, sleep_recycle)
    return 0

Contacting the master is an expensive operation, especially if we want
to compute the state:
* fetch Packages
* read master tree (~35000 files)
* compute package states
Contacting the master too often should be avoided. Or there could be a
"cheap" cached "idle" status that could be requested by a client. That
cached idle state needs to be cleared
- after timeout (1h?)
- if slave transmitted a logfile
- if recycle deleted some logfile

There could be a IDLE command that returns
  OK N
where N is the number of seconds the slave should wait before
reconnecting to the master (+ random(360 s) to avoid races) or
  OK 0
if the slave can continue now.

So a optimized master-slave communication may look like this:

M: hello
#if in recycle mode
S: recycle 50
M: OK
#endif
#if has logs
S: pass some.log
# master reads directory tree
M: OK
#endif
S: idle
M: OK 0
# or OK 42 if there is probably nothing to do for the next 42 seconds
S: reserve
# master reads directory tree (if not yet done) and computes state
M: foobar 1.0
S: reserve
M: error
# master creates idle flag file
S: status
M: OK .....

On the master the could be implemented using an idle.stamp file
- submitting logs deletes it
- deleting logs during recycle deletes it
- computing status (as needed for status or reserve commands) deletes
  it, but may create it again
- reserve may create it

The following unwanted behavior could happen:
- slave1 submits some logfile and clears teh stamp file
- slave1 quits
- slave2 is in recycle mode, recycle command is skipped by master, no
  packages available, but packages to recycle available
  ==> slave2 should retry the section instead of sleeping

Having some efficient way to minimize master comunication is important
for me: > 50 sections, 4-7 slaves running in parallel (slaves sharing a
slave tree (for concurrent processing of different sections) as well as
slaves on different slave trees (for concurrent processing on the same
section, e.g. on full recheck).

Comments, suggestions (and (partial) implementation) welcome!

Andreas