[Pkg-nagios-devel] Bug#725268: nagios3: nagios.log - misleading errors about check results queue

Vaclav Ovsik vaclav.ovsik at gmail.com
Thu Oct 3 13:45:39 UTC 2013


Package: nagios3
Version: 3.4.1-5+b1
Severity: minor
Tags: upstream

Dear Maintainer,
this is reincarnation of the bug #522538 closed as unreproducible some
time ago...

I'm affected by this bug too, but fortunately I'm further in the
observing the problem. The problem have appeared while preparing a new
backup monitoring virtual host based on Debian Wheezy amd64. I have got
up to a point with near the same configuration on the new server
as on a production server. The difference I have noticed between the
server's nagios.logs is, that a new node logs regularly:

[1380716982] Error: Unable to rename file '/var/lib/nagios3/spool/checkresults/checkP5jroS' to '/var/lib/nagios3/spool/checkresults/c84vzxe': No such file or directory
[1380716982] Warning: Unable to move file '/var/lib/nagios3/spool/checkresults/checkP5jroS' to check results queue.

I have certainty the configurations of Nagios on both servers are the
same because I'm using Unison to synchronize the server configurations.
I have installed Systemtap to see the problem on the syscall level.
I have not knowing the Nagios results processing, but have adapted some
example Systemtap script, to monitor syscalls open, rename and unlink.

11978246 6312 (nagios3) open /var/lib/nagios3/spool/checkresults/checkP5jroS returned 8
11986098 26537 (nagios3) rename( /var/lib/nagios3/spool/checkresults/checkP5jroS -> /var/lib/nagios3/spool/checkresults/cMlI3we ) returned 0
11988931 26532 (nagios3) rename( /var/lib/nagios3/spool/checkresults/checkP5jroS -> /var/lib/nagios3/spool/checkresults/c84vzxe ) returned -2
11989054 26532 (nagios3) unlink /var/lib/nagios3/spool/checkresults/checkP5jroS returned -2

A nagios process (pid 26537) renamed result and another nagios process
(pid 26532) later (cca 1ms later) tried to rename the same result file
too.

This was not the answer for why the two Nagios boxes behaves
differently. I have started to compare installed packages and found the
missing smbclient on the new server. I installed software without
recommended packages motivated to keep the number of installed packages
small. I'm monitoring a Samba share, so I have installed smbclient on
the new Nagios server too. The errors in the nagios.log disappeared. :)

I have setup another Nagios server on my destop (Debian Sid with
nagios3 3.4.1-5+b1) and have simplified the check_disk_smb until I have
comprehend the problem. Pieces of configuration and the script are
attached so you can reproduce the problem. In the short:

	Perl check running in the embeded Perl interpreter can do
	a fork() syscall, but if the child process fails to exec() some
	external binary and exits Perl interpreter through exit() then
	the cleanup phase calling move_check_result_to_queue()
	(base/checks.c) is ran in the two places: in the parent process
	and also in the child process. This is the bug in the
	base/checks.c and should be fixed upstream. Probably could be
	sufficient to test if pidof running process not changed (I'm the
	parent) and call move_check_result_to_queue() only in the parent
	process.

Thanks for your time on packaging Nagios!
Best Regards
-- 
Zito


-- System Information:
Debian Release: jessie/sid
  APT prefers unstable
  APT policy: (500, 'unstable'), (1, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 3.10-3-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=cs_CZ.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages nagios3 depends on:
ii  nagios3-cgi   3.4.1-5+b1
ii  nagios3-core  3.4.1-5+b1

nagios3 recommends no packages.

Versions of packages nagios3 suggests:
ii  nagios-nrpe-plugin  2.13-3

-- no debconf information
-------------- next part --------------
A non-text attachment was scrubbed...
Name: check_nagtest
Type: text/x-perl
Size: 208 bytes
Desc: not available
URL: <http://lists.alioth.debian.org/pipermail/pkg-nagios-devel/attachments/20131003/79146fcd/attachment.pl>
-------------- next part --------------
define command {
	command_name	nagtest_check
	command_line	/usr/local/bin/check_nagtest
}

define host {
  host_name			nagtest
  alias				nagtest
  check_command			return-ok
  max_check_attempts		1
  notifications_enabled         0
}

define service {
  host_name			nagtest
  service_description		nagtest_service
  check_command			nagtest_check
  max_check_attempts		1
  check_interval		1
  retry_interval		1
  normal_check_interval         1
  notifications_enabled         0
}
-------------- next part --------------
#! /usr/bin/env stap

global start

function timestamp:long() { return gettimeofday_us() - start }

function proc:string() { return sprintf("%d (%s)", pid(), execname()) }

function filename_filter:long(filename) {
    return substr(filename, 0, 36) == "/var/lib/nagios3/spool/checkresults/"
}

probe begin { start = gettimeofday_us() }

probe syscall.open.return {
  filename = user_string($filename)
  if ( filename_filter(filename) ) {
      printf("%d %s open %s returned %d\n", timestamp(), proc(), filename, $return)
  }
}

probe syscall.unlink.return {
  pathname = user_string($pathname)
  if ( filename_filter(pathname) ) {
      printf("%d %s unlink %s returned %d\n", timestamp(), proc(), pathname, $return)
  }
}

probe syscall.rename.return {
  oldname = user_string($oldname)
  newname = user_string($newname)
  if ( filename_filter(oldname) || filename_filter(newname) ) {
      printf("%d %s rename %s -> %s returned %d\n", timestamp(), proc(), oldname, newname, $return)
  }
}

probe kprocess.exec {
  printf("%d %s exec %s\n", timestamp(), proc(), filename)
}


More information about the Pkg-nagios-devel mailing list