[Pkg-nagios-devel] Bug#361261: more information
Kurt Yoder
ktydebbug at richard-group.com
Thu Apr 27 13:59:35 UTC 2006
So I've been watching this for a while, and it seems that I start
seeing problems when there is a lot of I/O on the machine. The nagios
machine also runs backups, so sometimes it will be very busy copying
files back and forth to various disks. At this point, I start seeing
problems with Nagios.
Usually what I see is zombied "insert.pl" processes. This is probably
the script that adds data to my rrd files. I worked around this by
checking for zombied insert.pl's every minute and killing any that I
find. During backups, this can happen as often as once every 5
minutes. It rarely or never happens if there are no backups running.
However, about once a week the main nagios process simply stops
spawning check process. It uses as much cpu as it can get but
apparently doesn't do anything else. At these times, it is generating
no system calls according to strace. Currently, I am looking at this
situation, and nagios hasn't budged for the last 2 hours. The last
bit I see from strace is this (probably happened right before nagios
stopped spawning children):
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_DFL}, 8) = 0
send(8, "Q\0\0\2\1INSERT INTO hostretention ("..., 514, 0) = 514
rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0
poll([{fd=8, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recv(8, "C\0\0\0\27INSERT 565606538 1\0Z\0\0\0\5T", 16384, 0) = 30
time([1146136794]) = 1146136794
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_DFL}, 8) = 0
send(8, "Q\0\0\1\347INSERT INTO hostretention ("..., 488, 0) = 488
rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0
poll([{fd=8, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recv(8, "C\0\0\0\27INSERT 565606539 1\0Z\0\0\0\5T", 16384, 0) = 30
time([1146136794]) = 1146136794
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_DFL}, 8) = 0
send(8, "Q\0\0\1\375INSERT INTO hostretention ("..., 510, 0) = 510
rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0
poll([{fd=8, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recv(8, "E\0\0\0bSERROR\0C22003\0Mvalue \"41992"..., 16384, 0) = 105
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_DFL}, 8) = 0
send(8, "Q\0\0\0\rROLLBACK\0", 14, 0) = 14
rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0
poll([{fd=8, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recv(8, "C\0\0\0\rROLLBACK\0Z\0\0\0\5I", 16384, 0) = 20
I am at a loss as to what is happening here. I purged and re-
installed all my nagios packages to ensure that there weren't any
errant "bad" files lying around. I don't understand why this was
working fine before I upgraded to testing and then downgraded back to
stable, but has stopped working now. Is there anything else I can
check here?
More information about the Pkg-nagios-devel
mailing list