[Pkg-nagios-devel] Bug#361261: more information

Kurt Yoder ktydebbug at richard-group.com
Thu Apr 27 13:59:35 UTC 2006


So I've been watching this for a while, and it seems that I start  
seeing problems when there is a lot of I/O on the machine. The nagios  
machine also runs backups, so sometimes it will be very busy copying  
files back and forth to various disks. At this point, I start seeing  
problems with Nagios.

Usually what I see is zombied "insert.pl" processes. This is probably  
the script that adds data to my rrd files. I worked around this by  
checking for zombied insert.pl's every minute and killing any that I  
find. During backups, this can happen as often as once every 5  
minutes. It rarely or never happens if there are no backups running.

However, about once a week the main nagios process simply stops  
spawning check process. It uses as much cpu as it can get but  
apparently doesn't do anything else. At these times, it is generating  
no system calls according to strace. Currently, I am looking at this  
situation, and nagios hasn't budged for the last 2 hours. The last  
bit I see from strace is this (probably happened right before nagios  
stopped spawning children):

rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_DFL}, 8) = 0
send(8, "Q\0\0\2\1INSERT INTO hostretention ("..., 514, 0) = 514
rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0
poll([{fd=8, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recv(8, "C\0\0\0\27INSERT 565606538 1\0Z\0\0\0\5T", 16384, 0) = 30
time([1146136794])                      = 1146136794
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_DFL}, 8) = 0
send(8, "Q\0\0\1\347INSERT INTO hostretention ("..., 488, 0) = 488
rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0
poll([{fd=8, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recv(8, "C\0\0\0\27INSERT 565606539 1\0Z\0\0\0\5T", 16384, 0) = 30
time([1146136794])                      = 1146136794
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_DFL}, 8) = 0
send(8, "Q\0\0\1\375INSERT INTO hostretention ("..., 510, 0) = 510
rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0
poll([{fd=8, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recv(8, "E\0\0\0bSERROR\0C22003\0Mvalue \"41992"..., 16384, 0) = 105
rt_sigaction(SIGPIPE, {SIG_IGN}, {SIG_DFL}, 8) = 0
send(8, "Q\0\0\0\rROLLBACK\0", 14, 0)   = 14
rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0
poll([{fd=8, events=POLLIN|POLLERR, revents=POLLIN}], 1, -1) = 1
recv(8, "C\0\0\0\rROLLBACK\0Z\0\0\0\5I", 16384, 0) = 20




I am at a loss as to what is happening here. I purged and re- 
installed all my nagios packages to ensure that there weren't any  
errant "bad" files lying around. I don't understand why this was  
working fine before I upgraded to testing and then downgraded back to  
stable, but has stopped working now. Is there anything else I can  
check here?




More information about the Pkg-nagios-devel mailing list