[Pkg-nagios-devel] Bug#615133: [lm-sensors] FAULT status of sensors

Jan Wagner waja at cyconet.org
Tue May 17 09:33:44 UTC 2011


Hi Jonathan,

thanks for your bugreport.

On Tuesday 17 May 2011 09:36:00 Jean Delvare wrote:
> On Mon, 16 May 2011 19:32:52 -0700, Guenter Roeck wrote:
> > On Mon, May 16, 2011 at 02:23:44PM -0400, Jan Wagner wrote:
> > > we got a bugreport[1] against our nagios-plugins package. Unfortunately
> > > we are unsure about what "FAULT" means.
> > > In case this is a hardware problem of a sensor in form of it got
> > > damaged, we would report "CRITICAL", as a problem occured.
> > > If this means there is a problem detecting the sensor or something
> > > software like problem, we would report "UNKNOWN" as this not means a
> > > hardware problem happened.
> > 
> > It is supposed to indicate a HW problem. Here is the text describing the
> > sysfs attribute:
> > 
> > "Each input channel may have an associated fault file. This can be used
> > 
> >  to notify open diodes, unconnected fans etc. where the hardware
> >  supports it. When this boolean has value 1, the measurement for that
> >  channel should not be trusted."
> > 
> > Note that "critical" in the hwmon ABI means that a critical limit has
> > been reached. You would get a "critical" alarm in this case. You might
> > have a terminology problem if you use "critical" for a hardware failure.
> > 
> > An undetected sensor should not show up in the first place.
> 
> In fact, FAULT can happen in two different cases. First case (most
> common) is unused channel by the manufacturer and the channel should
> indeed be ignored. Second case is thermal diode dying or fan stalling,
> and reporting this makes sense. So I would:
> * Ignore sensor channels which report FAULT when you start monitoring.
> * Report FAULT as an actual problem if it happens later during
>   monitoring, for a channel which reported real values before. The
>   terminology is up to you.

(Jean: many thanks for your clarification)

This means, that FAULT can be happen, if the hardware conditions are fine and 
hardware is failing too.

On Saturday 26 February 2011 00:07:22 Jonathan Wiltshire wrote:
> The attached patch causes check_sensors to return a critical status if
> faulty sensors are detected.

For nagios-plugins this means, we don't know if there is exactly a problem. We 
should report "UNKNOWN" via check_sensors if "FAULT" is reported by the 
sensor. As the source of this may also not a problem with the hardware 
conditions itself, something like --ignore-fault needs to be implemented too.

With kind regards, Jan.
-- 
Never write mail to <waja at spamfalle.info>, you have been warned!
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GIT d-- s+: a C+++ UL++++ P+ L+++ E--- W+++ N+++ o++ K++ w--- O M V- PS PE Y++
PGP++ t-- 5 X R tv- b+ DI D+ G++ e++ h---- r+++ y++++ 
------END GEEK CODE BLOCK------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 190 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.alioth.debian.org/pipermail/pkg-nagios-devel/attachments/20110517/e6b1a286/attachment.pgp>


More information about the Pkg-nagios-devel mailing list