[sane-devel] Mandrake 9.1 and ServeRAID 5i

Sun, 14 Sep 2003 23:11:48 +0200

Raf Schietekat schrieb:
> Let me drop this in your lap. It's an extremely serious problem for 
> users of a specific RAID system (you know, the people who are paranoid 
> that anything should go wrong with their data or the availability of 
> their server): it crashes the server and messes up that data. Evidence 
> suggests (also look at that bug report mentioned below!) that SANE may 
> be the guilty party. I will just reproduce the last message I sent to 
> some parties involved (with [...] used to omit some irrelevant text and 
> to avoid divulging some identities), which may be rather verbose, but 
> you never *really* know what's exactly relevant and/or convincing enough 
> and/or interesting.

your report sounds indeed quite nasty...

> No reaction, BTW, and now it's too late (I probably should have come 
> here before, but I did not know what SANE was, and then my message was 
> blocked for a while before I sent it again), or it would have to be to 
> help with a very targeted, convincing, and quick intervention (I would 
> have to invest time in a complete reinstallation, which I am obviously 
> reluctant to do). My workaround will be a cron(-like?) task that will 
> disable anything related to scanners every minute or so (the frequency 
> of the existing msec security check), to protect against accidental 
> updates that reinstate the code.

well, unless you have indeed a scanner installed on your server. there 
is no point to have Sane-related programs installed (scanimage, 
xscanimage, saned and sane-find-scanner come to mind -- I can't comment 
on any Mandrake-specific stuff, because I have nver used or installed 
Mandrake).

>>> After some further research using www.google.com for ["Mandrake 9.1" 
>>> ServeRAID], which at first didn't seem necessary because I had 
>>> repeatedly and successfully done all these steps before and the only 
>>> new thing were the two NICs on bus B (the same bus that carries the 
>>> ServeRAID 5i card), it appears that I may have been bitten by what's 
>>> dealt with on the following page:
>>>
>>> http://qa.mandrakesoft.com/show_bug.cgi?id=3421

 From William Borrelli's comment on this page, it seems that parts of 
the RAID system identify themselves as scanners (or processors -- I 
don't know, how this perl script gets the information about a "media 
type"; I couldn't find anything about "media type" in the SCSI 2 draft. 
Biut I must admit that I did not search very seriously):

           {
             'info' => 'IBM YGHv3 S2',
             'bus' => 'SCSI',
             'raw_type' => 'Processor            ANSI SCSI revision: 02',
             'channel' => '01',
             'device' => 'sg5',
             'host' => 2,
             'lun' => '00',
             'devfs_prefix' => 'scsi/host2/bus1/target9/lun0',
             'media_type' => 'scanner',
             'id' => '09'
           }

The command "cat /proc/scsi/scsi" is perhaps a simpler way to see, which 
SCSI devices are identified by the kernel.

>>> (this is where I saw Thierry Vignaud's address; I've found [...]'s 
>>> address in /usr/sbin/scannerdrake on a Mandrake 9.0 installation)

As already mentioned, I have never worked mit Mandrake, so I can't make 
any comment on scannerdrake based on real knowledge of this program. But 
I assume that it tries to identify scanners either by calling the 
standard Sane programs sane-find-scanner or scanimage, or it uses 
similar techniqes as these programs: (1) read /proc/scsi/scsi, (2) try 
to open those SG device files, which belong to SCSI devices of the types 
"scanner" and "processor", (3) issue a SCSI INQUIRY command and (4) 
decide from the result of this command, if this is a scanner (or 
"processor" in the case of some HP scanners), (5) check, if the vendor 
and model information returned by the INQUIRY command matches that of a 
scanner supported by a Sane backend, (6) close the device file, if there 
is no match.
.
Since scanimage may load many backends, and since I haven't read the 
source code of every Sane backend, I am not 100%, but "only" 99% sure, 
that these backend will not try to work any longer with the processor 
devices belonging to the RAID controller. It is highly unlikely that 
"IBM YGHv3 S2" is mentioned as the vendor and/or device IDs anywhere in 
a Sane backend. Hence it seems that your Raid controller does not like 
INQUIRY commands sent too often -- which would be in violation of the 
SCSI standard. SCSI devices should be able to respond to a few commands 
like INQUIRY and TEST UNIT READY under any circumstances. And especially 
these two commands should not alter the state of SCSI device in any way.

>> I've got my system up and running again. This involved the following:
>> - /dev/sda8 had disappeared, although its neighbours were still there,
>> - I tried MAKEDEV, but this uses /usr/bin/perl, on /dev/sda7, which 
>> was not yet mounted,
>> - I did "mount /usr",
>> - I did ./MAKEDEV,
>> - I rebooted, and things seemed fine.
>> Then I wanted to try a few things to see whether I could pinpoint the 
>> problem. Here is a complete account of what I did, probably erring on 
>> the side of giving too much information, but in the hope that it will 
>> be helpful for you to fix Mandrake's configuration managers etc. (I 
>> suggest that a probe for ServeRAID precedes and disables a probe for a 
>> scanner, perhaps with user input, unless the scanner probe can be 
>> changed so that it does no damage to the ServeRAID controller card 
>> configuration).
>> The system now only has a(n extra) NIC on bus A, which is separate 
>> from bus B which also carries the ServeRAID controller card. If I do 
>> "# scannerdrake" from a remote ssh -X session (I like to work from my 
>> laptop; the server is in a little server room), the system wants to 
>> install some packages, but I refuse to cooperate. It then says that it 
>> is scanning, or something (gone too fast for me to be able to read), 
>> and then it says "IBM SERVERAID is not in the scanner database, 
>> configure it manually?" (an obvious sign that something is going wrong 
>> with the scanner probe). I repond No. It then says "IBM 32P0042a S320 
>> 1" is not in the scanner database, configure it manually?". Don't even 
>> know what that is. I respond No. Then it does the same for "IBM 
>> SERVERAID" again, I respond No. And the same again for the other one, 
>> I respond No. Then I get the following panel:
>> - title: Scannerdrake
>> - text: There are no scanners found which are available on your system.
>> - button: Search for new scanners
>> - button: Add a scanner manually
>> - button: Scanner sharing
>> - button: Quit

Well, the mandrake scanner installation software should simply shut up. 
A SCSI device of the type "processor" is absolutely reasonable for a 
RAID array -- you need a device (or at least a LUN), which gives you 
access to information about the RAID system: number of read or write 
errors, general state of the array, like "all ECC information up to 
date", or which disk is broken, spare disk already in use and whatever 
else. (as a sidenote, this shows the drawbacks of installation systems 
which try to detect every device attached to a computer automatically... 
Sometimes it is better to be able to explicitly tell the installer "yes, 
I have a scanner and I want get the related software installed".)

>> I persevered, and clicked "Search for new scanners", well, that's the 
>> same as before, from just after the scanning. No crash yet. I did Quit.
>> Then I did vi `which harddrake2`, and I tried to add the line that 
>> [...] suggested (next if $Ident =~ "SCANNER";), but then vi froze 
>> (perhaps some of the file was still in memory from a previous vi 
>> session, but then it wanted to access the disk?). The other ssh 
>> sessions continued to work, unlike during the previous failure; I 
>> tried man perl in another one to try and see an explanation for double 
>> quotes ([...]) vs. slashes (harddrake2), but I got the error "-bash: 
>> /usr/bin/man: Input/output error", repeatedly. I can still open other 
>> ssh sessions, and the console itself works, but I see that all 3 
>> drives have an amber status light (not the green activity light, and 
>> if I remember correctly the status light is normally off), and that 
>> the "System-error LED" is lit on the "Operator information panel" 
>> (only other lit signs are "Power-on LED" and "POST-complete LED"), 
>> with also one LED lit in the "diagnostic LED panel" inside the 
>> computer, next to a symbol of a disk and the letters "DASD". When I 
>> look next, the console has gone from graphics to text mode, and is 
>> filling with messages about "EXT3-fs error", 

Do you see any messages from the Linux driver of the RAID controller? If 
the controller or the driver becomes confused, file system errors are 
unavoidable, I think.

>> Was data lost during the reset/power cycle (hopefully not during the 
>> rebuild, because that would defeat the purpose of having a RAID), or 
>> as early as the corruption of the ServeRAID controller card that 
>> (ultimately?) set the drives to defunct state? Apparently the boot 
>> doesn't even get to the stage where it would decide about clean state 
>> of the file systems, so this is not something we can afford on a 
>> system in production (evidence that recovery is not a simple matter 
>> and may involve data recovery from backup, unless *perhaps* if a boot 
>> floppy takes the system past this stage, after which ext2/ext3 gets a 
>> chance to repair itself, but I have not boot floppy... (will make one 
>> now, though, next chance I get)).

Well, I think you have been struck by a failure of one of the 
not-redundant parts of a RAID array: either the kernel driver or the 
controller itself...

>> I reboot into diagnostics (PC DOCTOR 2.0, apparently a specific 
>> feature of the IBM server), and the SCSI/RAID Controller test category 
>> passed.
>> Next I will proceeded to reinstall the whole system from scratch.
>>
>>>
>>> I'm not sure yet, though (why hasn't this happened before, and has a 
>>> conclusion been reached?), that's why I've also cc'ed [...]. It seems 
>>> strange however, if this is indeed the problem, that a hardware 
>>> adapter card should prove so vulnerable to a probing method used for 
>>> a different device (a scanner), but then again I have no close 
>>> knowledge of these issues.
>>>
>>> BTW, the machine is not yet in production (I was going to do that, 
>>> but I guess I can now wait a few days), and available for tests.
>>>
>>> I still think it's really unfortunate that there is no list of known 
>>> *in*compatibilities, because who would suspect, with ServeRAID 
>>> support, or drivers anyway, available for SuSE, TurboLinux, Caldera 
>>> (SCO Group, the enemy!), and RedHat, that Mandrake would pose a 
>>> problem? The same goes for Mandrake's site, of course (all of IBM is 
>>> just "known hardware", and xSeries 235 and ServeRAID 5i are just 
>>> absent).

I think it is highly unlikely that a Sane program or backend or this 
special Mandrake "scanner search and installation" program is to blame 
for your problem. If you need your server up and running quite soon, I'd 
recommend to use another RAID controller. (sorry, I don't have positive 
hint for a certain model...)

If you want to dig a bit deeper into the problem, you may try to run 
this Mandrake scanner installation program with the environment variable 
SANE_DEBUG_SANEI_SCSI set to 255. This will produce quite much debug 
output (which should probably be sent to an IDE hard disk on the server 
or to your notebook, because the file systems on the RAID array will 
probably break again). The most interesting things are the lines like

      rb>> rcv: id=0 blen=96 dur=10ms sgat=0 op=0x12

"op=..." is the SCSI command code sent to a device. 0x12 is INQUIRY; 
0x00 is TEST UNIT READY; these two commands should not cause any harm to 
a decent SCSI device. If you see anything else, we may have found a bug 
in Sane.

Of course, this test will only make sense, if the Mandrake software 
either calls sane-find-scanner or scanimage, or if it uses the 
sanei_scsi library.

Abel