[sane-devel] Mandrake 9.1 and ServeRAID 5i
Raf Schietekat
sky92136 at skynet.be
Sun Sep 14 17:06:31 BST 2003
Let me drop this in your lap. It's an extremely serious problem for
users of a specific RAID system (you know, the people who are paranoid
that anything should go wrong with their data or the availability of
their server): it crashes the server and messes up that data. Evidence
suggests (also look at that bug report mentioned below!) that SANE may
be the guilty party. I will just reproduce the last message I sent to
some parties involved (with [...] used to omit some irrelevant text and
to avoid divulging some identities), which may be rather verbose, but
you never *really* know what's exactly relevant and/or convincing enough
and/or interesting.
Raf Schietekat wrote:
> Note: ***Urgent***: If you (Mandrake and maybe IBM) would like to have
> me perform specific tests on my system, perhaps with Mandrake 9.2 RC1,
> it will almost have to be this week, because next week I'd like to bring
> the server into production. Please use this opportunity!
No reaction, BTW, and now it's too late (I probably should have come
here before, but I did not know what SANE was, and then my message was
blocked for a while before I sent it again), or it would have to be to
help with a very targeted, convincing, and quick intervention (I would
have to invest time in a complete reinstallation, which I am obviously
reluctant to do). My workaround will be a cron(-like?) task that will
disable anything related to scanners every minute or so (the frequency
of the existing msec security check), to protect against accidental
updates that reinstate the code.
>
> Brief description: Mandrake 9.1 crashes systems with ServeRAID.
> Extensive report below, including a reference to a previous bug report,
> currently marked as needing further information (well, here is the info).
>
> Raf Schietekat wrote:
>[...]
>> For [...], whom I've
>> included in cc, a resume, in case you want to step in: I've been
>> test-running an IBM xSeries 235 with ServeRAID 5i for several weeks,
>> with Mandrake 9.1 (probably still the most recent version). Yesterday,
>> I inserted two 3COM NICs in bus B, which also carries the ServeRAID 5i
>> card. To test that the latter was still independently running at full
>> 100 MHz speed as in the documentation and not dragged down to the
>> NICs' 33 MHz, I did "time tar cf - / | wc -l", which showed about 7.5
>> MB/s throughput as before (unless it was more like 10 MB/s before, I'm
>> not exactly sure). I then used drakconf to see whether the NICs were
>> identified correctly. I did this from a remote ssh -X session, which
>> froze up. I could not open another ssh connection. On the console
>> itself, the mouse pointer was still moving, but I could not type
>> anything into the logon screen. The bottom two drives were spinning
>> continuously, while the top one wasn't doing anything, this for a RAID
>> 5 setting involving all three drives. Since nothing seemed to work, I
>> did a reset (small button, I hope I shouldn't have used, e.g., the
>> power button instead). During reboot, the file system proved to be
>> corrupted, and could not be repaired (I will have to find out how to
>> do that, or reinstall everything).
>>
>> After some further research using www.google.com for ["Mandrake 9.1"
>> ServeRAID], which at first didn't seem necessary because I had
>> repeatedly and successfully done all these steps before and the only
>> new thing were the two NICs on bus B (the same bus that carries the
>> ServeRAID 5i card), it appears that I may have been bitten by what's
>> dealt with on the following page:
>>
>> http://qa.mandrakesoft.com/show_bug.cgi?id=3421
>> (this is where I saw Thierry Vignaud's address; I've found [...]'s
>> address in /usr/sbin/scannerdrake on a Mandrake 9.0 installation)
>
>
> I've got my system up and running again. This involved the following:
> - /dev/sda8 had disappeared, although its neighbours were still there,
> - I tried MAKEDEV, but this uses /usr/bin/perl, on /dev/sda7, which was
> not yet mounted,
> - I did "mount /usr",
> - I did ./MAKEDEV,
> - I rebooted, and things seemed fine.
> Then I wanted to try a few things to see whether I could pinpoint the
> problem. Here is a complete account of what I did, probably erring on
> the side of giving too much information, but in the hope that it will be
> helpful for you to fix Mandrake's configuration managers etc. (I suggest
> that a probe for ServeRAID precedes and disables a probe for a scanner,
> perhaps with user input, unless the scanner probe can be changed so that
> it does no damage to the ServeRAID controller card configuration).
> The system now only has a(n extra) NIC on bus A, which is separate from
> bus B which also carries the ServeRAID controller card. If I do "#
> scannerdrake" from a remote ssh -X session (I like to work from my
> laptop; the server is in a little server room), the system wants to
> install some packages, but I refuse to cooperate. It then says that it
> is scanning, or something (gone too fast for me to be able to read), and
> then it says "IBM SERVERAID is not in the scanner database, configure it
> manually?" (an obvious sign that something is going wrong with the
> scanner probe). I repond No. It then says "IBM 32P0042a S320 1" is not
> in the scanner database, configure it manually?". Don't even know what
> that is. I respond No. Then it does the same for "IBM SERVERAID" again,
> I respond No. And the same again for the other one, I respond No. Then I
> get the following panel:
> - title: Scannerdrake
> - text: There are no scanners found which are available on your system.
> - button: Search for new scanners
> - button: Add a scanner manually
> - button: Scanner sharing
> - button: Quit
> I persevered, and clicked "Search for new scanners", well, that's the
> same as before, from just after the scanning. No crash yet. I did Quit.
> Then I did vi `which harddrake2`, and I tried to add the line that
> [...] suggested (next if $Ident =~ "SCANNER";), but then vi froze
> (perhaps some of the file was still in memory from a previous vi
> session, but then it wanted to access the disk?). The other ssh sessions
> continued to work, unlike during the previous failure; I tried man perl
> in another one to try and see an explanation for double quotes ([...])
> vs. slashes (harddrake2), but I got the error "-bash: /usr/bin/man:
> Input/output error", repeatedly. I can still open other ssh sessions,
> and the console itself works, but I see that all 3 drives have an amber
> status light (not the green activity light, and if I remember correctly
> the status light is normally off), and that the "System-error LED" is
> lit on the "Operator information panel" (only other lit signs are
> "Power-on LED" and "POST-complete LED"), with also one LED lit in the
> "diagnostic LED panel" inside the computer, next to a symbol of a disk
> and the letters "DASD". When I look next, the console has gone from
> graphics to text mode, and is filling with messages about "EXT3-fs
> error", "ext3_reserve_inode_write: IO failure", "ext3_get_inode_lock:
> unable to get inode block". Meanwhile, the remote ssh sessions are still
> responsive. I don't try anything on the console, and use a remote ssh
> session to try "# shutdown -h now" as root, but obviously the command
> cannot be read from disk (error message "bash: shutdown: command not
> found"). ctl-alt-del on the console's keyboard: same thing (this causes
> init to (try to) invoke shutdown). I then did a reset (actually a power
> cycle; just a reset would have been better). The three drives were still
> marked defunct (status lights on). I used the ServeRAID support CD to
> boot, and could set two of the physical drives online, but the last one
> did not have that right-click menu option (I even set the second one
> defunct again, was able to bring the third online, but then the option
> was missing on the second one). So then I briefly removed the second
> drive from its hot-swap bay, and when I inserted it again it started
> getting rebuilt from the other drives, and (according to the log)
> completed a little over an hour later (for 30+ GB disk capacity, of
> which maybe less than 1 GB in use, if that matters). I tell ServeRAID
> Manager (?) to reboot, and then I'm stuck with a garbled Mandrake splash
> screen and a succession of:
> Boot: linux-secure
> Loading linux-secure
> Error 0x00
> and then a succession of just:
> Boot: linux-securere
> ctl-alt-del works (but brings no salvation).
> Was data lost during the reset/power cycle (hopefully not during the
> rebuild, because that would defeat the purpose of having a RAID), or as
> early as the corruption of the ServeRAID controller card that
> (ultimately?) set the drives to defunct state? Apparently the boot
> doesn't even get to the stage where it would decide about clean state of
> the file systems, so this is not something we can afford on a system in
> production (evidence that recovery is not a simple matter and may
> involve data recovery from backup, unless *perhaps* if a boot floppy
> takes the system past this stage, after which ext2/ext3 gets a chance to
> repair itself, but I have not boot floppy... (will make one now, though,
> next chance I get)).
> I reboot into diagnostics (PC DOCTOR 2.0, apparently a specific feature
> of the IBM server), and the SCSI/RAID Controller test category passed.
> Next I will proceeded to reinstall the whole system from scratch.
>
>>
>> I'm not sure yet, though (why hasn't this happened before, and has a
>> conclusion been reached?), that's why I've also cc'ed [...].
>> It seems strange however, if this is indeed the
>> problem, that a hardware adapter card should prove so vulnerable to a
>> probing method used for a different device (a scanner), but then again
>> I have no close knowledge of these issues.
>>
>> BTW, the machine is not yet in production (I was going to do that, but
>> I guess I can now wait a few days), and available for tests.
>>
>> I still think it's really unfortunate that there is no list of known
>> *in*compatibilities, because who would suspect, with ServeRAID
>> support, or drivers anyway, available for SuSE, TurboLinux, Caldera
>> (SCO Group, the enemy!), and RedHat, that Mandrake would pose a
>> problem? The same goes for Mandrake's site, of course (all of IBM is
>> just "known hardware", and xSeries 235 and ServeRAID 5i are just absent).
>>
>> http://www-1.ibm.com/servers/enable/site/xinfo/linux/servraid
>> (this is where I saw the address ipslinux at us.ibm.com)
>>
>> http://www.mandrakelinux.com/en/hardware.php3
>>
>> [...]
Raf Schietekat <Raf_Schietekat at ieee.org>
More information about the sane-devel
mailing list