Bug#239111: following etch->lenny grub upgrade instructions locks up a system with XFS root
Robert McQueen
robot101 at hadesian.co.uk
Fri Dec 12 03:43:31 UTC 2008
severity 239111 grave
found 239111 0.97-47
thanks
Sorry to drag this ancient bug up to RC status, but I just got *totally*
screwed by it. I've just upgraded an etch system to lenny, and following
the instructions in grub's NEWS.Debian file, I ran 'grub-install (hd0)'
after verifying my device.map in order to update the bootloader, under
the threat that my system would not be able to boot a >= 2.6.23 kernel.
Unfortunately, because of this grub-install hanging while my XFS root
filesystem is frozen, my entire (remote, headless) system is now
unresponsive on all of the ssh logins I had open to it, as well as on
its serial console, because any login activity ends up trying to write
to disk. (I could sysrq or powercycle it, but I'm not actually certain
whether the bootloader will be intact currently. Thankfully its the dom0
of a Xen system, and the mission-critical domUs are working still, so I
can risk a reboot over the weekend.)
Previously this bug has just been a minor annoyance because under d-i,
you can work around this freeze by going to the shell (which is in
memory in a different filesystem) and killing grub-install, unfreezing
the filesystem, and doing it manually. However, this is the first time
I've been enticed into using grub-install on a live system, and now I've
realised quite what a screwup this actually is. It's moved from a mild
annoyance for deviants like me who insist on choosing grub from the d-i
menu when using an XFS root, to a way to lock-up and potentially even
brick (I can tell you tomorrow evening) a running system.
As I noted, er, four and a half years ago[1], and by many previous
reporters, what the script currently does will *always* fail where /boot
and / are the same filesystem:
xfs_freeze -f
grub --bullshit... # tries to write to the filesystem and log file too,
# will never make any progress because / is frozen
xfs_freeze -u
On a live system, not only does this just hang the grub-install process,
but it effectively stops you from being able to log in to fix the
situation too.
1: http://lists.debian.org/debian-boot/2004/06/msg02301.html
Previous reporters have provided patches[2] to change the behaviour to this:
xfs_freeze -f
xfs_freeze -u
grub --bullshit...
The failures of this approach (see below) are that grub might not
succeed in upgrading/installing, or that it might install an older
version of grub. It will, however, never freeze your system.
2:
http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=152;filename=xfs_freeze.diff;att=1;bug=239111
Earlier reporters on the bug in 2006 commented that the above doesn't
work reliably. This is due to #306966, which is present in etch's
kernel, and means xfs_freeze -f can return before the data is actually
synced. I've spoken to the XFS developers about this on IRC, and
apparently it was fixed upstream around 2.6.19 or 20. So actually,
applying that patch to re-order things like that will make this freezing
bug go away on lenny kernel system once and for all, and d-i won't need
to force LILO on XFS users any more. :)
However it /will/ re-introduce the race condition so that grub-install
fails when you're running under etch's kernel, but that's /much/ less of
a problem than locking up or bricking the system, because you can work
around it by just installing grub manually. So, definitely for lenny we
must apply a patch which fixes the script to do freeze / unfreeze / grub.
The only remaining issue is then: how old a grub can really not read the
2.6.26 kernel that comes with lenny?
If etch's grub /can/ read lenny's kernel, and it's only /older/ grubs
that have problems, then this race bug will affect a pretty small number
of people. We can just edit NEWS.Debian then to say "If grub-install
fails and you have XFS /boot and etch's kernel, reboot and try it
again". Any people with >3 year old Debian systems with XFS root and
grub are probably able to figure out how to reinstall grub manually.
If etch's grub /can't/ read lenny's kernel, then I guess we need to
provide a smoother migration path for people who are doing as I did and
running grub-install, so we can apply the patch with the sleep
workaround[3] and include a message:
"If you got a file not found error and use an XFS root filesystem,
please try re-running grub-install with the --sync-sleep=60 option to
work around a bug in the 2.6.18 kernel in etch."
3:
http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=179;filename=xfs_freeze.diff;att=1;bug=239111
Regards,
Rob
P.S. If we were being "enterprise", we'd probably push an etch kernel
patch which fixed the race condition with xfs_freeze, so people could
reliably upgrade their kernel and grub at the same time before rebooting
into lenny. But, eh, screw it.
P.P.S. The XFS developers would like to take this opportunity to remind
viewers that reading/writing the block device underneath a filesystem is
the true bug here, and that the grub shell should just be made to use
FIBMAP to find the stage 1.5 block list like lilo has done happily for
over 10 years...
More information about the Pkg-grub-devel
mailing list