Bug#743512: grub-pc: grub-probe fails to locate md device (no such disk) when grub-installing

Adam Allred adam at gtisc.gatech.edu
Thu Apr 3 15:06:16 UTC 2014


Package: grub-pc
Version: 1.99-27+deb7u2
Severity: normal

Dear Maintainer,

I have a Debian Wheezy amd64 server comprised of a Supermicro X8DTi-F motherboard (BIOS version 2.1), a Supermicro 846E16-R900B with a BPN-SAS2-846EL1 backplane, LSI 9211-8i HBA with P14 IT firmware, and two md RAID arrays. The first was a RAID10 array of 6 x 300GB intel 320 SDDs (md0) using msdos partition tables; the second is a RAID10 of 18 Seagate ST4000DM000 HDDs (md1) using GPT.

I have a need to increase the size of md0, so I began the process of replacing each of the 300GB 320s with Samsung 840 EVO 1TB drives. The following is my standard process:

root at server:/home/adam# cat /proc/mdstat 
Personalities : [raid10] 
md1 : active raid10 sdn2[19] sdj2[0] sdx2[17] sdv2[16] sdw2[15] sdu2[14] sdt2[13] sds2[12] sdm2[11] sdo2[9] sdr2[18] sdp2[7] sdq2[6] sdg2[5] sdl2[4] sdk2[3] sdh2[2] sdi2[1]
      35161966080 blocks super 1.2 512K chunks 2 near-copies [18/18] [UUUUUUUUUUUUUUUUUU]
      
md0 : active raid10 sde1[10] sdd1[9] sdc1[8] sdb1[6] sda1[7] sdf1[5]
      878710272 blocks super 1.2 512K chunks 2 near-copies [6/6] [UUUUUU]
      
unused devices: <none>
root at server:/home/adam# mdadm --fail /dev/md0 /dev/sdf1
mdadm: set /dev/sdf1 faulty in /dev/md0
root at server:/home/adam# mdadm --remove /dev/md0 /dev/sdf1
mdadm: hot removed /dev/sdf1 from /dev/md0
root at server:/home/adam# cat /proc/mdstat 
Personalities : [raid10] 
md1 : active raid10 sdn2[19] sdj2[0] sdx2[17] sdv2[16] sdw2[15] sdu2[14] sdt2[13] sds2[12] sdm2[11] sdo2[9] sdr2[18] sdp2[7] sdq2[6] sdg2[5] sdl2[4] sdk2[3] sdh2[2] sdi2[1]
      35161966080 blocks super 1.2 512K chunks 2 near-copies [18/18] [UUUUUUUUUUUUUUUUUU]
      
md0 : active raid10 sde1[10] sdd1[9] sdc1[8] sdb1[6] sda1[7]
      878710272 blocks super 1.2 512K chunks 2 near-copies [6/5] [UUUUU_]
      
unused devices: <none>

<physically replace the drive>

root at server:/home/adam# dd if=/dev/sde of=/dev/sdf bs=512 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.00303809 s, 169 kB/s
root at server:/home/adam# echo w | fdisk /dev/sdf

Command (m for help): The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.
root at server:/home/adam# mdadm --add /dev/md0 /dev/sdf1
mdadm: added /dev/sdf1
root at server:/home/adam# cat /proc/mdstat 
Personalities : [raid10] 
md1 : active raid10 sdn2[19] sdj2[0] sdx2[17] sdv2[16] sdw2[15] sdu2[14] sdt2[13] sds2[12] sdm2[11] sdo2[9] sdr2[18] sdp2[7] sdq2[6] sdg2[5] sdl2[4] sdk2[3] sdh2[2] sdi2[1]
      35161966080 blocks super 1.2 512K chunks 2 near-copies [18/18] [UUUUUUUUUUUUUUUUUU]
      
md0 : active raid10 sdf1[11] sde1[10] sdd1[9] sdc1[8] sdb1[6] sda1[7]
      878710272 blocks super 1.2 512K chunks 2 near-copies [6/5] [UUUUU_]
      [>....................]  recovery =  0.0% (60160/292903424) finish=243.3min speed=20053K/sec
      
unused devices: <none>
root at server:/home/adam# grub-install --recheck /dev/sdf


At this point, something went wrong, only for the final 4 SSDs (sd[c-f]):

root at server:/home/adam# grub-install --recheck /dev/sdf
/usr/sbin/grub-probe: error: no such disk.
Auto-detection of a filesystem of /dev/md0 failed.
Try with --recheck.
If the problem persists please report this together with the output of "/usr/sbin/grub-probe --device-map="/boot/grub/device.map" --target=fs -v /boot/grub" to <bug-grub at gnu.org>

The output of grub-probe is as follows:

root at server:/home/adam# /usr/sbin/grub-probe --device-map="/boot/grub/device.map" --target=fs -v /boot/grub
/usr/sbin/grub-probe: info: Scanning for dmraid_nv RAID devices on disk hd0.
/usr/sbin/grub-probe: info: the size of hd0 is 1953525168.
/usr/sbin/grub-probe: info: the size of hd0 is 1953525168.
/usr/sbin/grub-probe: info: Scanning for dmraid_nv RAID devices on disk hd1.
/usr/sbin/grub-probe: info: the size of hd1 is 1953525168.
....
....
....
/usr/sbin/grub-probe: info: the size of hd23 is 7814037168.
/usr/sbin/grub-probe: info: no LVM signature found.
/usr/sbin/grub-probe: info: opening mduuid/48fdedd049d45f9a3963420a24b16940.
/usr/sbin/grub-probe: error: no such disk.

The mduuid is correct:

root at server:/home/adam# mdadm --examine --scan
ARRAY /dev/md/1 metadata=1.2 UUID=532ef032:cd182e40:575dfcd5:d6c2aa73 name=server:1
ARRAY /dev/md/0 metadata=1.2 UUID=48fdedd0:49d45f9a:3963420a:24b16940 name=server:0

What's really mind-boggling is that if I do this:

root at server:/home/adam# mdadm --examine /dev/sdf1
/dev/sdf1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x2
     Array UUID : 48fdedd0:49d45f9a:3963420a:24b16940
           Name : server:0  (local to host server)
  Creation Time : Thu Nov 29 21:55:02 2012
     Raid Level : raid10
   Raid Devices : 6

 Avail Dev Size : 585807872 (279.33 GiB 299.93 GB)
     Array Size : 878710272 (838.00 GiB 899.80 GB)
  Used Dev Size : 585806848 (279.33 GiB 299.93 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
Recovery Offset : 329516928 sectors
          State : active
    Device UUID : ae156010:d7785167:59f3471c:8a17568d

    Update Time : Thu Apr  3 14:57:29 2014
       Checksum : 31857985 - correct
         Events : 135343

         Layout : near=2
     Chunk Size : 512K

   Device Role : Active device 5
   Array State : AAAAAA ('A' == active, '.' == missing)

Then run grub-install again:

root at server:/home/adam# grub-install --recheck /dev/sdf
Installation finished. No error reported.

Then it completes with no problem. This consistently worked for all four of the drives which exhibited this behavior. I have, on rare occasions, encountered the same problem in the past; a reboot of the server also "fixed" it.

I checked for bug reports upstream to see if this was corrected in a newer version, but I did not see anything that pertained to my issue. Given I have a simple and effective workaround, it's not a big problem; it's just weird.



More information about the Pkg-grub-devel mailing list