Replacing a failed disk in RAID on an OVHcloud dedicated server

Yesterday morning, I woke up to some Discord messages saying that one of my websites was offline. Not a great way to start off the morning! I tried to first ascertain the scope of the issue—turns out none of my services were accessible at all from that one dedicated server. Panic started to set in a bit… did I forget to pay my monthly server bill? Did the server get hacked? Or did the server burn down?

Interestingly, SSH worked. Phew—sigh of relief—things are still working to some extent. I run Proxmox VE on my server, and its web control panel was partially functioning, so I attempted to restart to get it fully working:

root@garibaldi:/home/dennis# systemctl restart pveproxy
Failed to restart pveproxy.service: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
See system logs and 'systemctl status pveproxy.service' for details.

Hmm odd… at least it gave some suggestions at the bottom. I started with the service status:

root@garibaldi:/home/dennis# systemctl status pveproxy.service
Failed to get properties: Connection timed out

Not too helpful here. Following the other suggestion to see the system logs, I found the following:

May 16 04:12:46 garibaldi kernel: [44452858.296298] nvme nvme1: I/O 819 QID 1 timeout, aborting
May 16 04:13:16 garibaldi kernel: [44452888.500562] nvme nvme1: I/O 819 QID 1 timeout, reset controller
May 16 04:13:16 garibaldi kernel: [44452888.548755] nvme nvme1: Abort status: 0x371
May 16 04:13:19 garibaldi kernel: [44452891.245965] nvme nvme1: 8/0/0 default/read/poll queues
May 16 04:13:51 garibaldi kernel: [44452923.060884] nvme nvme1: I/O 576 QID 3 timeout, disable controller
May 16 04:13:51 garibaldi kernel: [44452923.877064] blk_update_request: I/O error, dev nvme1n1, sector 45088784 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
May 16 04:13:51 garibaldi kernel: [44452923.877065] blk_update_request: I/O error, dev nvme1n1, sector 312645672 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
May 16 04:13:51 garibaldi kernel: [44452923.877070] md: super_written gets error=-5
May 16 04:13:51 garibaldi kernel: [44452923.903587] md/raid1:md4: nvme1n1p4: rescheduling sector 267294760
May 16 04:13:51 garibaldi kernel: [44452923.903591] blk_update_request: I/O error, dev nvme1n1, sector 329846032 op 0x0:(READ) flags 0x0 phys_seg 4 prio class 0
May 16 04:13:51 garibaldi kernel: [44452923.929761] md/raid1:md4: Disk failure on nvme1n1p4, disabling device.
May 16 04:13:51 garibaldi kernel: [44452923.929761] md/raid1:md4: Operation continuing on 1 devices.
May 16 04:13:52 garibaldi kernel: [44452924.197711] nvme nvme1: failed to mark controller live state
May 16 04:13:52 garibaldi kernel: [44452924.197740] Read-error on swap-device (259:6:43028320)
May 16 04:13:52 garibaldi kernel: [44452924.197955] md/raid1:md4: redirecting sector 267294760 to other mirror: nvme0n1p4
May 16 04:13:52 garibaldi kernel: [44452924.217137] nvme nvme1: Removing after probe failure status: -19
May 16 04:13:52 garibaldi kernel: [44452924.560909] nvme1n1: detected capacity change from 879097968 to 0
May 16 04:13:52 garibaldi kernel: [44452924.598023] md: super_written gets error=-5
May 16 04:13:52 garibaldi kernel: [44452924.613193] md/raid1:md2: Disk failure on nvme1n1p2, disabling device.
May 16 04:13:52 garibaldi kernel: [44452924.613193] md/raid1:md2: Operation continuing on 1 devices.
May 16 04:13:52 garibaldi kernel: [44452924.739850] FAT-fs (nvme1n1p1): unable to read boot sector to mark fs as dirty

May 16 04:22:29 garibaldi kernel: [44453441.429045] fwbr112i0: port 2(tap112i0) entered disabled state
May 16 04:22:29 garibaldi kernel: [44453441.430212] Read-error on swap-device (259:6:44450424)

May 16 04:42:12 garibaldi kernel: [44454624.593680] vmbr0: port 2(tap101i0) entered disabled state
May 16 04:42:12 garibaldi kernel: [44454624.610177] vmbr0: port 2(tap101i0) entered disabled state
May 16 04:42:12 garibaldi kernel: [44454624.684771] vmbr1: port 1(tap101i1) entered disabled state
May 16 04:42:12 garibaldi kernel: [44454624.700848] vmbr1: port 1(tap101i1) entered disabled state

May 16 04:53:17 garibaldi kernel: [44455289.898789] fwbr111i0: port 2(tap111i0) entered disabled state
May 16 04:53:17 garibaldi kernel: [44455289.918070] fwbr111i0: port 2(tap111i0) entered disabled state
May 16 04:53:17 garibaldi kernel: [44455290.002365] fwbr111i1: port 2(tap111i1) entered disabled state
May 16 04:53:17 garibaldi kernel: [44455290.021230] fwbr111i1: port 2(tap111i1) entered disabled state

May 16 06:49:41 garibaldi kernel: [44462273.966854] vmbr0: port 3(tap102i0) entered disabled state
May 16 06:49:41 garibaldi kernel: [44462273.979660] vmbr0: port 3(tap102i0) entered disabled state
May 16 06:49:41 garibaldi kernel: [44462274.056903] vmbr1: port 2(tap102i1) entered disabled state
May 16 06:49:41 garibaldi kernel: [44462274.068993] vmbr1: port 2(tap102i1) entered disabled state

Disk failure! Not great! For some reason, the network also went down a bit afterwards. I didn’t exactly figure out why, but I thought that might have been more of a consequence than the cause, so I didn’t focus on that first.

So what’s going on with the disks? I used lsblk to list the disks:

root@garibaldi:/home/dennis# lsblk
NAME          MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda             8:0    0   3.6T  0 disk
└─sda1          8:1    0   3.6T  0 part  /mnt/storagesda
sdb             8:16   0   3.6T  0 disk
└─sdb1          8:17   0   3.6T  0 part  /mnt/storagesdb
nvme0n1       259:0    0 419.2G  0 disk
├─nvme0n1p1   259:1    0   511M  0 part
├─nvme0n1p2   259:2    0    20G  0 part
│ └─md2         9:2    0    20G  0 raid1 /
├─nvme0n1p3   259:3    0     1G  0 part  [SWAP]
├─nvme0n1p4   259:4    0 397.7G  0 part
│ └─md4         9:4    0 397.6G  0 raid1
│   └─vg-data 253:0    0 397.6G  0 lvm   /var/lib/vz
└─nvme0n1p5   259:5    0     2M  0 part

I should have 2 NVME disks, but only one is still alive. nvme1 is completely gone!

The two NVME disks were in a software RAID1 array, so I looked at /proc/mdstat to check the status of those:

root@garibaldi:/home/dennis# cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md4 : active raid1 nvme0n1p4[1]
      416871424 blocks super 1.2 [2/1] [_U]
      bitmap: 4/4 pages [16KB], 65536KB chunk

md2 : active raid1 nvme0n1p2[0]
      20955136 blocks super 1.2 [2/1] [U_]

unused devices: <none>

No surprise that only one disk is reporting in each of the RAID arrays. In particular the underscores in [_U] and [U_] indicate that there’s a disk down in each array. To get a bit more detail on the RAID status, I used mdadm:

root@garibaldi:/home/dennis# /sbin/mdadm -D /dev/md2
/dev/md2:
           Version : 1.2
     Creation Time : Fri Sep  3 10:46:30 2021
        Raid Level : raid1
        Array Size : 20955136 (19.98 GiB 21.46 GB)
     Used Dev Size : 20955136 (19.98 GiB 21.46 GB)
      Raid Devices : 2
     Total Devices : 1
       Persistence : Superblock is persistent

       Update Time : Tue May 16 08:08:07 2023
             State : clean, degraded
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

              Name : md2
              UUID : 075e9fdd:281f430d:edbc2afc:ff4db6bc
            Events : 5318

    Number   Major   Minor   RaidDevice State
       0     259        2        0      active sync   /dev/nvme0n1p2
       -       0        0        1      removed

root@garibaldi:/home/dennis# /sbin/mdadm -D /dev/md4
/dev/md4:
           Version : 1.2
     Creation Time : Fri Sep  3 10:46:31 2021
        Raid Level : raid1
        Array Size : 416871424 (397.56 GiB 426.88 GB)
     Used Dev Size : 416871424 (397.56 GiB 426.88 GB)
      Raid Devices : 2
     Total Devices : 1
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Tue May 16 06:50:00 2023
             State : clean, degraded
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : md4
              UUID : 82738eb2:c636d623:25662d34:b53efdf9
            Events : 15756

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1     259        4        1      active sync   /dev/nvme0n1p4

I’m not really a sysadmin expert, but to me this looked like the disk somewhat gracefully removed itself from operation. Just to recap, a RAID1 is a mirroring array, where both disks store the same data. Loss of one disk is tolerable, and obviously only losing both at the same time would be catastrophic to the array. This meant I had to be absolutely sure of what I was doing to not cause any more issues with the one good disk.

At this point, the only thing I could do was to request my host, OVHcloud, to replace the failed disk. After double checking backups were recent, I submitted the ticket. OVHcloud handled the ticket within a few hours, after some back and forth to confirm the disk and risk of data loss, they replaced the disk and booted up the system in the rescue OS.

First I used fdisk to check out the new disk:

root@rescue-customer-ca (ns000000.ip-000-000-000.net) ~ # fdisk -l

Disk /dev/nvme1n1: 419.2 GiB, 450098159616 bytes, 879097968 sectors
Disk model: INTEL SSDPE2MX450G7
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/nvme0n1: 419.2 GiB, 450098159616 bytes, 879097968 sectors
Disk model: INTEL SSDPE2MX450G7
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: CDE841D3-6953-474E-8837-37325C7D9269

Device             Start       End   Sectors   Size Type
/dev/nvme0n1p1      2048   1048575   1046528   511M EFI System
/dev/nvme0n1p2   1048576  42991615  41943040    20G Linux RAID
/dev/nvme0n1p3  42991616  45088767   2097152     1G Linux filesystem
/dev/nvme0n1p4  45088768 879093871 834005104 397.7G Linux RAID
/dev/nvme0n1p5 879093872 879097934      4063     2M Linux filesystem

Great, /dev/nvme1n1 is back and awaiting action! First, I needed to copy the partition table from the existing disk to the new disk so that it would have the same partitions to put back into the RAID1 arrays. From a bit of searching, sfdisk was the best way to do this. I did a lsblk before and after just to compare the before (and check I had the correct disks in my command) and after.

root@rescue-customer-ca (ns000000.ip-000-000-000.net) ~ # lsblk
NAME          MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda             8:0    0   3.7T  0 disk
└─sda1          8:1    0   3.7T  0 part
sdb             8:16   0   3.7T  0 disk
└─sdb1          8:17   0   3.7T  0 part
nvme1n1       259:0    0 419.2G  0 disk
nvme0n1       259:1    0 419.2G  0 disk
├─nvme0n1p1   259:2    0   511M  0 part
├─nvme0n1p2   259:3    0    20G  0 part
│ └─md126       9:126  0    20G  0 raid1
├─nvme0n1p3   259:4    0     1G  0 part
├─nvme0n1p4   259:5    0 397.7G  0 part
│ └─md127       9:127  0 397.6G  0 raid1
│   └─vg-data 251:0    0 397.6G  0 lvm
└─nvme0n1p5   259:6    0     2M  0 part

root@rescue-customer-ca (ns000000.ip-000-000-000.net) ~ # sfdisk -d /dev/nvme0n1 | sfdisk /dev/nvme1n1
Checking that no-one is using this disk right now ... OK

Disk /dev/nvme1n1: 419.2 GiB, 450098159616 bytes, 879097968 sectors
Disk model: INTEL SSDPE2MX450G7
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Created a new GPT disklabel (GUID: CDE841D3-6953-474E-8837-37325C7D9269).
/dev/nvme1n1p1: Created a new partition 1 of type 'EFI System' and of size 511 MiB.
/dev/nvme1n1p2: Created a new partition 2 of type 'Linux RAID' and of size 20 GiB.
/dev/nvme1n1p3: Created a new partition 3 of type 'Linux filesystem' and of size 1 GiB.
/dev/nvme1n1p4: Created a new partition 4 of type 'Linux RAID' and of size 397.7 GiB.
/dev/nvme1n1p5: Created a new partition 5 of type 'Linux filesystem' and of size 2 MiB.
/dev/nvme1n1p6: Done.

New situation:
Disklabel type: gpt
Disk identifier: CDE841D3-6953-474E-8837-37325C7D9269

Device             Start       End   Sectors   Size Type
/dev/nvme1n1p1      2048   1048575   1046528   511M EFI System
/dev/nvme1n1p2   1048576  42991615  41943040    20G Linux RAID
/dev/nvme1n1p3  42991616  45088767   2097152     1G Linux filesystem
/dev/nvme1n1p4  45088768 879093871 834005104 397.7G Linux RAID
/dev/nvme1n1p5 879093872 879097934      4063     2M Linux filesystem

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.

root@rescue-customer-ca (ns000000.ip-000-000-000.net) ~ # lsblk
NAME          MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda             8:0    0   3.7T  0 disk
└─sda1          8:1    0   3.7T  0 part
sdb             8:16   0   3.7T  0 disk
└─sdb1          8:17   0   3.7T  0 part
nvme1n1       259:0    0 419.2G  0 disk
├─nvme1n1p1   259:7    0   511M  0 part
├─nvme1n1p2   259:8    0    20G  0 part
├─nvme1n1p3   259:9    0     1G  0 part
├─nvme1n1p4   259:10   0 397.7G  0 part
└─nvme1n1p5   259:11   0     2M  0 part
nvme0n1       259:1    0 419.2G  0 disk
├─nvme0n1p1   259:2    0   511M  0 part
├─nvme0n1p2   259:3    0    20G  0 part
│ └─md126       9:126  0    20G  0 raid1
├─nvme0n1p3   259:4    0     1G  0 part
├─nvme0n1p4   259:5    0 397.7G  0 part
│ └─md127       9:127  0 397.6G  0 raid1
│   └─vg-data 251:0    0 397.6G  0 lvm
└─nvme0n1p5   259:6    0     2M  0 part

With the partitions good to go, they could then be re-added back into the software RAID1 arrays using mdadm:

root@rescue-customer-ca (ns000000.ip-000-000-000.net) ~ # mdadm --manage /dev/md126 --add /dev/nvme1n1p2
mdadm: added /dev/nvme1n1p2
root@rescue-customer-ca (ns000000.ip-000-000-000.net) ~ # mdadm --manage /dev/md127 --add /dev/nvme1n1p4
mdadm: added /dev/nvme1n1p4

That turned out to be quick and easy. Then it was just a matter of waiting for the RAID to rebuild. I monitored its progress using /proc/mdstat:

root@rescue-customer-ca (ns000000.ip-000-000-000.net) ~ # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md126 : active raid1 nvme1n1p2[2] nvme0n1p2[0]
      20955136 blocks super 1.2 [2/1] [U_]
      [=========>...........]  recovery = 45.6% (9574272/20955136) finish=0.9min speed=205555K/sec

md127 : active raid1 nvme1n1p4[2] nvme0n1p4[1]
      416871424 blocks super 1.2 [2/1] [_U]
      	resync=DELAYED
      bitmap: 4/4 pages [16KB], 65536KB chunk

unused devices: <none>

As these were solid state disks, they finished pretty quickly with the following /proc/mdstat result:

root@rescue-customer-ca (ns000000.ip-000-000-000.net) ~ # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md126 : active raid1 nvme1n1p2[2] nvme0n1p2[0]
      20955136 blocks super 1.2 [2/2] [UU]

md127 : active raid1 nvme1n1p4[2] nvme0n1p4[1]
      416871424 blocks super 1.2 [2/2] [UU]
      bitmap: 0/4 pages [0KB], 65536KB chunk

unused devices: <none>

Notice now the two entries per array, and the [UU] meaning both disks were active in each array.

I then tried to reboot the server, but it was unresponsive. Connecting to the IPMI console revealed that the boot sequence was dropping into the GRUB rescue console.

There was one additional step needed to reconfigure the GRUB bootloader to work with the new disk. I’m not exactly sure why. OVHcloud did have a guide for repairing the GRUB bootloader, however, the only change in the command was to point it to the correct boot partition:

grub-install /dev/nvme0n1p1

And when I rebooted this time, it booted all the way through, and everything was fine! Mission accomplished!

This recovery process was relatively painless, much thanks to the remaining disk in the RAID array keeping things intact, and all the standardized tooling around managing and recovering RAID arrays. It’s not often that I have a hardware failure on my server so I was a little slow having to figure out and go through the steps without messing anything up further.

Also I realize that the things running on my server aren’t “highly available.” I never really meant them to be nor are they necessary to be running 100% of the time, so it wasn’t the end of the world that my server was down for a day. I suppose though it would be a good exercise to try setting up one or two of my most used websites such that they would still be available if one server were to be brought down.

Related

Leave a Reply

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.