Yesterday morning, I woke up to some Discord messages saying that one of my websites was offline. Not a great way to start off the morning! I tried to first ascertain the scope of the issue—turns out none of my services were accessible at all from that one dedicated server. Panic started to set in a bit… did I forget to pay my monthly server bill? Did the server get hacked? Or did the server burn down?
Interestingly, SSH worked. Phew—sigh of relief—things are still working to some extent. I run Proxmox VE on my server, and its web control panel was partially functioning, so I attempted to restart to get it fully working:
root@garibaldi:/home/dennis# systemctl restart pveproxy Failed to restart pveproxy.service: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms) See system logs and 'systemctl status pveproxy.service' for details.
Hmm odd… at least it gave some suggestions at the bottom. I started with the service status:
root@garibaldi:/home/dennis# systemctl status pveproxy.service Failed to get properties: Connection timed out
Not too helpful here. Following the other suggestion to see the system logs, I found the following:
May 16 04:12:46 garibaldi kernel: [44452858.296298] nvme nvme1: I/O 819 QID 1 timeout, aborting May 16 04:13:16 garibaldi kernel: [44452888.500562] nvme nvme1: I/O 819 QID 1 timeout, reset controller May 16 04:13:16 garibaldi kernel: [44452888.548755] nvme nvme1: Abort status: 0x371 May 16 04:13:19 garibaldi kernel: [44452891.245965] nvme nvme1: 8/0/0 default/read/poll queues May 16 04:13:51 garibaldi kernel: [44452923.060884] nvme nvme1: I/O 576 QID 3 timeout, disable controller May 16 04:13:51 garibaldi kernel: [44452923.877064] blk_update_request: I/O error, dev nvme1n1, sector 45088784 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0 May 16 04:13:51 garibaldi kernel: [44452923.877065] blk_update_request: I/O error, dev nvme1n1, sector 312645672 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 May 16 04:13:51 garibaldi kernel: [44452923.877070] md: super_written gets error=-5 May 16 04:13:51 garibaldi kernel: [44452923.903587] md/raid1:md4: nvme1n1p4: rescheduling sector 267294760 May 16 04:13:51 garibaldi kernel: [44452923.903591] blk_update_request: I/O error, dev nvme1n1, sector 329846032 op 0x0:(READ) flags 0x0 phys_seg 4 prio class 0 May 16 04:13:51 garibaldi kernel: [44452923.929761] md/raid1:md4: Disk failure on nvme1n1p4, disabling device. May 16 04:13:51 garibaldi kernel: [44452923.929761] md/raid1:md4: Operation continuing on 1 devices. May 16 04:13:52 garibaldi kernel: [44452924.197711] nvme nvme1: failed to mark controller live state May 16 04:13:52 garibaldi kernel: [44452924.197740] Read-error on swap-device (259:6:43028320) May 16 04:13:52 garibaldi kernel: [44452924.197955] md/raid1:md4: redirecting sector 267294760 to other mirror: nvme0n1p4 May 16 04:13:52 garibaldi kernel: [44452924.217137] nvme nvme1: Removing after probe failure status: -19 May 16 04:13:52 garibaldi kernel: [44452924.560909] nvme1n1: detected capacity change from 879097968 to 0 May 16 04:13:52 garibaldi kernel: [44452924.598023] md: super_written gets error=-5 May 16 04:13:52 garibaldi kernel: [44452924.613193] md/raid1:md2: Disk failure on nvme1n1p2, disabling device. May 16 04:13:52 garibaldi kernel: [44452924.613193] md/raid1:md2: Operation continuing on 1 devices. May 16 04:13:52 garibaldi kernel: [44452924.739850] FAT-fs (nvme1n1p1): unable to read boot sector to mark fs as dirty May 16 04:22:29 garibaldi kernel: [44453441.429045] fwbr112i0: port 2(tap112i0) entered disabled state May 16 04:22:29 garibaldi kernel: [44453441.430212] Read-error on swap-device (259:6:44450424) May 16 04:42:12 garibaldi kernel: [44454624.593680] vmbr0: port 2(tap101i0) entered disabled state May 16 04:42:12 garibaldi kernel: [44454624.610177] vmbr0: port 2(tap101i0) entered disabled state May 16 04:42:12 garibaldi kernel: [44454624.684771] vmbr1: port 1(tap101i1) entered disabled state May 16 04:42:12 garibaldi kernel: [44454624.700848] vmbr1: port 1(tap101i1) entered disabled state May 16 04:53:17 garibaldi kernel: [44455289.898789] fwbr111i0: port 2(tap111i0) entered disabled state May 16 04:53:17 garibaldi kernel: [44455289.918070] fwbr111i0: port 2(tap111i0) entered disabled state May 16 04:53:17 garibaldi kernel: [44455290.002365] fwbr111i1: port 2(tap111i1) entered disabled state May 16 04:53:17 garibaldi kernel: [44455290.021230] fwbr111i1: port 2(tap111i1) entered disabled state May 16 06:49:41 garibaldi kernel: [44462273.966854] vmbr0: port 3(tap102i0) entered disabled state May 16 06:49:41 garibaldi kernel: [44462273.979660] vmbr0: port 3(tap102i0) entered disabled state May 16 06:49:41 garibaldi kernel: [44462274.056903] vmbr1: port 2(tap102i1) entered disabled state May 16 06:49:41 garibaldi kernel: [44462274.068993] vmbr1: port 2(tap102i1) entered disabled state
Disk failure! Not great! For some reason, the network also went down a bit afterwards. I didn’t exactly figure out why, but I thought that might have been more of a consequence than the cause, so I didn’t focus on that first.
So what’s going on with the disks? I used lsblk
to list the disks:
root@garibaldi:/home/dennis# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 3.6T 0 disk └─sda1 8:1 0 3.6T 0 part /mnt/storagesda sdb 8:16 0 3.6T 0 disk └─sdb1 8:17 0 3.6T 0 part /mnt/storagesdb nvme0n1 259:0 0 419.2G 0 disk ├─nvme0n1p1 259:1 0 511M 0 part ├─nvme0n1p2 259:2 0 20G 0 part │ └─md2 9:2 0 20G 0 raid1 / ├─nvme0n1p3 259:3 0 1G 0 part [SWAP] ├─nvme0n1p4 259:4 0 397.7G 0 part │ └─md4 9:4 0 397.6G 0 raid1 │ └─vg-data 253:0 0 397.6G 0 lvm /var/lib/vz └─nvme0n1p5 259:5 0 2M 0 part
I should have 2 NVME disks, but only one is still alive. nvme1
is completely gone!
The two NVME disks were in a software RAID1 array, so I looked at /proc/mdstat
to check the status of those:
root@garibaldi:/home/dennis# cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md4 : active raid1 nvme0n1p4[1] 416871424 blocks super 1.2 [2/1] [_U] bitmap: 4/4 pages [16KB], 65536KB chunk md2 : active raid1 nvme0n1p2[0] 20955136 blocks super 1.2 [2/1] [U_] unused devices: <none>
No surprise that only one disk is reporting in each of the RAID arrays. In particular the underscores in [_U]
and [U_]
indicate that there’s a disk down in each array. To get a bit more detail on the RAID status, I used mdadm
:
root@garibaldi:/home/dennis# /sbin/mdadm -D /dev/md2 /dev/md2: Version : 1.2 Creation Time : Fri Sep 3 10:46:30 2021 Raid Level : raid1 Array Size : 20955136 (19.98 GiB 21.46 GB) Used Dev Size : 20955136 (19.98 GiB 21.46 GB) Raid Devices : 2 Total Devices : 1 Persistence : Superblock is persistent Update Time : Tue May 16 08:08:07 2023 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 Consistency Policy : resync Name : md2 UUID : 075e9fdd:281f430d:edbc2afc:ff4db6bc Events : 5318 Number Major Minor RaidDevice State 0 259 2 0 active sync /dev/nvme0n1p2 - 0 0 1 removed root@garibaldi:/home/dennis# /sbin/mdadm -D /dev/md4 /dev/md4: Version : 1.2 Creation Time : Fri Sep 3 10:46:31 2021 Raid Level : raid1 Array Size : 416871424 (397.56 GiB 426.88 GB) Used Dev Size : 416871424 (397.56 GiB 426.88 GB) Raid Devices : 2 Total Devices : 1 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Tue May 16 06:50:00 2023 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 Consistency Policy : bitmap Name : md4 UUID : 82738eb2:c636d623:25662d34:b53efdf9 Events : 15756 Number Major Minor RaidDevice State - 0 0 0 removed 1 259 4 1 active sync /dev/nvme0n1p4
I’m not really a sysadmin expert, but to me this looked like the disk somewhat gracefully removed itself from operation. Just to recap, a RAID1 is a mirroring array, where both disks store the same data. Loss of one disk is tolerable, and obviously only losing both at the same time would be catastrophic to the array. This meant I had to be absolutely sure of what I was doing to not cause any more issues with the one good disk.
At this point, the only thing I could do was to request my host, OVHcloud, to replace the failed disk. After double checking backups were recent, I submitted the ticket. OVHcloud handled the ticket within a few hours, after some back and forth to confirm the disk and risk of data loss, they replaced the disk and booted up the system in the rescue OS.
First I used fdisk to check out the new disk:
root@rescue-customer-ca (ns000000.ip-000-000-000.net) ~ # fdisk -l Disk /dev/nvme1n1: 419.2 GiB, 450098159616 bytes, 879097968 sectors Disk model: INTEL SSDPE2MX450G7 Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk /dev/nvme0n1: 419.2 GiB, 450098159616 bytes, 879097968 sectors Disk model: INTEL SSDPE2MX450G7 Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: CDE841D3-6953-474E-8837-37325C7D9269 Device Start End Sectors Size Type /dev/nvme0n1p1 2048 1048575 1046528 511M EFI System /dev/nvme0n1p2 1048576 42991615 41943040 20G Linux RAID /dev/nvme0n1p3 42991616 45088767 2097152 1G Linux filesystem /dev/nvme0n1p4 45088768 879093871 834005104 397.7G Linux RAID /dev/nvme0n1p5 879093872 879097934 4063 2M Linux filesystem
Great, /dev/nvme1n1
is back and awaiting action! First, I needed to copy the partition table from the existing disk to the new disk so that it would have the same partitions to put back into the RAID1 arrays. From a bit of searching, sfdisk
was the best way to do this. I did a lsblk
before and after just to compare the before (and check I had the correct disks in my command) and after.
root@rescue-customer-ca (ns000000.ip-000-000-000.net) ~ # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 3.7T 0 disk └─sda1 8:1 0 3.7T 0 part sdb 8:16 0 3.7T 0 disk └─sdb1 8:17 0 3.7T 0 part nvme1n1 259:0 0 419.2G 0 disk nvme0n1 259:1 0 419.2G 0 disk ├─nvme0n1p1 259:2 0 511M 0 part ├─nvme0n1p2 259:3 0 20G 0 part │ └─md126 9:126 0 20G 0 raid1 ├─nvme0n1p3 259:4 0 1G 0 part ├─nvme0n1p4 259:5 0 397.7G 0 part │ └─md127 9:127 0 397.6G 0 raid1 │ └─vg-data 251:0 0 397.6G 0 lvm └─nvme0n1p5 259:6 0 2M 0 part root@rescue-customer-ca (ns000000.ip-000-000-000.net) ~ # sfdisk -d /dev/nvme0n1 | sfdisk /dev/nvme1n1 Checking that no-one is using this disk right now ... OK Disk /dev/nvme1n1: 419.2 GiB, 450098159616 bytes, 879097968 sectors Disk model: INTEL SSDPE2MX450G7 Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes >>> Script header accepted. >>> Script header accepted. >>> Script header accepted. >>> Script header accepted. >>> Script header accepted. >>> Script header accepted. >>> Created a new GPT disklabel (GUID: CDE841D3-6953-474E-8837-37325C7D9269). /dev/nvme1n1p1: Created a new partition 1 of type 'EFI System' and of size 511 MiB. /dev/nvme1n1p2: Created a new partition 2 of type 'Linux RAID' and of size 20 GiB. /dev/nvme1n1p3: Created a new partition 3 of type 'Linux filesystem' and of size 1 GiB. /dev/nvme1n1p4: Created a new partition 4 of type 'Linux RAID' and of size 397.7 GiB. /dev/nvme1n1p5: Created a new partition 5 of type 'Linux filesystem' and of size 2 MiB. /dev/nvme1n1p6: Done. New situation: Disklabel type: gpt Disk identifier: CDE841D3-6953-474E-8837-37325C7D9269 Device Start End Sectors Size Type /dev/nvme1n1p1 2048 1048575 1046528 511M EFI System /dev/nvme1n1p2 1048576 42991615 41943040 20G Linux RAID /dev/nvme1n1p3 42991616 45088767 2097152 1G Linux filesystem /dev/nvme1n1p4 45088768 879093871 834005104 397.7G Linux RAID /dev/nvme1n1p5 879093872 879097934 4063 2M Linux filesystem The partition table has been altered. Calling ioctl() to re-read partition table. Syncing disks. root@rescue-customer-ca (ns000000.ip-000-000-000.net) ~ # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 3.7T 0 disk └─sda1 8:1 0 3.7T 0 part sdb 8:16 0 3.7T 0 disk └─sdb1 8:17 0 3.7T 0 part nvme1n1 259:0 0 419.2G 0 disk ├─nvme1n1p1 259:7 0 511M 0 part ├─nvme1n1p2 259:8 0 20G 0 part ├─nvme1n1p3 259:9 0 1G 0 part ├─nvme1n1p4 259:10 0 397.7G 0 part └─nvme1n1p5 259:11 0 2M 0 part nvme0n1 259:1 0 419.2G 0 disk ├─nvme0n1p1 259:2 0 511M 0 part ├─nvme0n1p2 259:3 0 20G 0 part │ └─md126 9:126 0 20G 0 raid1 ├─nvme0n1p3 259:4 0 1G 0 part ├─nvme0n1p4 259:5 0 397.7G 0 part │ └─md127 9:127 0 397.6G 0 raid1 │ └─vg-data 251:0 0 397.6G 0 lvm └─nvme0n1p5 259:6 0 2M 0 part
With the partitions good to go, they could then be re-added back into the software RAID1 arrays using mdadm
:
root@rescue-customer-ca (ns000000.ip-000-000-000.net) ~ # mdadm --manage /dev/md126 --add /dev/nvme1n1p2 mdadm: added /dev/nvme1n1p2 root@rescue-customer-ca (ns000000.ip-000-000-000.net) ~ # mdadm --manage /dev/md127 --add /dev/nvme1n1p4 mdadm: added /dev/nvme1n1p4
That turned out to be quick and easy. Then it was just a matter of waiting for the RAID to rebuild. I monitored its progress using /proc/mdstat
:
root@rescue-customer-ca (ns000000.ip-000-000-000.net) ~ # cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty] md126 : active raid1 nvme1n1p2[2] nvme0n1p2[0] 20955136 blocks super 1.2 [2/1] [U_] [=========>...........] recovery = 45.6% (9574272/20955136) finish=0.9min speed=205555K/sec md127 : active raid1 nvme1n1p4[2] nvme0n1p4[1] 416871424 blocks super 1.2 [2/1] [_U] resync=DELAYED bitmap: 4/4 pages [16KB], 65536KB chunk unused devices: <none>
As these were solid state disks, they finished pretty quickly with the following /proc/mdstat
result:
root@rescue-customer-ca (ns000000.ip-000-000-000.net) ~ # cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty] md126 : active raid1 nvme1n1p2[2] nvme0n1p2[0] 20955136 blocks super 1.2 [2/2] [UU] md127 : active raid1 nvme1n1p4[2] nvme0n1p4[1] 416871424 blocks super 1.2 [2/2] [UU] bitmap: 0/4 pages [0KB], 65536KB chunk unused devices: <none>
Notice now the two entries per array, and the [UU]
meaning both disks were active in each array.
I then tried to reboot the server, but it was unresponsive. Connecting to the IPMI console revealed that the boot sequence was dropping into the GRUB rescue console.
There was one additional step needed to reconfigure the GRUB bootloader to work with the new disk. I’m not exactly sure why. OVHcloud did have a guide for repairing the GRUB bootloader, however, the only change in the command was to point it to the correct boot partition:
grub-install /dev/nvme0n1p1
And when I rebooted this time, it booted all the way through, and everything was fine! Mission accomplished!
This recovery process was relatively painless, much thanks to the remaining disk in the RAID array keeping things intact, and all the standardized tooling around managing and recovering RAID arrays. It’s not often that I have a hardware failure on my server so I was a little slow having to figure out and go through the steps without messing anything up further.
Also I realize that the things running on my server aren’t “highly available.” I never really meant them to be nor are they necessary to be running 100% of the time, so it wasn’t the end of the world that my server was down for a day. I suppose though it would be a good exercise to try setting up one or two of my most used websites such that they would still be available if one server were to be brought down.