Saturday, March 08, 2008

Hard drive crash

I just lost a drive on a CentOS / OpenVZ server. The system had three 750GB sata drives, two of which were mirrored, and one spare. The sdb drive had failed and brought the system down when it did so. I thought a RAID1 system was supposed to keep running even if a drive failed. Evidently not, or at least not with linux software raid.

Anyway, the spare drive was partitioned to be used as a place to store system backups. But the system was so new that it hadn't been used much. So I just moved the data from it and then used the spare to replace the bad sdb drive. The system was configured with LVM on top of software raid for sda and sdb, and just LVM for sdc (the spare).

I used fdisk to verify the drive partitions. I did a lot of testing and checking to make sure that I wasn't going to blow away the only good data I had.

fdisk -l /dev/sda
fdisk -l /dev/sdb


Once I was absolutely sure that sda had the good data, sdb was bad, and sdc was the spare, I first removed the LVM settings from the spare drive. Note that the system was reporting sdc as sdb because it didn't recognize the bad drive anymore. Unfortunately, I didn't keep great notes, but I think I did this:

lvremove /dev/vg1/lvbackup
vgremove /dev/vg0
pvremove /dev/sdb1


With those settings out of LVM, I repartitioned the spare drive (now reporting as sdb) to match the exact partition of sda. This command copies the partition of sda and overwrites the partition of sdb with the sampartitioning.

sfdisk -d /dev/sda | sfdisk /dev/sdb


I tried to remove the old sdb drive from the raid array, but the --failed and --removed commands reported errors:

mdadm --manage /dev/md1 --fail /dev/sdb1
mdadm --manage /dev/md1 --remove /dev/sdb1


So I just added the new drive partitions to the raid array:

mdadm --manage /dev/md0 --add /dev/sdb1
mdadm --manage /dev/md1 --add /dev/sdb2


Checking /proc/mdstat showed the partitions syncing. I rebooted just to make sure it all came back up properly, which it did.

Then, about an hour later, the load on the server skyrocketed up to the 30-50 range. This server is a dual 4-core cpu (total of 8 cores) with 16GB ram and three 750GB Seagate Barracuda ES.2 sata drives. I'm not sure what caused it, but I was worried it might be the raid sync. I found out that my customer was running myisamchk on their database at the same time. So maybe it was just that. The load came back down to around 1.5 and the syncing is still running. I guess it is all good now...

I found this resource a helpful guide while solving this problem:
http://www.howtoforge.com/replacing_hard_disks_in_a_raid1_array