Wednesday, April 30, 2008

Removing messages from postfix mail queue

Mail is backing up in the mail queue on my postfix server. I've found that many messages are ones I can remove, but I don't want to manually delete them one at a time. I found a posting on howtoforge that helped:

However, the command didn't work, as I think it left the -n out of the tail command. So I used this command and it worked:

mailq | tail -n +2 | awk 'BEGIN { RS = "" } / falko@example\.com$/ { print $1 }' | tr -d '*!' | postsuper -d -

Unfortunately, this didn't work right away. The mailq command was reporting nothing in the mail queue. I discovered that /usr/bin/mailq was pointing at the sendmail version of mailq. I fixed this by fixing the symlink for mta-mailq:

cd /etc/alternatives
rm mta-mailq
ln -s /usr/bin/mailq.postfix mta-mailq

Now running mailq gives the results that postfix -p does. I'm unclear on if the output of mailq is the same as postfix -p (and I didn't take the time to look), but it works this way.

I found the following command useful to see how many messages are currently in the queue:

postqueue -p | tail -1

Lastly, for some reason I couldn't seem to get the above command to remove mail from MAILER-DAEMON. Again, I didn't really look into the reason too much. But I found another solution on a howtoforge forum posting. The following deletes all mail in the queue that is from MAILER-DAEMON:

mailq | tail -n +2 | awk 'BEGIN { RS = "" } { if ($7 == "MAILER-DAEMON" ) print $1 }' | tr -d '*!' | postsuper -d -

Also, I found out that I should NOT have issued the following command to flush all queued mail. It takes all the deferred mail and makes them active again, trying to resend them. Most of those messages are in there because they couldn't connect to deliver for some reason, so retrying them all will take lots of time. It took well over a day to get through the entire queue.

postqueue -f (DON'T RUN THIS without knowing what it does)

Saturday, March 08, 2008

Hard drive crash

I just lost a drive on a CentOS / OpenVZ server. The system had three 750GB sata drives, two of which were mirrored, and one spare. The sdb drive had failed and brought the system down when it did so. I thought a RAID1 system was supposed to keep running even if a drive failed. Evidently not, or at least not with linux software raid.

Anyway, the spare drive was partitioned to be used as a place to store system backups. But the system was so new that it hadn't been used much. So I just moved the data from it and then used the spare to replace the bad sdb drive. The system was configured with LVM on top of software raid for sda and sdb, and just LVM for sdc (the spare).

I used fdisk to verify the drive partitions. I did a lot of testing and checking to make sure that I wasn't going to blow away the only good data I had.

fdisk -l /dev/sda
fdisk -l /dev/sdb

Once I was absolutely sure that sda had the good data, sdb was bad, and sdc was the spare, I first removed the LVM settings from the spare drive. Note that the system was reporting sdc as sdb because it didn't recognize the bad drive anymore. Unfortunately, I didn't keep great notes, but I think I did this:

lvremove /dev/vg1/lvbackup
vgremove /dev/vg0
pvremove /dev/sdb1

With those settings out of LVM, I repartitioned the spare drive (now reporting as sdb) to match the exact partition of sda. This command copies the partition of sda and overwrites the partition of sdb with the sampartitioning.

sfdisk -d /dev/sda | sfdisk /dev/sdb

I tried to remove the old sdb drive from the raid array, but the --failed and --removed commands reported errors:

mdadm --manage /dev/md1 --fail /dev/sdb1
mdadm --manage /dev/md1 --remove /dev/sdb1

So I just added the new drive partitions to the raid array:

mdadm --manage /dev/md0 --add /dev/sdb1
mdadm --manage /dev/md1 --add /dev/sdb2

Checking /proc/mdstat showed the partitions syncing. I rebooted just to make sure it all came back up properly, which it did.

Then, about an hour later, the load on the server skyrocketed up to the 30-50 range. This server is a dual 4-core cpu (total of 8 cores) with 16GB ram and three 750GB Seagate Barracuda ES.2 sata drives. I'm not sure what caused it, but I was worried it might be the raid sync. I found out that my customer was running myisamchk on their database at the same time. So maybe it was just that. The load came back down to around 1.5 and the syncing is still running. I guess it is all good now...

I found this resource a helpful guide while solving this problem: