I recently lost power to a NAS system (UPS failure), and it just happened to go down right when it was writing to one its drives. So when I powered the system back up, it had marked one of the drives as failed and my notification system sent me an alert. What do you do in this scenario? The drive is apparently good, but mdadm thinks it is not.

You have to get down and dirty with mdadm and some CLI commands to manually remove, repair and then re-add the drive.

Here is a snippet of how the failed drive looked when I look at /proc/mdstat:
md0 : active raid1 sdh1[7] sdc1[8] sdf1[5] sda1[0] sdd1[3] sdg1[6] sde1[4] 102388 blocks super 1.0 [8/7] [U_UUUUUU]

md1 : active raid5 sda2[0] sdc2[9] sdg2[6] sdh2[8] sdd2[3] sdf2[5] sde2[4] 114677248 blocks super 1.1 level 5, 512k chunk, algorithm 2 [8/7] [U_UUUUUU]

The underscore _ indicates a failed drive, and I can see that it is sdb that has failed. First thing I want to do is make a backup of the raid status:
mdadm --examine /dev/sd[a-z] > result.txt
mdadm --examine /dev/sd[a-z]1 >> result.txt
mdadm --examine /dev/sd[a-z]2 >> result.txt

Next I will try to remove the failed drive from each volume with mdadm:
mdadm --manage /dev/md0 -r /dev/sdb1
mdadm --manage /dev/md1 -r /dev/sdb2

But it doesn't see the drive as clean, so it gives me an error of:
mdadm: /dev/sdb1 reports being an active member for /dev/md0, but a --re-add fails.
mdadm: not performing --add as that would convert /dev/sdb1 in to a spare.
mdadm: To make this a spare, use "mdadm --zero-superblock /dev/sdb1" first.

So lets erase the superblock on this drive like it says, and then go straight to adding it back to the array:
mdadm --zero-superblock /dev/sdb1
mdadm --zero-superblock /dev/sdb2

mdadm --manage /dev/md0 -a /dev/sdb1
mdadm --manage /dev/md1 -a /dev/sdb2

And it works! Here is what it looks like when rebuilding (cat /proc/mdstat):
md1 : active raid5 sdb2[10] sda2[0] sdc2[9] sdg2[6] sdh2[8] sdd2[3] sdf2[5] sde2[4] 114677248 blocks super 1.1 level 5, 512k chunk, algorithm 2 [8/7] [U_UUUUUU] [======>..............] recovery = 34.5% (5659600/16382464) finish=3.6min speed=48646K/sec

Notes

I was able to recover the drive even though it was marked as failed, but of course you may not be able to (it might actually be failed). If you have to replace the drive it would be helpful to first remove the failed drive before powering the system off. Use the mdadm command above to remove the failed drive before powering it off.

Also note that by default the rebuild speed is very low, you can make it go faster by overwriting the value in /proc/sys/dev/raid/speed_limit_min and speed_limit_max

For More Information

If you get stuck and want to hear how RDA can help your business, please give us a call or contact us using the form on the right!