#1 Software for Data Storage, Backup & Business Continuity

0 Liked

Why a HOT SPARE Hard Disk Is a Bad Idea

Updated on 03/08/2021

Due, in part, to the different views and opinions regarding the usage of hot spare disks in our previous post, we’ve decided to add an update for clarification.

The Problematic Aspects of Using a Hot Spare Disk

As is said in almost every theory, using a hot spare disk with ZFS, Solaris FMA or in any other data storage environment is a good solution as it will automatically react to damage in a Redundant Array of Independent Disks (RAID) array and a hot spare disk indeed helps to minimize the duration of a degraded array state.

That being said, our goal of creating a RAID is to continue operation and not lose data in the event of a disk failure. Anything that increases the risk of data loss could be a bad idea. Let’s have a look at some of these problematic aspects of hot spare disks.

Hot Spare Disks Add Stress to Vulnerable Systems

The main problem with hot spare disks is that they allow the rebuilding (resilvering) of a system that is still actively being used as a production server. This means that, while the resilvering process is taking place, the system will also still be occupied with the usual production data reads and writes.

Resilvering is a process that needs a lot of server resources so when it’s executed while the server is still in use, it has to compete with the production loads. Since it’s a low-priority task, it can make the entire process of resilvering take very long (even up to a few weeks). This results in the server working at maximum achievable throughput for weeks, which can have dire consequences for the disks (especially HDDs).

Having decades worth of experience, we’ve realized that the use of hot spare disks in complex enterprise systems increases the probability of additional disks failing as the resilvering process starts to put more and more stress on the existing disks and the system itself.

Problems in Overall Hot Spare Disk Design

The next flaw of a hot spare disk is that it degrades over time. From the moment it is connected to the system, it keeps on working. And when, eventually, it’s time for it to be used as a damaged disk’s replacement; the hot spare disk itself could simply not be in a good enough state to actually replace the damaged disk.

Another problematic aspect of hot spare disks is that they are used automatically once the disk failure is detected so the corrupted disk might still be connected to the system. It could still try to reconnect and start working again while the hot spare disk is trying to take over its role thus adding even more stress to the system. This is yet another factor that can affect the system’s overall performance and could potentially lead to data loss.

Hot Spare Disks Create a Single Point of Failure

If you’re looking to create a system with no single point of failure, a hot spare disk will not provide you with much confidence given that the process of automatically replacing a failed disk has been known to occasionally fail, either partially or fully, and result in data loss.

Having spent decades providing customers with data storage solutions, we’ve heard of a lot of examples where a hot spare disk was the reason for the entire server failure and even data loss occurring. Automation here is risky since it can start the domino effect, especially when the data storage infrastructure has been working for years and the hardware is worn out.

Our Solution

These problematic aspects of hot spare disks are why our advice would be to not rely on hot spare disks in complex data storage architectures and to use other business continuity solutions instead like High Availability (HA) clusters, backups and On- & Off-site Data Protection (ideally all of the aforementioned).

Using the ZFS file system, it’s much easier to monitor the system and create a proper backup, with that you have the ability to retrieve data from a damaged disk and write it onto a new one. In addition to that, when using a HA cluster, there is an option of manually switching the production from the affected node to a second one so that you could perform maintenance on the affected node.

We’d advise following this procedure once the array shows that a degraded state has occurred as a result of a disk failure:

Move resources to the second node in your HA cluster if possible.
Run a full data backup.
Verify the backed-up data for consistency, and verify whether the data restore mechanism works.
Identify the problem source, i.e., find the erroneous hard disk. If possible, shut down the server and make sure the serial number of the hard disk matches the one that’s reported by the event viewer or system logs..
Replace the hard disk identified as bad with a new, unused one. If the replacement hard disk had already been used within another RAID array; make sure that any residual RAID metadata on it has been deleted via the original RAID controller.
Start a rebuild of the system.

So, if using this approach, the rebuild would consist of 6 steps! Using a hot spare disk, your RAID will skip the first four significant steps and then automatically run steps 5, and 6. Thus the rebuild will be completed before you can do these other critical steps; steps that could be the difference between your data being safe and lost.

Anyway, it’s still completely up to you as to how to build a proper system. However, we’d suggest not relying on hot spare disks in a ZFS RAID array due to the potential data loss it can cause.

Hard disk Hot spare

49 Comments

Reply
Hard Drives External
September 14, 09 2010 09:31:35
I found your resource via Google on Tuesday while searching for hard drive and your post regarding Why a HOT-SPARE Hard Disk is a bad idea? | Open-E Blog looked very interesting to me. I just wanted to write to say that you have a great site and a wonderful resource for all to share.
VA:D [1.9.22_1171]
Rating: -60 (from 70 votes)
- Reply
  Janusz
  September 17, 09 2010 05:39:06
  We try our best. Thank you!
  VA:D [1.9.22_1171]
  Rating: -42 (from 48 votes)
Reply
José Rocha
October 16, 10 2010 10:27:34
Had never really thought about the possibility of failure during a rebuild. Excellent approach.Thanks for the tip.
VA:F [1.9.22_1171]
Rating: -18 (from 34 votes)
Reply
Joe McDoaks
January 10, 01 2011 03:52:21
Had never really thought of this approach, but if you get a failure during the rebuild, you would also get a failure during the backup as this stresses the disks just as much. Trying to understand but for me I’m not sure how you are further ahead?
VA:D [1.9.22_1171]
Rating: +61 (from 65 votes)
- Reply
  Data Daddy
  February 15, 02 2011 09:13:29
  Well, rebuilding the RAID is going with each byte on the disk, and that require read/write alot on the disk, while backup could only require a read , as the write could be done on another disk, media.
  So that make the backup less stress than the rebuild.
  as the quick format for a HDD is less stress than a Full format,as the full will keep write more than the quick which only will erase the File Allocation Table (TOC table of contents).
  VA:D [1.9.22_1171]
  Rating: -36 (from 50 votes)
  - Reply
    linc
    July 16, 07 2013 01:04:23
    Not all RAID arrays rebuild byte by byte, most rebuild on a block level. Additionally, Hot Spares deliver a fully redundant disk. If the physical disk fails (note: the most common failure of disk is physical, not data) then a hot spare should rebuild successfully. It’s all about playing with the laws of averages, and not having a hot spare is just downright stupid in light of this.
    VA:D [1.9.22_1171]
    Rating: +134 (from 146 votes)
Reply
Matthias
February 13, 02 2011 07:07:49
Well…often I think you’re really posting good things. Here I am strongly disagree!
If somebody does not care about full backups, checking restore mechanisms and so on it doesnt make a difference if there is a spare or not.
And to Murphys law there is a risk that another disk will fail sooner when you have no spare. Thats my long-year experience. I was happy to have spares in place for a lot of times in my life.
Sure, RAID is no backup…and if somebody mixes that up…well…noobs are always there 😉
VA:D [1.9.22_1171]
Rating: +53 (from 57 votes)
Reply
BoonHong
August 24, 08 2011 07:04:33
Another harddisk may failed during backup, making the whole array unusable as well. Ideally we should have RAID 6 that allows 2 harddisks failure. Having 3 harddisks failure within the rebuild windows is highly unlikely.
Moreover, most users can’t afford to have a double set of storage for doing a full backup that already contain a previous full backup.
VA:D [1.9.22_1171]
Rating: +19 (from 23 votes)
Reply
peter
August 24, 08 2011 11:46:45
Hot Spare a bad idea?
Depending on the type and size of the RAID array it can also be a very good idea.
If for example you have a mirrored configuration where the primary array gets critical, you dont need to run a full backup.’
If for example you have a RAID 6 configuration there is still no critical need for a backup when one diskk fails.
If for example you have a RAID 50 array the time to make a full backup can exeed the time permitted to be in a critiacla situation.
So if a hotspare is a good idea depends on the configuration used, the type of arry chosen and the SLA’s
VA:D [1.9.22_1171]
Rating: +29 (from 31 votes)
Reply
Matthias
September 07, 09 2011 05:13:35
At this time we are evaluating a new iSCSI solution for a costumer. The performance of open-e is verry good. When the tests has been finished and our costumer is happy with this solution, the system goes to Africa. In Africa we have nobody, who’s able to make a exchange like this, when a disk has been crashed. But they are at least able to change a hard disk… The raid controllers we build in this systems, are able to make a RAID 5EE. I haven’t tested the performance until now, but I think that the stress for the system is fewer than a totally rebuild of a hotspare disk.
Sorry about my english, it’s not the best….
VA:F [1.9.22_1171]
Rating: +1 (from 3 votes)
Reply
Kai-Uwe
September 07, 09 2011 05:32:42
I strongly disagree. Any IT admin keeping data on ANY kind of disk be it a simple disk or a RAID or a complex SAN storage subsystem should ALWAYS have a complete backup which is AT MOST one work day old!
So if you start with a backup only after a disk has failed, you have definitely not done your job right before.
I personally tend to use RAID-6 nowadays as an additional disk will not cost much and will leave room for an additional failure. Sometimes I use RAID-6 without a hot spare in addition to it but sometimes (if a drive slot and the money for the extra drive do not matter) I even add a hot spare to a RAID-6, too.
Also, modern storage systems use “background scrubbing” to detect bad sectors in advance so that you are not hit by one in the event of a rebuild. In addition, this causes kind of a “healthy” stress on all disks to sort out the flaky ones rather soon …
VA:D [1.9.22_1171]
Rating: +48 (from 50 votes)
Reply
xpresshred
September 19, 09 2011 12:31:55
A hot spare disk is a disk used to be automatically, depending upon the hot spare policy, replace a failing or failed disk in a RAID configuration.
VA:D [1.9.22_1171]
Rating: -6 (from 10 votes)
Reply
IcebergTitanic
February 03, 02 2012 04:18:29
I have to agree with some of the other comments. You shouldn’t need to worry about whether or not Step 1 and 2 are done prior to the array automatically rebuilding. If you’re doing your job, then you have already done those two steps as a matter of course. You should also have some mechanism for actively monitoring the array, with notifications being sent to you in the event of a drive failure, so that you can immediately replace the failed drive.
Remember that when you purchase an array with drives all at the same time, they often come from the same batch, and are the same age. As a result, their MTBF is going to be fairly close, and you should be ready for further potential drive failures to follow on the heels of the first. That’s why it’s important to have the monitoring in place so that you can replace that failed drive within 24 hours, ideally. With a hot spare in place, that buys you a little extra time in case of a second drive failure. A hot spare is definitely a GOOD idea, not a bad one.
A lackluster backup and monitoring policy is a BAD idea.
VA:D [1.9.22_1171]
Rating: +26 (from 28 votes)
Reply
grin
February 09, 02 2012 04:56:15
Well backing up daily a multi-terabyte array is really fun. I mean, you haven’t even finished and it’s time to start again. 😉
As well as “halting a server, making a backup, do a consistency check, then…”, I’m sure your customers would be happy in that few days your server is offline. I’d prefer clustered servers and replicated storage but it still bites you if there’s an online breakage which gets mirrored while daily backups aren’t really feasible.
VA:F [1.9.22_1171]
Rating: +6 (from 12 votes)
Reply
Hard Disk Recovery
February 28, 02 2012 10:21:46
A lot of people get a false sense of security from having a RAID. I’ve seen many RAIDs where multiple drives failed at once. RAID sets are also a lot more expensive to recover the data from. Always backup the data, even from a RAID. Some data recovery companies charge up to $5000-$25,000 to recover a RAID set! Backing up the data can prevent the need for data recovery.
VA:F [1.9.22_1171]
Rating: +3 (from 3 votes)
Reply
Martino87r
May 03, 05 2012 09:54:50
From personal experience, working several years on Enterprise storage like Equallogic, dell MD3000, EMC and others… I cannot agree with not having a hot spare!
Normally if you run integrity checks regularly along with patrol reads all the time the probability of broken bytes on the disks is not that big.
Not having a hot spare on a raid system composed of 15/20 disks means that potentially you’re using a RAID 0 once a drive has failed and the more disks you have, higher is the probability of another failing shortly, especially today that people build up arrays composed of disks from the same production batches (likely to have the same MTBF).
In some arrays i even configured 2 hot spare drives as the arrays were located far and the replacement time for a drive was quite high (at least a day)
I don’t agree as well backing up an array which is in degraded state as this stresses the drives in the same manner (and even more, due to the random data positioning on the platters and the excessive head movement in order to read fragmented files), normally you should have a backup strategy that keeps your data safe (like asynchronous replication or snapshots on another array)
VA:F [1.9.22_1171]
Rating: +14 (from 14 votes)
- Reply
  Janusz Bak
  May 07, 05 2012 08:13:37
  In case your data is continuously protected and you use very good quality hardware and good monitoring system, shorter time of running array in degraded mode statistically proof for the hot spare. This is why hot spare was good selling point.
  The blog was created to make people understand that relying ONLY on hot-spare disks is not a good idea. You should always have your data backed up somewhere in different location.
  VN:F [1.9.22_1171]
  Rating: -23 (from 25 votes)
  - Reply
    user
    July 02, 07 2013 12:56:54
    You should have made that clear in the article then, because the article implies it is better to backup a degraded disk directly before a rebuild.
    VA:F [1.9.22_1171]
    Rating: +28 (from 30 votes)
Reply
Walter
May 15, 05 2012 02:57:06
Thanks for the info!
Regards
VA:F [1.9.22_1171]
Rating: -9 (from 9 votes)
Reply
athar13
February 05, 02 2013 04:38:05
Being the IT manager of a group of companies in the VAS industry, I beg to differ very strongly. I guess what is a bliss for one can be a blister for someone else. We do not have the luxury to down our database for a backup, swap and rebuild as the 8 disks are 1TB each and a rebuild would be way too painstakingly time consuming, (approximately a day!!!). The best RAID for an active server would be RAID5+hot spare and an archive server would be RAID10+Global spare. The RAID5+Hot Spare would give you a fault tolerance of 2 disk failures and the RAID10+GSP would give a fault tolerance of 5 disk failures (provided 1 disk of each group fails).
VA:F [1.9.22_1171]
Rating: +4 (from 8 votes)
Reply
Christophe
March 08, 03 2013 07:48:48
I do agree with the risk of one extra disk failing while rebuilding, however I do not agree at all with your procedure.
First of all the thing of taking a backup is not a good idea:
– You are as well putting stress on all your disks
– You need additional storage for your “full backup”, which is not possible depending on your RAID size. I must admit I don’t have a spare machine with a spare 50TB (used data).
– Your backup strategy should be designed beforehand and backups should be existing, independently from the state of your RAID. A RAID system IS a single point of failure in the case of a power-surge, so you should consider it could die every day.
From our experience shutting down/booting a system poses also a very big risk for disks. So if you’re already in a degraded state you don’t want to take that additional risk.
Your system should support hot-swappable disks, which is the case for any SATA these days.
The goal of a RAID is to have high availability so you don’t really want to bring your system down for a simple disk-swap.
Hoping that this experience gives another view.
VA:F [1.9.22_1171]
Rating: +9 (from 9 votes)
Reply
Matthias
August 16, 08 2013 02:08:49
I disagree, too….
In a “normal” productive environment, you HAVE to have a backup; most likely every day. So steps 1 & 2 should ALWAYS be set, no matter what the Admin or server does…
I already had the issue, that 2 of my drives died in my RAID 5 (WITH HOT SPARE), so the hot spare saved me plenty of time not to have to recover all my data from backup.
Another point of arguing could be the use of RAID 5 with hot spare or RAID 6.
Greetings from Germany….
VA:F [1.9.22_1171]
Rating: +8 (from 8 votes)
Reply
Neil
January 09, 01 2015 07:22:04
I have to disagree with the logic that says that hot adding a drive to a degraded RAID is a higher risk than doing a backup. Doing a backup causes writes to the disks that are still active in the array. The writes involve updating the access time of each file as well as usually the backup flag(s). Rebuilding the array only reads the active drives and writes to the new drive. Also the risk of failure goes up with drive seeks, not with reading or writing. Restoration of a RAID involves sequential seeking, which is the safest seeking that can be done. Data backup involves random seeking. Normally files are stored in one or more places, directory, metadata, and journaling all tend to be stored in certain areas which requires a large number of seeks, usually over one third of the travel of the heads.
My experience has taught me that when I see multiple drive failures, it is usually environmental, the three top drive killers are: heat (improper cooling, failed fans, or dust clogging), smoke (cigarette or similar), faulty power supply (noise in the power from bad or poor capacitors in the power supply, or wrong voltages).
VA:F [1.9.22_1171]
Rating: +13 (from 13 votes)
Reply
Florian Manach
April 01, 04 2015 10:10:02
Backup and RAID are two very different things and should not be faced against each other.
RAID prevent hardware failure to impact the service availability.
Backup prevents for data corruption or loss.
RAID is not made to protect the loss of data. Backup is.
RAIS is only there to ensure the service availability despite a disk failure.
A degraded RAID array is a threat to the continuity of service. The sooner the RAID array is rebuild, the less this threat is important. This is why NOT HAVING A HOT SPARE IS A BAD IDEA. The only case where a hot spare is not mandatory is when a technician is 24/7 at the same place as the machine and can go replace the faulty disk minutes after the fault.
A RAID array should ALWAYS be configured with a hot spare drive and with a notification mechanism so the fault unit can be replaced ASAP.
This is totally uncorrelated with backups which must ABSOLUTELY made for EVERY piece of data that needs to be protected. If your RAID controller fails, or if you change host architecture, don’t count on the RAID for your data.
Conclusion : ALWAYS use a hot spare. ALWAYS backup your data.
VA:D [1.9.22_1171]
Rating: +30 (from 30 votes)
Reply
Yowser
January 19, 01 2016 10:12:36
Silly not to have a hot spare if you can. You should rebuild the array ASAP. Not only does a backup put more strain on the existing degraded array, but the backup process will likely take longer than the rebuild, meaning your data is actually more vulnerable. You’re not protecting your data by performing a backup first, you’re actually increasing the risk of a second drive failure as it will take so much longer to get a stable array again.
VA:F [1.9.22_1171]
Rating: +3 (from 3 votes)
Reply
1T15
January 22, 01 2016 08:36:02
I guess the article needs to be revisited and revised with warnings.
Having a bad disk part of an array and trying to backup data is only increasing the risk of losing another disk and then the whole data. If there was a hot spare the data rebuild would have started already and there is no need to schedule a downtime of a server as well.
Again backups should be part of any organisation with critical/important data.
I’m all for hot spares and global hot spares.
VA:D [1.9.22_1171]
Rating: +2 (from 2 votes)
Reply
Ed Gorbunov
February 09, 02 2016 10:28:46
Wrong conclusion based on an incorrect assumption.
Upon the moment when RAID controller pulls a hot spare in to restore the data redundancy, it reads from all but the failed drive and writes calculated or parity blocks to the hot spare closing the window when the whole array may fail due to another drive failure. So, it is no different to what you call taking a backup. It is unlikely that your backup will take less time than the array’s recovery. A good controller will rebuild an array using a single hot spare, given they are of 3GB in size each, in less than 5 hours. Contrarily, a one terabyte worth of data could be copied/moved elsewhere in a little shy of 3 hours , granted your array could sustain reading 100MB/s in the degraded mode or the network moving data at that speed (unless you have 10GigE network or better). If you have more than 2TB to backup then restoring redundancy would be your best bet, thus the sooner a hot spare is taken into rebuild the better! Simple math, no gimmicks, no cheating with numbers.
Whenever you can afford to run a hot spare in your storage, indulge yourself – do so, and you will be sleeping better at night! 🙂
VA:D [1.9.22_1171]
Rating: +5 (from 5 votes)
Reply
MJI
February 20, 02 2016 10:52:45
I have some sympathy for this argument, but wouldn’t taking a full system backup and then a full verification of that backup stress the HDDs that remain in the array even more than rebuilding the array with the hot spare?
As other posters have pointed out, RAID is not a proper backup in the first place, but rather a way to ensure continuity of service – and a hot spare is the quickest way to ensure that the array is reconstructed and brought back into full operation as quickly as possible.
VA:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Reply
Frank
March 24, 03 2016 02:24:07
Hi, I’m getting this message even before the system start normally:
1785- Slot 0 Drive Array Not Configured. Run hpssa.
When I run HPSSA it asks for configuring a new array. i wasn’t the first person who configured it so i don’t know which raid system they were using. does it affects stored data?
Is it really bad? What’s the best solution?
VA:F [1.9.22_1171]
Rating: 0 (from 0 votes)
- Reply
  karolina.pletnia
  April 11, 04 2016 10:23:36
  Hey Frank! Apparently, the problem you are facing does not concern the software, as the message didn’t come from the software, but from RAID controller during the beginning of the boot. It is a hardware issue, and we would suggest you contact your hardware provider. Especially if you are not sure of your RAID configuration. Regards!
  VN:F [1.9.22_1171]
  Rating: +1 (from 1 vote)
Reply
matthew
August 10, 08 2016 10:58:59
Let me posit some reasons why a hot spare is a *really* good idea.
RAID array fails while no-one is at work (does your site remain manned 24×365, even in the event of a fire alarm or other security alert?), you’re running at risk of data loss until the situation is addressed.
SMART monitoring on the controller spots the disk is failing *before* it goes offline and fails to the hot spare with no risk of data loss and no downtime. I know it won’t catch all failures, but on the HP kit I’ve worked on the majority of failures (media failures as opposed to electronics failures) have been spotted and fixed on the fly while the disk was still usable.
In over 20 years I have *once* had an array fail a second disk while recovering onto a hot spare. I have lost count of the number of times that a hot spare has saved the day…
I have worked on one customer’s system where the transactional traffic was so high that they would not restore from backup… the lost income from the restore time outweighed the costs of abandoning the data and starting again with an empty database… for them, hot spares provided a better financial risk than an outage to perform steps (1) and (2)
If your data is that critical, you should use some RAID that allows multiple devices to fail… and probably have more than one hot spare available.
So I’m afraid my real-world experience teaches me that your article is really not a good way to go for every business. There may be edge cases where a hot spare proves bad, but there are edge cases where not wearing a seatbelt in a car proved beneficial.
VA:F [1.9.22_1171]
Rating: +6 (from 6 votes)
Reply
Luke
October 11, 10 2016 04:21:30
I also think hot-spare is a bad idea with RAID5. Here is my reasoning….
OptionA: RAID 5 without hot-spare (3 drives in total)
OptionB: RAID 5 with a hot-spare (4 drives in total)
In OptionA: if one drive fails, you simply replace that drive ASAP and then system will rebuild failed drive. Let’s say whole rebuild will take 24hrs.
In OptionB: if one drive fails, immediately rebuild begins on a hot-spare. It will also take 24hrs. Once rebuild is complete you still need to swap the broken drive. Which will trigger another rebuild this time from hot-spare assigned drive to the freshly inserted drive – this process will also take 24hrs.
So not only you did the rebuild twice… it also took twice as much time. Effectively doubling amount of time where your Raid Array is danger of having another faulty drive. With all the extra stress caused by doing rebuild twice it’s walking on a thing ice really.
It almost feels it’s better to have a spare drive ready, kept on shelf, not as part of system (and not assigned as a hot-spare). As soon as faulty drive is detected start rebuild onto that drive. Obviously, if you cannot be next to your system everyday then maybe a hot-spare is a better option.
If you really insist on RAID 5 then maybe not having hot-spare is a safer option in this case. Unless I am missing something really obvious here.
Would love more feedback on this case.
** I know in some cases you can mark freshly rebuild hot-spare as your new drive. And then simply add another drive as a hot spare. But I am not sure if this is a default behavior for Raid Controllers. I think most raid controllers usually just rebuild back from hot-spare into a freshly slotted drive.
VA:F [1.9.22_1171]
Rating: -10 (from 12 votes)
- Reply
  MJI
  November 01, 11 2016 01:09:37
  I don’t understand why option B would trigger two rebuilds. If you have a hot spare, the NAS will start an automatic rebuild, at the end of which you have a restored 3 drive setup. When you replace the broken drive, you can add the new drive as a hot spare if you want it, there is nothing that forces you to turn it into a 4 drive NAS.
  VA:F [1.9.22_1171]
  Rating: +11 (from 11 votes)
- Reply
  Scott in Texas
  December 28, 12 2016 01:45:46
  If you use real RAID, not motherboard RAID or software RAID, you need not move the drive, and if you DID move the drive from the hot swap port to the primary port it would not result in a rebuild, it is called “Disk Roaming”.
  VA:F [1.9.22_1171]
  Rating: +3 (from 3 votes)
Reply
Bill
November 13, 11 2016 09:25:40
I agree with many of the comments here. Not having a spare drive, as a POLICY, is retarded.
If you can’t afford a spare drive straight away, then go without one for a while. But add one later!
A company that actually values it’s data will have am automated backup mechanism in place that matches the required RPO and RTO as defined by the IT strategy – and signed off by management/executive levels.
If a disk fails in a RAID array, replace it IMMEDIATELY. Having a hot-spare accomplishes this for you.
And since you have a backup strategy in place that already satisfies the RPO/RTO of the organization, there is no need to take another backup from a DEGRADED array. If the array is a parity RAID (heaven forbid you are using this on your primary storage) then the performance will be lower than normal and you are leaving your array in this state for longer – further increasing the likelihood that another drive will fail before the rebuild is done.
If you don’t have a backup of your primary data that meets organization RPO/RTO, or your organization hasn’t even though of these things, then obviously data integrity doesn’t matter and you basically just have a load of junk on disk – so why bother with the backup if the data is just junk anyway?
Oh, and don’t forget that as well as a proper backup mechanism, you also need monitoring/notifications of the array status – so you KNOW that a disk has failed and can organize the replacement immediately.
VA:F [1.9.22_1171]
Rating: +2 (from 2 votes)
Reply
Scott in Texas
December 28, 12 2016 01:35:26
Struggling with your approach. I agree with the comments above about how a failure during a write of ALL your data to a back up set is just as risky as going straight to a rebuild. I also think it moronic of you to recommend data validation AFTER you have had a failure. I run validation EVERY NIGHT for a couple of hours a night, resulting in a full validation every week. Does it stress the drives, yes it does, and if any drives get flakey, S.M.A.R.T will identify them and warn me… more importantly, I would rather validate the data BEFORE it becomes critical that it is validated, and risky to perform the validation.
I also run Raid 6, so that in the event of a failure, it will take two additional failures to lose data. So no, a 14 drive RAID 6 array is not truly a “backup” but it is damned close… not to mention the problems backing up a 22Tb array would present.
FWIW, I also run a REAL RAID controller card (3Ware 9650SE), not motherboard raid and not software raid… so yes, I sleep quite well at night having a hot swap drive and skipping your first two steps.
VA:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Reply
Jon Redwood
February 12, 02 2017 07:01:17
Why not use RAID 6. You could do this with or without a hot spare but if one disk goes then whenever you decide to rebuild the array (straight away with a hot spare or when you have replaced the malfunctioning disk) then the array will need 2 more failed disks before you lose your data.
VA:F [1.9.22_1171]
Rating: +3 (from 3 votes)
Reply
DTEAdmin
February 13, 02 2017 11:40:09
Two years after posting. A worthwhile read, and we went with a 4TB Hot Spare for a (3)-3TB RAID1E.
The logic behind the OP was solid, but also was the fact of “It should be backed up scheduled nightly versus a workday past”.
Thanks for the civil forum and superb input.
VA:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Reply
PassingBy
April 18, 04 2017 02:25:52
In my experience, RAIDs consisting of disks from the same manufacturer are prone to have cluster failures.
The time frame can be days, maybe weeks. Spare (or RAID6) is a must. Do not wait until the end of the backup.
Actually I do not understand the full data backup statement at all. The RAID requirements and backup policies should be based on risk analysis, not some home grown ideas.
VA:F [1.9.22_1171]
Rating: +1 (from 1 vote)
Reply
Confused
May 12, 05 2017 07:55:28
This really doesn’t make any sense at all.
Most controllers rebuild using a very sequential process. You advise to back-up and check consistency first. How on earth is that less stressful to the now broken RAID? What do you think will happen if it finds a bad block on the RAID-set during the backup/verify now that it no longer has parity to correct it with? Pretty much the same as it would during the rebuild…
VA:F [1.9.22_1171]
Rating: +3 (from 3 votes)
Reply
Mike Uchima
September 13, 09 2018 12:26:55
I tend to agree with those who are questioning the logic here.
Running a full backup is probably going to stress the remaining disks just as much (if not more) than doing a RAID rebuild. A RAID rebuild is going to involve sequentially reading all of the other disks to reconstruct the contents of the failed one. A file level backup — while it will only require reading the parts of the disks which contain valid data — is going to involve more random seeking, stressing the head actuator assemblies and causing the drives to heat up more.
Another thing I question is how this article implies you’re not keeping your backups current! You shouldn’t be waiting until you’ve got a degraded array to do a full backup; if anything, all you should need to do is an incremental backup of anything that has changed since the last full backup, which is (hopefully) not a large amount of data. Steps 1 and 2 (full backup and verify) on a large array could take many hours, possibly even days; that’s an unacceptably long period of time where your degraded array could turn into a failed array, exposing you to downtime and data loss.
I think a much better approach would be a sensible backup regime combined with a RAID-6 (or raidz2) array. The double redundancy of RAID-6/raidz2 protects you against a second failure during the rebuild. In this scheme, whether you use a cold spare vs. hot spare, and whether you run an incremental backup prior to the rebuild, are judgement calls that I’m not going to take a strong stance on.
The only reasons I can think of to avoid hot spares are that keeping the drive powered up is causing additional wear on the spare drive itself and consuming additional power. If your RAID controller/software keeps the spare drive in a low-power (spun down) state until it is needed, then even these justifications go away.
VA:F [1.9.22_1171]
Rating: +3 (from 3 votes)
Reply
zman
April 26, 04 2019 11:42:46
This is actually pretty naive to think re-building an array will stress the disks. You know on a lot of servers disks run almost 24/7 for years w/o failing. So thinking that an enterprise grade will fail in a few hours is just silly. You should always already have a backup from couple of hours ago or even 15 minutes or so. You don’t run a back when a disk has failed. I have run backups on consumers grade external HDs for up to 48 hours non-stop and one has never failed.
VA:F [1.9.22_1171]
Rating: +1 (from 1 vote)
Reply
Tristyn Russelo
August 03, 08 2019 02:07:48
This is flawed logic.
The procedure in this article tells you that rebuilding is hard on the remaining drives, true, but then tells you to back up all data, verify backups, then rebuild.
This procedure puts 3x the workload on the drives. the drives will be spun up and heads moving for 3x longer.
Backing up is just as hard on the drives as a rebuild. Verifying is another complete read of all data. Then rebuilding.
It will also be 3x longer before your system is back up to normal operating condition
VA:F [1.9.22_1171]
Rating: +4 (from 4 votes)
Reply
Koe
November 05, 11 2019 02:22:49
I don’t get this. Sure a rebuild is stressing the controller and the harddrives.
If you make a backup from a degraded array this will stress your harddrives like the same, or not ?
So why not rebuilding it as fast as possible ?
A backup should exist befor a array is getting degraded.
Maybe you can explain a litte bit more about the logic behind.
VA:F [1.9.22_1171]
Rating: +3 (from 3 votes)
Reply
Boyan
December 27, 12 2019 04:28:42
Many of the replies don’t get why is using a Hot-Spare Drive a bad idea, because… it isn’t and the the opinion expressed here is imho is false. Wait so you stay away from high stress events – YES totally correct yet you recommend TWO of them instead of ONE?
Run full back – the highest stress event possible, then swap the bad HD and let it rebuild – second highest stress event possible instead of letting the hot spare rebuild and end up with ONE and only ONE high stress event on the books? Disagree?
VA:D [1.9.22_1171]
Rating: +6 (from 6 votes)
- Reply
  Michael
  February 21, 02 2021 08:31:09
  Boyan, I completely agree with you! Also make note to all of you that the modern server today allows 2-disk fault taulerance which means that if one disk fails one may be inserted for rebuild with minimal stress. Even if a 2nd drive dies during rebuild in the above example, the rebuild will finish and will say ‘You are still in need to insert a 2nd drive to finish rebuild’ after it reaches 100%.
  Have a good week.
  VA:F [1.9.22_1171]
  Rating: 0 (from 0 votes)
Reply
Michael
February 21, 02 2021 08:17:36
Back in 2010 when this article was posted, I would definitely agree that having a HS is a bad idea. However, today, disk failure during rebuild is much less likely and I have been using Hotspare the last 2 years since 2019 and its worked miracles for me. I recommend using 1, even 2 hotspares in your free slots.
VA:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Reply
DD
August 26, 08 2021 10:11:09
Great article!
VA:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Reply
InuYasha
January 16, 01 2022 09:39:14
I don’t know why there are so many negative comments here.
As soon as I found out tht hot spare drive is SPINNING in a HP server, I moved to RAID6. Not only it wasted energy into heat, but have been worn out WITHOUT DATA. This is just terrible. Why in the hell did HP make it spin!? Also, I always set rebuild priority to maximum because the choice between performance and data loss is kind of obvious to me )
Also, make your own CAPTCHA please!
VA:F [1.9.22_1171]
Rating: 0 (from 0 votes)

The Problematic Aspects of Using a Hot Spare Disk

Hot Spare Disks Add Stress to Vulnerable Systems

Problems in Overall Hot Spare Disk Design

Hot Spare Disks Create a Single Point of Failure

Our Solution

Related Posts

Related Project

49 Comments

Hard Drives External

Janusz

José Rocha

Joe McDoaks

Data Daddy

linc

Matthias

BoonHong

peter

Matthias

Kai-Uwe

xpresshred

IcebergTitanic

grin

Hard Disk Recovery

Martino87r

Janusz Bak

user

Walter

athar13

Christophe

Matthias

Neil

Florian Manach

Yowser

1T15

Ed Gorbunov

MJI

Frank

karolina.pletnia

matthew

Luke

MJI

Scott in Texas

Bill

Scott in Texas

Jon Redwood

DTEAdmin

PassingBy

Confused

Mike Uchima

zman

Tristyn Russelo

Koe

Boyan

Michael

Michael

DD

InuYasha

Leave a Reply

Janusz Bąk