Open-E logo
0 Liked

    Why a HOT-SPARE Hard Disk is a bad idea?

    People are a bit surprised every time they hear this question from us.

    Conventional wisdom about Hot-Spares teaches us that it is a very nice idea: minimizing degraded array state, etc.

    So, why is using a Hot-Spare Drive a bad idea?

    It’s true that a Hot-Spare helps to minimize the duration of a degraded array state but our goal of creating a Redundant Array of Inexpensive Disks is to continue operation and not to lose data in the event of a drive failure. Anything that increases the risk of data loss is a bad idea.

    Based on our long years of experience we have learned that during a RAID rebuild the probability of an additional drive failure is quite high – a rebuild is stressful on the existing drives. This is why we advise following the procedure once the array shows a degraded state as a result of a drive failure.

    1. Run a full data backup.
    2. Verify the backed-up data for consistency, and verify whether the data restore mechanism works.
    3. Identify the problem source, i.e. find the erroneous hard disk. If possible, shut down the server, and make sure the serial number of the hard disk matches that reported by the RAID controller.
    4. Replace the hard disk identified as bad with a new, unused one. If the replacement hard drive had already been used within another RAID array, make sure that any residual RAID metadata on it has been deleted via the original RAID controller.
    5. Start the rebuild of the RAID.

    So using this approach, the rebuild is the 5th step! By using a Hot-Spare your RAID will skip the first two very important steps and then run steps 3, 4 and 5 automatically. Thus the rebuild will be done before these other critical steps that work to ensure that your data is safe.

    Being aware of Murphy’s Law, no one would risk an immediate rebuild after a drive failure – but by using a Hot-Spare this is exactly what will happen. If you stop and think about the integrity of your data, you will come to the same conclusion: a Hot-Spare Drive is a very bad idea.

     

    VN:F [1.9.22_1171]
    Rating: 2.0/5 (200 votes cast)
    Why a HOT-SPARE Hard Disk is a bad idea?, 2.0 out of 5 based on 200 ratings

    37 Comments

    • Hard Drives External

      September 14, 09 2010 09:31:35

      I found your resource via Google on Tuesday while searching for hard drive and your post regarding Why a HOT-SPARE Hard Disk is a bad idea? | Open-E Blog looked very interesting to me. I just wanted to write to say that you have a great site and a wonderful resource for all to share.

      VA:F [1.9.22_1171]
      Rating: -26 (from 32 votes)
      • Janusz

        September 17, 09 2010 05:39:06

        We try our best. Thank you!

        VA:F [1.9.22_1171]
        Rating: -19 (from 23 votes)
    • José Rocha

      October 16, 10 2010 10:27:34

      Had never really thought about the possibility of failure during a rebuild. Excellent approach.Thanks for the tip.

      VA:F [1.9.22_1171]
      Rating: -4 (from 16 votes)
    • Joe McDoaks

      January 10, 01 2011 03:52:21

      Had never really thought of this approach, but if you get a failure during the rebuild, you would also get a failure during the backup as this stresses the disks just as much. Trying to understand but for me I’m not sure how you are further ahead?

      VA:F [1.9.22_1171]
      Rating: +27 (from 29 votes)
      • Data Daddy

        February 15, 02 2011 09:13:29

        Well, rebuilding the RAID is going with each byte on the disk, and that require read/write alot on the disk, while backup could only require a read , as the write could be done on another disk, media.

        So that make the backup less stress than the rebuild.
        as the quick format for a HDD is less stress than a Full format,as the full will keep write more than the quick which only will erase the File Allocation Table (TOC table of contents).

        VA:F [1.9.22_1171]
        Rating: -15 (from 25 votes)
        • linc

          July 16, 07 2013 01:04:23

          Not all RAID arrays rebuild byte by byte, most rebuild on a block level. Additionally, Hot Spares deliver a fully redundant disk. If the physical disk fails (note: the most common failure of disk is physical, not data) then a hot spare should rebuild successfully. It’s all about playing with the laws of averages, and not having a hot spare is just downright stupid in light of this.

          VA:F [1.9.22_1171]
          Rating: +67 (from 77 votes)
    • Matthias

      February 13, 02 2011 07:07:49

      Well…often I think you’re really posting good things. Here I am strongly disagree!
      If somebody does not care about full backups, checking restore mechanisms and so on it doesnt make a difference if there is a spare or not.
      And to Murphys law there is a risk that another disk will fail sooner when you have no spare. Thats my long-year experience. I was happy to have spares in place for a lot of times in my life.

      Sure, RAID is no backup…and if somebody mixes that up…well…noobs are always there 😉

      VA:F [1.9.22_1171]
      Rating: +25 (from 27 votes)
    • BoonHong

      August 24, 08 2011 07:04:33

      Another harddisk may failed during backup, making the whole array unusable as well. Ideally we should have RAID 6 that allows 2 harddisks failure. Having 3 harddisks failure within the rebuild windows is highly unlikely.

      Moreover, most users can’t afford to have a double set of storage for doing a full backup that already contain a previous full backup.

      VA:F [1.9.22_1171]
      Rating: +9 (from 11 votes)
    • peter

      August 24, 08 2011 11:46:45

      Hot Spare a bad idea?
      Depending on the type and size of the RAID array it can also be a very good idea.
      If for example you have a mirrored configuration where the primary array gets critical, you dont need to run a full backup.’
      If for example you have a RAID 6 configuration there is still no critical need for a backup when one diskk fails.
      If for example you have a RAID 50 array the time to make a full backup can exeed the time permitted to be in a critiacla situation.
      So if a hotspare is a good idea depends on the configuration used, the type of arry chosen and the SLA’s

      VA:F [1.9.22_1171]
      Rating: +20 (from 20 votes)
    • Matthias

      September 07, 09 2011 05:13:35

      At this time we are evaluating a new iSCSI solution for a costumer. The performance of open-e is verry good. When the tests has been finished and our costumer is happy with this solution, the system goes to Africa. In Africa we have nobody, who’s able to make a exchange like this, when a disk has been crashed. But they are at least able to change a hard disk… The raid controllers we build in this systems, are able to make a RAID 5EE. I haven’t tested the performance until now, but I think that the stress for the system is fewer than a totally rebuild of a hotspare disk.
      Sorry about my english, it’s not the best….

      VA:F [1.9.22_1171]
      Rating: 0 (from 2 votes)
    • Kai-Uwe

      September 07, 09 2011 05:32:42

      I strongly disagree. Any IT admin keeping data on ANY kind of disk be it a simple disk or a RAID or a complex SAN storage subsystem should ALWAYS have a complete backup which is AT MOST one work day old!

      So if you start with a backup only after a disk has failed, you have definitely not done your job right before.

      I personally tend to use RAID-6 nowadays as an additional disk will not cost much and will leave room for an additional failure. Sometimes I use RAID-6 without a hot spare in addition to it but sometimes (if a drive slot and the money for the extra drive do not matter) I even add a hot spare to a RAID-6, too.

      Also, modern storage systems use “background scrubbing” to detect bad sectors in advance so that you are not hit by one in the event of a rebuild. In addition, this causes kind of a “healthy” stress on all disks to sort out the flaky ones rather soon …

      VA:F [1.9.22_1171]
      Rating: +21 (from 21 votes)
    • xpresshred

      September 19, 09 2011 12:31:55

      A hot spare disk is a disk used to be automatically, depending upon the hot spare policy, replace a failing or failed disk in a RAID configuration.

      VA:F [1.9.22_1171]
      Rating: -2 (from 2 votes)
    • IcebergTitanic

      February 03, 02 2012 04:18:29

      I have to agree with some of the other comments. You shouldn’t need to worry about whether or not Step 1 and 2 are done prior to the array automatically rebuilding. If you’re doing your job, then you have already done those two steps as a matter of course. You should also have some mechanism for actively monitoring the array, with notifications being sent to you in the event of a drive failure, so that you can immediately replace the failed drive.

      Remember that when you purchase an array with drives all at the same time, they often come from the same batch, and are the same age. As a result, their MTBF is going to be fairly close, and you should be ready for further potential drive failures to follow on the heels of the first. That’s why it’s important to have the monitoring in place so that you can replace that failed drive within 24 hours, ideally. With a hot spare in place, that buys you a little extra time in case of a second drive failure. A hot spare is definitely a GOOD idea, not a bad one.

      A lackluster backup and monitoring policy is a BAD idea.

      VA:F [1.9.22_1171]
      Rating: +13 (from 13 votes)
    • grin

      February 09, 02 2012 04:56:15

      Well backing up daily a multi-terabyte array is really fun. I mean, you haven’t even finished and it’s time to start again. 😉
      As well as “halting a server, making a backup, do a consistency check, then…”, I’m sure your customers would be happy in that few days your server is offline. I’d prefer clustered servers and replicated storage but it still bites you if there’s an online breakage which gets mirrored while daily backups aren’t really feasible.

      VA:F [1.9.22_1171]
      Rating: +2 (from 4 votes)
    • Hard Disk Recovery

      February 28, 02 2012 10:21:46

      A lot of people get a false sense of security from having a RAID. I’ve seen many RAIDs where multiple drives failed at once. RAID sets are also a lot more expensive to recover the data from. Always backup the data, even from a RAID. Some data recovery companies charge up to $5000-$25,000 to recover a RAID set! Backing up the data can prevent the need for data recovery.

      VA:F [1.9.22_1171]
      Rating: +1 (from 1 vote)
    • Martino87r

      May 03, 05 2012 09:54:50

      From personal experience, working several years on Enterprise storage like Equallogic, dell MD3000, EMC and others… I cannot agree with not having a hot spare!
      Normally if you run integrity checks regularly along with patrol reads all the time the probability of broken bytes on the disks is not that big.
      Not having a hot spare on a raid system composed of 15/20 disks means that potentially you’re using a RAID 0 once a drive has failed and the more disks you have, higher is the probability of another failing shortly, especially today that people build up arrays composed of disks from the same production batches (likely to have the same MTBF).
      In some arrays i even configured 2 hot spare drives as the arrays were located far and the replacement time for a drive was quite high (at least a day)
      I don’t agree as well backing up an array which is in degraded state as this stresses the drives in the same manner (and even more, due to the random data positioning on the platters and the excessive head movement in order to read fragmented files), normally you should have a backup strategy that keeps your data safe (like asynchronous replication or snapshots on another array)

      VA:F [1.9.22_1171]
      Rating: +8 (from 8 votes)
      • Janusz Bak

        May 07, 05 2012 08:13:37

        In case your data is continuously protected and you use very good quality hardware and good monitoring system, shorter time of running array in degraded mode statistically proof for the hot spare. This is why hot spare was good selling point.
        The blog was created to make people understand that relying ONLY on hot-spare disks is not a good idea. You should always have your data backed up somewhere in different location.

        VN:F [1.9.22_1171]
        Rating: -15 (from 17 votes)
        • user

          July 02, 07 2013 12:56:54

          You should have made that clear in the article then, because the article implies it is better to backup a degraded disk directly before a rebuild.

          VA:F [1.9.22_1171]
          Rating: +18 (from 20 votes)
    • Walter

      May 15, 05 2012 02:57:06

      Thanks for the info!
      Regards

      VA:F [1.9.22_1171]
      Rating: -8 (from 8 votes)
    • athar13

      February 05, 02 2013 04:38:05

      Being the IT manager of a group of companies in the VAS industry, I beg to differ very strongly. I guess what is a bliss for one can be a blister for someone else. We do not have the luxury to down our database for a backup, swap and rebuild as the 8 disks are 1TB each and a rebuild would be way too painstakingly time consuming, (approximately a day!!!). The best RAID for an active server would be RAID5+hot spare and an archive server would be RAID10+Global spare. The RAID5+Hot Spare would give you a fault tolerance of 2 disk failures and the RAID10+GSP would give a fault tolerance of 5 disk failures (provided 1 disk of each group fails).

      VA:F [1.9.22_1171]
      Rating: +2 (from 6 votes)
    • Christophe

      March 08, 03 2013 07:48:48

      I do agree with the risk of one extra disk failing while rebuilding, however I do not agree at all with your procedure.

      First of all the thing of taking a backup is not a good idea:
      – You are as well putting stress on all your disks
      – You need additional storage for your “full backup”, which is not possible depending on your RAID size. I must admit I don’t have a spare machine with a spare 50TB (used data).
      – Your backup strategy should be designed beforehand and backups should be existing, independently from the state of your RAID. A RAID system IS a single point of failure in the case of a power-surge, so you should consider it could die every day.

      From our experience shutting down/booting a system poses also a very big risk for disks. So if you’re already in a degraded state you don’t want to take that additional risk.

      Your system should support hot-swappable disks, which is the case for any SATA these days.
      The goal of a RAID is to have high availability so you don’t really want to bring your system down for a simple disk-swap.

      Hoping that this experience gives another view.

      VA:F [1.9.22_1171]
      Rating: +6 (from 6 votes)
    • Matthias

      August 16, 08 2013 02:08:49

      I disagree, too….
      In a “normal” productive environment, you HAVE to have a backup; most likely every day. So steps 1 & 2 should ALWAYS be set, no matter what the Admin or server does…

      I already had the issue, that 2 of my drives died in my RAID 5 (WITH HOT SPARE), so the hot spare saved me plenty of time not to have to recover all my data from backup.

      Another point of arguing could be the use of RAID 5 with hot spare or RAID 6.

      Greetings from Germany….

      VA:F [1.9.22_1171]
      Rating: +7 (from 7 votes)
    • Neil

      January 09, 01 2015 07:22:04

      I have to disagree with the logic that says that hot adding a drive to a degraded RAID is a higher risk than doing a backup. Doing a backup causes writes to the disks that are still active in the array. The writes involve updating the access time of each file as well as usually the backup flag(s). Rebuilding the array only reads the active drives and writes to the new drive. Also the risk of failure goes up with drive seeks, not with reading or writing. Restoration of a RAID involves sequential seeking, which is the safest seeking that can be done. Data backup involves random seeking. Normally files are stored in one or more places, directory, metadata, and journaling all tend to be stored in certain areas which requires a large number of seeks, usually over one third of the travel of the heads.
      My experience has taught me that when I see multiple drive failures, it is usually environmental, the three top drive killers are: heat (improper cooling, failed fans, or dust clogging), smoke (cigarette or similar), faulty power supply (noise in the power from bad or poor capacitors in the power supply, or wrong voltages).

      VA:F [1.9.22_1171]
      Rating: +8 (from 8 votes)
    • Florian Manach

      April 01, 04 2015 10:10:02

      Backup and RAID are two very different things and should not be faced against each other.

      RAID prevent hardware failure to impact the service availability.
      Backup prevents for data corruption or loss.

      RAID is not made to protect the loss of data. Backup is.
      RAIS is only there to ensure the service availability despite a disk failure.

      A degraded RAID array is a threat to the continuity of service. The sooner the RAID array is rebuild, the less this threat is important. This is why NOT HAVING A HOT SPARE IS A BAD IDEA. The only case where a hot spare is not mandatory is when a technician is 24/7 at the same place as the machine and can go replace the faulty disk minutes after the fault.

      A RAID array should ALWAYS be configured with a hot spare drive and with a notification mechanism so the fault unit can be replaced ASAP.

      This is totally uncorrelated with backups which must ABSOLUTELY made for EVERY piece of data that needs to be protected. If your RAID controller fails, or if you change host architecture, don’t count on the RAID for your data.

      Conclusion : ALWAYS use a hot spare. ALWAYS backup your data.

      VA:F [1.9.22_1171]
      Rating: +18 (from 18 votes)
    • Yowser

      January 19, 01 2016 10:12:36

      Silly not to have a hot spare if you can. You should rebuild the array ASAP. Not only does a backup put more strain on the existing degraded array, but the backup process will likely take longer than the rebuild, meaning your data is actually more vulnerable. You’re not protecting your data by performing a backup first, you’re actually increasing the risk of a second drive failure as it will take so much longer to get a stable array again.

      VA:F [1.9.22_1171]
      Rating: +2 (from 2 votes)
    • 1T15

      January 22, 01 2016 08:36:02

      I guess the article needs to be revisited and revised with warnings.
      Having a bad disk part of an array and trying to backup data is only increasing the risk of losing another disk and then the whole data. If there was a hot spare the data rebuild would have started already and there is no need to schedule a downtime of a server as well.
      Again backups should be part of any organisation with critical/important data.
      I’m all for hot spares and global hot spares.

      VA:F [1.9.22_1171]
      Rating: 0 (from 0 votes)
    • Ed Gorbunov

      February 09, 02 2016 10:28:46

      Wrong conclusion based on an incorrect assumption.

      Upon the moment when RAID controller pulls a hot spare in to restore the data redundancy, it reads from all but the failed drive and writes calculated or parity blocks to the hot spare closing the window when the whole array may fail due to another drive failure. So, it is no different to what you call taking a backup. It is unlikely that your backup will take less time than the array’s recovery. A good controller will rebuild an array using a single hot spare, given they are of 3GB in size each, in less than 5 hours. Contrarily, a one terabyte worth of data could be copied/moved elsewhere in a little shy of 3 hours , granted your array could sustain reading 100MB/s in the degraded mode or the network moving data at that speed (unless you have 10GigE network or better). If you have more than 2TB to backup then restoring redundancy would be your best bet, thus the sooner a hot spare is taken into rebuild the better! Simple math, no gimmicks, no cheating with numbers.
      Whenever you can afford to run a hot spare in your storage, indulge yourself – do so, and you will be sleeping better at night! :)

      VA:F [1.9.22_1171]
      Rating: +3 (from 3 votes)
    • MJI

      February 20, 02 2016 10:52:45

      I have some sympathy for this argument, but wouldn’t taking a full system backup and then a full verification of that backup stress the HDDs that remain in the array even more than rebuilding the array with the hot spare?

      As other posters have pointed out, RAID is not a proper backup in the first place, but rather a way to ensure continuity of service – and a hot spare is the quickest way to ensure that the array is reconstructed and brought back into full operation as quickly as possible.

      VA:F [1.9.22_1171]
      Rating: 0 (from 0 votes)
    • Frank

      March 24, 03 2016 02:24:07

      Hi, I’m getting this message even before the system start normally:
      1785- Slot 0 Drive Array Not Configured. Run hpssa.
      When I run HPSSA it asks for configuring a new array. i wasn’t the first person who configured it so i don’t know which raid system they were using. does it affects stored data?
      Is it really bad? What’s the best solution?

      VA:F [1.9.22_1171]
      Rating: 0 (from 0 votes)
      • karolina.pletnia

        April 11, 04 2016 10:23:36

        Hey Frank! Apparently, the problem you are facing does not concern the software, as the message didn’t come from the software, but from RAID controller during the beginning of the boot. It is a hardware issue, and we would suggest you contact your hardware provider. Especially if you are not sure of your RAID configuration. Regards!

        VN:F [1.9.22_1171]
        Rating: 0 (from 0 votes)
    • matthew

      August 10, 08 2016 10:58:59

      Let me posit some reasons why a hot spare is a *really* good idea.

      RAID array fails while no-one is at work (does your site remain manned 24×365, even in the event of a fire alarm or other security alert?), you’re running at risk of data loss until the situation is addressed.

      SMART monitoring on the controller spots the disk is failing *before* it goes offline and fails to the hot spare with no risk of data loss and no downtime. I know it won’t catch all failures, but on the HP kit I’ve worked on the majority of failures (media failures as opposed to electronics failures) have been spotted and fixed on the fly while the disk was still usable.

      In over 20 years I have *once* had an array fail a second disk while recovering onto a hot spare. I have lost count of the number of times that a hot spare has saved the day…

      I have worked on one customer’s system where the transactional traffic was so high that they would not restore from backup… the lost income from the restore time outweighed the costs of abandoning the data and starting again with an empty database… for them, hot spares provided a better financial risk than an outage to perform steps (1) and (2)

      If your data is that critical, you should use some RAID that allows multiple devices to fail… and probably have more than one hot spare available.

      So I’m afraid my real-world experience teaches me that your article is really not a good way to go for every business. There may be edge cases where a hot spare proves bad, but there are edge cases where not wearing a seatbelt in a car proved beneficial.

      VA:F [1.9.22_1171]
      Rating: +1 (from 1 vote)
    • Luke

      October 11, 10 2016 04:21:30

      I also think hot-spare is a bad idea with RAID5. Here is my reasoning….

      OptionA: RAID 5 without hot-spare (3 drives in total)
      OptionB: RAID 5 with a hot-spare (4 drives in total)

      In OptionA: if one drive fails, you simply replace that drive ASAP and then system will rebuild failed drive. Let’s say whole rebuild will take 24hrs.

      In OptionB: if one drive fails, immediately rebuild begins on a hot-spare. It will also take 24hrs. Once rebuild is complete you still need to swap the broken drive. Which will trigger another rebuild this time from hot-spare assigned drive to the freshly inserted drive – this process will also take 24hrs.

      So not only you did the rebuild twice… it also took twice as much time. Effectively doubling amount of time where your Raid Array is danger of having another faulty drive. With all the extra stress caused by doing rebuild twice it’s walking on a thing ice really.

      It almost feels it’s better to have a spare drive ready, kept on shelf, not as part of system (and not assigned as a hot-spare). As soon as faulty drive is detected start rebuild onto that drive. Obviously, if you cannot be next to your system everyday then maybe a hot-spare is a better option.

      If you really insist on RAID 5 then maybe not having hot-spare is a safer option in this case. Unless I am missing something really obvious here.
      Would love more feedback on this case.

      ** I know in some cases you can mark freshly rebuild hot-spare as your new drive. And then simply add another drive as a hot spare. But I am not sure if this is a default behavior for Raid Controllers. I think most raid controllers usually just rebuild back from hot-spare into a freshly slotted drive.

      VA:F [1.9.22_1171]
      Rating: -3 (from 3 votes)
      • MJI

        November 01, 11 2016 01:09:37

        I don’t understand why option B would trigger two rebuilds. If you have a hot spare, the NAS will start an automatic rebuild, at the end of which you have a restored 3 drive setup. When you replace the broken drive, you can add the new drive as a hot spare if you want it, there is nothing that forces you to turn it into a 4 drive NAS.

        VA:F [1.9.22_1171]
        Rating: +2 (from 2 votes)
      • Scott in Texas

        December 28, 12 2016 01:45:46

        If you use real RAID, not motherboard RAID or software RAID, you need not move the drive, and if you DID move the drive from the hot swap port to the primary port it would not result in a rebuild, it is called “Disk Roaming”.

        VA:F [1.9.22_1171]
        Rating: 0 (from 0 votes)
    • Bill

      November 13, 11 2016 09:25:40

      I agree with many of the comments here. Not having a spare drive, as a POLICY, is retarded.
      If you can’t afford a spare drive straight away, then go without one for a while. But add one later!

      A company that actually values it’s data will have am automated backup mechanism in place that matches the required RPO and RTO as defined by the IT strategy – and signed off by management/executive levels.

      If a disk fails in a RAID array, replace it IMMEDIATELY. Having a hot-spare accomplishes this for you.
      And since you have a backup strategy in place that already satisfies the RPO/RTO of the organization, there is no need to take another backup from a DEGRADED array. If the array is a parity RAID (heaven forbid you are using this on your primary storage) then the performance will be lower than normal and you are leaving your array in this state for longer – further increasing the likelihood that another drive will fail before the rebuild is done.

      If you don’t have a backup of your primary data that meets organization RPO/RTO, or your organization hasn’t even though of these things, then obviously data integrity doesn’t matter and you basically just have a load of junk on disk – so why bother with the backup if the data is just junk anyway?

      Oh, and don’t forget that as well as a proper backup mechanism, you also need monitoring/notifications of the array status – so you KNOW that a disk has failed and can organize the replacement immediately.

      VA:F [1.9.22_1171]
      Rating: 0 (from 0 votes)
    • Jon Redwood

      February 12, 02 2017 07:01:17

      Why not use RAID 6. You could do this with or without a hot spare but if one disk goes then whenever you decide to rebuild the array (straight away with a hot spare or when you have replaced the malfunctioning disk) then the array will need 2 more failed disks before you lose your data.

      VA:F [1.9.22_1171]
      Rating: 0 (from 0 votes)
    • DTEAdmin

      February 13, 02 2017 11:40:09

      Two years after posting. A worthwhile read, and we went with a 4TB Hot Spare for a (3)-3TB RAID1E.

      The logic behind the OP was solid, but also was the fact of “It should be backed up scheduled nightly versus a workday past”.

      Thanks for the civil forum and superb input.

      VA:F [1.9.22_1171]
      Rating: 0 (from 0 votes)

    Leave a Reply


    *