Short Story about IO-Errors

Hardware failure is not an exception, it is a normal process. I guess most of you would agree with this statement.
The real case we have experienced a few weeks ago is worth to describe in our blog in order to remind you about the norm or normal process. The norm is obvious if we consider a hard disk failure.

Hard disks are usually the number one case of failure events. This is the reason why all of us use any kind or RAID in order to minimize the negative impact of a hard disk failure.

In theory the hard disk as a RAID member can come with a failure and the RAID will just reject the failed drive and rebuild with a new drive.

This is often the case in practise but unfortunately not always. In some cases the drive failure can cost the RAID array to show IO-errors to the Operating System or even worse: the whole server may stop working. Exactly such problem happened to our server. On a nice Friday morning we were faced with a notice of a very frustrating problem: one of our servers is not available at all, even ping does not respond. ☹

So, we had to hard power OFF and power ON the system and we see the RAID is in degraded mode and the OS does not boot. Lucky after the next OFF/ON the OS booted. At once after the boot we made a new fresh backup of our SQL Application database, but as the server was running and the application was working properly we decided to continue with RAID rebuild during after-hours.

Unfortunately after a few hours the server hung again. This time the ping was still answering but the SQL application and the console did not react at all. After power OFF we have removed the failed hard disk and started the server again. Now it was able to boot without a problem with the degreed RAID array.

So the faulty hard disk present in the RAID was in a position to hang the system. I am not going to provide the hardware vendors of the server, RAID and hard disks as it will not help to avoid such a problem, because from my experience it can happen with all vendors.

We have addressed this problem in our iSCSI and NAS (NFS) Failover solution. In case of any IO-errors the storage will failover transparently for applications.

If you need a real business continuity please consider HA cluster systems, protect the data with a professional backup plan and remember: “Hardware failure is not an exception, it is a norm”.

Business Continuity hardware IO RAID storage

2 Comments

Reply
Toni B.
June 04, 06 2012 08:36:38
And Friday is the norm, too! Usually at 3 PM.
Can this particular problem appear at any type of RAID (10, 5, 6)?
VA:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Reply
Stefan
June 13, 06 2012 02:39:14
Well, Toni’s question is not well thought-through. As Janusz addresses the problem to the hard disk, the RAID level does not matter at all. Even if it was a single system drive it may well be in an even better position to hang the system, considering the filesystem driver may take the root filesystem offline because of a non-blocking read error.
Still I fear this is not the complete truth. Disks may well present non-fatal I/O-errors to the OS, so it may be a bad idea to trigger failover on every I/O-error. I personally think one should consider a finer grained decision making based on the type of I/O-error.
Also you should consider: if a disk completely blocks the bus, the system (in this case a primary) needs to find out what exactly happened to it, which takes time. Most probably the heartbeat will be working well during this time, so the secondary won’t notice that the primary experiences problems. This is a very heavy problem, as the time from “a disk starts to block” until “the OS recognizes the blocking” may be enough to kill virtual machines hosted on the storage. If one would add functionality to find out blockings, there needs to be taken a whole lot of care that split brains do not occur. I personally think: a disk that dies with blocking the bus is a real worst case scenario. It happens very rarely. Implementing a good protection against this may not be worth the development time. Trying to solve this by triggering failover on any I/O-error is contra productive.
VA:F [1.9.22_1171]
Rating: 0 (from 0 votes)