What happens when a RAID rebuild goes wrong, and what can be done to prevent that from happening to you? To answer the question, you need to understand how data is written to a RAID array and what happens when a drive fails and a rebuild is started. I am going to use a Windows NTFS volume and a 4 drive RAID 5 array as the example system.
Windows splits the volume into metadata and user data. In the figure below, we can see the simplification of a contiguous NTFS volume on a single hard drive. The metadata is represented in blue and the user data in green.
Now let’s say we want to protect our data by using a RAID 5 array. To understand what this does to the data and how it is protected, we need to take a closer look at the RAID 5 array. When a RAID 5 array is created, the RAID controller breaks the array into chunks of data we will call stripes. Each stripe uses all of the disks in the array. For each stripe of data, the controller also adds some redundancy called parity. An empty RAID 5 array is illustrated in the figure below. The data stripe is in yellow and the parity is in orange.
When our RAID array is formatted NTFS, the data for the NTFS volume is striped across the disks.
You might say to yourself, that’s great, but how does it protect my data and what are the pitfalls to avoid? Well in the event of a drive failure, the RAID controller can use the information stored in parity to rebuild the data from the missing drive.
In our example, if HDD 1 fails, then the RAID controller can use the parity for each individual stripe to rebuild what is missing. In stripe 1, the controller would use the data from HDD 2 and HDD 3 and the parity from HDD 4 to rebuild the missing metadata from HDD 1. For stripe 2, the controller would use the data from HDD 2 and HDD 4 and the parity from HDD 3 to rebuild the missing metadata from HDD 1.
When a RAID is working as designed, it will efficiently protect your data when a hard drive fails. Now let’s look at a couple of scenarios where data can still be damaged if these RAID systems are not used appropriately.
In the scenario below we also have a single drive failure. Normally a RAID controller would handle this failure as shown above. However, data can be lost if the wrong type of Raid rebuild occurs, such as rebuilding parity data instead of the new drive.
In the example above, when the RAID is rebuilt, the controller simply updates the parity on the drives with new data. In this instance, in stripe 1, parity is updated with the data from HDD 2 and HDD 3 and the zeroed data from the new HDD 1.
How can you prevent this from happening and what can you do if this happens to you? The best way to prevent data loss is to create sound backups. Test them often to ensure that if you have a drive failure, your backups will help you to recover from a failed RAID rebuild. In the event the RAID array goes into a degraded mode, stop all activity on the volume and take a backup immediately to prevent data loss if a second drive fails and takes down the entire array. If you are unable to take a backup, then clone or image all of the disks before rebuilding the array. These images will preserve the data on the disks in the event the rebuild fails, allowing for a full recovery of critical data.
If you are unable to take a backup (or your backups are not usable) and your rebuild fails, there is still hope. Working with a data recovery company can make recovery in a case like this possible. A good data recovery company will request the failed disks be sent to their labs. Once the disks are received, the data recovery company should image all of the disks including the failed disk. Make sure the company you choose has a Class 100 clean room for this type of work. Once the disks are imaged, the company should be able to reassemble the array, check for logical volume correction, repair the damage and then recover the data. Be wary of companies that request the RAID controller and hardware to assist with the recovery. Unless you have a unique system or situation, this is often a sign of an inexperienced data recovery company that will put your data at risk.