Tuesday, August 30, 2011

RAID-5 Data Recovery Depends on Doing This, Not That

In many cases that eventually come before a data recovery lab, the essential information for a small business, a school district or an organization is stored centrally on a black or silver box in some dim room filled with whirring fans, colorful cables and little blinking lights. The box itself may say something like Buffalo DriveStation, Seagate BlackArmor, or Western Digital ShareSpace, but in any case, it is often a RAID-5 storage device that is trusted with holding the network's data.
Most RAID devices were designed and built for data reliability. RAID stands for Redundant Array of Inexpensive Disks (or later Independent Disks). And the technology, which was developed at the University of California, Berkeley, allows enough redundancy of the data that if one hard drive in the system fails, the information it contained can be reconstructed from the remaining hard drives.
RAID-5 devices not only offer this redundant data reliability, they spread the data in such a way that it can be read and used faster. In terms of hard drive technology, they are high performance machines, offering both greater speed and greater resistance to data loss.
But they have their limits. The cases that come to data recovery labs widely varied, but for the sake of example, let's say the read/write heads of one of the drives in a RAID-5 device can no longer detect the magnetic rails that would guide them along the tracks of data, so they instead click back and forth uselessly. Because of the clever redundancy built into the device, the machine can keep limping along, running in a degraded state as it reconstructs the failed drive's data by logical analysis of the remaining drives. Perhaps some people notice that things aren't as quick as before, but this situation persists for months until a second drive burns out the chip that controls its drive motor.
Now, the IT person in charge of the network is in a panic, since none of the data is accessible. In the repair efforts that ensue, a new hard drive is inserted into the device, a remaining drive is reformatted, and all the hard drives have been taken out of the device and put back in the wrong order. Meanwhile, a whole organization's most important data is inaccessible. It's a horrible situation.
At this point no one, obviously, wants to hear about what would have been the best solution - which is prevention by automatic remote data backup. So it's off to the recovery lab in the desperate hope that what was lost can be recovered.
If you are in the cold-sweat inducing stages of data loss on your RAID-5 systems, there is some useful advice, provided it's still timely. First, if your data is not accessible, you should never rebuild the array; this will not repair anything. It will take the current state of affairs and make it permanent. Another common mistake is to force drives back online after an observation that only one of three drives or two of four drives are up. The RAID controller took these drives offline for a reason. They're probably failed drives. When you force these drives online, data on the healthy drives likely will be corrupted. Worse, file system repair utilities will start seeing this mess and will start "repairing" all recent data. The effect is that the most critical data on the healthy drives will be gone. The best thing to do when your RAID-5 fails is to back out of that dim room with the blinking lights and the colorful cables and call a recovery lab.
At the few top data recover labs in the country, engineers and computer scientists have pioneered techniques for RAID recovery cases. Successful RAID-5 recoveries depend on reassembling the logical structure of the file system, which is necessary to get meaningful data back from a failed RAID device.
In the example above, after replacing the damaged read/write heads and calibrating them to read the platters, they would create full binary copies of all the drives in the system. They would look at each drive independently with a binary hex editor, which shows where the 1s and 0s lie, to determine how the data was being divided or striped among the drives and in what order. Each RAID controller is different, and it's a logic puzzle to determine how the data was being handled and what the file structure was before the system failed.
It would be crucial to determine which drive failed first. As mentioned earlier, RAID systems depend on a logic calculation to store their redundant data. It's called an "exclusively or" binary operator. You might intuitively expect that if you had four disks full of data that you'd need another four disks to have a redundant copy. But the "exclusively or" binary operator is a clever way to allow four disks to have their data redundantly stored on one disk. But to reconstruct data using this operator, it's necessary to understand in what order the disks failed and exactly how data was being written to them.
Only after all this analysis, and a correct diagnosis on the drive failure order, could data recovery experts begin to write the code that would rebuild this data system. They would then test their hypothesis by checking the integrity of a large recent file and proceed to reassemble all the pieces in the puzzle into one contiguous physical volume.
RAID-5 systems have a lot of appeal: speed, reliability and ease of use. Many organizations trust them to hold up the entire network's data without employing an automatic remote backup system. That puts a great deal of faith in the idea that your RAID-5 device will never fail. If it does, don't let panic complicate matters - there is good reason to hope for a successful RAID-5 data recovery with the right data recovery lab.
Lee Sensenbrenner is a former journalist and speech writer in Madison, Wis. with technical expertise and a background in math and works in the data recovery industry.

No comments:

Post a Comment