The on-call IT tech is jolted awake from a terrible dream – his heart pounding. Lightning crashes overhead as he glances at the clock – 2:59 a.m. The server isn’t down, it was just a dream.
3:00 a.m. The IT on-call pager goes off. This could mean any number of things: a fire, a break-in, a failed air-conditioner in the server room, or even a main business server crash.
3:25 a.m. The on-call IT tech arrives at the site and evaluates the situation. There is no fire, no evidence of a break-in, and the server room temperature reads a cool 18oC. A quick check of the servers shows that most of them are at a login screen. After checking two or three machines, it is obvious that the room lost power at some point. The UPS units verify a failure; all three massive battery units are showing failures and heavy load percentages.
3:40 a.m. The on-call IT tech calls the lead technician and department manager and informs them of the situation; both are on their way to the site. They leave instructions to check the main business application servers; one of them holds the company’s customer database, payroll, and accounting system, and the other is the company’s messaging server.
3:55 a.m. The on-call IT tech discovers that the RAID array for the business database server is not coming back online. The messaging server has rebooted but the messaging application is returning errors when it starts up. The tech realizes that the messaging server was performing incremental backups during the time of the outage. The on-call IT tech decides to leave that to the lead technician when he arrives.
4:00 a.m. The lead tech and manager arrive. Assessments of the other servers are made. The lead tech begins working with the messaging server. The on-call tech works with the failed RAID array. The firmware shows the array has failed; the controller only recognizes three of the ten drives. After a complete power down and restart of the server and drive enclosure, the firmware shows the drives are back online, however the array is shown as ’Failed’.
4:30 a.m. The-on call technician calls the RAID array manufacturer’s technical support. The choices in the firmware menu are vague and the IT Tech wants to know if forcing the drives online will get their array back. The manufacturer’s technical support says that the array will come back; however, there is a slight possibility that the data on the volume may be corrupted. The manufacturer’s technical support asks how recent their latest backup is. The IT Tech responds that the data is one week old and that is unacceptable; they cannot lose a week of transactions. The IT Tech hesitates in deciding what to do next…
Business system disasters like this happen every day. Despite the redundancy in backup systems or storage array systems, failures occur. Some failures can be hardware related, others can be due to software, and still others are the result of human error or natural disaster.
More and more businesses rely on their corporate server structure and document storage volumes. Some businesses rely completely on their database system, which may be financial data, job tracking data, or customer contact data. Other businesses may rely wholly on their messaging database and that is a critical business system. Some telephone systems actually convert voice messages to email notifications, thereby using the email-messaging server as part of the communication system. Today’s systems are also storage systems for all of the documents that users create.
Common Scenarios of Server Data Disasters
Severe damage to partition/volume information to Windows 2000 workstation; had used 3rd party recovery software--didn't work, reinstalled OS but was looking for 2nd partition/volume, found it and it was a 100% recovery. Evaluation Time: 46 minutes.
Causes of Specific File Error Disasters
- Corrupted business system database; file system is fine
- Corrupted message database; file system is fine
- Corrupted user files
Scenario: Windows 2000 server, volume repair tool damaged file system; target directories unavailable. Complete access to original files critical. Remote Data Recovery safely repaired volume; restored original data, 100% recovery. Evaluation Time: 20 Minutes.
Scenario: Exchange 2000 server, severely corrupted Information store; corruption cause unknown. Scanned Information Store file for valid user mailboxes, results took up to 48 hours due to the corruption. Backup was one month old/not valid for users. Evaluation Time: 96 Hours (1.5 days).
Possible Causes of Hardware Related Disasters
- Server hardware upgrades (Storage Controller Firmware, BIOS, RAID Firmware)
- Expanding Storage Array capacity by adding larger drives to controller
- Failed Array Controller
- Failed drive on Storage Array
- Multiple failed drives on Storage Array
- Storage Array failure but drives are working
- Failed boot drive
- Migration to new Storage Array system
Scenario: Netware volume server, Traditional NWFS, failing hard drive made volume inaccessible; Netware would not mount volume. Errors on hard drive were not in the data area and drive was still functional. Copied all of the data to another volume; 100% recovery. Evaluation Time: 1 hour.
Causes of Software Related Disasters
- Business System Software Upgrades (Service Packs, Patches to Business system)
- Anti-virus software deleted/truncated suspect file in error and data has been deleted, overwritten or both
Scenario: Partial drive copy overwrite using third party tools, overwrite started and then crashed 1% into the process, found a large portion of the original data. Rebuilt file system, provided reports on recoverable data; customer will be requiring that we test some files to verify quality of recovery. Evaluation Time: 1 hour.
Causes of User Error Disasters
- During a data loss disaster, restored backup data to exact location, thereby overwriting it
- Deleted files
- Overwritten operating system with reinstall of OS or application software
Scenario: User's machine had the OS reinstalled – Restore CD was used; user looking for Outlook PST file. Searched for PST data through the drive because original file system completely overwritten. Found three potential files that might contain the user's data, after using PST recovery tools we found one of those files to contain all of the user's email; there were missing messages, majority of the messages/attachments came back. Evaluation Time: 5 hours
Causes of Operating System Related Disasters
- Server OS upgrades (Service Packs, Patches to OS)
- Migration to different OS
Scenario: Netware traditional, 2TB volume, damage to file system when trying to expand size of volume, repaired on drive, volume mountable. Evaluation Time: 4 hours.
Server Recovery Tips
Data disasters will happen, accepting that reality is the first step in preparing a comprehensive disaster plan. Time is always against an IT team when a disaster strikes, therefore the details of a disaster plan are critical for success.
Here are some suggestions from Ontrack Data Recovery engineers of what not to do:
- In a disaster recovery, never restore data to the server that has lost the data—always restore to a separate server or location.
- In Microsoft Exchange or SQL failures, never try to repair the original Information Store or database files—work on a copy.
- In a deleted data situation, turn off the machine immediately. Do not shut down Windows—this will prevent the risk of overwritten data.
- Use a volume defragmenter regularly.
- If a drive fails on RAID systems, never replace the failed drive with a drive that was part of a previous RAID system—always zero out the replacement drive before using.
- If a drive is making unusual mechanical noises, turn it off immediately and get assistance.
- Have a valid backup before making hardware or software changes.
- Label the drives with their position in a RAID array.
- Do not run volume repair utilities on suspected bad drives.
- Do not run defragmenter utilities on suspected bad drives.
- In a power loss situation with a RAID array, if the file system looks suspicious, or is un-mountable, or the data is inaccessible after power is restored, do not run volume repair utilities.
One client early last year gave Ontrack Data Recovery this challenge, “We have a backup restoration going on right now and we need the data available as soon as possible. If you want the job, you have to beat the tape.” Recovery engineers worked the entire weekend to get the more than 2TB of data available and recovered over before the start of the work week.
Summary and Conclusion
The fictional, true-to-life IT scenarios at the beginning of this article illustrate the situations and decisions that IT staff must make. Businesses and institutions like yours, without access to their data, run the risk of losing millions in revenue every day. The fact is, today’s systems are relied on more then ever for consistent and available data.