Virtualization technology dominates the enterprise landscape. According to Gartner, most firms report 75% or higher virtualization. Improvements in hypervisors have reduced the complexity of setting up and maintaining physical servers, greatly improved server utilization and increased IT flexibility and responsiveness to the needs of the business. It’s no wonder that the bulk of modern IT systems are virtualized. But, whether you use VMware, Hyper-V, Citrix, Oracle or any of the other hypervisors, there is a potential downside to virtualization.
The downside to virtualization
In order to transform a physical server into many virtual machines (VMs), an additional software layer is added. While simplifying the admin user experience, virtualization raises the overall complexity of the IT environment as the underlying hardware is obfuscated, making it more difficult for admins to know which physical system their VMs are running on or which storage is used for a particular machine in the event of data loss. With fewer people to maintain and monitor a larger number of virtual machines (compared to physical servers), there are greater chances for problems and data loss.
The main reasons for data loss in virtual environments
Data gathered across the globe by Ontrack Data Recovery reveals that several causes of data loss incidents for virtualized environments. The leading reasons for virtual machine data loss are user error, ransomware, hardware failures, and RAID corruption. The purpose of this paper is to layout what leads to virtual data loss and explain how global data recovery providers are able to resolve a high percentage of even the most challenging data loss situations in virtualized environments.
Hardware / RAID Issues
To help prevent data loss, modern systems will often use some form of replication of data across multiple physical drives (HDD or SSD) that is consolidated into a single logical unit. This data protection can be a hardware or software-based solution. RAID combines multiple hard drives or data stripes to improve redundancy, increase data reliability and boost I/O (input/output) performance. RAID effectively fragments data across many disks and reassembles it when requested by the user or needed by the system.
It takes a robust RAID system to keep track of everything and manage the data. The hardware problems facing virtual systems are basically the same as in physical systems, such as failing drives, failing controllers, failing server components and power issues. But RAID corruption is a far greater challenge with VMs due to the nature of virtualization. Unfortunately, data loss is not uncommon with RAID storage. The complexity of modern hardware and software RAID is added to by the presence of deduplication and compression. Now factor in an additional virtualization layer and the likelihood of a fault increases. RAID controllers are responsible for mapping where all information resides across the many disks at their disposal. But if a RAID configuration becomes corrupted, files can’t be rebuilt. When that happens, the interconnectivity of multiple systems can potentially cause significant data loss and downtime.
Formatting / Software Issues
Reformatting a disk, virtual disk, array, LUN, vDisk, volume, etc. (or other storage media) and re-installing software are additional causes of data loss in virtualized environments. Specific to VMs, for example, there can be reformatting at the Guest or Host level. Corruption can also come about due to buggy patches and updates without an offline backup, poorly planned implementation of new software, integration issues and database corruption.
These issues can also cause host file corruption and guest file system damage. Thin provisioning data loss, too, should be considered. Instead of allocating all the data the VM will need and positioning the file system structures at their specified physical offsets, thin provisioning only provisions the amount of space immediately needed and adds additional blocks to the virtual disk as it grows. This can result in a more complex and fragmented virtual environment on the disk. If the metadata pointers to the data are missing or damaged, it is challenging to locate the various fragments and rebuild the virtual disk. Alternatively, the mapping layer within the virtual disk may be damaged or overwritten, making reassembly extremely difficult.
Virtual File System Metadata Corruption
Yet another source of data loss is metadata corruption. Metadata is even more important in virtualized environments due to the number of layers and VMs that exist. A small problem with VMFS metadata can have serious repercussions on data availability.
User Error
Many of these sources of data loss can be categorized as user error by administrators. Access privileges allow admins the capability to delete VMs by mistake. But even if access rights are correctly managed, errors remain commonplace. A surprisingly large amount of failures are due to virtual disks being deleted by mistake, VMs being overwritten or their space reassigned. There can also be snapshot chain corruption, i.e. one of a series of snapshots is either corrupted, gets deleted or becomes unavailable for some other reason. This can foul up backups and make it difficult to recover data.
Ironically, the ease of use of modern hypervisors is causing organizations to invest in less training. Inexperienced staff is being handed responsibility for managing large and ever-growing virtualized environments. Small and mid-size managed service providers (MSPs) may not have a sufficient number of experienced staff to monitor virtual environments frequently enough to catch issues as they develop. In some cases, IT admins may not initiate adequate security measures on a database or omit the documentation of changes. If encryption is enabled and a volume is deleted, for example, data becomes difficult to recover. Employee turnover is another source of problems. The new incumbent can’t figure out the intricacies of the virtualized architectures; he or she inadvertently deletes VMs or introduces changes that result in data loss.
Other cases
In other cases, the original flat-file may be stored but nobody can find it when data loss occurs. Neglect of backups, too, is a common reason for virtual data loss. And how about different storage, hypervisor and guest teams working in silos? One team might create a volume, another might attach the hypervisor and a guest admin then sets up the virtual machine. This type of organizational structure provides an opportunity for gaps and mistakes. Reformatting, overwriting and deletions can more easily occur. What can enterprises do, then, when they experience data loss from a virtualized environment? There is no back or undo button. A deleted VM is gone. Backups? They are often incomplete or corrupted.