Simplify File Recovery with Data Protection Manager
Microsoft IT Showcase and Laura Euler
At a Glance:
- Using DPM for backup at Microsoft
- The DPM architecture
- End-user data recovery
- Configuring backup and recovery
Most large companies use magnetic tape for backing up data. Tape is cheap and useful for long-term retention and off-site storage, but it can be inefficient and unreliable. It’s physically fragile and
vulnerable to misalignments. It also takes a long time to back up data to a tape. Even when data has been backed up successfully, tapes are subject to loss, damage, or prolonged delivery times due to off-site storage.
Some industry analysts estimate that more than 40 percent of companies have had restorations fail because the data was not correctly written to tape, was corrupted, or was in other ways unusable. Verification procedures are often necessary to ensure that data has been properly backed up, but that adds more time and overhead.
Is there a better way? Hard disk space gets cheaper all the time, so it’s now more cost-effective to back up on disk and avoid tape backups altogether. Disks aren’t subject to the same pitfalls as tapes. They aren’t as fragile, they can be overwritten more easily, and they’re faster for accessing data. And management tools can make the task of keeping track of your data easier.
Microsoft® Data Protection Manager (DPM) 2006 is a server software application that optimizes disk-based backup and restoration. The DPM server routinely synchronizes with your production servers, capturing only the changes. This replication occurs at the byte level; instead of replicating an entire file when a change occurs, DPM replicates only the bytes that actually change within each file.
A test by the Microsoft Data Protection Services Group using a beta release demonstrated that DPM could protect a branch office with 300GB of data to back up every day through a single nightly synchronization session that lasted only 10 minutes. This session replaced the eight-hour tape backup session formerly required.
For users, DPM offers instant recoverability. If a user’s data becomes overwritten or lost, a backup copy is available without the need to request a tape. An employee can use Windows® Explorer or any Microsoft Office 2003 (or higher) application to right-click the file and initiate the restoration. If a catastrophic server failure occurs, the DPM image of the data can be set to read-only mode and shared over the network to provide users access to their data while the server is rebuilt.
Microsoft Case Study
With about 150 sites, including 115 branch offices around the world, Microsoft needed a backup and restoration solution that would enable it to centrally manage all remote locations. A big part of centrally managing support for remote locations is having the ability to restore data without requiring local personnel to pull a tape from a library and mount the media for restoration. Loading tape cannot be done from a central location. And a WAN-based solution didn’t work, either. It combined the unreliability of a tape backup with the slow transmission rates of the WAN.
Microsoft also wanted to reduce the cost of maintenance and repair support, as well as minimize backup errors. The monitoring tools used at Microsoft to verify tape-based backups identified about 16,000 errors each month for more than 5,000 servers. About 14,000 could be resolved via automation, but another 2,000 errors had to be addressed. But with at least 150 different error codes, resolving each was time consuming.
Yet another consideration was the size of the Microsoft datacenter servers. Some of the these servers contain so much information that the Microsoft Data Protection Services Group found that tape was simply unable to protect all the information.
The solution was DPM. Microsoft IT configured its DPM servers to store between 14 and 21 days of backup data in the form of easily accessible snapshots—dedicating 180 terabytes of storage connected to the DPM servers to cover all 115 branch offices. Tape backups for long-term retention and off-site backup are made from the DPM servers, completely freeing the branch offices from the expense of managing tapes locally. The 12 DPM servers replaced 115 tape libraries, media servers, and associated infrastructure at these sites.
An architectural overview of a typical DPM installation can be found in Figure 1. The DPM server software is installed on a dedicated server running Windows Server™ 2003 Service Pack 1 (SP1) or Microsoft Windows Storage Server 2003 SP1. Windows Server 2003 R2 also supports DPM. The DPM setup process also installs components of SQL Server 2000 (the DPM product includes a restricted license of SQL Server for use only with DPM).
Figure 1 Typical DPM Installation
An Active Directory® service domain is required for the discovery of servers and to maintain the security settings of files and folders through access control lists. The Active Directory schema also holds the configuration settings required for the user recovery client to retrieve shadow copies from the DPM server. The DPM server must be in the same domain as the servers being protected.
The DPM agent software is installed on each protected server; it captures and logs all the changes made to the file system. These servers may run Windows 2000 (SP4 with the update rollup), Windows Server 2003, Windows Storage Server 2003, Windows Server 2003 R2, or Windows Storage Server 2003 R2. The agent is installed from within the DPM administrator console.
The clients run Windows XP and Windows Server 2003. The Previous Versions Client and Volume Shadow Copy Service (VSS) on these systems enable users to access and recover previous versions of their files. (Note that enabling end-user recoveries via DPM requires an update for computers running Windows XP SP2 and Windows Server 2003 SP1. For more information, see "An update is available to optimize the way that the Shadow Copy Client accesses shadow copies in Windows Server 2003 and in Windows XP
You might have noticed the final tape backup solution in Figure 1. This last step is optional, but for off-site archiving or long-term retention (especially for legal and financial purposes), tape is still the most cost-effective medium. However, for fast backups and restorations of recent data, it’s more efficient to restore directly from the DPM disk.
Synchronization is the process by which DPM transfers only the changes made on file servers to the DPM server and applies those changes to the replica of protected data. DPM updates data at the byte level within protected files. File operations such as renaming, deleting, and creating also are replicated for protected files.
DPM synchronization is asynchronous, meaning that synchronization occurs without blocking disk I/O on the protected objects. Changes are stored in the agent synchronization log and then replicated across the network to the DPM server, where they are stored in the transfer log. Data in the transfer log is later used to construct complete replicas that can be used to create a shadow copy of the protected data.
The Windows Server 2003 VSS creates consistent point-in-time copies of data known as shadow copies. After the DPM agent has synchronized the changed data with the DPM server, VSS creates replicas within the DPM server. Users can then browse and recover copies of deleted or corrupted files from various points in time. Eventually old shadow copies are deleted—as soon as the size of all shadow copies reaches either a configurable maximum or 64 shadow copies per volume (whichever occurs first).
Prior to VSS, there was no standard way to produce uncorrupted snapshots of a volume. You have to repair corruptions due to interrupted writes using tools such as Chkdsk.exe. VSS prevents incomplete writes by enabling applications to flush partially committed data from memory.
Allowing your users to selectively restore their own data improves productivity and satisfaction. Instead of being dependent upon someone from IT to locate and mount the appropriate backup tape, the user just navigates to the DPM file share and downloads the lost information. End-user recovery also lowers operation costs since helpdesk and administrators aren’t involved. Various industry studies have found that more than 90 percent of all tape restores are for single files, and nearly the same percentage of cases are for files that are less than 14 days old. So it makes sense to let your users recover their own data.
End users can recover files by browsing through shadow copies on the DPM server, either by using Windows Explorer or by using the Recover Previous Version command on the Microsoft Office 2003 Tools menu (see Figure 2).
Figure 2 Folder Recovery
Because DPM is disk based, it can take advantage of a redundant array of independent disks (RAID) for protection—unlike tape, with the potential to be a single point of failure. IT administrators can easily browse through the server running DPM to confirm that the shadow copy has been made, and they can check and verify the backup online. Being able to confirm the redundancy of data without continually restoring from tape may also benefit auditors doing compliance checks in regard to business continuity or disaster recovery planning.
DPM monitors its backup jobs to ensure that they complete without error. If an error is detected, DPM has a two-stage error-correction process. First, DPM automatically validates the replica against the production server to ensure that the replication is consistent and has occurred as planned. If inconsistencies between a data source and its replica are found, the fix-up activity resends the object or objects from the data source to the replica.
With tape-based systems, about 40 percent of the IT staff’s time is spent monitoring and correcting backup operations. DPM eliminates this chore because it automatically validates and reworks the replica to help ensure consistency with the production server.
You can deploy DPM to perform a nightly synchronization and shadow copy; you can also schedule nearly continuous protection. The scheduling engine supports a wide range of protection options and frequencies.
In its default settings, DPM provides hourly synchronizations so that even complete server failure results in less than an hour of data loss. With shadow copy scheduled at regular points throughout the day, users can roll back to a previous copy of data that is less than two hours old—instead of being limited to the previous night’s tape. Other organizations might configure for less frequent synchronizations and perhaps only daily shadow copies. With 64 instances per volume, a once-a-day schedule would enable more than two months of recovery capability from disk.
DPM essentially eliminates having to schedule backup windows because it logs and replicates byte-level changes to the files on the production servers. This also makes backups more efficient. Because DPM captures changes as they happen instead of copying entire files, DPM backups place less load on production servers than conventional backup tools that have to copy entire files if even a single byte changes. (Note: the DPM administrator console provides a throttling feature that enables you to define the maximum amount of total bandwidth that the backup procedure can use.) Collapsing the backup window from hours to minutes also lets you use fewer servers than you’d need using slower tape backups.
DPM also provides an alternative to tape backup for servers that hold so many millions of files or terabytes of data that attempting to use a tape-based backup solution would not be practical. Full tape backups take more than 24 hours on such systems, and they are often unsuccessful due to the millions of files and folders that need to be cataloged. On these systems, even incremental backups are problematic, due simply to the sheer volume of files.
The IT department at Microsoft found that the tape-based solutions it tried with a 10-million-file server failed consistently because crawling through that directory structure took so long. They also found that as the backup length increases, so do the odds of encountering a drive failure or some other issue that causes the backup to fail. Because DPM captures the data changes as they occur, no crawling of the directory structure is needed. Using this approach, Microsoft was able to successfully protect these very large servers within the already existing backup window.
Selecting the Data
After you select the data you want to protect, you can plan how to collect the data in protection groups. A protection group is a set of volumes, folders, or shares under the same protection policy. Items included are called members. The protection policy contains both the synchronization schedule and the shadow-copy schedule.
The key consideration in creating protection groups is tolerance for data loss. Typically, data with a relatively low loss tolerance is in one protection group, data with a relatively high loss tolerance in another, and medium loss-tolerance data in another. A given protection group may contain members from different volumes and servers, although each protected object can be associated with only one protection group, and only one protection group can protect data sources on any single volume.
Although the complete data of a server can be easily restored through DPM, special steps are required to enable DPM to support recovery that includes restoring the operating system and system-state data. To accomplish this, DPM recommends that you back up the protected server using the backup feature of that server’s OS (meaning the backup utility included with Windows 2000 Server or Windows Server 2003).
You can use these backups to restore the server to a bootable state; store these backups on a volume that is added to a protection group on the DPM server. When you require a recovery, retrieve the system state data from the DPM server and write the data to restore media. You can then use the media with the appropriate backup tool to restore the system to a bootable state. Then you can restore protected data to the server using the normal DPM workflow.
Allocating Disk Space
Obviously, the amount of disk space that an organization allocates on the DPM server will directly affect how extensive its backup history will be. The set of available storage for replicas and shadow copies is known as the storage pool. DPM enables an organization to distribute a single server’s storage pool across multiple disks, adding more space to the pool when necessary.
The default DPM cumulative space allocation for a given protected volume includes the following:
- A replica allocation of the smaller of 1.5 times the space used on the protected volume or the total volume capacity (with a minimum of 1.5GB). For a 10GB volume that is 40 percent full, this allocation equals 6GB.
- A shadow-copy allocation of 20 percent of the size of the replica allocation (with a minimum of 550MB).
- A transfer log allocation of 1.4 times the synchronization log space allocated on the protected server (with a minimum of 700MB).
In addition, DPM includes a calculator that can more precisely estimate the amount of space required to protect a given volume of data, and you can always manually specify space allocations either when creating the protection group or afterward.
Figure 3 shows actual DPM server utilization on the Microsoft corporate network. As you can see, the DPM server that protects the most branch offices is DPM #2, which is located in the Redmond datacenter and uses 1.55 terabytes of its allocated 3.69 terabytes of storage. The 12 datacenter DPM servers protect a total of 149 servers and hold 20.79 terabytes of data; the smallest DPM server has a 3-terabyte capacity, and the largest has a 10-terabyte capacity.
Figure 3 Group Utilization by DPM Server
|Data Center DPM Server
||Branch Servers Protected
||Disk Capacity (in terabytes)
||Disk Allocated (in terabytes)
||Disk Used (in terabytes)
|Dublin DPM #1
|Dublin DPM #2
|Dublin DPM #3
|Dublin DPM #4
|Dublin DPM #5
|Redmond DPM #1
|Redmond DPM #2
|Redmond DPM #3
|Redmond DPM #4
|Singapore DPM #1
|Singapore DPM #2
|Singapore DPM #3
Figure 4 shows the average number of files and the amount of total data stored on branch servers. Some protected servers hold more than 6 million files and have an average disk capacity of 218GB.
Figure 4 Average Branch Office Server Workload
|Average number of files
|Maximum number of files
|Average capacity in GB
|Maximum capacity in GB
Analyzing the change workload—how much data on a protected volume or server actually changes between backups—underscores the value of DPM, which is based upon copying only data blocks that have changed. Whereas Figure 4 showed that the average protected server holds 218GB of data, Figure 5 shows that the average branch office server has a data change of only about 1.8GB per backup session.
Figure 5 Average Branch Office Change Workload per Server
|Average change in GB
|Longest changes in GB
|Heavy day changes in GB
|Maximum changes in GB
DPM is a handy solution for organizations that need advanced data protection. DPM enables users to recover their own documents quickly and easily. Recoveries that once took hours, while IT staff located and mounted tapes from a library, can now be achieved in seconds. Because tape backups can easily be made from the DPM server, organizations can still use tapes for long-term and offsite storage, while eliminating the long backup windows that tape requires.
DPM is especially useful if you need to manage backup and restoration services for branch offices and other remote locations from a high-security centralized datacenter, while at the same time reducing the cost of local administration.
Microsoft IT Showcase presents an inside view of the Microsoft IT process for developing, deploying, and managing Microsoft solutions—from Microsoft IT professionals to IT professionals—peer to peer. The resources they provide reveal how Microsoft uses technology to solve specific business problems. Find out more at Microsoft IT: Showcase.
Laura Euler runs the universe from an underground fortress in an undisclosed location. You can contact her at firstname.lastname@example.org.
© 2008 Microsoft Corporation and CMP Media, LLC. All rights reserved; reproduction in part or in whole without permission is prohibited