Data Deduplication Technical Preview
Published: February 29, 2012
Updated: February 29, 2012
Applies To: Windows Server 2012
The past decade has seen rapid growth in file-based data in enterprise environments. Although storage costs have been steadily dropping, they are not dropping fast enough to offset this growth, which makes storage efficiency a critical requirement for most enterprise IT departments. Further, efficiencies need to happen wherever the data is—whether it is sitting in a data store or moving through a wide area network. To cope with such growth, customers are rapidly consolidating file servers and making capacity scaling and optimization one of the primary requirements for a consolidation platform.
The goal of data deduplication is to store more data in less space by segmenting files into small (32-128 KB) and variable-sized chunks, identifying duplicate chunks, and then maintaining a single copy of each chunk. Redundant copies of the chunk are replaced by a reference to the single copy. In addition, chunks are also compressed for further space optimization.
The result is an on-disk transformation of each file as shown in Figure 1. After deduplication, files are no longer stored as independent streams of data, and they are replaced with stubs that point to data blocks that are stored within a common chunk store. Because these files share blocks, those blocks are only stored once, which reduces the disk space needed to store all files. During file access, the correct blocks are transparently assembled to serve the data without calling the application or the user having any knowledge of the on-disk transformation to the file. This enables administrators to apply deduplication to files without having to worry about any change in behavior to the applications or impact to users who are accessing those files.
Figure 1 On-disk transformation of files during data deduplication
The Data Deduplication feature consists of a filter driver that monitors local or remote I/O and a deduplication service that controls the three types of jobs that are available (Optimization, Garbage Collection, and Scrubbing).
|Deduplication is limited to a single volume that is portable and cluster aware, and the best results will be with primary data that uses efficient, policy-driven, scheduled background optimization.|
Inherent in the deduplication architecture is resiliency during hardware failures—with full checksum validation on data and metadata, including redundancy for metadata and the most accessed data chunks.
Data Deduplication can potentially process all of the data on a selected volume (except a file size less than 32 KB, files in folders that are excluded, or files that have age settings applied). You should carefully determine if a server and attached volumes are suitable candidates for deduplication prior to enabling the feature. We strongly recommend that during deduplication, you regularly back up important data.