Data Deduplication Overview
Published: February 29, 2012
Updated: August 29, 2012
Applies To: Windows Server 2012, Windows Storage Server 2012
This topic describes the data deduplication feature in Windows Server 2012, and it explains practical applications for the feature.
Data deduplication involves finding and removing duplication within data without compromising its fidelity or integrity. The goal is to store more data in less space by segmenting files into small variable-sized chunks (32–128 KB), identifying duplicate chunks, and maintaining a single copy of each chunk. Redundant copies of the chunk are replaced by a reference to the single copy. The chunks are compressed and then organized into special container files in the System Volume Information folder.
After a volume is enabled for deduplication and the data is optimized, the volume contains the following:
-
Unoptimized files. For example, unoptimized files could include files that do not meet the selected file-age policy setting, system state files, alternate data streams, encrypted files, files with extended attributes, files smaller than 32 KB, other reparse point files, or files in use by other applications.
-
Optimized files. Files that are stored as reparse points that contain pointers to a map of the respective chunks in the chunk store that are needed to restore the file when it is requested.
-
Chunk store. Location for the optimized file data.
-
Additional free space. The optimized files and chunk store occupy much less space than they did prior to optimization.
To cope with data storage growth in the enterprise, administrators are consolidating servers and making capacity scaling and data optimization key goals. Data deduplication provides practical ways to achieve these goals, including:
-
Capacity optimization. Data deduplication in Windows Server 2012 stores more data in less physical space. It achieves greater storage efficiency than was possible by using features such as Single Instance Storage (SIS) or NTFS compression. Data deduplication uses subfile variable-size chunking and compression, which deliver optimization ratios of 2:1 for general file servers and up to 20:1 for virtualization data.
-
Scale and performance. In Windows Server 2012, data deduplication is highly scalable, resource efficient, and nonintrusive. It can process about 20 MB of data per second, and it can run on multiple volumes simultaneously without affecting other workloads on the server. Low impact on the server workloads is maintained by throttling the CPU and memory resources that are consumed. If the server gets very busy, deduplication can stop completely. In addition, administrators have the flexibility to run data deduplication jobs at any time, set schedules for when data deduplication should run, and establish file selection policies.
-
Reliability and data integrity. When data deduplication is applied, the integrity of the data is maintained. Windows Server 2012 uses checksum, consistency, and identity validation to ensure data integrity.For all metadata and the most frequently referenced data, data deduplication maintains redundancy to ensure that the data is recoverable in the event of data corruption.
-
Bandwidth efficiency with BranchCache. Through integration with BranchCache, the same optimization techniques are applied to data transferred over the WAN to a branch office. The result is faster file download times and reduced bandwidth consumption.
-
Optimization management with familiar tools. Windows Server 2012 has optimization functionality built into Server Manager and Windows PowerShell. Default settings can provide savings immediately, or administrators can fine-tune the settings to see more gains. One can easily use Windows PowerShell cmdlets to start an optimization job or schedule one to run in the future. Installing the Data Deduplication feature and enabling deduplication on selected volumes can also be accomplished by using an Unattend.xml file that calls a Windows PowerShell script and can be used with Sysprep to deploy deduplication when a system first boots.
To take advantage of data deduplication in Windows Server 2012, the environment must meet the following requirements:
-
Server: One computer running Windows Server 2012 or a virtual machine with at least one data volume
-
(Optional) Another computer: One computer running Windows Server 2012 or Windows® 8 that is connected to the server over a network
For additional resources about related technologies in Windows Server 2012, see:
