Saving Storage Costs Using Windows Server 2012 Deduplication
Quick Reference Guide
Quick Reference Guide, 205 KB, Microsoft Word file
Microsoft IT has found great value in using deduplication, the Server Data Storage Optimization service included in Windows Server 2012. Also known as Dedup, deduplication finds and removes duplication within data without comprising fidelity or integrity. Deduplication has also improved server storage capacity and user experience and increased productivity by streamlining the storage management process.
What Is Deduplication?
The basic purpose of deduplication is to eliminate duplicate data and reduce redundant data storage. Deduplication facilitates the storage of more data in less space and removes redundant copies of data by replacing them with a reference to a single copy.
In conjunction with Microsoft BranchCache and optimization management features found in Windows Server 2012, Microsoft IT used deduplication to increase data storage for scale, performance, reliability, data integrity, and bandwidth efficiency.
Why Did Microsoft IT Need Deduplication?
Enterprise data storage increases at tremendous rates, which results in increased costs—a challenge for any IT department. Deduplication reduces storage costs and is a significant storage solution available in Windows Server 2012.
To cope with data storage growth in the enterprise, Microsoft IT used data deduplication to consolidate servers and scale their capacity to meet their data storage optimization goals.
Features of Deduplication
Capacity optimization. Data deduplication in Windows Server 2012 stores more data in less physical space. It achieves greater storage efficiency than was possible by using features such as Single Instance Storage or NTFS compression. Data deduplication uses subfile variable-size chunking and compression, which deliver optimization ratios of 2:1 for general file servers and up to 20:1 for virtualization data.
Scale and performance. In Windows Server 2012, data deduplication is highly scalable, resource efficient, and nonintrusive. It can process about 20 MB of data per second, and it can run on multiple volumes simultaneously without affecting other workloads on the server. Low impact on the server workloads is maintained by throttling the CPU and memory resources consumed. If the server is busy, deduplication can stop completely. In addition, administrators have the flexibility to run data deduplication jobs at any time, set schedules for when data deduplication should run, and establish file-selection policies.
Reliability and data integrity. When data deduplication is applied, the integrity of the data is maintained. Windows Server 2012 uses checksum, consistency, and identity validation to ensure data integrity. For all metadata and the most frequently referenced data, data deduplication maintains redundancy to ensure that the data is recoverable in the event of data corruption.
Bandwidth efficiency with BranchCache. Through integration with BranchCache, the same optimization techniques are applied to data transferred over the WAN to a branch office. The result is faster file download times and reduced bandwidth consumption.
Optimization management with familiar tools. Windows Server 2012 has optimization functionality built into Server Manager and Windows PowerShell cmdlets. Default settings can provide savings immediately, or administrators can fine-tune the settings to see more gains. Windows PowerShell cmdlets can be easily used to start an optimization job or schedule one to run in the future. Installing the Data Deduplication feature and enabling deduplication on selected volumes can also be done by using an Unattend.xml file that calls a Windows PowerShell script or with the System Preparation Tool (Sysprep) to deploy deduplication when a system first starts.
Available as part of Windows Server 2012, deduplication made it easy for Microsoft IT to leverage both existing data management technologies and physical infrastructure. Inherent in the deduplication architecture is resiliency during hardware failures—with full checksum validation on data and metadata, including redundancy for metadata and the most accessed data chunks.
Dedup has three layers:
- Management Interface. Deduplication can be managed through Windows Server 2012 Server Manager, Windows PowerShell, or Windows Management Instrumentation.
- Deduplication job. The actual
deduplication job can optimize storage and includes two additional advanced
features—Garbage Collection and Scrubbing:
- Garbage Collection processes deleted or modified data on the volume so that any data chunks no longer referenced are cleaned up.
- Last Scrubbing Time analyses the chunk store corruption logs and, when possible, makes repairs.
- Data access layer. When deduplication is enabled, administrators can access the deduplication file system to store, retrieve, and index an application with no noticeable performance attrition.
How Deduplication Works
Data deduplication works by finding and removing duplication within data without compromising data fidelity or integrity. The goal is to store more data in less space by:
- Segmenting files into small, variable-sized chunks (32–128 KB)
- Identifying duplicate chunks and maintaining a single copy of each chunk
- Replacing redundant copies of the chunk with a reference to the single copy
- Compressing the chunks, and then organizing them into special container files in the System Volume Information folder
After a volume is enabled for deduplication and the data is optimized, the volume contains the following elements:
- Unoptimized files. For example, unoptimized files could include files that do not meet the selected file-age policy setting, system state files, alternate data streams, encrypted files, files with extended attributes, files less than 32 KB in size, other reparse point files, or files in use by other applications.
- Optimized files. These files are stored as reparse points that contain pointers to a map of the respective chunks in the chunk store that are needed to restore the file when it is requested.
- Chunk store. This is the location for the optimized file data.
- Additional free space. The optimized files and chunk store occupy much less space than they did prior to optimization.
No Impact on Application Performance
After deduplication, files are no longer stored as independent streams of data; instead, they are replaced with stubs that point to data blocks stored within a common chunk store. Because these files share blocks, the blocks are only stored once, which reduces the disk space needed to store all files. During file access, the correct blocks are transparently assembled to serve the data without calling the application or the user having any knowledge of the on-disk transformation to the file. This enabled Microsoft IT to apply deduplication to files without having to worry about changes in behavior to the applications or impact the users who are accessing those files.
Before Microsoft IT deployed deduplication feature, they first evaluated candidate servers and data for savings. After candidates were identified, rollout plans, scale, and policies were developed.
Microsoft IT used two methods to estimate the rate of savings that would help evaluate whether a drive was a good candidate for deduplication:
- At a command prompt, by typing C:\>DDPEval.exe< VolumePath:>
- In Windows PowerShell, by using the Measure-DedupFileMetadata cmdlet
An example of the Data Deduplication Saving Evaluation Tool results are illustrated below.
Enable and Configure
Microsoft IT used Server Manager to install and enable the deduplication feature on the chosen volumes:
- To install, click Server Manager -> Add Roles and Features -> File and Storage Services -> File Services -> Data Deduplication -> Install.
- To enable deduplication on a volume, click Server Manager -> File and Storage Services -> Volumes; right-click the intended volume, and then click Configure Data Deduplication -> Enable data deduplication, Set schedule, or Exclude folders.
Although deduplication does not affect performance while running, it can take anywhere from several hours to several days to finish depending on the amount of data. Microsoft IT customized the optimization schedule to start right before the end of peak business hours and run through the tasks during off-peak business hours.
- To schedule a task, type Task Scheduler in the Search bar. In the Task Scheduler, use the drop-down menu to access the Task Schedule Library and the list of tasks that can be scheduled. For example Task Schedule Library -> Microsoft -> Windows -> Deduplication -> Garbage Collection/ Scrubbing.
Checking Deduplication Results
Microsoft IT used Windows PowerShell to check the results of deduplication by running PS C:\Windows\System32>get-dedupvolume <volumepath>: | fl.
The returned information includes drive capacity, saving rates, excluded folders and file types, and chunk redundancy thresholds as well as other volume data that helps Microsoft IT validate deduplication results.
Monitoring, Maintenance, and Troubleshooting
Through ongoing monitoring, Microsoft IT is able to obtain the rate of optimization savings and follow the status and progress of ongoing optimizations. Also, part of monitoring efforts is the inspection of data scrubbing reports.
Microsoft IT also performs maintenance activities that include running additional Scrubbing and Garbage Collection as required for further optimization.
For troubleshooting, Microsoft IT can query Dedup metadata and unoptimize a volume.
Microsoft IT has successfully deployed deduplication on 12 Windows Server 2012 file servers, focusing on the deduplication of primary data, such as general file shares, software deployment shares, and virtual hard disk (VHD) libraries. With an average of 10 TB of storage capacity for each server, the initial overall savings rate averaged 40 percent. The two VHD drives that store archival data had a saving rate that was close to 90 percent.
In addition to the reduced data volume, Microsoft IT saw performance and capacity improvements as well as cost savings.
Reduced Data Volume
Microsoft IT reduced their data volume through deduplication without compromising the data contents, fidelity, or integrity. Deduplication optimized data volumes, depending on data type, by more than 80 percent. For example, for 10 TB of VHD data, deduplication can save 8 TB in drive space, requiring only 2 TB of drive space for 10 TB of data!
Typical Deduplication Storage Savings
|Scenario||Content type||Space savings|
|User documents||Documents, photos, music, videos||30–50%|
|Deployment shares||Software binaries, cab files, symbols files||70–80%|
|Virtualization libraries||VHD files||80–95%|
|General file share||All of the above||50–60%|
Improved Performance and Capacity
These savings provided Microsoft IT with smaller and faster data backup-and-restore processes as well as efficient archiving and quick data migration. As a result of the storage space savings, Microsoft IT could add more data using less hardware.
With the storage saving Microsoft IT enjoyed after implementing deduplication, there is now capacity for 200 TB of storage in an environment that used to support 120 TB, without the addition of physical capacity. Deduplication has enabled Microsoft IT to expand their file share, client data backup service, scaling of the service to accommodate additional storage needs, simplified storage management, and reduced costs.
Based on the current rate of deduplication savings, the table below demonstrates deduplication savings forecasted by Microsoft IT for the current and next two fiscal years.
File Share Data Storage Capacity
Physical Storage and Management Cost
Storage and Management Cost
- Data Deduplication Overview: http://technet.microsoft.com/en-us/library/hh831602.aspx
- About Data Deduplication: http://msdn.microsoft.com/en-us/library/windows/desktop/Hh769303(v=vs.85).aspx
- Windows Server 2012: http://www.microsoft.com/en-us/server-cloud/windows-server/default.aspx