.gif)
SQL Server
Technical Article
Writer: Sanjay Mishra
Contributors: Marcel van der Holst, Peter Carlin,
Sunil Agarwal
Technical Reviewer: Stuart Ozer, Lindsey Allen, Juergen
Thomas, Thomas Kejser, Burzin Patel, Prem Mehra, Joseph Sack, Jimmy May,
Cameron Gardiner, Mike Ruthruff, Glenn Berry (SQL Server MVP), Paul S Randal
(SQLskills.com), David P Smith (ServiceU Corporation)
Published: May 2009
Applies to: SQL Server 2008
Summary: The data compression feature in SQL
Server 2008 helps compress the data inside a database, and it can help reduce
the size of the database. Apart from the space savings, data compression
provides another benefit: Because compressed data is stored in fewer pages, queries
need to read fewer pages from the disk, thereby improving the performance of
I/O intensive workloads. However, extra CPU resources are required on the
database server to compress and decompress the data, while data is exchanged
with the application. Therefore, it is important to understand the workload
characteristics when deciding which tables to compress.
Introduction
The data compression feature in the Microsoft® SQL Server®
2008 database software can help reduce the size of the database as well as
improve the performance of I/O intensive workloads. However, extra CPU
resources are required on the database server to compress and decompress the
data, while data is exchanged with the application. Therefore, it is important
to understand the workload characteristics when deciding which tables to
compress. This white paper provides guidance on the following:
- How to decide which tables and indexes to compress
- How to estimate the resources required to compress a
table
- How to reclaim space released by data compression
- The performance impacts of data compression on
typical workloads
About Data Compression
SQL Server
2008 provides two levels of data compression – row compression and page
compression. Row compression helps store data more efficiently in
a row by storing fixed-length data types in variable-length storage format. A
compressed row uses 4 bits per compressed column to store the length of the
data in the column. NULL and 0 values across all data types take no additional space
other than these 4 bits.
Page compression is a superset of row compression. In
addition to storing data efficiently inside a row, page compression optimizes
storage of multiple rows in a page, by minimizing the data redundancy. Page
compression uses prefix compression and dictionary compression. Prefix compression looks for common patterns in the
beginning of the column values on a given column across all rows on each page. Dictionary compression looks for exact value matches across
all columns and rows on each page. Both dictionary and prefix are type-agnostic
and see every column value as a bag of bytes.
For more
information about the data compression feature, see SQL Server Books Online. The SQL Server Storage Engine blog is also a great resource for the
internals of data compression.
The data
compression feature is available in the Enterprise and Developer editions of
SQL Server 2008. Databases with compressed tables or indexes cannot be
restored, attached, or in any way used on other editions. To determine whether
a database is using compression, query the dynamic management view (DMV)sys.dm_db_persisted_sku_features.
To determine what is compressed,
and how (row or page), query the data_compression_desc column in the catalog view sys.partitions.
Deciding What to Compress
SQL Server
2008 provides great flexibility in how data compression is used. Row and page
compression can be configured at the table, index, indexed view, or partition level. Some examples of the flexibility in applying
data compression are to:
- Row-compress some tables, page-compress some others,
and don’t compress the rest.
- Page-compress a heap or clustered index, but have no
compression on its nonclustered indexes.
- Row-compress one index, and have no compression on
another index.
- Row-compress some partitions of a table, page-compress
some others, and don’t compress the rest.
With this
flexibility comes the challenge in deciding what to compress. This section
provides some guidelines to assist in deciding what to compress. Some of the
factors that influence this decision are:
- Estimated space savings
- Application workload
Estimated Space Savings
The stored
procedure sp_estimate_data_compression_savings estimates the amount of space saved
by compressing a table and its indexes. It functions by taking a sample of the
data and then compressing it in tempdb.
Estimate the space savings for the largest tables and indexes in a database,
and consider compressing only those tables and indexes that yield significant
space savings.
The stored
procedure sp_estimate_data_compression_savings estimates the space savings one table
at a time. This can be wrapped in a script to estimate the space savings for
all the tables and indexes in a database – as shown in these two blogs: Whole Database - Data Compression
Procs and Procedure used for applying Database
Compression to Microsoft SAP ERP system. Be aware that estimating data compression savings on an
entire database may take a long time in a database with several thousand tables
and indexes, such as an SAP ERP database.
Data and Data Types
The amount of
space saved by compressing a table depends on the “data” the table contains
(after all, it is called “data” compression!). Some data compresses
significantly, while some other doesn’t. Tables that contain the following
patterns of data compress very well:
- Columns with numeric or fixed-length character data
types where most values don’t require all the allocated bytes: For
example, integers where most values are less than 1000
- Nullable columns where a significant number of the
rows have a NULL value for the column
- Significant amounts of repeating data values or
repeating prefix values in the data
Some patterns
of data that do not benefit much from compression are:
- Columns with numeric or fixed-length character data
types where most values require all the bytes allocated for the specific
data type
- Not much repeating data
- Repeating data with non-repeating prefixes
- Data stored out of the row
- FILESTREAM data
A table or a
partition can have three allocation units - IN_ROW_DATA, LOB_DATA, and
ROW_OVERFLOW_DATA.
The data stored in LOB_DATA and ROW_OVERFLOW_DATA allocation units is not
compressed. Only the data that is stored in the IN_ROW_DATA allocation unit is
compressed. Use Appendix
B to understand how
much data is stored in each of these three allocation units.
FILESTREAM data is stored outside the database
in a FILESTREAM data container on an NTFS volume. This data is not compressed.
Application
Workload
Compressed pages are persisted as compressed
on disk and stay compressed when read into memory. Data is decompressed (not
the entire page, but only the data values of interest) when it meets one of the
following conditions:
- It is read
for filtering, sorting, joining, as part of a query response.
- It is updated
by an application.
There is no in-memory, decompressed
copy of the compressed page. Decompressing data consumes CPU. However, because
compressed data uses fewer data pages, it also saves:
- Physical I/O:
Because physical I/O is expensive from a workload perspective, reduced
physical I/O often results in a bigger saving than the additional CPU cost
to compress and decompress the data. Note that physical I/O is saved both
because a smaller volume of data is read from or written to disk, and because more data can remain
cached in buffer pool memory.
- Logical I/O
(if data is in memory): Because logical I/O consumes CPU, reduced logical
I/O can sometimes compensate for the CPU cost to compress and decompress
the data.
The savings in logical and physical I/O
is largest when tables or indexes are scanned. When singleton lookups (for read
or write) are performed, I/O savings from compression are smaller - they only
occur if compression causes more requests to target the same page, and this
leads to reduced physical I/O.
The CPU overhead of row compression is
usually minimal (generally less than or equal to 10 percent in our experience).
If row compression results in space savings and the system can accommodate a 10
percent increase in CPU usage, all data should be row-compressed. For example, SAP
ERP NetWeaver 7.00 Business Suite 7 and above use row compression on all
tables.
The CPU overhead of page compression
can be higher than row compression, and therefore deciding what to page-compress
is more difficult. Some general guidelines for page compression are:
- Start with
less frequently used tables and indexes to ensure the understanding of the
system behavior is accurate.
- If CPU
headroom is not available, do not use page compression without thorough
testing.
- Query
operations such as filtering, joins, aggregates, and sorting see
decompressed data, and hence the cost of these operations are not affected
by data compression. A query, whose cost is dominated by complex
processing operations (for example, a query involving multiple table joins
or complex aggregate operations) is much less likely to see any
significant change in performance or CPU utilization, due to data
compression. These complex queries usually occur in data warehousing
applications, but they can occur in other applications as well. If CPU
time for an application consists primarily of complex queries, page
compression may not impact CPU measurably. In such scenarios, space
savings may be the prime driver for data compression.
- Most
large-scale data warehousing workloads are scan-intensive, and storage is
typically a premium as well. In a data warehouse or large-scale data mart,
if there is CPU headroom available, we recommend page-compressing all
objects in the database, rather than evaluating object-by-object as
outlined below.
A more detailed approach to deciding
what to compress involves analyzing the workload characteristics for each table
and index. It is based on the following two metrics:
- U: The
percentage of update operations on a specific table, index, or partition,
relative to total operations on that object. The lower the value of U
(that is, the table, index, or partition is infrequently updated), the
better candidate it is for page compression.
- S: The
percentage of scan operations on a table, index, or partition, relative to
total operations on that object. The higher the value of S (that is, the table,
index, or partition is mostly scanned), the better candidate it is for page
compression.
U:
Percent of Update Operations on the Object
To compute U, use the statistics in
the DMV sys.dm_db_index_operational_stats. U is the ratio (expressed in
percent) of updates performed on a table or index to the sum of all operations
(scans + DMLs + lookups) on that table or index. The following query reports U
for each table and index in the database.
SELECT o.name AS [Table_Name], x.name AS [Index_Name],
i.partition_number AS [Partition],
i.index_id AS [Index_ID], x.type_desc AS [Index_Type],
i.leaf_update_count * 100.0 /
(i.range_scan_count + i.leaf_insert_count
+ i.leaf_delete_count + i.leaf_update_count
+ i.leaf_page_merge_count + i.singleton_lookup_count
) AS [Percent_Update]
FROM sys.dm_db_index_operational_stats (db_id(), NULL, NULL, NULL) i
JOIN sys.objects o ON o.object_id = i.object_id
JOIN sys.indexes x ON x.object_id = i.object_id AND x.index_id = i.index_id
WHERE (i.range_scan_count + i.leaf_insert_count
+ i.leaf_delete_count + leaf_update_count
+ i.leaf_page_merge_count + i.singleton_lookup_count) != 0
AND objectproperty(i.object_id,'IsUserTable') = 1
ORDER BY [Percent_Update] ASC
S:
Percent of Scan Operations on the Object
To compute S, use the statistics in
the DMV sys.dm_db_index_operational_stats. S is the ratio (expressed in
percent) of scans performed on a table or index to the sum of all operations
(scans + DMLs + lookups) on that table or index. In other words, S represents
how heavily the table or index is scanned. The following query reports S for
each table, index, and partition in the database.
SELECT o.name AS [Table_Name], x.name AS [Index_Name],
i.partition_number AS [Partition],
i.index_id AS [Index_ID], x.type_desc AS [Index_Type],
i.range_scan_count * 100.0 /
(i.range_scan_count + i.leaf_insert_count
+ i.leaf_delete_count + i.leaf_update_count
+ i.leaf_page_merge_count + i.singleton_lookup_count
) AS [Percent_Scan]
FROM sys.dm_db_index_operational_stats (db_id(), NULL, NULL, NULL) i
JOIN sys.objects o ON o.object_id = i.object_id
JOIN sys.indexes x ON x.object_id = i.object_id AND x.index_id = i.index_id
WHERE (i.range_scan_count + i.leaf_insert_count
+ i.leaf_delete_count + leaf_update_count
+ i.leaf_page_merge_count + i.singleton_lookup_count) != 0
AND objectproperty(i.object_id,'IsUserTable') = 1
ORDER BY [Percent_Scan] DESC
Be aware that the counters in the DMV sys.dm_db_index_operational_statsreflect
the operational statistics of the tables and indexes in the metadata cache.
Carefully consider the timeframe of the statistics used. If you prefer to err
on the side of too little compression, use performance statistics during the
period where CPU usage is the highest (for example, month-end processing).
For active tables and indexes, the
statistics reflected are cumulative since the time the SQL Server service was
restarted, or since the last time the database was opened. Less active objects
may not have operational statistics in this DMV; however, the less active
objects may be good candidates for page compression, provided they are of
significant size and the estimated space saving is significant.
Inserts to append-only (inserted at
the end of the table) tables that are rarely used, do not have much overhead
for page compression. Such tables may perform well with page compression even if
S is low. Examples of such tables are logging or audit tables, which are
written once and rarely read. These are good candidates for page
compression.
Example
Here is an example how a customer used
these measurements to decide which tables to page-compress. The customer had an
OLTP database running on a server with average CPU utilization of approximately
20 percent. The large amount of available CPU, the significant amount of
planned database growth, and the expense of storage provided motivation for
data compression. The customer computed space savings, and the U and S
measurements for the largest tables in the database, and targeted tables with S
greater than 75 percent, U less than 20 percent for page compression. Table 1
shows the estimated row and page savings, the values of S and U, and the
decision as to whether to row or page compress.
Table
1: Deciding what to compress
Based on the metrics shown in Table 1,
the customer decided to page-compress tables T2, T5, T6, T7, T8, and T10. All
other tables in the database were row-compressed. Following this plan, the
customer achieved 50 percent space savings, and approximately 10 percent
increase in CPU utilization.
Planning Compression:
Estimating Workspace, CPU, I/O
Tables and
indexes are compressed using the ALTER TABLE… REBUILD and ALTER INDEX … REBUILD statements respectively. Compressing
a table or an index requires workspace, CPU, and I/O. Compressing a table or
index uses the same mechanism as rebuilding an index. This section provides
estimated resource requirements for compressing a clustered index, alongside
the resource requirements for rebuilding the same uncompressed index, for
comparison. The resource requirements depend upon:
- Whether you are compressing a heap, a clustered
index, or a nonclustered index
- Whether you set
the SORT_IN_TEMPDB option to ON
- Whether you are performing the compression operation
with the ONLINE option set to ON
- Whether you are using the simple, bulk logged, or full
recovery model
Workspace
Free workspace is required in the
following:
- User
database
- Transaction
log
- tempdb
Estimate the workspace requirements
before starting the compression and have enough free space to avoid potentially
expensive autogrow of the database files or failed compression due to lack of
disk space. Use the script in Appendix C to determine the available free space
in each filegroup.
Workspace
Required in the User Database
In the user database, free workspace
is required for the following:
- The
compressed table or index
- The mapping
index
if you are compressing a heap or a clustered index with the ONLINE option
set to ON and the SORT_IN_TEMPDB option set to OFF. (The recommendation is
to set SORT_IN_TEMPDB to ON. The workspace requirement for the mapping
index is discussed in the tempdb
section.).
While a table is being compressed, both
the uncompressed table and the compressed table exist together until the
compression is successful and committed. After the table or the index is
compressed, the uncompressed table is dropped, and the space is released to the
filegroup. To estimate the size of the compressed table, use the output of sp_estimate_data_compression_savings.
Transaction
Log Space
The amount of transaction log space
needed depends on whether ONLINE is set to ON or OFF, and the recovery model used
(full,
bulk-logged, or simple).
Workspace
Required in tempdb
In the tempdb database, free workspace is required if ONLINE is set to ON:
- For the mapping
index,
an internal structure used to map old bookmarks to new bookmarks, enabling
concurrent DML transactions. This is stored in tempdb if SORT_IN_TEMPDB is set to ON.
- For the
version store. This is only used if there are concurrent DML operations.
Size depends upon the volume of ongoing modifications and duration of long
running DML transactions.
I/O
I/O is generally proportional to the
workspace used.
CPU
On average, row compression takes 1.5
times the CPU time used for rebuilding an index, whereas page compression takes
2 to 5 times the CPU time used for rebuilding an index. As an example data
point, 41 CPU seconds per GB for index rebuild, 48 CPU seconds per GB for row compression,
and 182 CPU seconds per GB for page compression were observed with ONLINE set
to OFF, on the hardware listed in Appendix A. ONLINE operations require more
CPU. These numbers are meant to give a general ballpark, not to be precise, because
the performance varies depending upon the characteristics of the data and the
hardware.
Rebuilding and compression can be
parallelized and take advantage of multiple CPUs. An example of using the
MAXDOP option is shown in this blog. There are two caveats:
- SQL Server
uses statistics on the leading column to distribute work amongst multiple
CPUs, thus multiple CPUs are not beneficial when creating, rebuilding, or
compressing an index where the leading column of the index has relatively
few unique values or when the data is heavily skewed to just a small
number of leading key values – only limited effective parallelism will be achieved
in this case.
- Compressing
or rebuilding a heap with ONLINE set to ON uses a single CPU for
compression or rebuild. However, SQL Server first needs to scan the
table—the scan is parallelized, and after the table scan is complete, the
rest of the compression processing of the heap is single-threaded.
Summary
of Resource Requirements for Data Compression
Table 2 shows a summary of workspace,
CPU, and I/O requirements for compressing a clustered index as compared to
rebuilding the same uncompressed index. Measurements used:
- X = number
of pages before compression (or rebuild)
- P = number
of pages after compression (P < X)
- Y = number
of new or updated pages (by a concurrent application, applies only to the
ONLINE case)
- M = size of
the mapping index (estimate based on guidelines in the TEMPDB Capacity
Planning white paper)
- C = the CPU
time taken to rebuild the uncompressed index
| Workspace | I/O | CPU |
TEMPDB | UserDB | UserDB Tran Log | TEMPDB | UserDB | UserDB Tran Log |
OFFLINE with BULK_LOGGED or SIMPLE
Recovery Model |
Rebuild | 0 | X | ~0 | 0 | X+2X | ~0 | C |
Compress | 0 | P | ~0 | 0 | X+2P | ~0 | 1.5C
to 5C |
OFFLINE with FULL Recovery Model |
Rebuild | 0 | X | X | 0 | X+X | X | ~C |
Compress | 0 | P | P | 0 | X+P | P | 1.5C
to 5C |
ONLINE with FULL Recovery Model |
Rebuild | M+Y | X+Y | 2X+Y | M+4Y | X+X+Y | 2X+Y | ~2C |
Compress | M+Y | P+Y | 2P+Y | M+4Y | X+P+Y | 2P+Y | 3C
to 10C |
Table
2: Workspace, CPU, and I/O summary for compressing a clustered index
Some
important notes from Table 2:
- Comparing to
rebuild without compression, compression uses less workspace and I/O, but
more CPU. The reduced workspace/I/O is due to the smaller size of the
resulting structure.
- The lowest
resource utilization is when running OFFLINE with the bulk-logged or
simple recovery model. Offline, bulk operations involve reading the
existing data, and writing new, compressed data. Minimal log is written—only
the allocations are logged.
- If it is not
possible to use the BULK_LOGGED or SIMPLE recovery model (but still using
OFFLINE), additional I/O is needed to fully log the compressed data, with the
full recovery model. No additional workspace (compared to bulk-logged) is
allocated, because log space is reserved even in the bulk-logged and
simple recovery models.
- Many production
systems cannot run in bulk-logged or simple mode, and many systems cannot
afford downtime for OFFLINE compression. ONLINE operations require about
twice as much CPU as OFFLINE operations.
How
and When to Compress
The key
decisions to be made on when and how to compress are:
- Online vs.
Offline: Whether to set the value of ONLINE to ON or OFF depends upon what
else is running on the database at the same time. OFF is faster and
requires less resources than ON, but the table is locked for the duration
of the compression operation. Be aware of the restrictions of online
operations as discussed in SQL
Server Books Online.
- One table,
index, or partition at a time vs. many concurrently: Compressing one table,
index, or partition at a time is recommended in most cases. When doing
multiple simultaneous compressions, workspace, I/O, and CPU requirements
outlined earlier must be available for all the compressions together. Be
mindful of the risk of insufficient free space and the need to expand data
files, as well as the likelihood of large amount of unused space in the
data files at the end of compression. If you have sufficient resources
(workspace, I/O, and CPU), and have an efficient mechanism to reclaim the
unused space (refer to section 6 later in this white paper), executing
multiple compressions simultaneously may be acceptable.
- Order of
compressing the tables: After you have decided on the list of tables and
indexes to compress, compress the objects starting with the smallest in
this list. Compressing smaller objects requires less workspace, and it
releases space to the data files that can be used as workspace for
compressing the larger objects subsequently. This approach minimizes the
need for additional disk space during the compression process.
- Setting SORT_IN_TEMPDB
to ON or OFF: ON is recommended. This makes use of tempdb space for the mapping index, and therefore, it requires
less workspace in the user database.
Side
Effects of Compressing a Table or Index
When you
compress a table or an index, you should be aware of two side effects:
- Compression
includes a rebuild, thus removing fragmentation from the table or index.
- When a heap
is compressed, if there are any nonclustered indexes on the heap, they are
rebuilt as follows:
o
With ONLINE set to OFF, the nonclustered indexes are rebuilt one by one.
o
With ONLINE set to ON, all the nonclustered indexes are rebuilt
simultaneously.
You must account
for the workspace required to rebuild the nonclustered indexes, because the
space for the uncompressed heap is not released until the rebuild of the nonclustered
indexes is complete.
Manipulating Compressed Data
This section
explains what happens when compressed data is changed: new rows inserted, rows
deleted or updated.
Newly-Inserted
Rows
With row
compression, newly inserted rows are row-compressed. With page compression, a
newly inserted row is row-compressed or page-compressed, depending upon the
following:
- The table
organization: heap or clustered index
- How and
where the new row is inserted
Table 3
summarizes the compression state of the newly inserted rows into a compressed
table.
Table organization | Table compression setting |
ROW | PAGE |
Heap | The newly
inserted row is row-compressed. | The newly
inserted row is page-compressed: ·
if new row goes to an existing page with page
compression ·
if the new row is inserted through BULK INSERT
with TABLOCK ·
if the new row is inserted through INSERT INTO ...
(TABLOCK) SELECT ... FROM Otherwise,
the row is row-compressed.* |
Clustered
index | The newly
inserted row is row-compressed. | The newly
inserted row is page-compressed if new row goes to an existing page with page
compression Otherwise, it is row compressed until the page fills up. Page
compression is attempted before a page split.** |
Table 3: Compression of the newly
inserted rows into a compressed heap or clustered index
* The
resulting row-compressed pages can be page-compressed by running a heap rebuild
with page compression.
** With page
compression, all the pages in the table may not actually be page-compressed. A
page is page-compressed only if the space savings on that page exceeds an
internally defined threshold.
Updating
or Deleting Compressed Rows
All updates
to the rows in a row-compressed table or partition will maintain the rows in
row-compressed format. Not every update to the rows in a page-compressed table
or partition will cause the column prefix and page dictionary to be recomputed.
When the number of changes on a given page-compressed page exceeds an
internally defined threshold, the column prefix and page dictionary are
recomputed.
What
Happens to the Supporting Data Structures
When an
application manipulates (INSERT, UPDATE, DELETE, CREATE/REBUILD INDEX, and so
on) data in a table, some supporting data structures may be created by SQL
Server, which may temporarily hold a subset of the data. Some of these data
structures are:
- Transaction
log
- Mapping index
- Version store
- Sort pages
Whether or
not these supporting structures are compressed if the data in a compressed
table is manipulated depends upon the type of the data structure and the type
of data compression used on the table. Table 4 summarizes the compression
characteristics of the supporting data structures when a compressed table is
manipulated.
Table 4: Compression characteristics
of supporting data structures when data in a compressed table is modified
On
a Page-Compressed Index, Nonleaf Pages are Row-Compressed
All updates
to the rows in a row-compressed table or partition will maintain the rows in
row-compressed format. Not every update to the rows in a page-compressed table
or partition will cause the column prefix and page dictionary to be recomputed.
When the number of changes on a given page-compressed page exceeds an
internally defined threshold, the column prefix and page dictionary are
recomputed.
On a page-compressed
index (clustered or nonclustered), the leaf level pages are page-compressed;
but the nonleaf pages are row-compressed, not page-compressed (see Appendix D).
This is for efficiency reasons.
- The number
of nonleaf pages in an index are relatively small, and the space saving
that would result from page-compressing these pages is relatively
insignificant.
- The nonleaf
pages are accessed much more frequently, and having them row-compressed (instead
of page-compressed) reduces the cost of decompressing them on each access.
Reclaiming
the Space Released by Data Compression
After data compression
has completed, the space saved is released to the respective data file(s).
However, the space is not released to the filesystem, because the file size
doesn’t reduce automatically as part of data compression. There are several
options to release the space to the filesystem by reducing the size of the
file(s):
Option
1: You can decide not to reclaim the
released space, keeping the free space in the filegroup for the future data
growth. This is a simple option for databases where the data volume within the existing
filegroups is expected to grow in the future. It is not an option for
partitioned tables where each partition is allocated on a different filegroup,
and you want to compress the older read-only partitions to save disk space.
Option
2: DBCC SHRINKFILE (or DBCC
SHRINKDATABASE) is an option, but shrinking a database file severely fragments
its contents. Also be aware that DBCC SHRINKFILE is single-threaded and may
take a long time to complete. In a test, after a clustered index was
page-compressed, a filegroup had about 68 percent free space. After DBCC
SHRINKFILE, the fragmentation became 100 percent.
.jpg)
Figure
1: Database free space and percent fragmentation after DBCC SHRINKFILE
After ALTER INDEX … REORGANIZE, the fragmentation
reduced to 0. Use REORGANIZE (which does not require additional data file
space), and not REBUILD (which creates a new index and then drops the existing
index, requiring more workspace in the filegroup and extending the data file
again).
Option
3: If compressing all the tables in a
filegroup:
- Create a new filegroup.
- Move tables and indexes to the new filegroup
while compressing.
- You need to create (or re-create, if one already exists) a clustered
index (to move a heap, refer to Option 4) on the table and move it to the new
filegroup in one operation. Use the DROP_EXISTING option of CREATE CLUSTERED
INDEX command, if you already have a clustered index. For example:
CREATE UNIQUE CLUSTERED INDEX [PK_TRADE]
ON [TRADE_BULK] ([T_ID] ASC)
WITH (DATA_COMPRESSION=PAGE, DROP_EXISTING=ON, SORT_IN_TEMPDB=ON) ON [FG_Data2]
- After all the tables and indexes from the
old filegroup have been compressed and moved to the new filegroup, the old
filegroup and its file(s) can be removed to release space to the filesystem.
- Be aware of a caveat in this method. If the
table has a LOB_DATA allocation unit in the same filegroup, then this method
will not move the LOB_DATA to the new filegroup (only the IN_ROW_DATA and
ROW_OVERFLOW_DATA allocation units are moved when a clustered index is
re-created in a different filegroup). So the old filegroup will not be
completely empty, and hence cannot be dropped.
Option
4: There is another solution if you are
compressing all the tables in a filegroup. Create an empty table in the new
filegroup, compress it, and then copy the data over to the new table using
INSERT … SELECT.
-- Create a new empty table in the new filegroup
ALTER DATABASE [TestDB] MODIFY FILEGROUP FG_COMP DEFAULT;
SELECT * INTO [Tab1] FROM [Tab] WHERE 1 = 2;
-- Compress the newly created empty table
ALTER TABLE [Tab1] REBUILD WITH (DATA_COMPRESSION = PAGE);
-- Create the appropriate indexes on the table
-- Copy the data over to the new table
INSERT INTO [Tab1] WITH (TABLOCK) SELECT * FROM [Tab]
-- The incoming data will be compressed as it gets inserted
-- TABLOCK is required if the target table is a heap
-- Trace flag 610 may help enable minimal logging
-- The LOB_DATA allocation unit will also be copied
-- Drop the old table and rename the new table as old table
-- After all the tables are copied like this, remove the old filegroup
Application
Performance with Data Compression
As discussed earlier, data compression
reduces logical and physical I/O, but it increases CPU consumed. In this
section we discuss some workloads and their performance behavior under data
compression.
Workload
1: An OLTP Application with High Volumes of DML Operations
Figure 2
shows the performance achieved by a customer on an OLTP application with high
volumes of DML (INSERT, UPDATE, and DELETE) operations. The average response
time of four different types of business transactions were measured with NONE,
ROW, and PAGE compression.
.jpg)
Figure 2: A Customer OLTP workload
performance with data compression
As
illustrated in Figure 2, this workload achieved similar or better performance
with row compression as compared to no compression, for all business
transactions, except the fourth one. However, with page compression, all the
business transactions took longer to complete as compared to no compression or
row compression. Based on these tests, this customer decided to use row
compression with all their tables and indexes in this application.
Workload
2: A Reporting Application with Large Queries
Figure 3
shows the performance achieved by a customer on a reporting application. The
response times of eight queries were measured with no compression and
page-compressed data in the customer’s test environment.
.jpg)
Figure 3: A Customer reporting
workload performance with data compression
As
illustrated in Figure 3, this workload achieved better performance for all
queries with page compression, as compared to no compression. For some queries
the response time was cut in half or better. Based on these tests, this
customer is planning to deploy page compression in production for this
application.
Rebuilding
Compressed Indexes
Rebuilding
compressed indexes takes longer compared to rebuilding corresponding
uncompressed indexes. Note that if the table is compressed, but the index
(nonclustered) is not compressed, rebuilding that index will not require
additional processing time.
Rebuilding a
compressed index involves decompressing the index pages and then rebuilding the
index with compression. This requires longer time and more CPU resources.
Figures 4 and 5 show the CPU usage and time taken to rebuild a clustered and a
nonclustered index, respectively.
.jpg)
Figure 4: CPU usage and time taken to
build a clustered index
.jpg)
Figure 5: CPU usage and time taken to
build a nonclustered index
Note the
significant increase in the CPU usage for rebuilding the page-compressed index.
The reason for the high CPU usage is that when you rebuild a page-compressed
index, the compression information (prefix and dictionary) on each (leaf-level)
page is recomputed.
BULK
INSERT with Data Compression
Bulk loading
data into a compressed table involves compressing the data while performing the
load. Therefore, BULK INSERT takes longer on a compressed table. Figure 6 shows
the performance of BULK INSERT on a heap with and without data compression. The
first three data points display BULK INSERT performance without the TABLOCK
option. Notice that without the TABLOCK option, the size of the compressed heap
table is the same with both row and page compression. If the TABLOCK option is
not used while performing BULK INSERT on a page-compressed heap, the newly
inserted pages will be row-compressed, instead of being page-compressed (refer
to Table 3 in section 5.1). Use the query in Appendix E to confirm.
.jpg)
Figure 6: BULK INSERT into a heap with
and without data compression
The last
data point in Figure 6 shows the BULK INSERT performance (on a page-compressed
heap) with the TABLOCK option. Note that the TABLOCK option results in a
smaller table size, because almost all the pages are page-compressed during the
load. Therefore, remember to use the TABLOCK hint if you are loading data into
a page-compressed heap.
The same
considerations apply if you are performing an INSERT … SELECT operation into a
page compressed heap. Be sure to use the TABLOCK hint to ensure that new pages are
compressed.
BULK
INSERT Followed by CREATE CLUSTERED INDEX
In many bulk
loading scenarios, loading data is typically followed by creating a clustered
index. In a sliding window scenario, usually new data is loaded into an empty
staging table, and then a clustered index (and other appropriate indexes and
constraints) is created on the staging table, to make it ready for switching
into an empty partition in a partitioned table. If you are loading data into an
empty table and then creating a clustered index, and the data needs to be
compressed, there are multiple options:
- Option 1:
BULK INSERT into uncompressed heap, followed by CREATE CLUSTERED INDEX
WITH (DATA_COMPRESSION = PAGE). Because the data is loaded into an
uncompressed heap, it allows for faster loading compared to the other two
options. This option allows compressing the data at the same time as
creating the clustered index, thereby reducing the total time. However,
more free space is needed in the user database compared to option 3,
because, the uncompressed heap and the compressed clustered index need to
reside in the user database simultaneously while the index is being
created.
- Option 2:
BULK INSERT into a page-compressed heap, followed by CREATE CLUSTERED
INDEX. Because the data is loaded into a heap, the loading is faster
compared to option 3; however, because the heap is compressed, loading
takes longer compared to option 1 (the data is compressed while being
loaded). And, because the heap and the clustered index need to reside in
the user database while the index is created, more free space is needed
than option 3; but less than option 1, because both the heap and the
clustered index are compressed.
- Option 3:
BULK INSERT into a page-compressed clustered index. This option takes
longer, because the data is loaded into a clustered index, and data is
compressed during loading, but all the tasks (loading, compressing,
creating clustered index) are completed together. Because there is no post-processing
involved in the form of creating the clustered index or compressing the
data, no extra free space is required in the user database.
Figure 7 illustrates the time required
for bulk loading data, creating a clustered index, and compressing the data.
Figure 8 illustrates the workspace required for these tasks.
.jpg)
Figure 7: Time required for bulk data
loading, creating a clustered index, and compressing the data (simple recovery
model, ONLINE = OFF, SORT_IN_TEMPDB = ON)
.jpg)
Figure 8: Workspace required for bulk
data loading, creating a clustered index, and compressing the data (simple
recovery model, ONLINE = OFF, SORT_IN_TEMPDB = ON)
Data
Compression and Partition Manipulation
Partitioning provides flexibility for
compressing data selectively. Each partition in a table can have a different
compression setting. Be aware of the compression settings on partition
manipulation operations, such as switch, split, and merge.
Switch
Switching a
partition requires that the source and the target have the same compression
setting. The target of the switch partition operation is always an empty
partition (or table). Changing the compression setting of an empty partition or
table is very quick, because this is a metadata-only operation. Therefore,
check and change the compression setting of the target, if needed, prior to
executing the switch command.
Split
New
partition inherits the data compression property of the partition being split.
Merge
Merging two
partitions essentially removes the boundary between the two partitions, thereby
dropping one partition and moving all the data in that partition to the other
partition. Therefore merging partitions involves a source partition (the
partition being dropped) and a destination partition (the partition into which
the data from the dropped partition is moving). The destination partition
retains its compression property.
If the
destination partition’s compression setting is NONE, the data coming in from
the source partition will be decompressed during the merge operation. If the
destination partition’s compression setting is ROW, the data coming in from the
source partition will be row-compressed during the merge operation. If the
destination partition’s compression setting is PAGE, the data coming in from
the source partition will be either row-compressed (if the table is a heap) or
page-compressed (if the table is a clustered index), as shown in Table 5.
Table 5: Partition merge and data
compression
During a
partition merge operation, the source partition and the destination partition
are identified based on the partition function used. The example in Figure 9
illustrates the compression behavior of partition merge for LEFT and RIGHT
partition functions.
.jpg)
Figure 9: Data compression and
partition manipulation
Some
partitioning scenarios in large data warehouses dedicate an entire separate
filegroup to each partition, so that individual partitions can be marked as
read-only to minimize backup requirements. In this scenario, to avoid using
DBCC SHRINKFILE, it is important that a partition is initially loaded as
compressed, as discussed in Section 7. For heaps, this requires that the
partition is bulk loaded using the TABLOCK hint, and for clustered indexes, it
means that the compressed clustered index should be in place as the data is
loaded, rather than created later, leaving unused empty space in the filegroup.
Data
Compression and Transparent Data Encryption
Transparent data encryption (TDE) is
another very useful feature in SQL Server 2008. TDE provides encryption of data
in a database at the storage level without requiring any application changes. A
common question related to this is “How does data compression perform against
an encrypted database?”
TDE is transparent to data
compression. Figure 10 displays the amount of time it takes to page-compress a
clustered index, and the amount of space savings obtained by page compression,
with and without TDE. As illustrated in Figure 10, TDE has negligible, if any,
impact on data compression in either compression time or space saved.
.jpg)
Figure
10: Data compression with transparent data encryption
TDE encrypts the pages when they are
written to disk and decrypts them when they are read from disk into memory.
Because data compression (as well as decompression) is performed on in-memory
pages, data compression always sees unencrypted data, and hence the
effectiveness and efficiency of data compression is not impacted by TDE.
Conclusion
Data
compression provides multiple benefits. It saves disk space, and it can help
improve the performance of certain workloads. The benefits of data compression
come at the cost of higher CPU usage for compressing and decompressing the
data. Therefore, it is important to understand the workload characteristics on
a table before deciding on a compression strategy. Data compression provides
flexibility in terms of levels of compression (row or page) and the objects you
can compress (table, index, partition). This enables fine-tuning the
compression based on the characteristics of data and the workload.
Another
important advantage of data compression is that it works transparently to the
application, and it works well with other SQL Server features, such as TDE and
backup compression.
The results
shown in this white paper are based on the data and the hardware used in our
tests. Your results will vary based on your own data, workload and hardware.
Perform thorough testing when deciding what tables and indexes to compress.
For more information:
http://www.microsoft.com/sqlserver/: SQL Server Web site
http://technet.microsoft.com/en-us/sqlserver/: SQL Server TechCenter
http://msdn.microsoft.com/en-us/sqlserver/: SQL Server DevCenter
Did this
paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to
5 (excellent), how would you rate this paper and why have you given it this
rating? For example:
- Are
you rating it high due to having good examples, excellent screen shots, clear
writing, or another reason?
- Are
you rating it low due to poor examples, fuzzy screen shots, or unclear writing?
This feedback
will help us improve the quality of white papers we release.
Send feedback.
Appendix A: Test Hardware
and Software
All the tests
(except those in Figure 2 and Figure 3, which were run in customer test
environments) were performed on the following hardware and software
environment.
Server
DELL R805 with:
- 2
socket quad core
- AMD
Opteron Processor 2354 @2.20 GHz
- 32
GB RAM
Storage
EqualLogic iSCSI Storage
3 Disk
shelves – Each Containing 14 SAS drives. Each drive had:
Software
- The
64-bit edition of the Windows Server 2008 Enterprise operating system
- The
64-bit edition of SQL Server 2008 Enterprise
Appendix B: Data Volume in
Each Allocation Unit Type
A table or a
partition can have three allocation units - IN_ROW_DATA, LOB_DATA and
ROW_OVERFLOW_DATA.
Usually, most of the data in the table is stored in the IN_ROW_DATA allocation unit. Depending upon the
row size and the table options, some data can be stored outside the row in ROW_OVERFLOW_DATA or LOB_DATA allocation units. Use the following
script to determine how much data is stored in each of the three allocation
units.
-- Provided AS IS, without warranty of any kind
SELECT OBJECT_NAME(p.object_id) AS Object_Name
, i.name AS Index_Name
, ps.in_row_used_page_count AS IN_ROW_DATA
, ps.row_overflow_used_page_count AS ROW_OVERFLOW_DATA
, ps.lob_used_page_count AS LOB_DATA
FROM sys.dm_db_partition_stats ps
JOIN sys.partitions p ON ps.partition_id = p.partition_id
JOIN sys.indexes i ON p.index_id = i.index_id AND p.object_id = i.object_id
WHERE OBJECTPROPERTY (p.[object_id], 'IsUserTable') = 1
Appendix C: Free Space in
Database Files
--Provided AS IS, without warranty of any kind
SELECT
a.file_id,
LOGICAL_NAME = a.name,
PHYSICAL_FILENAME = a.physical_name,
FILEGROUP_NAME = b.name,
FILE_SIZE_MB = CONVERT(DECIMAL(12,2),ROUND(a.size/128.000,2)),
SPACE_USED_MB = CONVERT(DECIMAL(12,2),ROUND(FILEPROPERTY(a.name,'SpaceUsed')/128.000,2)),
FREE_SPACE_MB = CONVERT(DECIMAL(12,2),ROUND((a.size-FILEPROPERTY(a.name,'SpaceUsed'))/128.000,2))
FROM sys.database_files a LEFT OUTER JOIN sys.data_spaces b
ON a.data_space_id = b.data_space_id
Appendix D: On a
Page-Compressed Index, Verifying That Nonleaf Pages are Row-Compressed
Use the
following query on a page-compressed index (clustered or nonclustered) to
verify that the leaf-level pages are page-compressed; but the nonleaf pages are
row-compressed, not page-compressed.
SELECT
o.name, ips.index_type_desc, p.partition_number, p.data_compression_desc,
ips.index_level, ips.page_count, ips.compressed_page_count
FROM sys.dm_db_index_physical_stats
(DB_ID(), object_id(<insert index name here>), NULL, NULL, 'DETAILED') ips
JOIN sys.objects o ON o.object_id = ips.object_id
JOIN sys.partitions p ON p.object_id = o.object_id
ORDER BY ips.index_level
The output of
this query on a page-compressed clustered index looks like the following.
Table 6
Observe the
columns index_level and compressed_page_count in the output. Note that almost
all the leaf pages (index_level = 0) are page-compressed; but none of the
nonleaf pages (index_level >=1) are page-compressed (compressed_page_count
column is 0 for nonleaf level pages).
Appendix E: BULK INSERT into
a Heap
The following
query can be used to determine how many pages in a table, index, or partition
are page-compressed.
SELECT
o.name, ips.index_type_desc, p.partition_number, p.data_compression_desc,
ips.page_count, ips.compressed_page_count
FROM sys.dm_db_index_physical_stats
(DB_ID(), object_id(<insert table name here>), NULL, NULL, 'DETAILED') ips
JOIN sys.objects o ON o.object_id = ips.object_id
JOIN sys.partitions p ON p.object_id = o.object_id
After
performing BULK INSERT on a page-compressed heap, the output of the above query
will look similar to the following.
Table 7
Note the
columns data_compression_desc and compressed_page_count. The column
data_compression_desc in sys.partitions shows the data compression setting
(metadata) for a given table, index, or partition. To get the actual number of
page-compressed pages, use the column compressed_page_count in the dynamic
management function (DMF) sys.dm_db_index_physical_stats. Note that in this
output, even though the data compression setting of the heap is PAGE, the newly
loaded pages are not page-compressed.
After BULK INSERT
is performed with TABLOCK on a page-compressed heap, the output of this query
will look similar to the following.
Table 8
Note that
almost all the pages are page-compressed. Remember to use the TABLOCK hint if
you are loading data into a page-compressed heap, or if you are performing an
INSERT … SELECT operation into a page-compressed heap.
Note: The compressed_page_count column is
displayed only when the DETAILED mode is used with the
sys.dm_db_index_physical_stats DMF. When the LIMITED mode (the default) is
used, the output of this column is NULL. Be careful when using the DETAILED
mode, because it scans all the pages in the table, and it takes a significant
amount of time on large tables. When you use the DETAILED mode, we recommended
that you specify a value for object_id; do not specify NULL for this parameter.
Specifying object_id as NULL (an invalid or nonexistent object reference to
object_id translates to NULL) will scan all the pages in all the tables in the
database, and it may significantly impact performance.
Appendix F: Additional
References
SQL
Server Storage Engine Team Blog
SAP
on SQL Server blog on data compression
SQL
Server 2008 Technologies for SAP Solutions
HP white paper
on SQL Server data compression
Unisys
white paper on SQL Server data compression
NetApp white
paper on SQL Server data compression
Linchi
Shea’s blog on data compression
Paul
Nielsen’s blog on data compression
Kalen
Delaney’s blog on data compression
Microsoft
Case Study on Data Compression