Statistics Used by the Query Optimizer in Microsoft SQL Server 2008
SQL Server Technical Article
Writer: Eric N. Hanson and Yavor Angelov
Contributor: Lubor Kollar
Published: February 2009
Applies to: SQL Server 2008
Summary: Microsoft SQL Server 2008 collects statistical information about indexes and column data stored in the database. These statistics are used by the SQL Server query optimizer to choose the most efficient plan for retrieving or updating data. This paper describes what data is collected, where it is stored, and which commands create, update, and delete statistics. By default, SQL Server 2008 also creates and updates statistics automatically, when such an operation is considered to be useful. This paper also outlines how these defaults can be changed on different levels (column, table, and database). In addition, it presents how certain query language features, such as Transact-SQL variables, interact with use of statistics by the optimizer, and it provides guidance for using these features when writing queries so you can obtain good query performance.
Statistical Data in SQL Server 2008
Microsoft SQL Server 2008 collects statistics about individual columns (single-column statistics) or sets of columns (multicolumn statistics). Statistics are used by the query optimizer to estimate the selectivity of expressions, and thus the size of intermediate and final query results. Good statistics allow the optimizer to accurately assess the cost of different query plans and then choose a high-quality plan. All information about a single statistics object is stored in several columns of a single row in the sysindexes table, and in a statistics binary large object (statblob) kept in an internal-only table. In addition, information about statistics can be found in the new metadata views sys.stats and sys.indexes.
Summary of Statistics Features
SQL Server 2008 has a number of features for maintaining statistics. The most important one is the ability to automatically create and update statistics. This feature, also called auto stats, is on by default starting in SQL Server 2000. Approximately 98% of SQL Server 2000 installations leave this feature enabled, which is a best practice. For the majority of databases and applications, developers and administrators can rely on the automatic creation and update of statistics to provide comprehensive and accurate statistics about their data. This allows the SQL Server 2008 query optimizer to produce good query plans consistently, while keeping development and administration costs low. If you need more control over statistics creation and update to get good query plans and manage the overhead of gathering statistics, you can use manual statistics creation and update capabilities.
An important feature for high-throughput online transaction processing application environments is the ability to asynchronously update statistics automatically. This can improve predictability of query response time in such environments.
SQL Server 2008 statistics features allow you to:
· Implicitly create and update statistics with the default sampling rate (in the SELECT, INSERT, DELETE, UPDATE and MERGE commands, use of a column in a query condition such as a WHERE or JOIN clause causes statistics to be created or updated if necessary when automatic statistics creation and update is enabled).
· Manually create and update statistics with any desired sampling rate, and drop statistics (CREATE STATISTICS, UPDATE STATISTICS, DROP STATISTICS, CREATE INDEX, DROP INDEX).
· Create and maintain filtered statistics, which are defined over a subset of the rows in the table by using a WHERE clause.
· Automatically create filtered statistics whenever a filtered index is created.
· Manually create statistics in bulk for all columns of all tables in a database (sp_createstats).
· Manually update all existing statistics in the database (sp_updatestats).
· List statistics objects that exist for a table or database (sp_helpstats, catalog views sys.stats, sys.stats_columns).
· Display descriptive information about statisticsobjects (DBCC SHOW_STATISTICS).
· Enable and disable automatic creation and update of statistics database-wide or for a specific table or statistics object (ALTER DATABASE options AUTO_CREATE_STATISTICS and AUTO_UPDATE_STATISTICS, sp_autostats, and the NORECOMPUTE option on CREATE STATISTICS and UPDATE STATISTICS).
· Enable and disable asynchronous automatic update of statistics (ALTER DATABASE option AUTO_UPDATE_STATISTICS_ASYNC).
In addition, SQL Server Management Studio allows you to graphically browse and control statistics objects within its Object Explorer view. Statistics are listed in Object Explorer in a folder underneath each table object.
What's New for Statistics in SQL Server 2008?
SQL Server 2008 provides the option to create filtered statistics, which are restricted to a subset of the rows in the table defined by a predicate, also called a filter. Filtered statistics are either created explicitly, or more commonly, through creating a filtered index.
- Automatic creation: As mentioned, filtered statistics are automatically created as a by-product of filtered indexes. Unlike regular statistics, filtered statistics are never automatically created by the query optimizer as a result of query and update statements.
- Automatic update: To determine when statistics need to be automatically updated, the query optimizer uses column modification counters. For filtered statistics, the counter is multiplied by the selectivity of the filter, which ranges from 0 to 1, to compensate for the fact that the column counter tracks changes to all rows in the table, not the changes to the rows in the filtered statistics object.
- Sampling: To build the statistics, SQL Server 2008 reads every nth data page, according to the sampling rate, and then it applies the filter, if any.
- If you don’t specify sampling rate, the default behavior is to sample based on the number of rows in the table and, for filtered statistics, based on the selectivity of the filter. The more selective the filter, the more rows need to be sampled to gather enough data to build the histogram.
- Cardinality estimate: As with regular statistics, the optimizer attempts to pick the most relevant statistics when performing cardinality estimate. There is an additional check that the optimizer must make for filtered statistics: whether the statistics predicate contains the query predicate (that is, whether the query predicate must be true for a row if the statistics predicate is true for that row). Out of the possible filtered statistics, the query optimizer favors those that are as close to the query predicate as possible.
Other Important Enhancements in Statistics
The previous version, SQL Server 2005, introduced a number of important statistics features that allow the query optimizer to choose better query plans for a broader range of queries, or otherwise improve the management of statistics:
· Large object support: Columns of type ntext, text, and image, as well as the new types nvarchar(max), varchar(max), and varbinary(max) can be specified as statistics columns.
· Improved statistics loading framework: The optimizer internally improves the loading of statistics compared with SQL Server 2000. It now loads all and only the statistics it needs, improving optimization result quality and performance.
· Minimum sample size: A minimum of 8 megabytes (MB) of data, or the size of the table if smaller, are now sampled during statistics gathering.
· Increased limit on number of statistics: The number of column statistics objects allowed per table has been increased to 2,000. An additional 249 index statistics may also be present, for a maximum of 2,249 statistics per table.
· Clearer and more consistent display of histograms: DBCC SHOW_STATISTICS is improved because histograms are always scaled before they are stored in the catalogs.
· Enhanced DBCC SHOW_STATISTICS output: DBCC SHOW_STATISTICS now displays the name of the statistics object being displayed, to avoid ambiguity.
· Single rowset output for DBCC SHOW_STATISTICS: DBCC SHOW_STATISTICS supports options to output the header, density vector, and histogram individually as single rowsets. This allows easier programming when DBCC SHOW_STATISTICS output is automatically processed.
· Statistics on internal tables: Statistics are fully supported on the tables listed in sys.internal_tables, including XML and full-text indexes, Service Broker queues, and query notification tables.
· Statistics on up to 32 columns: The limit on the number of columns in a statistics object has been increased to 32 from 16.
· Statistics on partitioned tables: Statistics are fully supported on partitioned tables, which were introduced in SQL Server 2005. Histograms are maintained on a per-table basis (not per-partition).
· Parallel statistics gathering for fullscan: For statistics gathered with fullscan, creation of a single statistics object can be done in parallel for both nonpartitioned and partitioned tables.
· Improved recompiles and statistics creation in case of missing statistics: In the case where auto create of statistics failed, on a subsequent execution of a plan generated with missing statistics, auto create is performed and the plan is recompiled; the missing statistics condition does not persist. For more information, see the Batch Compilation, Recompilation, and Plan Caching Issues in SQL Server 2005 white paper by Marathe.
· Improved recompilation logic and statistics update for empty tables: Changing from 0 to > 0 rows in a table results in query recompilation and update of statistics. For more information, see the Batch Compilation, Recompilation, and Plan Caching Issues in SQL Server 2005 white paper by Marathe.
· Inferred date correlation constraints: By enabling the DATE_CORRELATION_OPTIMIZATION database setting, you can cause SQL Server to maintain information about the correlation between datetime fields across a pair of tables linked by a foreign key. This information is used to allow implied predicates to be determined for some queries. The information is not used directly for selectivity estimation or costing by the optimizer, so it is not statistics in the strictest sense, but it is closely related to statistics because it is auxiliary information used to help obtain a better query plan.
· sp_updatestats: In SQL Server 2005 and SQL Server 2008, sp_updatestats updates only those statistics that require updating based on the rowmodctr information in the sys.sysindexes compatibility view, thus eliminating unnecessary updates of unchanged items. For databases running under compatibility level 90 or higher, sp_updatestats preserves the automatic UPDATE STATISTICS setting for any particular index or statistics.
There are also some other minor changes in behavior for statistics. In particular, the statblob column in sys.sysindexes is now always given as NULL. The statblob itself is kept in an internal-only catalog table.
We define the following terms related to SQL Server 2008 statistics:
· statblob: Statistics binary large object. This object is stored in the system catalogs in an internal catalog view, sys.sysobjvalues.
· String summary: A form of statistics that summarizes the frequency distribution of substrings in a string column. This is used to help estimate selectivity of LIKE predicates. It is stored in the statblob for a string column.
· sysindexes: The sys.sysindexes catalog view, which contains information about tables and indexes.
· Predicate: A condition that evaluates to true or false. Predicates appear in a WHERE or JOIN clause in a database query.
· Selectivity: The fraction of rows from the input set of the predicate that satisfy the predicate. More sophisticated selectivity measures are also used to estimate the number of rows produced by joins, DISTINCT, and other operators. For example, SQL Server 2008 estimates the selectivity of the predicate "Sales.SalesOrderHeader.OrderID = 43659" in the AdventureWorks database as 1/31465 = 0.00003178.
· Cardinality estimate: An estimate of the size of a result set. For example, if a table T has 100,000 rows and a query contains a selection predicate of the form T.a=10, and a histogram shows that the selectivity of T.a=10 is 10%, then the cardinality estimate for the fraction of rows of T that must be considered by the query is 10% * 100,000 = 10,000.
· LOB: Large object (value of type text, ntext, image, varchar(max),nvarchar(max), varbinary(max)).
· Filter: a condition that is evaluated to determine whether a row must be part of the filtered statistics. The predicate appears in the WHERE clause of the CREATE STATISTICS or CREATE INDEX statements (in the case when statistics are automatically created as a side effect of creating an index).
Statistics Collected by SQL Server 2008
SQL Server 2008 maintains the following information at the table level. These are not part of a statistics object, but SQL Server 2008 uses them in some cases during query cost estimation:
· Number of rows in the table or index (rows column in sys.sysindexes)
· Number of pages occupied by the table or index (dpages column in sys.sysindexes)
SQL Server 2008 collects the following statistics about table columns and stores them in a statistics object (statblob):
· The time the statistics were collected.
· The number of rows used to produce the histogram and density information (described later in this section). (The row heading is “Rows sampled.”)
· The average key length.
· A single-column histogram, including the number of steps.
· A string summary, if the column contains character data. DBCC SHOW_STATISTICS output contains a column String Index, which has the value YES if a statistics object contains a string summary.
· The estimated number of rows matching the filter (for filtered statistics); or all rows in the table (for regular statistics). (The row heading is “Rows.”)
All rows in the table. (The row heading is
A histogram is a set of up to 200 values of a given column. All or a sample of the values in a given column are sorted; the ordered sequence is divided into up to 199 intervals so that the most statistically significant information is captured. In general, these intervals are of nonequal size. The following values, or information sufficient to derive them, are stored with each step of the histogram.
Shows the upper boundary of a histogram step. This is a key value.
Specifies how many rows are inside the range (they are smaller than this RANGE_HI_KEY, but bigger than the previous smaller RANGE_HI_KEY).
Specifies how many rows are exactly equal to RANGE_HI_KEY.
Specifies the average number of rows per distinct value inside the range.
Specifies how many distinct key values are inside this range (not including the previous key before RANGE_HI_KEY and RANGE_HI_KEY itself).
Histograms in SQL Server 2008 are only built for a single column—the first column in the set of key columns of the statistics object.
SQL Server 2008 builds the histogram from the sorted set of column values in three steps:
1. Histogram initialization: In the first step, a sequence of values starting at the beginning of the sorted set is processed, and up to 200 values of RANGE_HI_KEY, EQ_ROWS, RANGE_ROWS, and DISTINCT_RANGE_ROWS are collected (RANGE_ROWS and DISTINCT_RANGE_ROWS are always zero during this step). The first step ends either when all input has been exhausted, or when 200 values have been found.
2. Scan with bucket merge: Each additional value from the leading column of the statistics key is processed in the second step, in sorted order; for each successive value, either the value is added to the last range, or a new range at the end is created (this is possible because the input values are sorted). If a new range is created, one pair of existing, neighboring ranges is collapsed into a single range. This pair of ranges is selected in order to minimize information loss. The number of histogram steps after the ranges are collapsed stays at 200 throughout this step. This method is based on a variation of the maxdiff histogram.
3. Histogram consolidation: In the third
step, more ranges can be collapsed if a significant amount of information is
not lost. Therefore, even if the column has more than 200 unique values, the
histogram might have less than 200 steps.
If the histogram has been built using a sample, the values of RANGE_ROWS, EQ_ROWS, DISTINCT_RANGE_ROWS and AVG_RANGE_ROWS are estimated and therefore they do not need to be whole integers.
Density is information about the number of duplicates in a given column or combination of columns, and it is calculated as 1/(number of distinct values). If a column is used in an equality predicate, the number of qualifying rows is estimated by using the density derived from the histogram. Histograms are also used to estimate the selectivity of nonequality selection predicates, joins, and other operators.
In addition to the timestamp showing the time the statistics were gathered, the number of rows in the table, the number of rows estimated to match the filter, the number of rows sampled to produce the histogram, the Density information and average key length, and the histogram itself, single-column statistics information includes an All density value for each set of columns forming a prefix of the statistics column set. This is shown in the second rowset output by DBCC SHOW_STATISTICS. All density is an estimate of 1/(number of distinct values in the prefix column set). The next section gives an example of this.
Note: The Density value included in the first row returned by dbcc show_statistics is the density of all values sampled other than the RANGE_HI_KEY values. The RANGE_HI_KEY values are typically the more frequent values in the distribution. Hence, the displayed Density value gives potentially useful information about the density of nonfrequent values.
The multicolumnstatistics for one set of columns consist of one histogram for the first column in the statistics definition, one density value for the first column, and an All Density value for each prefix combination of columns (including the first column alone). Each set of multicolumn statistics (a histogram and two or more density values) is stored in one statblob together with the timestamp of the last statistics update, the number of rows in the sample used to produce the statistical information, the number of steps in the histogram, and the average length of the key. A string summary is included for only the first column, only if it contains character data. Multicolumn densities are not supported for filtered statistics.
Use sp_helpindex and sp_helpstats to display the list of all statistics available for a given table. sp_helpindex lists all indexes on the table, and sp_helpstats lists all the statistics on the table. Each index also carries the statistical information for its columns. The statistical information created by using the CREATE STATISTICS command is equivalent to the statistics built by a CREATE INDEX command on the same columns. The only difference is that the CREATE STATISTICS command uses sampling by default while the CREATE INDEX command gathers the statistics with fullscan, because it has to process all rows for the index anyway.
The following example illustrates how statistics are created both automatically and manually, and shows how to list and display information about statistics. Results are given for some but not all commands; the output produced by SQL Server 2008 is shown when this is useful to illustrate its behavior. You can run this example yourself to see the complete set of output.
USE tempdb GO -- Clean up objects from any previous runs. IF object_id(N'Person.Contact','U') IS NOT NULL DROP TABLE Person.Contact GO IF EXISTS (SELECT * FROM sys.schemas WHERE name = N'Person') DROP SCHEMA Person GO -- Create a sample schema and table. CREATE SCHEMA Person GO CREATE TABLE Person.Contact( FirstName nvarchar(60), LastName nvarchar(60), Phone nvarchar(15), Title nvarchar(15) ) GO -- Populate the table with a few rows. INSERT INTO Person.Contact VALUES(N'John',N'Smith',N'425-555-1234',N'Mr') INSERT INTO Person.Contact VALUES(N'Erik',N'Andersen',N'425-555-1111',N'Mr') INSERT INTO Person.Contact VALUES(N'Erik',N'Andersen',N'425-555-3333',N'Mr') INSERT INTO Person.Contact VALUES(N'Jeff',N'Williams',N'425-555-0000',N'Dr') INSERT INTO Person.Contact VALUES(N'Larry',N'Zhang',N'425-555-2222',N'Mr') GO -- Show that there are no statistics yet on the Person.Contact table. sp_helpstats N'Person.Contact', 'ALL' GO -- Implicitly create statistics on LastName. SELECT * FROM Person.Contact WHERE LastName = N'Andersen' GO -- Show that statistics were automatically created on LastName. sp_helpstats N'Person.Contact', 'ALL' GO
-- Create an index, which also creates statistics. CREATE NONCLUSTERED INDEX Phone on Person.Contact(Phone) GO -- Show that creating the index created an associated statistics object. sp_helpstats N'Person.Contact', 'ALL' GO
-- Create a multicolumn statistics object on first and last name. CREATE STATISTICS FirstLast ON Person.Contact(FirstName,LastName) GO -- Show that there are now three statistics objects on the table. sp_helpstats N'Person.Contact', 'ALL' GO
-- Display the statistics for LastName. DBCC SHOW_STATISTICS (N'Person.Contact', LastName) GO
|Name||Updated||Rows||Rows Sampled||Steps||Density||Average key length||String Index||Filter Expression||Unfiltered Rows|
Jun 20 2008 2:00PM
Column set prefixes and associated densities and lengths:
|All Density||Average Length||Columns|
|RANGE_HI_ KEY||RANGE_ ROWS||EQ_ROWS||DISTINCT_ RANGE_ROWS||AVG_ RANGE_ROWS|
-- If you take the name of the statistics object displayed by -- the command above and subsitute it in as the second argument of -- DBCC SHOW_STATISTICS, you can form a command like the following one --(the exact name of the automatically created statistics object -- will typically be different for you). DBCC SHOW_STATISTICS (N'Person.Contact', _WA_Sys_00000002_2D7CBDC4) -- Executing the above command illustrates that you can show statistics by -- column name or statistics object name. GO -- The following displays multicolumn statistics. Notice the two -- different density groups for the second rowset in the output. DBCC SHOW_STATISTICS (N'Person.Contact', FirstLast)
Results (second rowset only):
Column set prefixes and associated densities and lengths:
|All density||Average length||Columns|
To see a fully populated histogram for a larger table, run the following commands:
USE AdventureWorks -- Clean up objects from previous runs. IF EXISTS (SELECT * FROM sys.stats WHERE object_id = object_id('Sales.SalesOrderHeader') AND name = 'TotalDue') DROP STATISTICS Sales.SalesOrderHeader.TotalDue GO CREATE STATISTICS TotalDue ON Sales.SalesOrderHeader(TotalDue) GO DBCC SHOW_STATISTICS(N'Sales.SalesOrderHeader', TotalDue) GO
To try out filtered statistics on the AccountNumber column for all purchase orders that are not NULL, run:
-- Clean up objects from previous runs. IF EXISTS (SELECT * FROM sys.stats WHERE object_id = object_id('Sales.SalesOrderHeader') AND name = 'ActNumberPONotNull') DROP STATISTICS Sales.SalesOrderHeader.ActNumberPONotNull GO CREATE STATISTICS ActNumberPONotNull ON Sales.SalesOrderHeader (AccountNumber) WHERE PurchaseOrderNumber IS NOT NULL GO DBCC SHOW_STATISTICS(N'Sales.SalesOrderHeader', ActNumberPONotNull) GO
Creating Statistics with SQL Server 2008
You can create statistics in SQL Server 2008 in a number of different ways, as described below:
· The optimizer automatically creates single-column statistics as needed as a side effect of optimizing SELECT, INSERT, UPDATE, DELETE, and MERGE statements if AUTO_CREATE_STATISTICS is enabled, which is the default setting.
Note: The optimizer only creates nonfiltered statistics in these cases.
· There are two basic statements in SQL Server 2008 that explicitly generate the statistical information described above: CREATE INDEX generates the declared index in the first place, and it also creates one set of statistics for the column combinations constituting the index keys (but not other included columns). CREATE STATISTICS only generates the statistics for a given column or combination of columns.
Note: If the CREATE INDEX defines a predicate, the corresponding statistics are created with the same predicate.
· In addition, there are several other ways to create statistics or indexes. Ultimately, though, each issues one of the above two commands. Use sp_createstatsto create statistics for all eligible columns (all except XML columns) for all user tables in the current database. A new statistics object will not be created for columns that already have a statistics object.
· Use dbcc dbreindexto rebuild one or more indexes for a table in the specified database.
· In SQL Server Management Studio, expand the folder under a Table object, right click the Statistics folder, and choose New Statistics.
· Use the Database Engine Tuning Advisor to create indexes.
Here is an example of a CREATE STATISTICS command on the AdventureWorks.Person.Contact table.
CREATE STATISTICS FirstLast2 ON Person.Contact(FirstName,LastName) WITH SAMPLE 50 PERCENT
Usually, statistics with default sampling are sufficient to allow good query plans to be produced. However, there may be cases when statistics with larger sample sizes can benefit query optimization, such as when the values in the given column sample are not random. Nonrandom samples can occur if the data is sorted or clustered. Sorting or clustering can be caused by the creation of indexes, or by loading data into a heap structure that is already sorted or clustered. The most commonly used larger sample size is fullscan, because it gives the most accurate statistics. The price for using statistics with larger sample sizes is the time required to create the statistics.
The above command creates a single two-column statistics object. In this case, SAMPLE 50 PERCENT is ignored and a full scan is performed because the table is too small. Sampling is used primarily to avoid excessive scans of data and affects only tables and indexes with 1,024 or more pages (8 MB).
In SQL Server 2008, statistics are created for all indices at the index creation time. SQL Server creates single-column statistics automatically when it compiles queries. These statistics are created for columns where the optimizer would otherwise have to estimate the approximate density or distribution. There are the following exceptions to this rule. Statistics might not be created automatically if (1) the database is read-only, (2) there are too many outstanding compilations in progress, and (3) the column's data type is not supported for auto stats.
The automatic statistics creation function can be disabled at the database level by executing the following command.
Similarly, automatic statistics creation can be enabled at the database level by executing the following command.
It's recommended that you leave this setting ON. Only disable it if you need to work around known performance issues, such as the need to specify a sampling rate other than the default for some tables.
By default, statistics are created by sampling the data set when the CREATE STATISTICS command is executed or when the statistics are automatically created. CREATE INDEX scans the whole data set anyway; therefore, the index statistics are initially created without sampling (the equivalent of fullscan). The CREATE STATISTICS command allows you to set the sample size in the WITH clause either by specifying fullscan or the percentage of data or number of rows to scan. The latter is interpreted as an approximation. It is also possible to inherit the previous sample size when you specify WITH RESAMPLE on the UPDATE STATISTICS command. This is particularly useful when there are indexes on some columns or column sets (originally created with fullscan statistics) and only statistics on some other columns or column sets (originally created with SAMPLE statistics). Using the resample option on UPDATE STATISTICS will then maintain the fullscan statistics for the indexes and sample statistics for the rest of the columns.
For small tables, a minimum of 8 MB of data is sampled. If a table starts small, and you sample with the default sampling rate, and thereafter use the resample option when updating the statistics, you will get the equivalent of fullscan even as the table grows beyond 8 MB. Avoid the use of resample if you want the default sampling rate as table size varies.
The resample sampling rate is computed as a function of the number of rows sampled and the total number of rows in the table during the previous statistics calculation. Because the actual sampling rate may vary due to the random nature of sampling, the resample rate is only an approximation of the prior sampling rate for non-fullscan samples. For consistently repeatable sampling, explicitly specify the same sampling rate you used before with UPDATE STATISTICS instead of using resample.
The dbcc show_statistics command displays the sample size under the Rows Sampled heading. Statistics created automatically, or updated automatically (as described in the next section) are always generated using default sampling. The default sampling rate is a slow-growing function of the table size, which allows statistics to be gathered relatively quickly even for very large tables.
When statistics are created and updated, the optimizer must choose an access path to gather the statistics. An access path may include a heap, clustered index, or nonclustered index. For sampled statistics, the optimizer tries to avoid access paths that are physically sorted on the leading statistics key column. This helps provide a more random sample, which leads to more accurate statistics. For those access paths that are not sorted on the statistics key (if such an access path exists), the lowest cost access path is chosen. This is the narrowest index or the heap. For fullscan statistics, the sort order of the access paths is not significant for statistical accuracy, so the lowest cost access path is chosen.
SQL Server Profiler can monitor automatic statistics creation. The Auto Stats event is in the group of Performance trace events. Select the IntegerData, Success, and ObjectID columns for the Auto Stats event when you define the trace. After the Auto Stats event is captured, the Integer Data column contains the number of statistics created or updated for a given table, the Object ID is the ID of the table and the TextData column (included in the trace definition by default) contains names of the columns for which statistics creation or update was performed, together with either an Updated: or Created: prefix. The Success column contains an indication of the success or failure of the Auto Stats operation. Specifically, Success has three possible values.
Auto Stats creation or update failed for reasons other than THROTTLED (see below). For example, the database was read-only.
Auto Stats creation or update succeeded.
Auto Stats creation or update failed because too many optimizations were in progress.
In some cases, you might also observe an Auto Stats event in which no statistics are created or updated. Such an event is generated when auto update statistics is turned off, or when there have been a substantial number of changes to a table referenced in a query, and the optimizer has been able to remove all references to that table from the query plan because of the structure of the query and the existence of a foreign key constraint.
The DROP STATISTICS command is used to drop statistics, but it is not possible to drop statistics that are a byproduct of an index. Such statistics are removed only when the index is dropped.
Maintaining Statistics in SQL Server 2008
After a series of INSERTs, DELETEs, and/or UPDATEs are performed on a table the statistics may not reflect the true data distribution in a given column or index. If the SQL Server query optimizer requires statistics for a particular column in a table that has undergone substantial update activity since the last time the statistics were created or updated, SQL Server automatically updates the statistics by sampling the column values (by using auto update statistics). The statistics auto update is triggered by query optimization or by execution of a compiled plan, and it involves only a subset of the columns referred to in the query. Statistics are updated before query compilation if AUTO_UPDATE_STATISTCS_ASYNC is OFF, and asynchronously if it is ON. When statistics are updated asynchronously, the query that triggered the update proceeds using the old statistics. This provides more predictable query response time for some work loads, particularly those with short running queries and very large tables.
When a query is first compiled, if the optimizer needs a particular statistics object, and that statistics object exists, the statistics object is updated if it is out of date. When a query is executed and its plan is in the cache, the statistics the plan depends on are checked to see if they are out of date. If so, the plan is removed from the cache, and during recompilation of the query, the statistics are updated. The plan also is removed from the cache if any of the statistics it depends on have changed.
SQL Server 2008 determines whether to update statistics based on changes to column modification counters (colmodctrs).
A statistics object is considered out of date in the following cases:
· If the statistics is defined on a regular table, it is out of date if:
o The table size has gone from 0 to >0 rows (test 1).
o The number of rows in the table when the statistics were gathered was 500 or less, and the colmodctr of the leading column of the statistics object has changed by more than 500 since then (test 2).
o The table had more than 500 rows when the statistics were gathered, and the colmodctr of the leading column of the statistics object has changed by more than 500 + 20% of the number of rows in the table when the statistics were gathered (test 3).
· For filtered statistics, the colmodctr is first adjusted by the selectivity of the filter before these conditions are tested. For example, for filtered statistics with predicate selecting 50% of the rows, the colmodctr is multiplied by 0.5.
· One limitation of the automatic update logic is that it tracks changes to columns in the statistics, but not changes to columns in the predicate. If there are many changes to the columns used in predicates of filtered statistics, consider using manual updates to keep up with the changes.
If the statistics object is defined on a
temporary table, it is out of date as discussed above, except that there is an
additional threshold for recomputation at 6 rows, with a test otherwise
identical to test 2 in the previous list.
Table variables do not have statistics at all.
The auto update statistics feature described above may be turned off at different levels.
· On the database level, disable auto update statistics by using the following command.
ALTER DATABASE dbname SET AUTO_UPDATE_STATISTICS OFF
· At the table level, disable auto update statistics using the NORECOMPUTE option of the UPDATE STATISTICS command or CREATE STATISTICS command.
sp_autostats to display and change the auto update statistics
setting for a table, index, or statistics object.
Re-enabling the automatic updating of statistics can be done similarly using ALTER DATABASE, UPDATE STATISTICS, or sp_autostats.
SQL Server 2008 maintains the automatic statistics update setting on per-database, per-table, and per-index or statistics object levels. Although you can change this setting for all statistics on one table by using a single sp_autostats command, this is implemented by changing the setting individually for each statistics object and index on a given table. No metadata explicitly records that auto update statistics is ON or OFF for the whole table.
The following table shows the combined effect of different database, table, and index settings.
|Database setting||Index or statistics object setting||Statistics auto update is in effect for the object|
It is not possible to override the database setting of OFF for auto update statistics by setting it ON at the statistics object level.
Auto statistics update is always performed by sampling the index or table using the default sampling rate. To set the sampling rate explicitly, run CREATE STATISTICS or UPDATE STATISTICS.
Statistics update is covered by the same SQL Server Profiler event as statistics creation.
String Summary Statistics
SQL Server 2008 includes patented technology for estimating the selectivity of LIKE conditions. It builds a statistical summary of substring frequency distribution for character columns (a string summary). This includes columns of type text, ntext, char, varchar, and nvarchar. Using the string summary, SQL Server can accurately estimate the selectivity of LIKE conditions where the pattern may have any number of wildcards in any combination. For example, SQL Server can estimate the selectivity of predicates of the following form.
Column LIKE 'string%' Column LIKE '%string' Column LIKE '%string%' Column LIKE 'string' Column LIKE 'str_ing' Column LIKE 'str[abc]ing' Column LIKE '%abc%xy'
If there is a user-specified escape character in a LIKE pattern (that is, the pattern is of the form LIKE pattern ESCAPE escape_character), SQL Server 2008 guesses selectivity.
This is an improvement over SQL Server 2000, which uses a guess for selectivity when any wildcard other than a trailing wildcard % is used in the LIKE pattern, and it has limited accuracy in its estimates in that case.
The String Index field in the first row set returned by DBCC SHOW_STATISTICS includes the value YES if the statistics object also includes a string summary. The contents of the string summary are not shown. The string summary includes additional information beyond what is shown in the histogram.
For strings longer than 80 characters, the first and last 40 characters are extracted from the string and concatenated prior to considering the string in the creation of the string summary. Hence, accurate frequency estimates for substrings that appear only in the ignored portion of a string are not available.
Statistics on Computed Columns
SQL Server 2008 and SQL Server 2005 support the creation, update, and use of statistics on computed columns even when a query doesn't contain the computed column by name, but rather contains the computed column expression. Earlier versions can only automatically create, update, and use statistics on computed columns if the column is named explicitly in the query.
You can observe auto creation of computed column statistics for the AdventureWorks database table column Sales.SalesOrderHeader.TotalDue if you execute the following Transact-SQL script in SQL Server 2008.
USE AdventureWorks GO -- Remove all statistics for Sales.SalesOrderHeader DECLARE c CURSOR FOR SELECT name FROM sys.stats WHERE object_id = object_id('Sales.SalesOrderHeader') AND auto_created <> 0 AND user_created <> 0 DECLARE @name NVARCHAR(255) OPEN c FETCH next FROM c INTO @name WHILE @@FETCH_STATUS = 0 BEGIN EXEC ('drop statistics Sales.SalesOrderHeader.' + @name) FETCH NEXT FROM c INTO @name END CLOSE c DEALLOCATE c -- Query Sales.SalesOrderHeader with an expression equivalent -- to the TotalDue computed column, -- ((isnull(([SubTotal]+[TaxAmt])+[Freight],(0))). SELECT * FROM Sales.SalesOrderHeader h, Sales.SalesOrderDetail d WHERE h.SalesOrderID = d.SalesOrderID AND (isnull(([SubTotal]+[TaxAmt])+[Freight],(0))) > 200000.00 ORDER BY TotalDue DESC -- List the created statistics. Observe that statistics -- are created for TotalDue even though it is not explicitly -- referenced in the query. sp_helpstats 'Sales.SalesOrderHeader'
SQL Server 2008 does not support statistics on non-persisted computed columns that involve a common language runtime (CLR) expression, such as an invocation of a CLR user-defined function. To create statistics on a CLR computed column, the column must be marked PERSISTED.
Statistics on CLR User-Defined Type Columns
SQL Server 2008 supports the creation, update, and use of statistics on CLR user-defined type columns if the type supports binary ordering. Statistics are not supported on user-defined types that don't support binary ordering. A type is binary ordered if it has the IsByteOrdered flag set to true in the SqlUserDefinedType attribute that is specified as part of the type definition. If a type supports binary ordering, standard binary sort ordering is a semantically correct order for the type.
Statistics and Indexed Views
Normally, statistics are not required on indexed views. This is because the substitution of the indexed views into the query plan is considered only after all the statistics for the underlying tables and indexes are attached to the query plan. There is one exception: Statistics are used if the view is directly referenced by the NOEXPAND hint in a FROM clause. Note that an error is generated and the plan is not created if the NOEXPAND hint is used on a view that does not also contain an index.
Because of their limited use, the statistics on indexed views are not created by using sp_createstats or updated by using sp_updatestats. The auto update and auto create statistics features work for indexed views. But as noted before, such statistics would be required by the optimizer, and they are subsequently created only if the indexed view is used with the NOEXPAND hint in the query and if the AUTO_CREATE_STATISTICS and AUTO_UPDATE_STATISTICS options are set to ON. You can also manually perform CREATE STATISTICS on the indexed view columns or use UPDATE STATISTICS to update column or index statistics on indexed views.
Best Practices for Managing Statistics
The goal for using statistics in SQL Server is to allow the optimizer to get good cardinality estimates so it can find good query plans, while at the same time keeping any overhead or delay associated with gathering statistics reasonable. Below we list best practices for managing statistics in SQL Server, with the most important ones first.
Use auto create and auto update statistics
For a large majority of SQL Server installations, the most important best practice is to use auto create and auto update statistics database-wide. Auto create and auto update statistics are on by default. If you observe bad plans and suspect that missing or out of date statistics are at fault, verify that auto create and auto update statistics are on.
When statistics are created or maintained automatically, SQL Server samples the rows by default, instead of performing a full scan. (An exception is when statistics are created as a by-product of index creation, when SQL Server uses full scan.) Update of statistics with a different sampling rate (from full scan to sampled, or vice versa), can cause query plan changes. If query plan stability is a high priority, drop and re-create with sampling any statistics that are the by-product of index creation.
If needed, selectively use FULLSCAN statistics
If you are using auto create and auto update statistics and you are getting a bad query plan because the statistics are not accurate or current, do the following:
1. Leave auto create and auto update statistics on.
Only for those statistics that are not accurate
or current enough, use
CREATE STATISTICS … WITH FULLSCAN, NORECOMPUTE
and a batch job that does
UPDATE STATISTICS … WITH FULLSCAN, NORECOMPUTE periodically.
The frequency with which you should update statistics depends on your application, and it may require some experimentation to determine. A good starting point for the frequency of fullscan update is your update rate. If the table under consideration has a high update rate, run a fullscan statistics update nightly. If the table has a low update rate, run a fullscan statistics update weekly.
If you use a local variable in a query predicate instead of a parameter or literal, the optimizer resorts to a reduced-quality estimate, or a guess for selectivity of the predicate. Use parameters or literals in the query instead of local variables, and the optimizer typically can pick a better query plan. For example, consider this query, which uses a local variable.
declare @StartOrderDate datetime set @StartOrderDate = '20040731' select * from Sales.SalesOrderHeader h, Sales.SalesOrderDetail d WHERE h.SalesOrderID = d.SalesOrderId AND h.OrderDate >= @StartOrderDate
The number of rows from Sales.SalesOrderHeader that the optimizer estimates will qualify versus the condition h.OrderDate >= @StartOrderDate is 9439.5, which is exactly 30 percent of the size of the table. You can use the graphical showplan for the query and right-click the plan node for Sales.SalesOrderHeader to display this cardinality estimate. In a prerelease version of SQL Server 2008 used while preparing this paper, the plan chosen uses a merge join (the observations that follow are based on this same SQL Server 2008 version; your results may differ depending on your SQL Server version, available memory, and so on.). Now, consider this equivalent query, which doesn't use a local variable:
SELECT * FROM Sales.SalesOrderHeader h, Sales.SalesOrderDetail d WHERE h.SalesOrderID = d.SalesOrderId AND h.OrderDate >= '20040731'
The cardinality of the result set for the predicate "h.OrderDate >= '20040731'" is estimated as 40 in the graphical showplan for the query (right-click the Filter operator), for a selectivity of 0.13%. The plan chosen for this query uses a nested loop join instead of a merge join because of this improved estimate.
Even when local variables are used in a query, an estimate that is better than a guess is used in the case of equality predicates. Selectivity for conditions of the form "@local_variable = column_name" is estimated using the average value frequency from the histogram for column_name. So, for example, if the column column_name contains all unique values, a selectivity estimate of 1/(number of unique values in column) is used, which is accurate.
To eliminate the use of local variables, consider (1) rewriting the query to use literals instead of variables, (2) using sp_executesql with parameters that replace your use of local variables, or (3) using a stored procedure with parameters that replace your use of local variables. Dynamic SQL via EXEC may also be useful for eliminating local variables, but it typically results in higher compilation overhead and more complex programming. A new enhancement in SQL Server 2008 is that the OPTION(RECOMPILE) hint expands both variables and parameters inline as constant literals before optimization. If you can tolerate the extra compile time, this is an excellent way to work around the problem of poor selectivity estimates associated with variables in query predicates.
Consider limiting use of multistatement table-valued user-defined functions and table variables
Multistatement table-valued user-defined functions (also known as TVFs) have no statistics. The optimizer must guess the size of their results. Similarly, table variables don't have statistics, and the optimizer must resort to guesses for their cardinality. If a bad plan results because of these guesses, consider using a standard table or temporary table as a temporary holding place for the results of the table-valued user-defined function, or a replacement for the table variable. This allows the optimizer to use better cardinality table-valued user-defined function estimates.
Use multicolumn statistics when queries have multicolumn conditions
If a query has a multicolumn condition, consider using multicolumn statistics if you suspect that the optimizer is not producing the best plan for the query. You get multicolumn statistics as a by-product of creating a multicolumn index, so if there is already a multicolumn index that supports the multicolumn condition, there is no need to create statistics explicitly. Auto create statistics only creates single-column statistics, never multicolumn statistics. So if you need multicolumn statistics, create them manually, or create a multicolumn index.
If one of the columns has a low distinct value count, you may achieve higher accuracy if you create multiple filtered statistics, one per each distinct value. For more information, see the following best practice, “Consider filtered statistics when querying correlated columns.”
Consider a query that accesses the AdventureWorks.Person.Contact table, and contains the following condition.
To make selectivity estimation more accurate for this query, create the following statistics object:
This statistics object will be useful for queries that contain predicates on LastName and FirstName, as well as LastName alone. In general, the selectivity of a predicate on any prefix of the set of columns in a multicolumn statistics object can be estimated using that statistics object.
For a statistics object to fully support a multi-column condition, a prefix of the columns in the multicolumn statistics object must contain the columns in the condition. For example, a multicolumn statistics object on columns (a,b,c) only partially supports the condition a=1 AND c=1; the histogram is used to estimate the selectivity for a=1, but the density information for c is not used, because b is missing from the condition. Multicolumn statistics on (a,c) or (a,c,b) would support the condition a=1 AND c=1, and density information could be used to improve selectivity estimation.
Consider filtered statistics when querying correlated columns
If your table contains columns correlated to each other, the query optimizer will have difficulties predicting the selectivity of any expressions involving two or more such columns. Think about your favorite car make and model. Is there a statistical correlation between the two? For example, if a table contains the columns CarMake and CarModel, a query with the predicate WHERE CarMake=constant and CarModel=constant can result in underestimated expected rows in the query plan. This is because the query optimizer does not know how closely the two conditions are related (multicolumn densities can only be of limited use, because they count the total distinct values across the table). Filtered statistics can be a much better option in such cases. Define separate filtered statistics on the CarModel column for each CarMake.
CREATE STATISTICS model1 ON autos(CarModel) WHERE CarMake=constant1 CREATE STATISTICS model2 ON autos(CarModel) WHERE CarMake=constant2 ...
At query compilation time, the optimizer can make a much better estimate by picking the correct statistics object, based on the value of CarMake, and then looking up the number of models from the steps of the histogram.
Be alert for any situation where SQL Server falls back to guessing selectivity
SQL Server guesses selectivity in a number of situations. Often, the guesses are reasonable, the data sizes are small enough that this is not a problem, or the guess doesn't lead to a bad query plan. Nevertheless, when SQL Server must guess selectivity of a query predicate, sometimes less-than-optimal query plans result. If a query is not performing as well as you would like, and you suspect that a suboptimal query plan is being selected, look at the query and its resulting plan for signs that selectivity was guessed rather than estimated based on statistics. In many cases, you can modify the query or a small part of your application to avoid the guess. The following list gives some situations that can lead to guesses, and possible workarounds:
· Missing statistics: Verify that auto create statistics is enabled, or manually create statistics by using CREATE STATISTICS or sp_createstats. Check to see whether the database is read-only; auto create statistics does not work on read-only databases.
· Use of local variables in the query condition (this was addressed previously).
· Non-constant-foldable expressions in the query condition (for example, T.x+1 = 0). Rewrite the query to eliminate the expression, or evaluate the expression prior to executing the query and pass the result into the query as a parameter (not a local variable). In the case of T.x+1 = 0, rewrite the expression as T.x = -1. This is an equivalent expression, and it allows an accurate estimate instead of a guess.
· Complex expressions such as "Price + Tax > 100" or "Price * (1+TaxRate) > 100". If you see slower-than-desired query performance in such a case, consider creating a computed column with the equivalent expression, and creating statistics or an index on the computed column. Auto create statistics also creates statistics for the computed column if it exists, so you need not create the computed column statistics manually if auto create statistics is enabled.
Avoid modification of parameters in a stored procedure prior to use in a query
For best query performance, in some situations you'll need to avoid assigning a new value to a parameter of a stored procedure within the procedure body, and then using the parameter value in a query. The stored procedure and all queries in it are initially compiled with the parameter value first passed in as a parameter to the query. This is sometimes called parameter sniffing. Consider the following stored procedure, which gets sales on or after a certain date, or if NULL is passed as an argument, the last three months of sales.
CREATE PROCEDURE GetRecentSales (@date datetime) WITH RECOMPILE AS BEGIN IF @date IS NULL SET @date = dateadd("mm",-3,(SELECT MAX(OrderDATE) FROM Sales.SalesOrderHeader)) SELECT * FROM Sales.SalesOrderHeader h, Sales.SalesOrderDetail d WHERE h.SalesOrderID = d.SalesOrderID AND h.OrderDate > @date END
This stored procedure, if called with NULL, has the final SELECT statement optimized for @date = NULL. Because no rows have NULL OrderDate, the cardinality estimate for the result of applying this filter to SalesOrderHeader is very low (1 row). However, at run time, the date is not NULL, but three months before the latest OrderDate. The actual number of SalesOrderHeader rows that qualify is 5,736. The optimizer chooses a nested loop join for the query when NULL is passed to GetRecentSales, whereas the optimal plan contains a merge join. You can see the plan selected, and the expected and actual cardinalities, using this script.
SET STATISTICS PROFILE ON GO EXEC GetRecentSales NULL GO SET STATISTICS PROFILE OFF GO
Notice that the WITH RECOMPILE option specified on the GetRecentSales stored procedure above does not eliminate the cardinality estimation error. One way to ensure that the queries in this example are optimized with appropriate parameter values that allow good estimates to be obtained is to modify the stored procedure as follows, breaking it down into parts.
CREATE PROCEDURE GetRecentSales (@date datetime) AS BEGIN IF @date IS NULL SET @date = dateadd("mm",-3,(SELECT MAX(OrderDATE) FROM Sales.SalesOrderHeader)) EXEC GetRecentSalesHelper @date END CREATE PROCEDURE GetRecentSalesHelper (@date datetime) WITH RECOMPILE AS BEGIN SELECT * FROM Sales.SalesOrderHeader h, Sales.SalesOrderDetail d WHERE h.SalesOrderID = d.SalesOrderID AND h.OrderDate > @date -- @date is unchanged from compile time, -- so a good plan is obtained. END
Consider more frequent statistics gathering for ascending keys
Ascending key columns, such as IDENTITY columns or datetime columns representing real-world timestamps, can cause inaccurate statistics in tables with frequent INSERTs because new values all lie outside the histogram. Consider updating statistics on such columns frequently with a batch job if your application seems to be getting inadequate query plans for queries that have a condition on the ascending key column. How often to run the batch job depends on your application. Consider daily or weekly intervals, or more often if needed for your application.
Use asynchronous statistics update if synchronous update causes undesired delay
If you have a large database and an OLTP workload, and if you enable AUTO_UPDATE_STATISTICS, some transactions that normally run in a fraction of a second may very infrequently take several seconds or more because they cause statistics to be updated. If you want to avoid the possibility of this noticeable delay, enable AUTO_UPDATE_STATISTICS_ASYNC. For workloads with long-running queries, getting the best plan is more important than an infrequent delay in compilation. In such cases, use synchronous rather than asynchronous auto update statistics.
Consider filtered statistics for heterogeneous data
Sometimes rows with different schema are mapped to a single physical table, with multipurpose columns such as ntext1, ntext2, bigint1, bigint2 storing semantically unrelated data. Typically, there is also a special-purpose rowtype column that defines what is the semantic meaning of the data stored in each column. Such design is useful for storing arbitrary user-defined lists without changing the underlying database schema. As a result, the same column may end up storing telephone numbers and city names, and a histogram on such column may not be very useful, due to the limit of 200 steps. To avoid this, define separate statistics for each rowtype in this table.
Consider filtered statistics for partitioned tables
Statistics are defined at the table level. Changes to partitions affect statistics only indirectly, through the column modification counters. Switching in a partition is treated as an insert of its rows into the table, triggering statistics update based on the 20% rule, as outlined above. Filtered statistics, through their predicates, can target only rows in certain partition or partitions. There is no requirement to align to the boundaries of the partitions when defining the statistics.
Often, customers partition a table by the Date column, keeping a partition for every month, and updating only the last partition; older months receive the majority of the complex, read-only queries. In this scenario, creating separate, full-scan statistics on the read-only region of the table results in more accurate cardinality estimates. In order to benefit from the separate statistics object, queries must be contained within the read-only region. Similarly, separate statistics objects can be created for different regions based on the different access patterns.
SQL Server 2008 contains many enhancements to its statistics management capabilities. Most importantly, you can rely on the automatic creation and update of statistics to help ensure good query plans in most cases. When automatic statistics with the default sampling rate are not sufficient, you can control statistics sampling rates and creation and update time explicitly. If you are observing what appears to be a suboptimal query plan related to statistics or cost estimation, consider the best practices described in this paper.
For more information:
http://www.microsoft.com/sqlserver/: SQL Server Web site
http://technet.microsoft.com/en-us/sqlserver/: SQL Server TechCenter
http://msdn.microsoft.com/en-us/sqlserver/: SQL Server DevCenter
Did this paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to 5 (excellent), how would you rate this paper and why have you given it this rating? For example:
- Are you rating it high due to having good examples, excellent screen shots, clear writing, or another reason?
- Are you rating it low due to poor examples, fuzzy screen shots, or unclear writing?
This feedback will help us improve the quality of white papers we release.
SQL Server 2008 contains many enhancements to its statistics management capabilities. Most importantly, you can rely on the automatic creation and update of statistics to help ensure good query plans in most cases. When automatic statistics with the default sampling rate are not sufficient, you can control statistics sampling rates and creation and update time explicitly. If you are observing what appears to be a suboptimal query plan related to statistics or cost estimation, consider the best practices described in this paper.
[Mar04] Arun Marathe, Batch Compilation, Recompilation, and Plan Caching Issues in SQL Server 2005, http://www.microsoft.com/technet/prodtechnol/sql/2005/recomp.mspx, July 2004.