Statistics Used by the Query Optimizer in Microsoft SQL Server 2000
Statistics Used by the Query Optimizer in Microsoft SQL Server 2000
Statistical Data in SQL Server 2000
Statistics Collected by SQL Server 2000
Statistics for INDEX ''PK_Order_Details''
Creating Statistics with SQL Server 2000
Maintaining Statistics in SQL Server 2000
Statistics and Indexed Views
Summary: Describes what data is collected, where it is stored, and which commands create, update, and delete statistics about indexes and column data stored in the Microsoft SQL Server 2000 database. These statistics are used by the SQL Server query optimizer to choose the most efficient plan for retrieving data and performing INSERT, SELECT, DELETE or UPDATE queries. It also outlines how SQL Server default statistics creation and maintenance settings can be changed on different levels (index, table, and database). (15 printed pages)
Microsoft® SQL Server™ 2000 collects statistical information about indexes and column data stored in the database. These statistics are used by the SQL Server query optimizer to choose the most efficient plan for performing INSERT, SELECT, DELETE or UPDATE queries. This article describes what data is collected, where it is stored, and which commands create, update, and delete statistics. By default, SQL Server 2000 also creates and updates statistics automatically (when such an operation is considered to be useful). This article also outlines how these defaults can be changed on different levels (index, table, and database).
SQL Server 2000 collects statistics about individual columns (single column statistics) or sets of columns (multi-column statistics). All information about a single statistics object is stored in several columns of a single row in the SYSINDEXES table. Computed columns and columns of the ntext, text, or image data types cannot be specified as statistics columns. The combined width of all columns constituting a single statistics set must not be greater than 900 bytes.
- Time of the last statistics collection (inside STATBLOB).
- Number of rows in the table or index (rows column in SYSINDEXES).
- Number of pages occupied by the table or index (dpages column in SYSINDEXES).
- Number of rows used to produce the histogram and density information (inside STATBLOB, described below).
- Average key length (inside STATBLOB).
- Single column histogram, including the number of steps (inside STATBLOB).
Note A histogram is a set of up to 200 values of a given column. All (or selected, if statistics are collected by sampling) values in a given column are sorted; the ordered sequence is divided into intervals up to 199 so that the most statistically significant information is captured. In general, these intervals are of non-equal size. The following values are stored with each step of the histogram.
Table 1. Histogram values.
|EQ_ROWS||Specifies how many rows are exactly equal to RANGE_HI_KEY.|
|RANGE_ROWS||Specifies how many rows are inside the range (they are smaller than this RANGE_HI_KEY, but bigger than the previous smaller range key).|
|DENSITY||Specifies 1 / Number of distinct values (inside the range).|
When a dbcc show_statistics command is used, instead of the DENSITY information, two derived values are shown.
Table 2. Histogram showing dbcc show_statistics with two derived values.
|DISTINCT_RANGE_ROWS||Specifies how many distinct rows are inside this range (not accounting for the RANGE_HI_KEY value itself); DISTINCT_RANGE_ROWS = 1 / DENSITY.|
|AVG_RANGE_ROWS||Average number of rows per distinct value inside the range; AVG_RANGE_ROWS = DENSITY * RANGE_ROWS.|
Histograms in SQL Server 2000 are only built for a single column, the first column in the case of multi-column statistics, or an index.
SQL Server 2000 builds the histogram from the sorted set of column values in three steps. In the first step, up to 200 values of RANGE_HI_KEY, EQ_ROWS, RANGE_ROWS and DISTINC_RANGE_ROWS are collected. Each additional column value is processed in the second step; the value is either added to the last range (the values are sorted), or a new range is created. If the new range is created, one pair of existing, neighboring ranges is collapsed into a single range. The collapsed ranges are selected by considering the density information so that two neighboring ranges with the closest densities are collapsed in order to minimize information loss. In the third step, more ranges may be collapsed if they have close densities. Therefore, even if the column has more than 200 unique values, the histogram may have less than 200 steps.
If the histogram has been built using a sample, then the values of RANGE_ROWS, EQ_ROWS, DISTINCT_RANGE_ROWS and AVG_RANGE_ROWS are estimated and therefore they do not need to be whole integers.
Density is information about the number of duplicates in a given column or combination of columns and it is calculated as 1/(number of distinct values). When a column is used in an equal predicate the number of qualifying rows is estimated using the density derived from the histogram. Histograms are always used to estimate filtering by non-equality predicates.
Note A separate density value is also displayed in the first line of dbcc show_statistics, but this is not used by the optimizer in SQL Server 2000.
The multicolumn statistics for one set of columns consists of one histogram for the first column in the statistics definition, one density value for the first column, and an all density value for each prefix combination of columns (including the first column alone). Each set of statistics (a histogram and two or more density values) is stored in one row of SYSINDEXES together with the timestamp of the last statistics update, the number of rows in the sample used to produce the statistical information, the number of steps in the histogram, and the average length of the key. The number of rows value (rowcnt column) is maintained only for index number 0 or 1 (heap or clustered index) and it is replicated across all indexes on the table. Similarly, the dpages is maintained for each table and index. The statistical information is empty for tables without any rows at the time of statistics collection.
Use sp_helpindex and sp_helpstats to display the list of all statistics available for a given table; sp_helpindex lists all indexes on the table, and sp_helpstats lists all the statistics on the table. Each index also carries the statistical information for its columns. The statistical information created using the CREATE STATISTICS command is equivalent to the statistics built by a CREATE INDEX command in the same columns. The only difference is that the CREATE STATISTICS command uses sampling by default while the CREATE INDEX command gathers the statistics with fullscan since it has to process all rows for the index anyway.
Here is an example of all indexes and statistics for the Order Details table in the Northwind database. Since there are no statistics on the non-indexed columns initially, run sp_createstats while connected to the Northwind database first.
Table 3. Order Details table in the Northwind database.
sp_helpindex [Order Details]
|OrderID||nonclustered located on PRIMARY||OrderID|
|PK_Order_Details||clustered, unique, primary key located on PRIMARY||OrderID, ProductID|
|ProductID||nonclustered located on PRIMARY||ProductID|
sp_helpstats [Order Details]
Statistics can also be displayed using a dbcc show_statistics command like
dbcc show_statistics ([Order Details],PK_Order_Details)
If no information is returned, it means that the statistics were updated last time or the index was created when there were no rows in the table. To update the statistics for Order Details table, run UPDATE STATISTICS [Order Details]; to update statistics for all tables in the Northwind database, run sp_updatestats.
Note The output of the command has been edited for better readability.
Table 4. Statistics for INDEX ''PK_Order_Details''
|May 17 2000 10:38PM||2155||2155||192||1.1090337E-3||8.0|
|All Density||Average Length||Columns|
|4.6403712E-4||8.0||OrderID, Product ID|
|(192 row(s) affected)|
Note that while the sp_helpindex output shows only one column in the City statistics (City), the output of show_statistics displays the All Density value for column combination City,Customer_id as well. This is because there is a clustering index on a single column of Customer_id on the table, and each secondary index also contains the clustering key columns. Normally this fact is transparent to the user, but the query optimizer knows about the clustering columns and may avoid secondary fetches if only those column values are required on top of the secondary index columns for the query execution. The statistics contain the clustering key columns as well.
There are two basic statements in SQL Server 2000 that generate the statistical information described above: CREATE INDEX generates the declared index in the first place, and then as a byproduct creates one set of statistics for the column combination constituting the index. CREATE STATISTICS only generates the statistics for a given column or combination of columns.
In addition, there are several other ways to create statistics or indexes. Ultimately, though, each issues one of the above two commands.
- Use sp_createstats to create statistics for all eligible columns (except image and text data) for all user tables in the current database. A new histogram will not be created for columns that already have a histogram.
- Use dbcc dbreindex to rebuild one or more indexes for a table in the specified database.
- In the Query Analyzer, type in a query, select Show Execution Plan and then execute the query. Right-click on any icon in the displayed plan and choose Manage Indexes or Create/Update Statistics.
- Use Create Index wizard (described in a separate article).
Here is an example of a CREATE STATISTICS command on the pubs..authors table:
CREATE STATISTICS s1 ON authors (state, au_lname) WITH SAMPLE 50 PERCENT
Usually, statistics with default sampling are the best. However, there may be cases when statistics with larger sample sizes (ideally with fullscan) may benefit the query optimization, such as when the values in the given column are skewed (several values with very high frequency, the rest with low frequency). The price for using statistics with larger sample sizes is the time to create the statistics.
The above command creates single two-column statistics. In this case, the SAMPLE 50 PERCENT is ignored and a full scan is performed because the table is too small. Sampling is used primarily to avoid excessive scans of data and affects only tables and indices with 1024 or more pages (8 MB).
In SQL Server 2000, statistics are created for all indexes at the index creation time. SQL Server creates single column statistics automatically when compiling queries. These statistics are created for columns where the optimizer would have to estimate the approximate density or distribution otherwise. There are two exceptions to this rule: first, statistics may not be created for tables where the cost of the plan execution would be lower than the statistics creation itself, and secondly, in the case when the server is too busy (too many outstanding compilations in progress).
To avoid long term maintenance of unused statistics, SQL Server 2000 ages the automatically created statistics (only those that are not a byproduct of the index creation). After several automatic updates the column statistics are dropped rather than updated. If they are needed in the future, they may be created again. There is no substantial cost difference between statistics that are updated and created. The aging does not affect user-created statistics.
The automatic statistics creation function can be disabled at the database level by executing sp_dboption dbname, 'auto create statistics', 'OFF'.
By default, the statistics are created by sampling the data set when executing the CREATE STATISTICS command or when the statistics are automatically created. CREATE INDEX scans the whole data set anyway; therefore, the index statistics are initially created without sampling. The CREATE STATISTICS command allows you to set the sample size in the WITH clause either by specifying FULLSCAN or the percentage of data to scan. The latter is interpreted as an approximation. It is also possible to inherit the previous sample size when specifying WITH RESAMPLE on the UPDATE STATISTICS command. This is particularly useful when there are both indexes (originally created with fullscan statistics) and statistics only on some other columns (originally created with sample statistics). Using the RESAMPLE option on UPDATE STATISTICS will then maintain the fullscan statistics for the indexes and sample statistics for the rest of the columns.
The dbcc show_statistics command displays the sample size under the Rows Sampled heading. Statistics created automatically, or updated automatically are always generated using default sampling. The default sampling is a slow-growing logarithmic function of the table size.
The SQL Server query profiler can also monitor automatic statistics creation. The AutoStats event is in the group of Object trace events. When defining the trace, also select the Integer Data, Success, and Object ID columns. Once the AutoStats event is captured, the Integer Data column contains the number of statistics updated for a given table, the Object ID is the ID of the table, and the TextData column (included in the trace definition by default) contains names of the columns together with either an Updated: or Created: prefix. The Success column contains potential Failure indication. In some cases, you may also observe an AutoStats event with no statistics created or updated. Such an event is generated when the auto update statistics is turned off, or when there are substantial changes in a table and a query referring to the changed table is optimized so that there are no indexes or statistics on the table at that moment.
The DROP STATISTICS command is used to drop statistics, but it is not possible to drop statistics that are a byproduct of an index. Such statistics are removed only when the index is dropped.
After a series of INSERT, DELETE and/or UPDATE queries are performed on a table, the statistics may not reflect the true data distribution in a given column or index. If the SQL Server query optimizer requires statistics for a particular column in a table that has undergone substantial update activity since the last time the statistics were created or updated, SQL Server automatically updates the statistics by sampling the column values (using auto update statistics). The statistics auto update is triggered by query optimization and involves only a subset of columns referred to in the query. The rowmodctr column in SYSINDEXES value shows the number of changes on the table since the last time the statistics were updated or created. There is one row in SYSINDEXES for each index and statistics set (if the table does not have a clustered index then there is a row corresponding to the heap as well), and SQL Server 2000 maintains the amount of change for indexes and statistics separately (although this was not true in SQL Server 7.0). Throughout the INSERT, UPDATE, and DELETE queries performed in a table, the rowmodctr value is increased only for the index ID 0 or 1 (there is always only one on a table). For the rest of the indexes and statistics, it shows only a relative value that has to be added to the rowmodctr of the index 0 or 1 to get the true number of changed rows for this index.
This logic enables the maintenance rowmodctr through the INSERT, UPDATE, and DELETE queries in a single row per table (the index 0/1), while at the same time allowing for the changed rows to be tracked individually for each index. Therefore if auto update statistics is on, rows are updated only on the indexes and columns that are necessary for a given query.
Five hundred and six rows are inserted into the table t1, and then non-clustered indexes i1 and i2 are created. The interesting entries in the SYSINDEXES table after these operations are:
After another 213 rows are inserted, the counts are:
After the statistics for the index i1 are refreshed using UPDATE STATISTICS t1 (i1):
The auto update statistics feature described above may be turned off at different levels.
- Use sp_dboption dbname, 'auto create statistics', 'OFF' on the database level. Use this sp also to display the current setting for a particular database.
- Use the NORECOMPUTE option of the UPDATE STATISTICS command for table, index, or statistics.
- Use the NORECOMPUTE option of the CREATE STATISTICS command for table or statistics.
- Use sp_autostats to display and change setting for table, index, or statistics.
Re-enabling the automatic updating of statistics can be done similarly using the sp_dboption, UPDATE STATISTICS or sp_autostats.
SQL Server 2000 keeps the automatic statistics update setting on per database, as well as per index and table levels. While you can switch it to all statistics on one table using a single CREATE STATISTICS command, this must be done by changing the setting for all statistics and indexes on a given table. Table 6 shows the combined effect of different database, table, and index settings.
Table 6. Combined effect of different database, table, and index settings
|Statistics Auto Update |
is in effect for the object
Auto statistics update is always performed by sampling the index or table. Manually running CREATE and UPDATE statistics allows you to change the sample size. The statistics update is covered by the same SQL Profiler event as the above statistics creation.
Normally, statistics are not required on indexed views. This is because substitution of the indexed views into the query plan is considered only after all the statistics for the underlying tables and indexes are attached to the query plan. However, there is one exception: statistics will be used if the view is directly referenced in the FROM clauses using the NOEXPAND hint. If the NOEXPAND hint is used on a view that does not contain an index, an error will be generated and the plan will not be created.
Because of their very limited use, the statistics on indexed views are not created using the sp_createstats or updated using the sp_updatestats. The auto update and auto create statistics work for indexed views. But as noted before, such statistics would be required by the optimizer and subsequently created only if the indexed view is used with the NOEXPAND hint in the query and the AUTO UPDATE or CREATE STATISTICS option is turned on. You can also manually perform CREATE STATISTICS on the indexed view columns or use UPDATE STATISTICS to update column or index statistics on indexed views.
Statistics about the data in the database are important input for the query optimizer. By default, SQL Server is creating and maintaining the statistics without any user intervention. The majority of SQL Server users will achieve optimal performance without any changes to this default. SQL Server provides several interfaces to change the default statistics creation and maintenance policies, and these should be used only in special circumstances when the default policies do not provide the optimal performance.